Hand-crafted curation: meta-#SoDS18

😎 Summer of Data Science (#SoDS18) is upon us, and (thanks to Renée (aka @BecomingDataSci)) there is already some great guidance out there. I’m a big fan of “mini” projects — and, unlike Highlander, I’m pretty sure there can only be one isn’t the name of the SoDS game.

So, I wanted to share some ideas I have for potential micro-projects…

🙌 do something by hand

I know, I know — this borders on heresy. As my Archer visualization partner-in-crime, Elijah Meeks, put it:1

Whenever you tell someone you painstakingly annotated something by hand they grimace and get uncomfortable like you told them you enjoy thrash metal.

But, according to expert educators (Albert Y. Kim and Chester Ismay, to be specific), there’s still a lot of value in taking ye olde approach to things. 2

📦 Compare/recommend packages

First off, I want to acknowledge that there have been some great algorithmic/technical approaches to this, and there are projects under way (🐬 give flipper a look when you have a chance).3 In fact, I recommend you make use of these approaches should you give it a go (detailed nicely in packagemetrics - Helping you choose a package since runconf17) — but give it that certain human je ne sais quoi. 💅

This doesn’t have to be exhaustive! I really enjoyed two posts by Adam Medcalf, “My favourite R package for: summarising data” and “R packages for summarising data – part 2”.

This can also be great info to add to a package README or vignette. For example, Jenny Bryan (readxl’s maintainer, and all- around awesome human) discusses similar packages in the readxl README. This is a win-win, since she’ll point users who file issues to a different Excel-related package when it’s appropriate — as is often the case when it comes to tidyxl’s specialty of handling awkward, non-tabular Excel files.4

There’s no need to leave this up to the maintainers, though. If you go through a few packages while trying to accomplish a task, you are in a great position to describe what it was about them that led to your choice!

Recommending packages can also be of great help to others. Check out Sharon Machlis’ posts for some inspiration in that department.

👼 Bring a dataset to life

🎴 How many times have you used the iris dataset?

🚗 What about mtcars?

I can’t speak for Antoine Bichat’s experience with iris, but hunting down and sharing pics of the frequently-plotted ’74 vehicles was a pretty eye-opening experience. Among other things, thanks to the keen eye of Nathanael Aff we found out that the Mazda RX4 and RX4 Wagons have rotary engines. Even if you allow for the cylinder-to-rotor conversion (which is a bit of a stretch), it’s like comparing apples to oranges (or doritos to a water pump).

Update: Thanks to Ben Bolker, I can rest knowing that the source paper from which mtcars is taken acknowledges this unsettling error.

👨‍🎤 And more…

Let the spirit move you! Share your ideas with others (including me, naturally), and make it an #SoDS18 to remember.

  1. 👀 you should read the whole piece, Visualizing Archer: Data visualization to further your enjoyment of narrative, because it’s great…and I’m totally not biased at all.

  2. Check out the slides from Albert Kim’s talk from Data Day Texas 2018, “Something old, something new, something borrowed, something blue Ways to teach data science (and learn it too!)”.

  3. packagemetrics and its related issues from the rOpenSci 2017 unconf in will give you a better sense of this problem than I could ever hope to!

  4. These include: openxlsx, writexl, the C-library libxlsxwriter, and tidyxl.

  5. In fact, there was a whole session about this at useR! 2017, which you can learn more about from Julia Silge’s posts, “How do you discover R packages?”, and “Seeking guidance in choosing and evaluating R packages”.