Data Science

Selecting columns based on type

The tidyverse and, in particular, dplyr, provides functions to select columns from a data frame. There are three scoped functions available: select_all, select_if and select_at. In this post, we’ll look at a particular application of select_if, i.e., capturing the names of numeric variables. A quick search using Google finds a few solutions to this problem. As an example data set, I’ll use the diamonds data set from the ggplot2 package.

Practical Data Science Cookbook

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC. Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications.

The many faces of statistics/data science: Can't we all just get along and learn from each other?

Two blog posts in the last 24 hours caught my attention. First was this post by Jeff Leek noting that there are many fields which are applied statistics by another name (and I’d add operations research to his list). The second is an excellent post on Cloudera’s blog on constructing case-control studies. It is generally excellent, but has this rather unfortunate (in my view) statement: Analyzing a case-control study is a problem for a statistician.