When evaluating the sampling variability of different statistics, I’ll often use the bootstrap procedure to resample my data, compute the statistic on each sample, and look at the distribution of the statistic over several bootstrap samples.
In principle, the bootstrap is straightforward to do. However, if you have correlated data (like repeated measures or longitudinal data or circular data), the unit of sampling no longer is the particular data point but the second-level unit within which the data are correlated; otherwise you break the correlation structure of the data by doing a naive bootstrap and distort the resultant distributions.
Context One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.
One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things.
Well, here’s what I was dealing with:
(You can download this dataset for your playtime here)
Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).
Goal A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.
Personal expressiveness, or how data is stored in a spreadsheet When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!
But for smaller data, or data collected in the lab, the possibilities are truly endless!
Quirks about running Rcpp on Windows through RStudio This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.
I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.
Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.
Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.
First the pandas code:
import pandas as pd dat = pd.
The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.
This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr.
We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.
R solutions Hadleyverse solutions Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.
If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion).
My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.
Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications.