Category Archives: Data Analysis

The peril of big (flu) data

There is an interesting new post at “In the Pipeline” that summarizes the performance of Google’s “big data” project to track flu trends from search terms.  In short, the predictive performance appears to be pretty bad so far, at least compared to what you might have expected given the hype around “big data.”  The author raises some key points, including the importance of high-quality data, even in very large datasets.  I particularly like this analogy:

“The quality of the data matters very, very, much, and quantity is no substitute. You can make a very large and complex structure out of toothpicks and scraps of wood, because those units are well-defined and solid. You cannot do the same with a pile of cotton balls and dryer lint, not even if you have an entire warehouse full of the stuff.”  –In the Pipeline, March 24, 2014

Data filtering and modeling approaches will likely continue to improve, however, and I think this project is worth watching in the future.

 

Advertisements

Using R to create a dotplot with jittered x values

If you need to create a plot where you have a several groups of data that you want to distribute along the ‘y’ axis, but bin into one of several categories in x then you can do the following:

1) create a .csv file with your data in columns (you can use headers)

2) import the .csv file into R with: TEST <- read.table(“yourfile.csv”, sep=’,’, header=TRUE)

3) do the dotplot: dotplot(values ~ ind, data=stack(TEST), jitter.x=TRUE)

The important point here is the use of the “stack” function.  This converts vectors into factors; it also lets you create the type of dotplot where the data is plotted along ‘y’ while having the same ‘x’ value.