diff --git a/_Rmd_files/2016-12-13-leekgroup-plots.Rmd b/_Rmd_files/2016-12-13-leekgroup-plots.Rmd new file mode 100644 index 0000000..3d3a7f7 --- /dev/null +++ b/_Rmd_files/2016-12-13-leekgroup-plots.Rmd @@ -0,0 +1,40 @@ +--- +title: "Leek group guide to making plots" +output: html_document +--- + +I have written a few guides for people in academics including: + +* [How to write your first paper](https://github.com/jtleek/firstpaper) +* [How to review a paper](https://github.com/jtleek/reviews) +* [How to share data](https://github.com/jtleek/datasharing) +* [How to write an R package](https://github.com/jtleek/rpackages) +* [How to read academic papers](https://github.com/jtleek/readingpapers) + +The purpose of these guides has partially been for other people to use them outside of my research group. But the main driver has been having a set of tutorials that can be a sort of "onboarding" for new members of my research group. + +Recently we had to work collectively on a project where multiple members were each sending in plots and I realized that they looked very different in aesthetic, color scheme, and organization. The result is that it was pretty hard to put the figures together in a paper. It also means that when we use each other's slides in talks there is no coherent pattern to what a plot will look like. + +Other organizations - like [fivethirtyeight](http://fivethirtyeight.com/) have a consistent look and feel to their graphics. They do this (I imagine) largely as a defense mechanism - they have to produce plots every day! But I think that it also adds to the professionalism of the data analysis products they produce. + +I realized I would like my research group to have a similar type of professionalism to our plots since we regularly produce data products and have to illustrate scientific data. + +This is a guide for how plots should be made in the Leek group. I hope it will evolve over time as members of the group weigh in on their opinions. There is a corresponding + +*[Leek group plotting R package](link TBD) + +that you can use to make plots like ours if you want to with both ggplot2 and base R plotting parameters set up. + +## Expository versus exploratory graphs + +If you are analyzing data you make plots all of the time. This is part of the interactive data analysis workflow. When exploring data you should not spend time on how the plots look. They should be ugly and fast so you can quickly explore a data set. This guide does not apply to exploratory plots. + +Expository plots are plots that we intend to distribute as part of a paper, blog post, or other communication of our results. Expository plots differ from exploratory plots because they are intended to communicate information to someone who is not you. The key principles behind Leek group expository plots are: + +(1) They communicate the answer to a specific scientific question +(2) Each plot answers a single scientific question +(3) Each plot will have a figure caption describing the key story in the plot +(4) The figure and legend are sufficient to communicate a scientific message without the surrounding paper text. +(5) They have a consistent color theme, point type, and font. + +Point (4) is directly related to the Leek group [guide to writing the first paper](https://github.com/jtleek/firstpaper) diff --git a/_images/2017-04-06/IMG_7075.jpg b/_images/2017-04-06/IMG_7075.jpg new file mode 100644 index 0000000..4b587cd Binary files /dev/null and b/_images/2017-04-06/IMG_7075.jpg differ diff --git a/_images/2017-04-06/IMG_7076.jpg b/_images/2017-04-06/IMG_7076.jpg new file mode 100644 index 0000000..7113ccc Binary files /dev/null and b/_images/2017-04-06/IMG_7076.jpg differ diff --git a/_images/2017-04-06/cambio-en-matricula.png b/_images/2017-04-06/cambio-en-matricula.png new file mode 100644 index 0000000..d3e242f Binary files /dev/null and b/_images/2017-04-06/cambio-en-matricula.png differ diff --git a/_images/2017-04-06/costo.png b/_images/2017-04-06/costo.png new file mode 100644 index 0000000..4f35c7f Binary files /dev/null and b/_images/2017-04-06/costo.png differ diff --git a/_images/2017-04-06/matricula.png b/_images/2017-04-06/matricula.png new file mode 100644 index 0000000..fce0606 Binary files /dev/null and b/_images/2017-04-06/matricula.png differ diff --git a/_images/2017-05-04/haircuts.png b/_images/2017-05-04/haircuts.png new file mode 100644 index 0000000..6c6aa51 Binary files /dev/null and b/_images/2017-05-04/haircuts.png differ diff --git a/_images/2x2-table-results.png b/_images/2x2-table-results.png new file mode 100644 index 0000000..ae194b8 Binary files /dev/null and b/_images/2x2-table-results.png differ diff --git a/_images/2x2-table.png b/_images/2x2-table.png new file mode 100644 index 0000000..4627ae3 Binary files /dev/null and b/_images/2x2-table.png differ diff --git a/_images/Flowchart-full.png b/_images/Flowchart-full.png new file mode 100644 index 0000000..c02a43a Binary files /dev/null and b/_images/Flowchart-full.png differ diff --git a/_images/Flowchart-partial.png b/_images/Flowchart-partial.png new file mode 100644 index 0000000..b5c5f27 Binary files /dev/null and b/_images/Flowchart-partial.png differ diff --git a/_images/Flowchart.png b/_images/Flowchart.png new file mode 100644 index 0000000..a59257c Binary files /dev/null and b/_images/Flowchart.png differ diff --git a/_images/ai-album.png b/_images/ai-album.png new file mode 100644 index 0000000..875a289 Binary files /dev/null and b/_images/ai-album.png differ diff --git a/_images/alexa-ai.png b/_images/alexa-ai.png new file mode 100644 index 0000000..c2fed8f Binary files /dev/null and b/_images/alexa-ai.png differ diff --git a/_images/cartoon-phone-photos.png b/_images/cartoon-phone-photos.png new file mode 100644 index 0000000..84c8e29 Binary files /dev/null and b/_images/cartoon-phone-photos.png differ diff --git a/_images/chromebook2.jpg b/_images/chromebook2.jpg new file mode 100644 index 0000000..822ac76 Binary files /dev/null and b/_images/chromebook2.jpg differ diff --git a/_images/heights-with-outlier.png b/_images/heights-with-outlier.png new file mode 100644 index 0000000..42521cb Binary files /dev/null and b/_images/heights-with-outlier.png differ diff --git a/_images/images-to-numbers.png b/_images/images-to-numbers.png new file mode 100644 index 0000000..5f96e92 Binary files /dev/null and b/_images/images-to-numbers.png differ diff --git a/_images/importance-not-size.jpg b/_images/importance-not-size.jpg new file mode 100644 index 0000000..650b742 Binary files /dev/null and b/_images/importance-not-size.jpg differ diff --git a/_images/jeff-color-names.png b/_images/jeff-color-names.png new file mode 100644 index 0000000..14b0529 Binary files /dev/null and b/_images/jeff-color-names.png differ diff --git a/_images/jeff-rgb.png b/_images/jeff-rgb.png new file mode 100644 index 0000000..787459b Binary files /dev/null and b/_images/jeff-rgb.png differ diff --git a/_images/jeff-smile-dots.png b/_images/jeff-smile-dots.png new file mode 100644 index 0000000..f678d6a Binary files /dev/null and b/_images/jeff-smile-dots.png differ diff --git a/_images/jeff-smile-lines.png b/_images/jeff-smile-lines.png new file mode 100644 index 0000000..dad19a3 Binary files /dev/null and b/_images/jeff-smile-lines.png differ diff --git a/_images/jeff-smile.png b/_images/jeff-smile.png new file mode 100644 index 0000000..dfeb8ea Binary files /dev/null and b/_images/jeff-smile.png differ diff --git a/_images/jeff.jpg b/_images/jeff.jpg new file mode 100644 index 0000000..61b66c2 Binary files /dev/null and b/_images/jeff.jpg differ diff --git a/_images/labels-to-numbers.png b/_images/labels-to-numbers.png new file mode 100644 index 0000000..4f6beb4 Binary files /dev/null and b/_images/labels-to-numbers.png differ diff --git a/_images/many-workflows.png b/_images/many-workflows.png new file mode 100644 index 0000000..0576655 Binary files /dev/null and b/_images/many-workflows.png differ diff --git a/_images/movie-ai.png b/_images/movie-ai.png new file mode 100644 index 0000000..871b259 Binary files /dev/null and b/_images/movie-ai.png differ diff --git a/_images/notajftweet.png b/_images/notajftweet.png new file mode 100644 index 0000000..043e0e3 Binary files /dev/null and b/_images/notajftweet.png differ diff --git a/_images/papr.png b/_images/papr.png new file mode 100644 index 0000000..cd8a6ce Binary files /dev/null and b/_images/papr.png differ diff --git a/_images/pisa-2015-math-v-others.png b/_images/pisa-2015-math-v-others.png new file mode 100644 index 0000000..5a2522f Binary files /dev/null and b/_images/pisa-2015-math-v-others.png differ diff --git a/_images/pisa-2015-scatter.png b/_images/pisa-2015-scatter.png new file mode 100644 index 0000000..5526667 Binary files /dev/null and b/_images/pisa-2015-scatter.png differ diff --git a/_images/silver3.png b/_images/silver3.png new file mode 100644 index 0000000..6d8a756 Binary files /dev/null and b/_images/silver3.png differ diff --git a/_images/timeline-ai.png b/_images/timeline-ai.png new file mode 100644 index 0000000..111562f Binary files /dev/null and b/_images/timeline-ai.png differ diff --git a/_images/us-election-2016-538-prediction.png b/_images/us-election-2016-538-prediction.png new file mode 100644 index 0000000..9d2cf19 Binary files /dev/null and b/_images/us-election-2016-538-prediction.png differ diff --git a/_images/us-election-2016-538-v-upshot.png b/_images/us-election-2016-538-v-upshot.png new file mode 100644 index 0000000..23892e2 Binary files /dev/null and b/_images/us-election-2016-538-v-upshot.png differ diff --git a/_images/ux1.png b/_images/ux1.png new file mode 100644 index 0000000..a06128a Binary files /dev/null and b/_images/ux1.png differ diff --git a/_images/ux2.png b/_images/ux2.png new file mode 100644 index 0000000..2558c08 Binary files /dev/null and b/_images/ux2.png differ diff --git a/_images/workflow.png b/_images/workflow.png new file mode 100644 index 0000000..c090dbd Binary files /dev/null and b/_images/workflow.png differ diff --git a/_posts/2011-12-03-reverse-scooping.md b/_posts/2011-12-03-reverse-scooping.md index d3ab808..6bf39b0 100644 --- a/_posts/2011-12-03-reverse-scooping.md +++ b/_posts/2011-12-03-reverse-scooping.md @@ -18,4 +18,4 @@ tags: - advice - Rant --- -I would like to define a new term: r_everse scooping_ is when someone publishes your idea after you, and doesn’t cite you. It has happened to me a few times. What does one do? I usually send a polite message to the authors with a link to my related paper(s). These emails are usually ignored, but not always. Most times I don’t think it is malicious though. In fact, I almost reverse scooped a colleague recently.  People arrive at the same idea a few months (or years) later and there is just too much literature to keep track-off. And remember the culprit authors were not the only ones that missed your paper, the referees and associate editor missed it as well. One thing I have learned is that if you want to claim an idea, try to include it in the title or abstract as very few papers get read cover-to-cover. \ No newline at end of file +I would like to define a new term: _reverse scooping_ is when someone publishes your idea after you, and doesn’t cite you. It has happened to me a few times. What does one do? I usually send a polite message to the authors with a link to my related paper(s). These emails are usually ignored, but not always. Most times I don’t think it is malicious though. In fact, I almost reverse scooped a colleague recently.  People arrive at the same idea a few months (or years) later and there is just too much literature to keep track-off. And remember the culprit authors were not the only ones that missed your paper, the referees and associate editor missed it as well. One thing I have learned is that if you want to claim an idea, try to include it in the title or abstract as very few papers get read cover-to-cover. diff --git a/_posts/2012-11-07-nate-silver-does-it-again-will-pundits-finally-accept.md b/_posts/2012-11-07-nate-silver-does-it-again-will-pundits-finally-accept.md index 86bf265..ff8a97c 100644 --- a/_posts/2012-11-07-nate-silver-does-it-again-will-pundits-finally-accept.md +++ b/_posts/2012-11-07-nate-silver-does-it-again-will-pundits-finally-accept.md @@ -25,6 +25,6 @@ While the pundits were claiming the race was a “dead heat”, the day **Update****: **Congratulations also to Sam Wang (Princeton Election Consortium) and Simon Jackman (pollster) that also called the election perfectly. And thanks to the pollsters that provided the unbiased (on average) data used by all these folks. Data analysts won “experts” lost. -**Update 2**: New plot with data from here. Old graph here. +~~**Update 2**: New plot with data from here. Old graph here.~~ - \ No newline at end of file +![Observed versus predicted](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/silver3.png) diff --git a/_posts/2014-10-13-as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential.md b/_posts/2014-10-13-as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential.md index 008f4f7..22e7ea0 100644 --- a/_posts/2014-10-13-as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential.md +++ b/_posts/2014-10-13-as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential.md @@ -35,36 +35,34 @@ In a recent New York Times [article](http://www.nytimes.com/2014/09/30/science/t -Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, [Jeff](http://simplystatistics.org/2013/11/26/statistical-zealots/) and [I](http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/) agree on is that this debate is [not constructive](http://noahpinionblog.blogspot.com/2013/01/bayesian-vs-frequentist-is-there-any.html). As [Rob Kass](http://arxiv.org/pdf/1106.2895v2.pdf) suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

+Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, [Jeff](http://simplystatistics.org/2013/11/26/statistical-zealots/) and [I](http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/) agree on is that this debate is [not constructive](http://noahpinionblog.blogspot.com/2013/01/bayesian-vs-frequentist-is-there-any.html). As [Rob Kass](http://arxiv.org/pdf/1106.2895v2.pdf) suggests it's time to move on to pragmatism. Here I follow up Jeff's [recent post](http://simplystatistics.org/2014/09/30/you-think-p-values-are-bad-i-say-show-me-the-data/) by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest. + -

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky. -

-

- It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon. -

-

+ + It is also important to remember that good applied statisticians also **think**. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon. + + + -

-

+ + Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties. -

-

- Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter. -

-

- For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)? -

-

+ Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In [this](http://www.ncbi.nlm.nih.gov/pubmed/16646809) popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter. + + + + For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)? + + + -

-

+ And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster. -

\ No newline at end of file diff --git a/_posts/2014-11-04-538-election-forecasts-made-simple.md b/_posts/2014-11-04-538-election-forecasts-made-simple.md index 323ddff..c7a2c4f 100644 --- a/_posts/2014-11-04-538-election-forecasts-made-simple.md +++ b/_posts/2014-11-04-538-election-forecasts-made-simple.md @@ -21,10 +21,10 @@ categories: --- Nate Silver does a [great job](http://fivethirtyeight.com/features/how-the-fivethirtyeight-senate-forecast-model-works/) of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my [data science class](http://cs109.github.io/2014) I have a better idea of what is going on. -[Here](http://rafalab.jhsph.edu/simplystats/midterm2012.html) is my best attempt at explaining the ideas of 538 using formulas and data. And [here](http://rafalab.jhsph.edu/simplystats/midterm2012.Rmd) is the R markdown. +[Here](http://simplystatistics.org/html/midterm2012.html) is my best attempt at explaining the ideas of 538 using formulas and data. ~~And [here](http://rafalab.jhsph.edu/simplystats/midterm2012.Rmd) is the R markdown.~~     -  \ No newline at end of file +  diff --git a/_posts/2016-10-26-datasets-new-server-rooms.md b/_posts/2016-10-26-datasets-new-server-rooms.md new file mode 100644 index 0000000..dc1b181 --- /dev/null +++ b/_posts/2016-10-26-datasets-new-server-rooms.md @@ -0,0 +1,28 @@ +--- +title: Are Datasets the New Server Rooms? +author: roger +layout: post +comments: false +--- + +Josh Nussbaum has an [interesting post](https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&_hsmi=36470868#.wz8f23tak) over at Medium about whether massive datasets are the new server rooms of tech business. + +The analogy comes from the "old days" where in order to start an Internet business, you had to buy racks and servers, rent server space, buy network bandwidth, license expensive server software, backups, and on and on. In order to do all that up front, it required a substantial amount of capital just to get off the ground. As inconvenient as this might have been, it provided an immediate barrier to entry for any other competitors who weren't able to raise similar capital. + +Of course, + +> ...the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry. + +So if startups don't have huge capital costs in the beginning, what costs *do* they have? Well, for many new companies that rely on machine learning, they need to collect data. + +> As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth. + +Collecting huge datasets ultimately costs money. The sooner a startup can raise money to get that data, the sooner they can defend themselves from competitors who may not yet have collected the huge datasets for training their algorithms. + +I'm not sure the analogy between datasets and server rooms quite works. Even back when you had to pay a lot of up front costs to setup servers and racks, a lot of that technology was already a commodity, and anyone could have access to it for a price. + +I see massive datasets used to train machine learning algorithms as more like the new proprietary software. The startups of yore spent a lot of time writing custom software for what we might now consider mundane tasks. This was a time-consuming activity but the software that was developed had value and was a differentiator for the company. Today, many companies write complex machine learning algorithms, but those algorithms and their implmentations are quickly becoming commodities. So the only thing that separates one company from another is the amount and quality of data that they have to train those algorithms. + +Going forward, it will be interesting see what these companies will do with those massive datasets once they no longer need them. Will they "open source" them and make them available to everyone? Could there be an open data movement analogous to the open source movement? + +For the most part, I doubt it. While I think many today would perhaps sympathize with the sentiment that [software shouldn't have owners](https://www.gnu.org/gnu/manifesto.en.html), those same people I think would argue vociferously that data most certainly do have owners. I'm not sure how I'd feel if Facebook made all their data available to anyone. That said, many datasets are made available by various businesses, and as these datasets grow in number and in usefulness, we may see a day where the collection of data is not a key barrier to entry, and that you can train your machine learning algorithm on whatever is out there. \ No newline at end of file diff --git a/_posts/2016-10-28-nssd-episode-25.md b/_posts/2016-10-28-nssd-episode-25.md new file mode 100644 index 0000000..467b3d2 --- /dev/null +++ b/_posts/2016-10-28-nssd-episode-25.md @@ -0,0 +1,35 @@ +--- +author: roger +layout: post +title: Not So Standard Deviations Episode 25 - How Exactly Do You Pronounce SQL? +--- + +Hilary and I go through the overflowing mailbag to respond to listener questions! Topics include causal inference in trend modeling, regression model selection, using SQL, and data science certification. + +If you have questions you’d like us to answer, you can send them to +nssdeviations @ gmail.com or tweet us at [@NSSDeviations](https://twitter.com/nssdeviations). + +Show notes: + +* [Professor Kobre's Lightscoop Standard Version Bounce Flash Device](https://www.amazon.com/gp/product/B0017LNHY2/) + +* [Speechpad](https://www.speechpad.com) + +* [Speaking American by Josh Katz](https://www.amazon.com/gp/product/0544703391/) + +* [Data Sets Are The New Server Rooms](https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&_hsmi=36470868#.wybl0l3p7) + +* [Are Datasets the New Server Rooms?](http://simplystatistics.org/2016/10/26/datasets-new-server-rooms/) + +* Subscribe to the podcast on [iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570) or [Google Play](https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna). And please [leave us a review on iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570). + +* Support us through our [Patreon page](https://www.patreon.com/NSSDeviations?ty=h). + +* Get the [Not So Standard Deviations book](https://leanpub.com/conversationsondatascience/). + + +[Download the audio for this episode](https://soundcloud.com/nssd-podcast/episode-25-how-exactly-do-you-pronounce-sql) + +Listen here: + + \ No newline at end of file diff --git a/_posts/2016-11-08-chromebook-part2.md b/_posts/2016-11-08-chromebook-part2.md new file mode 100644 index 0000000..545d752 --- /dev/null +++ b/_posts/2016-11-08-chromebook-part2.md @@ -0,0 +1,24 @@ +--- +title: Data scientist on a chromebook take two +author: jeff +layout: post +comments: true +--- + +My friend Fernando showed me his collection of [old Apple dongles](https://twitter.com/jtleek/status/795749713966497793) that no longer work with the latest generation of Apple devices. This coupled with the announcement of the Macbook pro that promises way more dongles and mostly the same computing, had me freaking out about my computing platform for the future. I've been using cloudy tools for more and more of what I do and so it had me wondering if it was time to go back and try my [Chromebook experiment](http://simplystatistics.org/2012/01/09/a-statistician-and-apple-fanboy-buys-a-chromebook-and/) again. Basically the question is whether I can do everything I need to do comfortably on a Chromebook. + +So to execute the experience I got a brand new [ASUS chromebook flip](https://www.asus.com/us/Notebooks/ASUS_Chromebook_Flip_C100PA/) and the connector I need to plug it into hdmi monitors (there is no escaping at least one dongle I guess :(). Here is what that badboy looks like in my home office with Apple superfanboy Roger on the screen. + + +![chromebook2](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/chromebook2.jpg) + +In terms of software there have been some major improvements since I last tried this experiment out. Some of these I talk about in my book [How to be a modern scientist](https://leanpub.com/modernscientist). As of this writing this is my current setup: + +* Music on [Google Play](https://play.google.com) +* Latex on [Overleaf](https://www.overleaf.com) +* Blog/website/code on [Github](https://github.com/) +* R programming on an [Amazon AMI with Rstudio loaded](http://www.louisaslett.com/RStudio_AMI/) although [I hear](https://twitter.com/earino/status/795750908457984000) there may be other options that are good there that I should try. +* Email/Calendar/Presentations/Spreadsheets/Docs with [Google](https://www.google.com/) products +* Twitter with [Tweetdeck](https://tweetdeck.twitter.com/) + +That handles the vast majority of my workload so far (its only been a day :)). But I would welcome suggestions and I'll report back when either I give up or if things are still going strong in a little while.... diff --git a/_posts/2016-11-09-not-all-forecasters-got-it-wrong.md b/_posts/2016-11-09-not-all-forecasters-got-it-wrong.md new file mode 100644 index 0000000..fa13785 --- /dev/null +++ b/_posts/2016-11-09-not-all-forecasters-got-it-wrong.md @@ -0,0 +1,76 @@ +--- +title: 'Not all forecasters got it wrong: Nate Silver does it again (again)' +date: 2016-11-09 +author: rafa +layout: post +comments: true +--- + +Four years ago we +[posted](http://simplystatistics.org/2012/11/07/nate-silver-does-it-again-will-pundits-finally-accept/) +on Nate Silver's, and other forecasters', triumph over pundits. In +contrast, after yesterday's presidential election, results contradicted +most polls and data-driven forecasters, several news articles came out +wondering how this happened. It is important to point +out that not all forecasters got it wrong. Statistically +speaking, Nate Silver, once again, got it right. + +To show this, below I include a plot showing the expected margin of +victory for Clinton versus the actual results for the most competitive states provided by 538. It includes the uncertainty bands provided by 538 in +[this site](http://projects.fivethirtyeight.com/2016-election-forecast/) +(I eyeballed the band sizes to make the plot in R, so they are not +exactly like 538's). + +![538-2016-election](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-prediction.png) + +Note that if these are 95% confidence/credible intervals, 538 got 1 +wrong. This is exactly what we expect since 15/16 is about +95%. Furthermore, judging by the plot [here](http://projects.fivethirtyeight.com/2016-election-forecast/), 538 estimated the popular vote margin to be 3.6% +with a confidence/credible interval of about 5%. +This too was an accurate +prediction since Clinton is going to win the popular vote by +about 1% ~~0.5%~~ (note this final result is in the margin of error of +several traditional polls as well). Finally, when other forecasters were +giving Trump between 14% and 0.1% chances of winning, 538 gave +him about a +30% chance which is slightly more than what a team has when down 3-2 +in the World Series. In contrast, in 2012 538 gave Romney only a 9% +chance of winning. Also, remember, if in ten election cycles you +call it for someone with a 70% chance, you should get it wrong 3 +times. If you get it right every time then your 70% statement was wrong. + +So how did 538 outperform all other forecasters? First, as far as I +can tell they model the possibility of an overall bias, modeled as a +random effect, that affects +every state. This bias can be introduced by systematic +lying to pollsters or under sampling some group. Note that this bias +can't be estimated from data from +one election cycle but it's variability can be estimated from +historical data. 538 appear +to estimate the standard error of this term to be +about 2%. More details on this are included [here](http://simplystatistics.org/html/midterm2012.html). In 2016 we saw this bias and you can see it in +the plot above (more points are above the line than below). The +confidence bands account for this source of variabilty and furthermore +their simulations account for the strong correlation you will see +across states: the chance of seeing an upset in Pennsylvania, Wisconsin, +and Michigan is **not** the product of an upset in each. In +fact it's much higher. Another advantage 538 had is that they somehow +were able to predict a systematic, not random, bias against +Trump. You can see this by +comparing their adjusted data to the raw data (the adjustment favored +Trump about 1.5 on average). We can clearly see this when comparing the 538 +estimates to The Upshots': + + +![538-2016-election](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-v-upshot.png) + +The fact that 538 did so much better than other forecasters should +remind us how hard it is to do data analysis in real life. Knowing +math, statistics and programming is not enough. It requires experience +and a deep understanding of the nuances related to the specific +problem at hand. Nate Silver and the 538 team seem to understand this +more than others. + +Update: Jason Merkin points out (via Twitter) that 538 provides 80% credible +intervals. + diff --git a/_posts/2016-11-11-im-not-moving-to-canada.md b/_posts/2016-11-11-im-not-moving-to-canada.md new file mode 100644 index 0000000..e447dbb --- /dev/null +++ b/_posts/2016-11-11-im-not-moving-to-canada.md @@ -0,0 +1,40 @@ +--- +title: 'Open letter to my lab: I am not "moving to Canada"' +date: 2016-11-11 +author: rafa +layout: post +comments: true +--- + +Dear Lab Members, + +I know that the results of Tuesday's election have many of you +concerned about your future. You are not alone. I am concerned +about my future as well. But I want you to know that I have no plans +of going anywhere and I intend to dedicate as much time to our +projects as I always have. Meeting, discussing ideas and putting them +into practice with you is, by far, the best part of my job. + +We are all concerned that if certain campaign promises are kept many +of our fellow citizens may need our help. If this happens, then we +will pause to do whatever we can to help. But I am currently +cautiously optimistic that we will be able to continue focusing on +helping society in the best way we know how: by doing scientific +research. + +This week Dr. Francis Collins assured us that there is strong +bipartisan support for scientific research. As an example consider +[this op-ed](http://www.nytimes.com/2015/04/22/opinion/double-the-nih-budget.html?_r=0) +in which Newt Gingrich advocates for doubling the NIH budget. There +also seems to be wide consensus in this country that scientific +research is highly beneficial to society and an understanding that to +do the best research we need the best of the best no matter their +gender, race, religion or country of origin. Nothing good comes from +creative, intelligent, dedicated people leaving science. + +I know there is much uncertainty but, as of now, there is nothing stopping us +from continuing to work hard. My plan is to do just that and I hope +you join me. + + + diff --git a/_posts/2016-11-17-leekgroup-colors.md b/_posts/2016-11-17-leekgroup-colors.md new file mode 100644 index 0000000..178683b --- /dev/null +++ b/_posts/2016-11-17-leekgroup-colors.md @@ -0,0 +1,19 @@ +--- +title: Help choose the Leek group color palette +author: jeff +layout: post +comments: true +--- + +My research group just recently finish a paper where several different teams within the group worked on different analyses. If you are interested the paper describes the [recount resource](http://biorxiv.org/content/early/2016/08/08/068478) which includes processed versions of thousands of human RNA-seq data sets. + +As part of this project each group had to contribute some plots to the paper. One thing that I noticed is that each person used their own color palette and theme when building the plots. When we wrote the paper this made it a little harder for the figures to all fit together - especially when different group members worked on a single panel of a multi-panel plot. + +So I started thinking about setting up a Leek group theme for both base R and ggplot2 graphics. One of the first problems was that every group member had their own opinion about what the best color palette would be. So we are running a little competition to determine what the official Leek group color palette for plots will be in the future. + +As part of that process, one of my awesome postdocs, Shannon Ellis, decided to collect some data on how people perceive different color palettes. The survey is here: + +https://docs.google.com/forms/d/e/1FAIpQLSfHMXVsl7pxYGarGowJpwgDSf9lA2DfWJjjEON1fhuCh6KkRg/viewform?c=0&w=1 + +If you have a few minutes and have an opinion about colors (I know you do!) please consider participating in our little poll and helping to determine the future of Leek group plots! + diff --git a/_posts/2016-11-30-nssd-episode-27.md b/_posts/2016-11-30-nssd-episode-27.md new file mode 100644 index 0000000..c09d14f --- /dev/null +++ b/_posts/2016-11-30-nssd-episode-27.md @@ -0,0 +1,41 @@ +--- +author: roger +layout: post +title: Not So Standard Deviations Episode 27 - Special Guest Amelia McNamara +--- + +I had the pleasure of sitting down with Amelia McNamara, Visiting Assistant Professor of Statistical and Data Sciences at Smith College, to talk about data science, data journalism, visualization, the problems with R, and adult coloring books. + +If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at [@NSSDeviations](https://twitter.com/nssdeviations). + +Show notes: + +* [Amelia McNamara’s web site](http://www.science.smith.edu/~amcnamara/index.html) + +* [Mark Hansen](http://datascience.columbia.edu/mark-hansen) + +* [Listening Post](https://www.youtube.com/watch?v=dD36IajCz6A) + +* [Moveable Type](http://www.nytimes.com/video/arts/1194817116105/moveable-type.html) + +* [Alan Kay](https://en.wikipedia.org/wiki/Alan_Kay) + +* [HARC (Human Advancement Research Community)](https://harc.ycr.org/) + +* [VPRI (Viewpoints Research Institute)](http://www.vpri.org/index.html) + +* [Interactive essays](https://www.youtube.com/watch?v=hps9r7JZQP8) + +* [Golden Ratio Coloring Book](https://rafaelaraujoart.com/products/golden-ratio-coloring-book) + +* Subscribe to the podcast on [iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570) or [Google Play](https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna). And please [leave us a review on iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570). + +* Support us through our [Patreon page](https://www.patreon.com/NSSDeviations?ty=h). + +* Get the [Not So Standard Deviations book](https://leanpub.com/conversationsondatascience/). + + +[Download the audio for this episode](https://soundcloud.com/nssd-podcast/episode-27-special-guest-amelia-mcnamara) + +Listen here: + diff --git a/_posts/2016-12-09-pisa-us-math.md b/_posts/2016-12-09-pisa-us-math.md new file mode 100644 index 0000000..95aa968 --- /dev/null +++ b/_posts/2016-12-09-pisa-us-math.md @@ -0,0 +1,77 @@ +--- +title: 'What is going on with math education in the US?' +date: 2016-12-09 +author: rafa +layout: post +comments: true +--- + +When colleagues with young children seeking information about schools +ask me if I like the Massachusetts public school my +children attend, my answer is always the same: "it's great...except for +math". The fact is that in our household we supplement our kids' math +education with significant extra curricular work in order to ensure +that they receive a math education comparable to what we received as +children in the public system. + +The latest results from the Program for International Student +Assessment (PISA) +[results](http://www.businessinsider.com/pisa-worldwide-ranking-of-math-science-reading-skills-2016-12) +show that there is a general problem with math education in the +US. Were it a country, Massachusetts would have been in second place +in reading, sixth in science, but 20th in math, only ten points above +the OECD average of 490. The US as a whole did not fair nearly as well +as MA, and the same discrepancy between math and the other two +subjects was present. In fact, among the top 30 performing +countries ranked by their average of science and reading scores, the +US has, by far, the largest discrepancy between math and +the other two subjects tested by PISA. The difference of 27 was +substantially greater than the second largest difference, +which came from Finland at 17. Massachusetts had a difference of 28. + + +![PISA 2015 Math minus average of science and reading](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-math-v-others.png) + + +If we look at the trend of this difference since PISA was started 16 +years ago, we see a disturbing progression. While science and reading +have +[remained stable, math has declined](http://www.artofteachingscience.org/wp-content/uploads/2013/12/Screen-Shot-2013-12-17-at-9.28.38-PM.png). In +2000 the difference between the results in math and the other subjects +was only 8.5. Furthermore, +the US is not performing exceptionally well in any subject: + +![PISA 2015 Math versus average of science and reading](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-scatter.png) + +So what is going on? I'd love to read theories in the comment +section. From my experience comparing my kids' public schools now +with those that I attended, I have one theory of my own. When I was a +kid there was a math textbook. Even when a teacher was bad, it +provided structure and an organized alternative for learning on your +own. Today this approach is seen as being "algorithmic" and has fallen +out of favor. "Project based learning" coupled with group activities have +become popular replacements. + +Project based learning is great in principle. But, speaking from +experience, I can say it is very hard to come up with good projects, +even for highly trained mathematical minds. And it is certainly much +more time consuming for the instructor than following a +textbook. Teachers don't have more time now than they did 30 years ago +so it is no surprise that this new more open approach leads to +improvisation and mediocre lessons. A recent example of a pointless +math project involved 5th graders picking a number and preparing a +colorful poster showing "interesting" facts about this number. To +make things worse in terms of math skills, students are often rewarded +for effort, while correctness is secondary and often disregarded. + +Regardless of the reason for the decline, given the trends +we are seeing, we need to rethink the approach to math education. Math +education may have had its problems in the past, but recent evidence +suggests that the reforms of the past few decades seem to have +only worsened the situation. + +Note: To make these plots I download and read-in the data into R as described [here](https://www.r-bloggers.com/pisa-2015-how-to-readprocessplot-the-data-with-r/). + + + + diff --git a/_posts/2016-12-15-nssd-episode-28.md b/_posts/2016-12-15-nssd-episode-28.md new file mode 100644 index 0000000..6769968 --- /dev/null +++ b/_posts/2016-12-15-nssd-episode-28.md @@ -0,0 +1,24 @@ +--- +author: roger +layout: post +title: Not So Standard Deviations Episode 28 - Writing is a lot Harder than Just Talking +--- + +Hilary and I talk about building data science products that provide a good user experience while adhering to some kind of ground truth, whether it’s in medicine, education, news, or elsewhere. Also Gilmore Girls. + +If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at [@NSSDeviations](https://twitter.com/nssdeviations). + +Show notes: + +* [Hill’s criteria for causation](https://en.wikipedia.org/wiki/Bradford_Hill_criteria) +* [O’Reilly Bots Podcast](https://www.oreilly.com/topics/oreilly-bots-podcast) +* [NHTSA’s Federal Automated Vehicles Policy](http://www.nhtsa.gov/nhtsa/av/index.html) +* Subscribe to the podcast on [iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570) or [Google Play](https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna). And please [leave us a review on iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570). +* Support us through our [Patreon page](https://www.patreon.com/NSSDeviations?ty=h). +* Get the [Not So Standard Deviations book](https://leanpub.com/conversationsondatascience/). + + +[Download the audio for this episode](https://soundcloud.com/nssd-podcast/episode-28-writing-is-a-lot-harder-than-just-talking) + +Listen here: + \ No newline at end of file diff --git a/_posts/2016-12-16-the-four-eras-of-data.md b/_posts/2016-12-16-the-four-eras-of-data.md new file mode 100644 index 0000000..3ad9fcc --- /dev/null +++ b/_posts/2016-12-16-the-four-eras-of-data.md @@ -0,0 +1,37 @@ +--- +title: The four eras of data +author: jeff +layout: post +comments: true +--- + +I'm teaching [a class in data science](http://jtleek.com/advdatasci16/) for our masters and PhD students here at Hopkins. I've been teaching a variation on this class since 2011 and over time I've introduced a number of new components to the class: high-dimensional data methods (2011), data manipulation and cleaning (2012), real, possibly not doable data analyses (2012,2013), peer reviews (2014), building [swirl tutorials](http://swirlstats.com/) for data analysis techniques (2015), and this year building data analytic web apps/R packages. + +I'm the least efficient teacher in the world, probably because I'm very self conscious about my teaching. So I always feel like I have to completely re-do my lecture materials every year I teach the class (I know, I know I'm a dummy). This year I was reviewing my notes on high-dimensional data and I was looking at this breakdown of the three eras of statistics from Brad Efron's [book](http://statweb.stanford.edu/~ckirby/brad/other/2010LSIexcerpt.pdf): + +> 1. The age of Quetelet and his successors, in which huge census-level data +sets were brought to bear on simple but important questions: Are there +more male than female births? Is the rate of insanity rising? +2. The classical period of Pearson, Fisher, Neyman, Hotelling, and their +successors, intellectual giants who developed a theory of optimal inference +capable of wringing every drop of information out of a scientific +experiment. The questions dealt with still tended to be simple — Is treatment +A better than treatment B? — but the new methods were suited to +the kinds of small data sets individual scientists might collect. +3. The era of scientific mass production, in which new technologies typi- +fied by the microarray allow a single team of scientists to produce data +sets of a size Quetelet would envy. But now the flood of data is accompanied +by a deluge of questions, perhaps thousands of estimates or +hypothesis tests that the statistician is charged with answering together; +not at all what the classical masters had in mind. + +While I think this is a useful breakdown, I realized I think about it in a slightly different way as a statistician. My breakdown goes more like this: + +1. __The era of not much data__ This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed. +2. __The era of lots of measurements on a few samples__ This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise. +3. __The era of a few measurements on lots of samples__ This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration. +4. __The era of all the data on everything.__ This is an era that currently we as civilians don't get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I'm speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2). + +I've focused here on the implications of these eras from a statistical modeling perspective, but as we discussed in my class, era 4 coupled with advances in machine learning methods mean that there are social, economic, and behaviorial implications of these eras as well. + + diff --git a/_posts/2016-12-20-noncomprehensive-list-of-awesome.md b/_posts/2016-12-20-noncomprehensive-list-of-awesome.md new file mode 100644 index 0000000..c158b3f --- /dev/null +++ b/_posts/2016-12-20-noncomprehensive-list-of-awesome.md @@ -0,0 +1,50 @@ +--- +title: A non-comprehensive list of awesome things other people did in 2016 +author: jeff +layout: post +comments: true +--- + +_Editor's note: For the last few years I have made a list of awesome things that other people did ([2015](http://simplystatistics.org/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015/), [2014](http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/), [2013](http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/)). Like in previous years I'm making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data._ + + +* Thomas Lin Pedersen created the [tweenr](https://github.com/thomasp85/tweenr) package for interpolating graphs in animations. Check out this awesome [logo](https://twitter.com/thomasp85/status/809896220906897408) he made with it. +* Yihui Xie is still blowing away everything he does. First it was [bookdown](https://bookdown.org/yihui/bookdown/) and then the yolo feature in [xaringan](https://github.com/yihui/xaringan) package. +* J Alammar built this great [visual introduction to neural networks](https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/) +* Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her [Data Rectangling](https://speakerdeck.com/jennybc/data-rectangling) talk. The analogy between exponential families and data frames is so so good. +* Hadley Wickham's book on [R for data science](http://r4ds.had.co.nz/) is everything you'd expect. Super clear, great examples, just a really nice book. +* David Robinson is a machine put on this earth to create awesome data science stuff. Here is [analyzing Trump's tweets](http://varianceexplained.org/r/trump-tweets/) and here he is on [empirical Bayes modeling explained with baseball](http://varianceexplained.org/r/hierarchical_bayes_baseball/). +* Julia Silge and David created the [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html) package. This is a holy moly big contribution to NLP in R. They also have a killer [book on tidy text mining](http://tidytextmining.com/). +* Julia used the package to do this [fascinating post](http://juliasilge.com/blog/Reddit-Responds/) on mining Reddit after the election. +* It would be hard to pick just five different major contributions from JJ Allaire (great interview [here](https://www.rstudio.com/rviews/2016/10/12/interview-with-j-j-allaire/)), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely _churning_ out awesome stuff at a rate that is hard to keep up with. I loved [R notebooks](https://blog.rstudio.org/2016/10/05/r-notebooks/) and have used them extensively for teaching. +* Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their [10 simple rules piece](http://biorxiv.org/content/early/2016/11/28/088278) Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently. +* Yaniv Erlich just can't stop himself from doing interesting things like [seeq.io](https://seeq.io/) and [dna.land](https://dna.land/). +* Thomaz Berisa and Joe Pickrell set up a freaking [Python API for genomics projects](https://medium.com/the-seeq-blog/start-a-human-genomics-project-with-a-few-lines-of-code-dde90c4ef68#.g64meyjim). +* DataCamp continues to do great things. I love their [DataChats](https://www.datacamp.com/community/blog/an-interview-with-david-robinson-data-scientist-at-stack-overflow) series and they have been rolling out tons of new courses. +* Sean Rife and Michele Nuijten created [statcheck.io](http://statcheck.io/) for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself. +* This was the artificial intelligence [tweet of the year](https://twitter.com/notajf/status/795717253505413122) +* I loved seeing PLoS Genetics start a policy of looking for papers in [biorxiv](http://blogs.plos.org/plos/2016/10/the-best-of-both-worlds-preprints-and-journals/). +* Matthew Stephens [post](https://medium.com/@biostatistics/guest-post-matthew-stephens-on-biostatistics-pre-review-and-reproducibility-a14a26d83d6f#.usisi7kd3) on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now! +* Lorena Barba made this amazing [reproducibility syllabus](https://hackernoon.com/barba-group-reproducibility-syllabus-e3757ee635cf#.2orb46seg) then [won the Leamer-Rosenthal prize](https://twitter.com/LorenaABarba/status/809641955437051904) in open science. +* Colin Dewey continues to do just stellar stellar work, this time on [re-annotating genomics samples](http://biorxiv.org/content/early/2016/11/30/090506). This is one of the key open problems in genomics. +* I love FlowingData sooooo much. Here is one on [the changing American diet](http://flowingdata.com/2016/05/17/the-changing-american-diet/). +* If you like computational biology and data science and like _super_ detailed reports of meetings/talks you [MIchael Hoffman](https://twitter.com/michaelhoffman) is your man. How he actually summarizes that much information in real time is still beyond me. +* I really really wish I had been at Alyssa Frazee's talk at startup.ml but loved this [review of it](http://www.win-vector.com/blog/2016/09/adversarial-machine-learning/). Sampling, inverse probability weighting? Love that stats flavor! +* I have followed Cathy O'Neil for a long time in her persona as [mathbabedotorg](https://twitter.com/mathbabedotorg) so it is no surprise to me that her new book [Weapons of Math Descruction](https://www.amazon.com/dp/B019B6VCLO/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1) is so good. One of the best works on the ethics of data out there. +* A related and very important piece is on [Machine bias in sentencing](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica. +* Dimitris Rizopolous created this stellar [integrated Shiny app](http://iprogn.blogspot.com/2016/03/an-integrated-shiny-app-for-course-on.html) for his repeated measures class. I wish I could build things half this nice. +* Daniel Engber's piece on [Who will debunk the debunkers?](http://fivethirtyeight.com/features/who-will-debunk-the-debunkers/) at fivethirtyeight just keeps getting more relevant. +* I rarely am willing to watch a talk posted on the internet, but [Amelia McNamara's talk on seeing nothing](https://www.youtube.com/watch?v=hps9r7JZQP8) was an exception. Plus she talks so fast #jealous. +* Sherri Rose's post on [economic diversity in the academy](http://drsherrirose.com/economic-diversity-and-the-academy-statistical-science) focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive. +* If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas's [data science handbook](http://shop.oreilly.com/product/0636920034919.do) and the associated [Jupyter notebooks](https://github.com/jakevdp/PythonDataScienceHandbook). +* I love Thomas Lumley [being snarky](http://www.statschat.org.nz/2016/12/19/sauna-and-dementia/) about the stats news. Its a guilty pleasure. If he ever collected them into a book I'd buy it (hint Thomas :)). +* Dorothy Bishop's blog is one of the ones I read super regularly. Her post on [When is a replication a replication](http://deevybee.blogspot.com/2016/12/when-is-replication-not-replication.html) is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well. +* Ben Goldacre's crowd is doing a bunch of interesting things. I really like their [OpenPrescribing](https://openprescribing.net/) project. +* I'm really excited to see what Elizabeth Rhodes does with the experimental design for the [Ycombinator Basic Income Experiment](http://blog.ycombinator.com/moving-forward-on-basic-income/). +* Lucy D'Agostino McGowan made this [amazing explanation](http://www.lucymcgowan.com/hill-for-data-scientists.html) of Hill's criterion using xckd. +* It is hard to overstate how good Leslie McClure's blog is. This post on [biostatistics is public health](https://statgirlblog.wordpress.com/2016/09/16/biostatistics-is-public-health/) should be read aloud at every SPH in the US. +* The ASA's [statement on p-values](http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108) is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together. +* I really liked [this piece](http://jama.jamanetwork.com/article.aspx?articleId=2513561&guestAccessKey=4023ce75-d0fb-44de-bb6c-8a10a30a6173) on the relationship between income and life expectancy by Raj Chetty and company. +* Christie Aschwanden continues to be the voice of reason on the [statistical crises in science](http://fivethirtyeight.com/features/failure-is-moving-science-forward/). + +That's all I have for now, I know I'm missing things. Maybe my New Year's resolution will be to keep better track of the awesome things other people are doing :). diff --git a/_posts/2016-12-29-some-stress-reducers.md b/_posts/2016-12-29-some-stress-reducers.md new file mode 100644 index 0000000..134a671 --- /dev/null +++ b/_posts/2016-12-29-some-stress-reducers.md @@ -0,0 +1,32 @@ +--- +title: Some things I've found help reduce my stress around science +author: jeff +layout: post +comments: true +--- + +Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to [getting blown up on the internet](http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/). + +Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba's class on [essential skills in reproducibility](https://barbagroup.github.io/essential_skills_RRC/) and came across this [set of slides](http://www.stat.berkeley.edu/~stark/Seminars/reproNE16.htm#1) by Phillip Stark. The one that caught my attention said: + +> If I say just trust me and I'm wrong, I'm untrustworthy. +> If I say here's my work and it's wrong, I'm honest, human, and serving scientific progress. + +I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I've learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science. + +1. _Put everything out in the open._ We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work. +2. _Admit mistakes quickly._ Since my code/data are out in the open I've had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary. +3. _Respond to requests for support at my own pace._ I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this *right away* when I would get the emails. I still try to be prompt, but I don't let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets. +4. _Treat rejection as a feature not a bug._ This one is by far the hardest for me but preprints have helped a ton. The academic system is _designed_ to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow. +5. _Don't argue with people on the internet, especially on Twitter._ This is a new one for me and one I'm having to practice hard every single day. But I've found that I've had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn't help me accomplish much. +6. _Redefine success._ I've found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I'm much less stressed out. +7. _Don't compare myself to other scientists._ It is [very hard to get good evaluation in science](http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/) and I'm extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are "better" than me. I find I'm happier when I set internal, short term goals for myself and only compare myself to them. +8. _When comparing, at least pick a metric I'm good at._ I'd like to claim I never compare myself to others, but the reality is I do it more than I'd like. I've found one way to not stress myself out for my own internal comparisons is to pick metrics I'm good at - even if they aren't the "right" metrics. That way at least if I'm comparing I'm not hurting my own psyche. +9. _Let myself be bummed sometimes._ Some days despite all of that I still get the imposter syndrome feels and can't get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work. +10. _Try very hard to be positive in my interactions._ This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc. +11. _Realize that giving credit doesn't take away from me._ In my research career I have worked with some extremely [generous](http://genomics.princeton.edu/storeylab/) [mentors](http://rafalab.github.io/). They taught me to always give credit whenever possible. I also learned from [Roger](http://www.biostat.jhsph.edu/~rpeng/) that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better. + + +The last thing I'd say is that having a blog has helped reduce my stress, because sometimes I'm having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done... + + diff --git a/_posts/2017-01-09-nssd-episode-30.md b/_posts/2017-01-09-nssd-episode-30.md new file mode 100644 index 0000000..75dd0de --- /dev/null +++ b/_posts/2017-01-09-nssd-episode-30.md @@ -0,0 +1,31 @@ +--- +author: roger +layout: post +title: Not So Standard Deviations Episode 30 - Philately and Numismatology +--- + +Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R. + +If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at [@NSSDeviations](https://twitter.com/nssdeviations). + +Show notes: + +* Lucy D’Agostino McGowan (@LucyStats) made a [great translation of Hill’s criteria using XKCD comics](http://www.lucymcgowan.com/hill-for-data-scientists.html) + +* [Lucy’s web page](http://www.lucymcgowan.com) + +* [Preparing for the Future of Artificial Intelligence](https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf) + +* [Partially Derivative White House Special – with DJ Patil, US Chief Data Scientist](http://12%20Dec%202016%20White%20House%20Special%20with%20DJ%20Patil,%20US%20Chief%20Data%20Scientist) + +* [Not So Standard Deviations – Standards are Like Toothbrushes – with with Daniel Morgan, Chief Data Officer for the U.S. Department of Transportation and Terah Lyons, Policy Advisor to the Chief Technology Officer of the U.S.](https://soundcloud.com/nssd-podcast/episode-29-standards-are-like-toothbrushes) + +* [Henry Gitner Philatelists](http://www.hgitner.com) + +* [Some Pioneers of Modern Statistical Theory: A Personal Reflection by Sir David R. Cox](https://drive.google.com/file/d/0B678uTpUfn80a2RkOUc5LW51cVU/view?usp=sharing) + + +[Download the audio for this episode](https://soundcloud.com/nssd-podcast/episode-30-philately-and-numismatology) + +Listen here: + diff --git a/_posts/2017-01-17-effort-report-episode-23.md b/_posts/2017-01-17-effort-report-episode-23.md new file mode 100644 index 0000000..826d329 --- /dev/null +++ b/_posts/2017-01-17-effort-report-episode-23.md @@ -0,0 +1,23 @@ +--- +title: Interview with Al Sommer - Effort Report Episode 23 +author: roger +layout: post +--- + +My colleage [Elizabeth Matsui](https://twitter.com/elizabethmatsui) and I had a great opportunity to talk with Al Sommer on the [latest episode](http://effortreport.libsyn.com/23-special-guest-al-sommer) of our podcast [The Effort Report](http://effortreport.libsyn.com). Al is the former Dean of the Johns Hopkins Bloomberg School of Public Health and is Professor of Epidemiology and International Health at the School. He is (among other things) world reknown for his pioneering research in vitamin A deficiency and mortality in children. + +Al had some good bits of advice for academics and being successful in academia. + +> What you are excited about and interested in at the moment, you're much more likely to be succesful at---because you're excited about it! So you're going to get up at 2 in the morning and think about it, you're going to be putting things together in ways that nobody else has put things together. And guess what? When you do that you're more succesful [and] you actual end up getting academic promotions. + +On the slow rate of progress: + +> It took ten years, after we had seven randomized trials already to show that you get this 1/3 reduction in child mortality by giving them two cents worth of vitamin A twice a year. It took ten years to convince the child survival Nawabs of the world, and there are still some that don't believe it. + +On working overseas: + +> It used to be true [that] it's a lot easier to work overseas than it is to work here because the experts come from somewhere else. You're never an expert in your own home. + +You can listen to the entire episode here: + + \ No newline at end of file diff --git a/_posts/2017-01-18-data-prototyping-class.md b/_posts/2017-01-18-data-prototyping-class.md new file mode 100644 index 0000000..87b2b00 --- /dev/null +++ b/_posts/2017-01-18-data-prototyping-class.md @@ -0,0 +1,25 @@ +--- +title: Got a data app idea? Apply to get it prototyped by the JHU DSL! +author: jeff +layout: post +comments: true +--- + +![Get your app built](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/papr.png) + +Last fall we ran the first iteration of a class at the [Johns Hopkins Data Science Lab](http://jhudatascience.org/) where we teach students to build data web-apps using Shiny, R, GoogleSheets and a number of other technologies. Our goals were to teach students to build data products, to reduce friction for students who want to build things with data, and to help people solve important data problems with web and SMS apps. + +We are going to be running a second iteration of our program from March-June this year. We are looking for awesome projects for students to build that solve real world problems. We are particularly interested in projects that could have a positive impact on health but are open to any cool idea. We generally build apps that are useful for: + +* __Data donation__ - if you have a group of people you would like to donate data to your project. +* __Data collection__ - if you would like to build an app for collecting data from people. +* __Data visualziation__ - if you have a data set and would like to have a web app for interacting with the data +* __Data interaction__ - if you have a statistical or machine learning model and you would like a web interface for it. + +But we are interested in any consumer-facing data product that you might be interested in having built. We want you to submit your wildest, most interesting ideas and we’ll see if we can get them built for you. + +We are hoping to solicit a large number of projects and then build as many as possible. The best part is that we will build the prototype for you for free! If you have an idea of something you'd like built please submit it to this [Google form](https://docs.google.com/forms/d/1UPl7h8_SLw4zNFl_I9li_8GN14gyAEtPHtwO8fJ232E/edit?usp=forms_home&ths=true). + +Students in the class will select projects they are interested in during early March. We will let you know if your idea was selected for the program by mid-March. If you aren't selected you will have the opportunity to roll your submission over to our next round of prototyping. + +I'll be writing a separate post targeted at students, but if you are interested in being a data app prototyper, sign up [here](http://jhudatascience.org/prototyping_students.html). diff --git a/_posts/2017-01-19-what-is-artificial-intelligence.md b/_posts/2017-01-19-what-is-artificial-intelligence.md new file mode 100644 index 0000000..a989975 --- /dev/null +++ b/_posts/2017-01-19-what-is-artificial-intelligence.md @@ -0,0 +1,353 @@ +--- +title: What is artificial intelligence? A three part definition +author: jeff +layout: post +comments: true +--- + +_Editor's note: This is the first chapter of a book I'm working on called [Demystifying Artificial Intelligence](https://leanpub.com/demystifyai/). The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I'm developing the book over time - so if you buy the book on Leanpub know that there is only one chaper in there so far, but I'll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this [amazing tweet](https://twitter.com/notajf/status/795717253505413122) by Twitter user [@notajf](https://twitter.com/notajf/). Feedback is welcome and encouraged!_ + +What is artificial intelligence? +================================ + +> "If it looks like a duck and quacks like a duck but it needs +> batteries, you probably have the wrong abstraction" [Derick +> Bailey](https://lostechies.com/derickbailey/2009/02/11/solid-development-principles-in-motivational-pictures/) + +This book is about artificial intelligence. The term "artificial +intelligence" or "AI" has a long and convoluted history (Cohen and +Feigenbaum 2014). It has been used by philosophers, statisticians, +machine learning experts, mathematicians, and the general public. This +historical context means that when people say *artificial intelligence* +the term is loaded with one of many potential different meanings. + +Humanoid robots +--------------- + +Before we can demystify artificial intelligence it is helpful to have +some context for what the word means. When asked about artificial +intelligence, most people's imagination leaps immediately to images of +robots that can act like and interact with humans. Near-human robots +have long been a source of fascination by humans have appeared in +cartoons like the *Jetsons* and science fiction like *Star Wars*. More +recently, subtler forms of near-human robots with artificial +intelligence have played roles in movies like *Her* and *Ex machina*. + +![People usually think of artificial intelligence as a human-like robot +performing all the tasks that a person could.](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/movie-ai.png) + +The type of artificial intelligence that can think and act like a human +is something that experts call artificial general intelligence +(Wikipedia contributors 2017a). + +> is the intelligence of a machine that could successfully perform any +> intellectual task that a human being can + +There is an understandable fascination and fear associated with robots, +created by humans, but evolving and thinking independently. While this +is a major area of ressearch (Laird, Newell, and Rosenbloom 1987) and of +course the center of most people's attention when it comes to AI, there +is no near term possibility of this type of intelligence (Urban, n.d.). +There are a number of barriers to human-mimicking AI from difficulty +with robotics (Couden 2015) to needed speedups in computational power +(Langford, n.d.). + +One of the key barriers is that most current forms of the computer +models behind AI are trained to do one thing really well, but can not be +applied beyond that narrow task. There are extremely effective +artificial intelligence applications for translating between languages +(Wu et al. 2016), for recognizing faces in images (Taigman et al. 2014), +and even for driving cars (Santana and Hotz 2016). + +But none of these technologies are generalizable across the range of +tasks that most adult humans can accomplish. For example, the AI +application for recognizing faces in images can not be directly applied +to drive cars and the translation application couldn't recognize a +single image. While some of the internal technology used in the +applications is the same, the final version of the applications can't be +transferred. This means that when we talk about artificial intelligence +we are not talking about a general purpose humanoid replacement. +Currently we are talking about technologies that can typically +accomplish one or two specific tasks that a human could accomplish. + +Cognitive tasks +--------------- + +While modern AI applications couldn't do everything that an adult could +do (Baciu and Baciu 2016), they can perform individual tasks nearly as +well as a human. There is a second commonly used definition of +artificial intelligence that is considerably more narrow (Wikipedia +contributors 2017b) + +> ... the term "artificial intelligence" is applied when a machine +> mimics "cognitive" functions that humans associate with other human +> minds, such as "learning" and "problem solving". + +This definition encompasses applications like machine translation and +facial recognition. They are "cognitive" functions that are generally +usually only performed by humans. A difficulty with this definition is +that it is relative. People refer to machines that can do tasks that we +thought humans could only do as artificial intelligence. But over time, +as we become used to machines performing a particular task it is no +longer surprising and we stop calling it artificial intelligence. John +McCarthy, one of the leading early figures in artificial intelligence +said (Vardi 2012): + +> As soon as it works, no one calls it AI anymore... + +As an example, when you send a letter in the mail, there is a machine +that scans the writing on the letter. A computer then "reads" the +characters on the front of the letter. The computer reads the characters +in several steps - the color of each pixel in the picture of the letter +is stored in a data set on the computer. Then the computer uses an +algorithm that has been built using thousands or millions of other +letters to take the pixel data and turn it into predictions of the +characters in the image. Then the characters are identified as +addresses, names, zipcodes, and other relevant pieces of information. +Those are then stored in the computer as text which can be used for +sorting the mail. + +This task used to be considered "artificial intelligence" (Pavlidis, +n.d.). It was surprising that a computer could perform the tasks of +recognizing characters and addresses just based on a picture of the +letter. This task is now called "optical character recognition" +(Wikipedia contributors 2016). Many tutorials on the algorithms behind +machine learning begin with this relatively simple task (Google +Tensorflow Team, n.d.). Optical character recognition is now used in a +wide range of applications including in Google's effort to digitize +millions of books (Darnton 2009). + +Since this type of algorithm has become so common it is no longer called +"artificial intelligence". This transition happened becasue we no longer +think it is surprising that computers can do this task - so it is no +longer considered intelligent. This process has played out with a number +of other technologies. Initially it is thought that only a human can do +a particular cognitive task. As computers become increasingly proficient +at that task they are called artificially intelligent. Finally, when +that task is performed almost exclusively by computers it is no longer +considered "intelligent" and the boundary moves. + +![Timeline of tasks we were surprised that computers could do as well as +humans.](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/timeline-ai.png) + +Over the last two decades tasks from optical character recognition, to +facial recognition in images, to playing chess have started as +artificially intelligent applications. At the time of this writing there +are a number of technologies that are currently on the boundary between +doable only by a human and doable by a computer. These are the tasks +that are considered AI when you read about the term in the media. +Examples of tasks that are currently considered "artificial +intelligence" include: + +- Computers that can drive cars +- Computers that can identify human faces from pictures +- Computers that can translate text from one language to another +- Computers that can label pictures with text descriptions + +Just as it used to be with optical character recognition, self-driving +cars and facial recognition are tasks that still surprise us when +performed by a computer. So we still call them artificially intelligent. +Eventually, many or most of these tasks will be performed nearly +exclusively by computers and we will no longer think of them as +components of computer "intelligence". To go a little further we can +think about any task that is repetitive and performed by humans. For +example, picking out music that you like or helping someone buy +something at a store. An AI can eventually be built to do those tasks +provided that: (a) there is a way of measuring and storing information +about the tasks and (b) there is technology in place to perform the task +if given a set of computer instructions. + +The more narrow definition of AI is used colloquially in the news to +refer to new applications of computers to perform tasks previously +thought impossible. It is important to know both the definition of AI +used by the general public and the more narrow and relative definition +used to describe modern applications of AI by companies like Google and +Facebook. But neither of these definitions is satisfactory to help +demystify the current state of artificial intelligence applications. + +A three part definition +----------------------- + +The first definition describes a technology that we are not currently +faced with - fully functional general purpose artificial intelligence. +The second definition suffers from the fact that it is relative to the +expectations of people discussing applications. For this book, we need a +definition that is concrete, specific, and doesn't change with societal +expectations. + +We will consider specific examples of human-like tasks that computers +can perform. So we will use the definition that artificial intelligence +requires the following components: + +1. *The data set* : A of data examples that can be used to train a + statistical or machine learning model to make predictions. +2. *The algorithm* : An algorithm that can be trained based on the data + examples to take a new example and execute a human-like task. +3. *The interface* : An interface for the trained algorithm to receive + a data input and execute the human like task in the real world. + +This definition encompases optical character recognition and all the +more modern examples like self driving cars. It is also intentionally +broad, covering even examples where the data set is not large or the +algorithm is not complicated. We will use our definition to break down +modern artificial intelligence applications into their constituitive +parts and make it clear how the computer represents knowledge learned +from data examples and then applies that knowledge. + +As one example, consider Amazon Echo and Alexa - an application +currently considered to be artificially intelligent (Nuñez, n.d.). This +combination meets our definition of artificially intelligent since each +of the components is in place. + +1. *The data set* : The large set of data examples consist of all the + recordings that Amazon has collected of people talking to their + Amazon devices. +2. *The machine learning algorithm* : The Alexa voice service (Alexa + Developers 2016) is a machine learning algorithm trained using the + previous recordings of people talking to Amazon devices. +3. *The interface* : The interface is the Amazon Echo (Amazon Inc 2016) + a speaker that can record humans talking to it and respond with + information or music. + +![The three parts of an artificial intelligence illustrated with Amazon +Echo and Alexa](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/alexa-ai.png) + +When we break down artificial intelligence into these steps it makes it +clearer why there has been such a sudden explosion of interest in +artificial intelligence over the last several years. + +First, the cost of data storage and collection has gone down steadily +(Irizarry, n.d.) but dramatically (Quigley, n.d.) over the last several +years. As the costs have come down, it is increasingly feasible for +companies, governments, and even individuals to store large collections +of data (Component 1 - *The Data*). To take advantage of these huge +collections of data requires incredibly flexible statistical or machine +learning algorithms that can capture most of the patterns in the data +and re-use them for prediction. The most common type of algorithms used +in modern artificial intelligence are something called "deep neural +networks". These algorithms are so flexible they capture nearly all of +the important structure in the data. They can only be trained well if +huge data sets exist and computers are fast enough. Continual increases +in computing speed and power over the last several decades now make it +possible to apply these models to use collections of data (Component 2 - +*The Algorithm*). + +Finally, the most underappreciated component of the AI revolution does +not have to do with data or machine learning. Rather it is the +development of new interfaces that allow people to interact directly +with machine learning models. For a number of years now, if you were an +expert with statistical and machine learning software it has been +possible to build highly accurate predictive models. But if you were a +person without technical training it was not possible to directly +interact with algorithms. + +Or as statistical experts Diego Kuonen and Rafael Irizarry have put it: + +> The big in big data refers to importance, not size + +![It isn't about how much data you have, it is about how many people you +can get to use it.](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/importance-not-size.jpg) + +The explosion of interfaces for regular, non-technical people to +interact with machine learning is an underappreciated driver of the AI +revolution of the last several years. Artificial intelligence can now +power labeling friends on Facebook, parsing your speech to your personal +assistant Siri or Google Assistant, or providing you with directions in +your car, or when you talk to your Echo. More recently sensors and +devices make it possible for the instructions created by a computer to +steer and drive a car. + +These interfaces now make it possible for hundreds of millions of people +to directly interact with machine learning algorithms. These algorithms +can range from exceedingly simple to mind bendingly complex. But the +common result is that the interface allows the computer to perform a +human-like action and makes it look like artificial intelligence to the +person on the other side. This interface explosion only promises to +accelerate as we are building sensors for both data input and behavior +output in objects from phones to refrigerators to cars (Component 3 - +*The interface*). + +This definition of artificial intelligence in three components will +allow us to demystify artificial intelligence applications from self +driving cars to facial recognition. Our goal is to provide a high-level +interface to the current conception of AI and how it can be applied to +problems in real life. It will include discussion and references to the +sophisticated models and data collection methods used by Facebook, +Tesla, and other companies. However, the book does not assume a +mathematical or computer science background and will attempt to explain +these ideas in plain language. Of course, this means that some details +will be glossed over, so we will attempt to point the interested reader +toward more detailed resources throughout the book. + +References +---------- + +Alexa Developers. 2016. “Alexa Voice Service.” +. + +Amazon Inc. 2016. “Amazon Echo.” +. + +Baciu, Assaf, and Assaf Baciu. 2016. “Artificial Intelligence Is More +Artificial Than Intelligent.” *Wired*, 7~dec. + +Cohen, Paul R, and Edward A Feigenbaum. 2014. *The Handbook of +Artificial Intelligence*. Vol. 3. Butterworth-Heinemann. +. + +Couden, Craig. 2015. “Why It’s so Hard to Make Humanoid Robots | Make:” +. + +Darnton, Robert. 2009. *Google & the Future of Books*. na. + +Google Tensorflow Team. n.d. “MNIST for ML Beginners | TensorFlow.” +. + +Irizarry, Rafael. n.d. “The Big in Big Data Relates to Importance Not +Size · Simply Statistics.” +. + +Laird, John E, Allen Newell, and Paul S Rosenbloom. 1987. “Soar: An +Architecture for General Intelligence.” *Artificial Intelligence* 33 +(1). Elsevier: 1–64. + +Langford, John. n.d. “AlphaGo Is Not the Solution to AI « Machine +Learning (Theory).” . + +Nuñez, Michael. n.d. “Amazon Echo Is the First Artificial Intelligence +You’ll Want at Home.” +. + +Pavlidis, Theo. n.d. “Computers Versus Humans - 2002 Lecture.” +. + +Quigley, Robert. n.d. “The Cost of a Gigabyte over the Years.” +. + +Santana, Eder, and George Hotz. 2016. “Learning a Driving Simulator,” +3~aug. + +Taigman, Y, M Yang, M Ranzato, and L Wolf. 2014. “DeepFace: Closing the +Gap to Human-Level Performance in Face Verification.” In *2014 IEEE +Conference on Computer Vision and Pattern Recognition*, 1701–8. + +Urban, Tim. n.d. “The AI Revolution: How Far Away Are Our Robot +Overlords?” +. + +Vardi, Moshe Y. 2012. “Artificial Intelligence: Past and Future.” +*Commun. ACM* 55 (1). New York, NY, USA: ACM: 5–5. + +Wikipedia contributors. 2016. “Optical Character Recognition.” +. + +———. 2017a. “Artificial General Intelligence.” +. + +———. 2017b. “Artificial Intelligence.” +. + +Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, +Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine +Translation System: Bridging the Gap Between Human and Machine +Translation,” 26~sep. diff --git a/_posts/2017-01-20-not-artificial-not-intelligent.md b/_posts/2017-01-20-not-artificial-not-intelligent.md new file mode 100644 index 0000000..c6da41f --- /dev/null +++ b/_posts/2017-01-20-not-artificial-not-intelligent.md @@ -0,0 +1,275 @@ +--- +title: An example that isn't that artificial or intelligent +author: jeff +layout: post +comments: true +--- + +_Editor's note: This is the second chapter of a book I'm working on called [Demystifying Artificial Intelligence](https://leanpub.com/demystifyai/). The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I'm developing the book over time - so if you buy the book on Leanpub know that there are only two chapters in there so far, but I'll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this [amazing tweet](https://twitter.com/notajf/status/795717253505413122) by Twitter user [@notajf](https://twitter.com/notajf/). Feedback is welcome and encouraged!_ + + +> "I am so clever that sometimes I don't understand a single word of +> what I am saying." Oscar Wilde + +As we have described it artificial intelligence applications consist of +three things: + +1. A large collection of data examples +2. An algorithm for learning a model from that training set. +3. An interface with the world. + +In the following chapters we will go into each of these components in +much more detail, but lets start with a a couple of very simple examples +to make sure that the components of an AI are clear. We will start with +a completely artificial example and then move to more complicated +examples. + +Building an album +---------------- + +Lets start with a very simple hypothetical example that can be +understood even if you don't have a technical background. We can also +use this example to define some of the terms we will be discussing later +in the book. + +In our simple example the goal is to make an album of photos for a +friend. For example, suppose I want to take the photos in my photobook +and find all the ones that include pictures of myself and my son Dex for +his grandmother. + +![The author's drawing of the author's phone album. Don't make fun, he's +a data scientist, not an artist](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/cartoon-phone-photos.png) + +If you are anything like the author of this book, then you probably have +a very large number of pictures of your family on your phone. So the +first step in making the photo alubm would be to stort through all of my +pictures and pick out the ones that should be part of the album. + +This is a typical example of the type of thing we might want to train a +computer to do in an artificial intelligence application. Each of the +components of an AI application is there: + +1. **The data**: all of the pictures on the author's phone (a big + training set!) +2. **The algorithm**: finding pictures of me and my son Dex +3. **The interface**: the album to give to Dex's grandmother. + +One way to solve this problem is for me to sort through the pictures one +by one and decide whether they should be in the album or not, then +assemble them together, and then put them into the album. If I did it +like this then I myself would be the AI! That wouldn't be very +artificial though...imagine we instead wanted to teach a computer to +make this album.. + +> But what does it mean to "teach" a computer to do something? + +The terms "machine learning" and "artificial intelligence" invoke the +idea of teaching computers in the same way that we teach children. This +was a deliberate choice to make the analogy - both because in some ways +it is appropriate and because it is useful for explaining complicated +concepts to people with limited backgrounds. To teach a child to find +pictures of the author and his son, you would show her lots of examples +of that type of picture and maybe some examples of the author with other +kids who were not his son. You'd repeat to the child that the pictures +of the author and his son were the kinds you wanted and the others +weren't. Eventually she would retain that information and if you gave +her a new picture she could tell you whether it was the right kind or +not. + +To teach a machine to perform the same kind of recognition you go +through a similar process. You "show" the machine many pictures labeled +as either the ones you want or not. You repeat this process until the +machine "retains" the information and can correctly label a new photo. +Getting the machine to "retain" this information is a matter of getting +the machine to create a set of step by step instructions it can apply to +go from the image to the label that you want. + +The data +-------- + +The images are what people in the fields of artificial intelligence and +machine learning call *"raw data"* (Leek, n.d.). The categories of +pictures (a picture of the author and his son or a picture of something +else) are called the *"labels"* or *"outcomes"*. If the computer gets to +see the labels when it is learning then it is called *"supervised +learning"* (Wikipedia contributors 2016) and when the computer doesn't +get to see the labels it is called *"unsupervised learning"* (Wikipedia +contributors 2017a). + +Going back to our analogy with the child, supervised learning would be +teaching the child to recognize pictures of the author and his son +together. Unsupervised learning would be giving the child a pile of +pictures and asking them to sort them into groups. They might sort them +by color or subject or location - not necessarily into categories that +you care about. But probably one of the categories they would make would +be pictures of people - so she would have found some potentially useful +information even if it wasn't exactly what you wanted. One whole field +of artificial intelligence is figuring out how to use the information +learned in this "unsupervised" setting and using it for supervised tasks +- this is sometimes called *"transfer learning"* (Raina et al. 2007) by +people in the field since you are transferring information from one task +to another. + +Returning to the task of "teaching" a computer to retain information +about what kind of pictures you want we run into a problem - computers +don't know what pictures are! They also don't know what audio clips, +text files, videos, or any other kind of information is. At least not +directly. They don't have eyes, ears, and other senses along with a +brain designed to decode the information from these senses. + +So what can a computer understand? A good rule of thumb is that a +computer works best with numbers. If you want a computer to sort +pictures into an album for you, the first thing you need to do is to +find a way to turn all of the information you want to "show" the +computer into numbers. In the case of sorting pictures into albums - a +supervised learning problem - we need to turn the labels and the images +into numbers the computer can use. + +![Label each picture as a one or a zero depending on whether it is the +kind of picture you want in the album](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/labels-to-numbers.png) + +One way to do that would be for you to do it for the computer. You could +take every picture on your phone and label it with a 1 if it was a +picture of the author and his son and a 0 if not. Then you would have a +set of 1's and 0's corresponding to all of the pictures. This takes some +thing the computer can't understand (the picture) and turns it into +something the computer can understand (the label). + +This process would turn the labels into something a computer could +understand, it still isn't something we could teach a computer to do. +The computer can't actually "look" at the image and doesn't know who the +author or his son are. So we need to figure out a way to turn the images +into numbers for the computer to use to generate those labels directly. + +This is a little more complicated but you could still do it for the +computer. Let's suppose that the author and his son always wear matching +blue shirts when they spend time together. Then you could go through and +look at each image and decide what fraction of the image is blue. So +each picture would get a number ranging from zero to one like 0.30 if +the picture was 30% blue and 0.53 if it was 53% blue. + +![Calculate the fraction of each image that is the color blue as a +"feature" of the image that is numeric](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/images-to-numbers.png) + +The fraction of the picture that is blue is called a *"feature"* and the +process of creating that feature is called *"feature engineering"* +(Wikipedia contributors 2017b). Until very recently feature engineering +of text, audio, or video files was best performed by an expert human. In +later chapters we will discuss how one of the most exciting parts about +AI application is that it is now possible to have computers perform +feature engineering for you. + +The algorithm +------------- + +Now that we have converted the images to numbers and the labels to +numbers, we can talk about how to "teach" a computer to label the +pictures. A good rule of thumb when thinking about algorithms is that a +computer can't "do" anything without being told very explicitly what to +do. It needs a step by step set of instructions. The instructions should +start with a calculation on the numbers for the image and should end +with a prediction of what label to apply to that image. The image +(converted to numbers) is the *"input"* and the label (also a number) is +the *"output"*. You may have heard the phrase: + +> "Garbage in, garbage out" + +What this phrase means is if the inputs (the images) are bad - say they +are all very dark or hard to see. Then the output of the algorithm will +also be bad - the predictions won't be very good. + +A machine learning *"algorithm"* can be thought of as a set of +instructions with some of the parts left blank - sort of like mad-libs. +One example of a really simple algorithm for sorting pictures into the +album would be: + +> 1. Calculate the fraction of blue in the image. +> 2. If the fraction of blue is above *X* label it 1 +> 3. If the fraction of blue is less than *X* label it 0 +> 4. Put all of the images labeled 1 in the album + +The machine *"learns"* by using the examples to fill in the blanks in +the instructions. In the case of our really simple algorithm we need to +figure out what fraction of blue to use (*X*) for labeling the picture. + +To figure out a guess for *X* we need to decide what we want the +algorithm to do. If we set *X* to be too low then all of the images will +be labeled with a 1 and put into the album. If we set *X* to be too high +then all of the images will be labeled 0 and none will appear in the +album. In between there is some grey area - do we care if we +accidentally get some pictures of the ocean or the sky with our +algorithm? + +But the number of images in the album isn't even the thing we really +care about. What we might care about is making sure that the album is +mostly pictures of the author and his son. In the field of AI they +usually turn this statement around - we want to make sure the album has +a very small fraction of pictures that are not of the author and his +son. This fraction - the fraction that are incorrectly placed in the +album is called the *"loss"*. You can think about it like a game where +the computer loses a point every time it puts the wrong kind of picture +into the album. + +Using our loss (how many pictures we incorrectly placed in the album) we +can now use the data we have created (the numbers for the labels and the +images) to fill in the blanks in our mad-lib algorithm (picking the +cutoff on the amount of blue). We have a large number of pictures where +we know what fraction of each picture is blue and whether it is a +picture of the author and his son or not. We can try each possible *X* +and calculate the fraction of pictures in the album that are incorrectly +placed into the album (the loss) and find the *X* that produces the +smallest fraction. + +Suppose that the value of *X* that gives the smallest faction of wrong +pictures in the album is 30. Then our "learned" model would be: + +> 1. Calculate the fraction of blue in the image +> 2. If the fraction of blue is above 0.1 label it 1 +> 3. If the fraction of blue is less than 0.1 label it 0 +> 4. Put all of the images labeled 1 in the album + +The interface +------------- + +The last part of an AI application is the interface. In this case, the +interface would be the way that we share the pictures with Dex's +grandmother. For example we could imagine uploading the pictures to +[Shutterfly](https://www.shutterfly.com/) and having the album delivered +to Dex's grandmother. + +Putting this all together we could imagine an application using our +trained AI. The author uploads his unlabeled photos. The photos are then +passed to the computer program which calculates the fraction of the +image that is blue, then applies a label according to the algorithm we +learned, then takes all the images predicted to be of the author and his +son and sends them off to be a Shutterfly album mailed to the authors' +mother. + +![Whoa that computer is smart - from the author's picture to grandma's +hands!](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ai-album.png) + +If the algorithm was good, then from the perspective of the author the +website would look "intelligent". I just uploaded pictures and it +created an album for me with the pictures that I wanted. But the steps +in the process were very simple and understandable behind the scenes. + + +References +------------- + +Leek, Jeffrey. n.d. “The Elements of Data Analytic Style.” +[{https://leanpub.com/datastyle}]({https://leanpub.com/datastyle}). + +Raina, Rajat, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y +Ng. 2007. “Self-Taught Learning: Transfer Learning from Unlabeled Data.” +In *Proceedings of the 24th International Conference on Machine +Learning*, 759–66. ICML ’07. New York, NY, USA: ACM. + +Wikipedia contributors. 2016. “Supervised Learning.” +. + +———. 2017a. “Unsupervised Learning.” +. + +———. 2017b. “Feature Engineering.” +. diff --git a/_posts/2017-01-23-ux-value.md b/_posts/2017-01-23-ux-value.md new file mode 100644 index 0000000..a920f70 --- /dev/null +++ b/_posts/2017-01-23-ux-value.md @@ -0,0 +1,64 @@ +--- +title: User Experience and Value in Products - What Regression and Surrogate Variables can Teach Us +author: roger +layout: post +--- + +Over the past year, there have been a number of recurring topics in my global news feed that have a shared theme to them. Some examples of these topics are: + +* **Fake news**: Before and after the election in 2016, Facebook (or Facebook's Trending News algorithm) was accused of promoting news stories that turned out to be completely false, promoted by dubious news sources in FYROM and elsewhere. +* **Theranos**: This diagnostic testing company promised to revolutionize the blood testing business and prevent disease for all by making blood testing simple and painless. This way people would not be afraid to get blood tests and would do them more often, presumably catching diseases while they were in the very early stages. Theranos lobbied to allow patients order their own blood tests so that they wouldn't need a doctor's order. +* **Homeopathy**: This a so-called [alternative medical system](https://nccih.nih.gov/health/homeopathy) developed in the late 18th century based on notions such as "like cures like" and "law of minimum dose. +* **Online education**: New companies like Coursera and Udacity promised to revolutionize education by making it accessible to a broader audience than conventional universities were able. + +What exactly do these things have in common? + +First, consumers love them. Fake news played to people's biases by confirming to them, from a seemingly trustworthy source, what they always "knew to be true". The fact that the stories weren't actually true was irrelevant given that users enjoyed the experience of seeing what they agreed with. Perhaps the best explanation of the entire Facebook fake news issue was from Kim-Mai Cutler: + + + +Theranos promised to revolutionize blood testing and change the user experience behind the whole industry. Indeed the company had some fans (particularly amongst its [investor base](https://www.axios.com/tim-drapers-keeps-defending-theranos-2192078259.html)). However, after investigations by the Center for Medicare and Medicaid Services, the FDA, and an independent laboratory, it was found that Theranos's blood testing machine was wildly inconsistent and variable, leading to Theranos ultimately retracting all of its blood test results and cutting half its workforce. + +Homeopathy is not company specific, but is touted by many as an "alternative" treatment for many diseases, with many claiming that it "works for them". However, the NIH states quite clearly on its [web site](https://nccih.nih.gov/health/homeopathy) that "There is little evidence to support homeopathy as an effective treatment for any specific condition." + +Finally, companies like Coursera and Udacity in the education space have indeed produced products that people like, but in some instances have hit bumps in the road. Udacity conducted a brief experiment/program with San Jose State University that failed due to the large differences between the population that took online courses and the one that took them in person. Coursera has massive offerings from major universities (including my own) but has run into continuing [challenges with drop out](http://www.economist.com/news/special-report/21714173-alternative-providers-education-must-solve-problems-cost-and) and questions over whether the courses offered are suitable for job placement. + +## User Experience and Value + +In each of these four examples there is a consumer product that people love, often because they provide a great user experience. Take the fake news example--people love to read headlines from "trusted" news sources that agree with what they believe. With Theranos, people love to take a blood test that is not painful (maybe "love" is the wrong word here). With many consumer products companies, it is the user experience that defines the value of a product. Often when describing the user experience, you are simultaneously describing the value of the product. + +Take for example Uber. With Uber, you open an app on your phone, click a button to order a car, watch the car approach you on your phone with an estimate of how long you will be waiting, get in the car and go to your destination, and get out without having to deal with paying. If someone were to ask me "What's the value of Uber?" I would probably just repeat the description in the previous sentence. Isn't it obvious that it's better than the usual taxi experience? The same could be said for many companies that have recently come up: Airbnb, Amazon, Apple, Google. With many of the products from these companies, *the description of the user experience is a description of its value*. + +## Disruption Through User Experience + +In the example of Uber (and Airbnb, and Amazon, etc.) you could depict the relationship between the product, the user experience, and the value as such: + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux1.png) + +Any changes that you can make to the product to improve the user experience will then improve the value that the product offers. Another way to say it is that the user experience serves as a *surrogate outcome* for the value. We can influence the UX and know that we are improving value. Furthermore, any measurements that we take on the UX (surveys, focus groups, app data) will serve as direct observations on the value provided to customers. + +New companies in these kinds of consumer product spaces can disrupt the incumbents by providing a much better user experience. When incumbents have gotten fat and lazy, there is often a sizable segment of the customer base that feels underserved. That's when new companies can swoop in to specifically serve that segment, often with a "worse" product overall (as in fewer features) and usually much cheaper. The Internet has made the "swooping in" much easier by [dramatically reducing transaction and distribution costs](https://stratechery.com/2015/netflix-and-the-conservation-of-attractive-profits/). Once the new company has a foothold, they can gradually work their way up the ladder of customer segments to take over the market. It's classic disruption theory a la [Clayton Christensen](http://www.claytonchristensen.com). + + +## When Value Defines the User Experience and Product + +There has been much talk of applying the classic disruption model to every space imaginable, but I contend that not all product spaces are the same. In particular, the four examples I described in the beginning of this post cover some of those different areas: + +* Medicine (Theranos, homeopathy) +* News (Facebook/fake news) +* Education (Coursera/Udacity) + +One thing you'll notice about these areas, particularly with medicine and education, is that they are all heavily regulated. The reason is because we as a community have decided that there is a minimum level of value that is required to be provided by entities in this space. That is, the value that a product offers is *defined first*, before the product can come to market. Therefore, the value of the product actually constrains the space of products that can be produced. We can depict this relationship as such: + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux2.png) + +In classic regression modeling language, the value of a product must be "adjusted for" before examining the relationship between the product and the user experience. Naturally, as in any regression problem, when you adjust for a variable that is related to the product and the user experience, you reduce the overall variation in the product. + +In situations where the value defines the product and the user experience, there is much less room to maneuver for new entrants in the market. The reason is because they, like everyone else, are constrained by the value that is agreed upon by the community, usually in the form of regulations. + +When Theranos comes in and claims that it's going to dramatically improve the user experience of blood testing, that's great, but they must be constrained by the value that society demands, which is a certain precision and accuracy in its testing results. Companies in the online education space are welcome to disrupt things by providing a better user experience. Online offerings in fact do this by allowing students to take classes according to their own schedule, wherever they may live in the world. But we still demand that the students learn an agreed-upon set of facts, skills, or lessons. + +New companies will often argue that the things that we currently value are outdated or no longer valuable. Their incentive is to change the value required so that there is more room for new companies to enter the space. This is a good thing, but it's important to realize that this cannot happen solely through changes in the product. Innovative features of a product may help us to understand that we should be valuing different things, but ultimately the change in what we preceive as value occurs independently of any given product. + +When I see new companies enter the education, medicine, or news areas, I always hesitate a bit because I want some assurance that they will still provide the value that we have come to expect. In addition, with these particular areas, there is a genuine sense that failing to deliver on what we value could cause serious harm to individuals. However, I think the discussion that is provoked by new companies entering the space is always welcome because we need to constantly re-evaluate what we value and whether it matches the needs of our time. + diff --git a/_posts/2017-01-26-new-prototyping-class.md b/_posts/2017-01-26-new-prototyping-class.md new file mode 100644 index 0000000..e98d520 --- /dev/null +++ b/_posts/2017-01-26-new-prototyping-class.md @@ -0,0 +1,32 @@ +--- +title: New class - Data App Prototyping for Public Health and Beyond +author: jeff +layout: post +comments: true +--- + + +Are you interested in building data apps to help save the world, start the next big business, or just to see if you can? We are running a data app prototyping class for people interested in creating these apps. + +This will be a special topics class at JHU and is open to any undergrad student, grad student, postdoc, or faculty member at the university. We are also seeing if we can make the class available to people outside of JHU so even if you aren't at JHU but are interested you should let us know below. + +One of the principles of our approach is that anyone can prototype an app. Our class starts with some tutorials on Shiny and R. While we have no formal pre-reqs for the class you will have much more fun if you have the background equivalent to our Coursera classes: + +* [Data Scientist’s Toolbox](https://www.coursera.org/learn/data-scientists-tools) +* [R programming](https://www.coursera.org/learn/r-programming) +* [Building R packages](https://www.coursera.org/learn/r-packages) +* [Developing Data Products](https://www.coursera.org/learn/data-products) + +If you don't have that background you can take the classes online starting now to get up to speed! To see some examples of apps we will be building check out our [gallery](http://jhudatascience.org/data_app_gallery.html). + + +We will mostly be able to support development with R and Shiny but would be pumped to accept people with other kinds of development background - we just might not be able to give a lot of technical assistance. + + +As part of the course we are also working with JHU's [Fast Forward](https://ventures.jhu.edu/fastforward/) program to streamline and ease the process of starting a company around the app you build for the class. So if you have entrepreneurial ambitions, this is the class for you! + + +We are in the process of setting up the course times, locations, and enrollment cap. The class will run from March to May (exact dates TBD). To sign up for announcements about the class please fill out your information [here](http://jhudatascience.org/prototyping_students.html). + + + diff --git a/_posts/2017-01-31-data-into-numbers.md b/_posts/2017-01-31-data-into-numbers.md new file mode 100644 index 0000000..8c40b1d --- /dev/null +++ b/_posts/2017-01-31-data-into-numbers.md @@ -0,0 +1,153 @@ +--- +title: Turning data into numbers +author: jeff +layout: post +comments: true +--- + +_Editor's note: This is the third chapter of a book I'm working on called [Demystifying Artificial Intelligence](https://leanpub.com/demystifyai/). The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I'm developing the book over time - so if you buy the book on Leanpub know that there are only three chapters in there so far, but I'll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this [amazing tweet](https://twitter.com/notajf/status/795717253505413122) by Twitter user [@notajf](https://twitter.com/notajf/). Feedback is welcome and encouraged!_ + + +> "It is a capital mistake to theorize before one has data." Arthur Conan Doyle + +Data, data everywhere +--------------------- + +I already have some data about you. You are reading this book. Does that seem like data? It’s just something you did, that’s not data is it? But if I collect that piece of information about you, it actually tells me a surprising amount. It tells me you have access to an internet connection, since the only place to get the book is online. That in turn tells me something about your socioeconomic status and what part of the world you live in. It also tells me that you like to read, which suggests a certain level of education. + +Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. Data were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy. + +To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. (Travers and Milgram 1969). In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. + +This is an idea that is so powerful it even became part of the popular consciousness. For example it is the foundation of the internet meme "the 6-degrees of Kevin Bacon" (Wikipedia contributors 2016a) - the idea that if you take any actor and look at the people they have been in movies with, then the people those people have been in movies with, it will take you at most six steps to end up at the actor Kevin Bacon. This idea, despite its popularity was originally studied by Milgram using only 64 data points. A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort (Leskovec and Horvitz 2008). + +Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome (Venter et al. 2001). This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $1,000 in about a week (“The Cost of Sequencing a Human Genome,” n.d.), soon it may be less than $100 (Buhr 2017). + +You may have heard that this is the era of “big data” from The Economist or The New York Times. It is really the era of cheap data collection and storage. Measurements we never bothered to collect before are now so easy to obtain that there is no reason not to collect them. Advances in computer technology also make it easier to store huge amounts of data digitally. This may not seem like a big deal, but it is much easier to calculate the average of a bunch of numbers stored electronically than it is to calculate that same average by hand on a piece of paper. Couple these advances with the free and open distribution of data over the internet and it is no surprise that we are awash in data. But tons of data on their own are meaningless. It is understanding and interpreting the data where the real advances start to happen. + +This explosive growth in data collection is one of the key driving influences behind interest in artificial intelligence. When teaching computers to do something that only humans could do previously, it helps to have lots of examples. You can then use statistical and machine learning models to summarize that set of examples and help a computer make decisions what to do. The more examples you have, the more flexible your computer model can be in making decisions, and the more "intelligent" the resulting application. + +What is data? +------------- + +### Tidy data + +"What is data"? Seems like a relatively simple question. In some ways this question is easy to answer. According to [Wikipedia](https://en.wikipedia.org/wiki/Data): + +> Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)\[1\] is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews with people of an Indigenous tribe. Pieces of data are individual pieces of information. While the concept of data is commonly associated with scientific research, data is collected by a huge range of organizations and institutions, ranging from businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). + +When you think about data, you probably think of orderly sets of numbers arranged in something like an Excel spreadsheet. In the world of data science and machine learning this type of data has a name - "tidy data" (Wickham and others 2014). Tidy data has the properties that all measured quantities are represented by numbers or character strings (think words). The data are organized such that. + +1. Each variable you measured is in one column +2. Each different measurement of that variable is in a different row +3. There is one data table for each "type" of variable. +4. If there are multiple tables then they are linked by a common ID. + +This idea is borrowed from data management schemas that have long been used for storing data in databases. Here is an example of a tidy data set of swimming world records. + +| year| time| sex | +|-----:|-----:|:----| +| 1905| 65.8| M | +| 1908| 65.6| M | +| 1910| 62.8| M | +| 1912| 61.6| M | +| 1918| 61.4| M | +| 1920| 60.4| M | +| 1922| 58.6| M | +| 1924| 57.4| M | +| 1934| 56.8| M | +| 1935| 56.6| M | + +This type of data, neat, organized and nicely numeric is not the kind of data people are talking about when they say the "era of big data". Data almost never start their lives in such a neat and organized format. + +### Raw data + +The explosion of interest in AI has been powered by a variety of types of data that you might not even think of when you think of "data". The data might be pictures you take and upload to social media, the text of the posts on that same platform, or the sound captured from your voice when you speak to your phone. + +Social media and cell phones aren't the only area where data is being collected more frequently. Speed cameras on roads collect data on the movement of cars, electronic medical records store information about people's health, wearable devices like Fitbit collect information on the activity of people. GPS information stores the location of people, cars, boats, airplanes, and an increasingly wide array of other objects. + +Images, voice recordings, text files, and GPS coordinates are what experts call "raw data". To create an artificial intelligence application you need to begin with a lot of raw data. But as we discussed in the simple AI example from the previous chapter - a computer doesn't understand raw data in its natural form. It is not always immediately obvious how the raw data can be turned into numbers that a computer can understand. For example, when an artificial intelligence works with a picture the computer doesn't "see" the picture file itself. It sees a set of numbers that represent that picture and operates on those numbers. The first step in almost every artificial intelligence application is to "pre-process" the data - to take the image files or the movie files or the text of a document and turn it into numbers that a computer can understand. Then those numbers can be fed into algorithms that can make predictions and ultimately be used to make an interface look intelligent. + +Turning raw data into numbers +----------------------------- + +So how do we convert raw data into a form we can work with? It depends on what type of measurement or data you have collected. Here I will use two examples to explain how you can convert images and the text of a document into numbers that an algorithm can be applied to. + +### Images + +Suppose that we were developing an AI to identify pictures of the author of this book. We would need to collect a picture of the author - maybe an embarrassing one. + +![An embarrassing picture of the author](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff.jpg) + +This picture is made of pixels. You can see that if you zoom in very close on the image and look more closely. You can see that the image consists of many hundreds of little squares, each square just one color. Those squares are called pixels and they are one step closer to turning the image into numbers. + +![A zoomed in view of the author's smile - you can see that each little square corresponds to one pixel and has an individual color](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile.png) + +You can think of each pixel like a dot of color. Let's zoom in a little bit more and instead of showing each pixel as a square show each one as a colored dot. + +![A zoomed in view of the author's smile - now each of the pixels are little dots one for each pixel.](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-dots.png) + +Imagine we are going to build an AI application on the basis of lots of images. Then we would like to turn a set of images into "tidy data". As described above a tidy data set is defined as the following. + +1. Each variable you measured is in one column +2. Each different measurement of that variable is in a different row +3. There is one data table for each "type" of variable. +4. If there are multiple tables then they are linked by a common ID. + +A translation of tidy data for a collection of images would be the following. + +1. *Variables*: Are the pixels measured in the images. So the top left pixel is a variable, the bottom left pixel is a variable, and so on. So each pixel should be in a separate column. +2. *Measurements*: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row. +3. *Tables*: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them). + +To start to turn the image into a row of the data set we need to stretch the dots into a single row. One way to do this is to snake along the image going from top left corner to bottom right corner and creating a single line of dots. + +![Follow the path of the arrows to see how you can turn the two dimensional picture into a one dimensional picture](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-lines.png) + +This still isn't quite data a computer can understand - a computer doesn't know about dots. But we could take each dot and label it with a color name. + +![Labeling each color with a name](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-color-names.png) + +We could take each color name and give it a number, something like `rosybrown = 1`, `mistyrose = 2`, and so on. This approach runs into some trouble because we don't have names for every possible color and because it is pretty inefficient to have a different number for every hue we could imagine. + +But that would be both inefficient and not very understandable by a computer. An alternative strategy that is often used is to encode the intensity of the red, green, and blue colors for each pixel. This is sometimes called the rgb color model (Wikipedia contributors 2016b). So for example we can take these dots and show how much red, green, and blue they have in them. + +![Breaking each color down into the amount of red, green and blue](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-rgb.png) + +Looking at it this way we now have three measurements for each pixel. So we need to update our tidy data definition to be: + +1. *Variables*: Are the three colors for each pixel measured in the images. So the top left pixel red value is a variable, the top left pixel green value is a variable and so on. So each pixel/color combination should be in a separate column. +2. *Measurements*: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row. +3. *Tables*: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them). + +So a tidy data set might look something like this for just the image of Jeff. + +| id | label | p1red | p1green | p1blue | p2red | ... | +|-----|--------|-------|---------|--------|-------|-----| +| 1 | "jeff" | 238 | 180 | 180 | 205 | ... | + +Each additional image would then be another row in the data set. As we will see in the chapters that follow we can then feed this data into an algorithm for performing an artificial intelligence task. + +Notes +----- + +Parts of this chapter from appeared in the Simply Statistics blog post ["The vast majority of statistical analysis is not performed by statisticians"](http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/) written by the author of this book. + +References +---------- + +Buhr, Sarah. 2017. “Illumina Wants to Sequence Your Whole Genome for $100.” . + +Leskovec, Jure, and Eric Horvitz. 2008. “Planetary-Scale Views on an Instant-Messaging Network,” 6~mar. + +“The Cost of Sequencing a Human Genome.” n.d. . + +Travers, Jeffrey, and Stanley Milgram. 1969. “An Experimental Study of the Small World Problem.” *Sociometry* 32 (4). \[American Sociological Association, Sage Publications, Inc.\]: 425–43. + +Venter, J Craig, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, et al. 2001. “The Sequence of the Human Genome.” *Science* 291 (5507). American Association for the Advancement of Science: 1304–51. + +Wickham, Hadley, and others. 2014. “Tidy Data.” *Under Review*. + +Wikipedia contributors. 2016a. “Six Degrees of Kevin Bacon.” . + +———. 2016b. “RGB Color Model.” . diff --git a/_posts/2017-02-01-reproducible-research-limits.md b/_posts/2017-02-01-reproducible-research-limits.md new file mode 100644 index 0000000..701d636 --- /dev/null +++ b/_posts/2017-02-01-reproducible-research-limits.md @@ -0,0 +1,50 @@ +--- +title: Reproducible Research Needs Some Limiting Principles +author: roger +layout: post +--- + +Over the past 10 years thinking and writing about reproducible research, I've come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people's thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area. + +When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005). + +Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition. + +## Reproducible for Whom? + +In discussions about reproducibility with others, the topic of **who** should be able to reproduce the analysis only occasionally comes up. There's a general sense, especially amongst academics, that **anyone** should be able to reproduce any analysis if they wanted to. + +There is an analogy with free software here in the sense that free software can be free for some people and not for others. This made more sense in the days before the Internet when distribution was much more costly. The idea here was that I could write software for a client and give them the source code for that software (as they would surely demand). The software is free for them but not for anyone else. But free software ultimately only matters when it comes to distribution. Once I distribute a piece of software, that's when all the restrictions come into play. However, if I only distribute it to a few people, I only need to guarantee that those few people have those freedoms. + +Richard Stallman once said that something like 90% of software was free software because almost all software being written was custom software for individual clients (I have no idea where he got this number). Even if the number is wrong, the point still stands that if I write software for a single person, it can be free for that person even if no one in the world has access to the software. + +Of course, now with the Internet, everything pretty much gets distributed to everyone because there's nothing stopping someone from taking a piece of free software and posting it on a web site. But the idea still holds: Free software only needs to be free for the people who receive it. + +That said, the analogy is not perfect. Software and research are not the same thing. They key difference is that you can't call something research unless is generally available and disseminated. If Pfizer comes up with the cure for cancer and never tells anyone about it, it's not research. If I discover that there's a 9th planet and only tell my neighbor about it, it's not research. Many companies might call those activities research (particularly from an tax/accounting point of view) but since society doesn't get to learn about them, it's not research. + +If research is by definition disseminated to all, then it should therefore be reproducible by all. However, there are at least two circumstances in which we do not even pretend to believe this is possible. + +1. **Imbalance of resources**: If I conduct a data analysis that requires the [world's largest supercomputer](https://www.top500.org/lists/2016/06/), I can make all the code and data available that I want--few people will be able to actually reproduce it. That's an extreme case, but even if I were to make use of a [dramatically smaller computing cluster](https://jhpce.jhu.edu) it's unlikely that anyone would be able to recreate those resources. So I can distribute something that's reproducible in theory but not in reality by most people. +2. **Protected data**: Numerous analyses in the biomedical sciences make use of protected health information that cannot easily be disseminated. Privacy is an important issue, in part, because in many cases it allows us to collect the data in the first place. However, most would agree we cannot simply post that data for all to see in the name of reproducibility. First, it is against the law, and second it would likely deter anyone from agreeing to participate in any study in the future. + +We can pretend that we can make data analyses reproducible for all, but in reality it's not possible. So perhaps it would make sense for us to consider whether a limiting principle should be applied. The danger of not considering it is that one may take things to the extreme---if it can't be made reproducible for all, then why bother trying? A partial solution is needed here. + + +## For How Long? + +Another question that needs to be resolved for reproducibility to be a widely implemented and sustainable phenomenon is for how long should something be reproducible? Ultimately, this is a question about time and resources because ensuring that data and code can be made available and can run on current platforms *in perpetuity* requires substantial time and money. In the academic community, where projects are often funded off of grants or contracts with finite lifespans, often the money is long gone even though the data and code must be maintained. The question then is who pays for the maintainence and the upkeep of the data and code? + +I've never heard a satisfactory answer to this question. If the answer is that data analyses should be reproducible forever, then we need to consider a different funding model. This position would require a perpetual funds model, essentially an endowment, for each project that is disseminated and claims to be reproducible. The endowment would pay for things like servers for hosting the code and data and perhaps engineers to adapt and adjust the code as the surrounding environment changes. While there are a number of [repositories](http://dataverse.org) that have developed scalable operating models, it's not clear to me that the funding model is completely sustainable. + +If we look at how scientific publications are sustained, we see that it's largely private enterprise that shoulders the burden. Journals house most of the publications out there and they charge a fee for access (some for profit, some not for profit). Whether the reader pays or the author pays is not relevant, the point is that a decision has been made about *who* pays. + +The author-pays model is interesting though. Here, an author pays a publication charge of ~$2,000, and the reader never pays anything for access (in perpetuity, presumably). The $2,000 payment by the author is like a one-time capital expense for maintaining that one publication forever (a mini-endowment, in a sense). It works for authors because grant/contract supported research often budget for one-time publication charges. There's no need for continued payments after a grant/contract has expired. + +The publication system is quite a bit simpler because almost all publications are the same size and require the same resources for access---basically a web site that can serve up PDF files and people to maintain it. For data analyses, one could see things potentially getting out of control. For a large analysis with terabytes of data, what would the one-time up-front fee be to house the data and pay for anyone to access it for free forever? + +Using Amazon's [monthly cost estimator](http://calculator.s3.amazonaws.com/index.html) we can get a rough sense of what the pure data storage might cost. Suppose we have a 10GB dataset that we want to store and we anticipate that it might be downloaded 10 times per month. This would cost about $7.65 per month, or $91.80 per year. If we assume Amazon raises their prices about 3% per year and a discount rate of 5%, the total cost for the storage is $4,590. If we tack on 20% for other costs, that brings us to $5,508. This is perhaps not unreasonable, and the scenario would certainly include most people. For comparison a 1 TB dataset downloaded once a year, using the same formula gives us a one-time cost of about $40,000. This is real money when it comes to fixed research budgets and would likely require some discussion of trade-offs. + + +## Summary + +Reproducibility is a necessity in science, but it's high time that we start considering the practical implications of actually doing the job. There are still holdouts when it comes to the basic idea of reproducibiltiy, but they are fewer and farther between. If we do not seriously consider the details of how to implement reproducibility, perhaps by introducing some limiting principles, we may never be able to achieve any sort of widespread adoption. \ No newline at end of file diff --git a/_posts/2017-02-13-nssd-episode-32.md b/_posts/2017-02-13-nssd-episode-32.md new file mode 100644 index 0000000..4979219 --- /dev/null +++ b/_posts/2017-02-13-nssd-episode-32.md @@ -0,0 +1,37 @@ +--- +author: roger +layout: post +title: Not So Standard Deviations Episode 32 - You Have to Reinvent the Wheel a Few Times +--- + +Hilary and I discuss training in PhD programs, estimating the variance vs. the standard deviation, the bias variance tradeoff, and explainable machine learning. + +We're also introducing a new level of support on our Patreon page, where you can get access to some of the outtakes from our episodes. Check out our [Patreon page](https://www.patreon.com/NSSDeviations) for details. + +Show notes: + +* [Explainable AI](http://www.darpa.mil/program/explainable-artificial-intelligence) + +* [Stitch Fix Blog NBA Rankings](http://multithreaded.stitchfix.com/blog/2016/11/22/nba-rankings/) + +* [David Robinson’s Empirical Bayes book](http://varianceexplained.org/r/empirical-bayes-book/) + +* [War on the Rocks podcast](https://warontherocks.com/2017/01/introducing-bombshell-the-explosive-first-episode/) + +* [Roger on Twitter](https://twitter.com/rdpeng) + +* [Hilary on Twitter](https://twitter.com/hspter) + +* [Get the Not So Standard Deviations book](https://leanpub.com/conversationsondatascience/) + +* [Subscribe to the podcast on iTunes](https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570) + +* [Subscribe to the podcast on Google Play](https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna) + +* [Find past episodes](https://soundcloud.com/nssd-podcast) + + +[Download the audio for this episode](https://soundcloud.com/nssd-podcast/episode-32-you-have-to-reinvent-the-wheel-a-few-times) + +Listen here: + \ No newline at end of file diff --git a/_posts/2017-02-15-Data-Scientists-Clashing-at-Hedge-Funds.md b/_posts/2017-02-15-Data-Scientists-Clashing-at-Hedge-Funds.md new file mode 100644 index 0000000..7237d92 --- /dev/null +++ b/_posts/2017-02-15-Data-Scientists-Clashing-at-Hedge-Funds.md @@ -0,0 +1,35 @@ +--- +title: Data Scientists Clashing at Hedge Funds +author: roger +layout: post +--- + + +There's an interesting article over at Bloomberg about how [data scientists have struggled at some hedge funds](https://www.bloomberg.com/news/articles/2017-02-15/point72-shows-how-firms-face-culture-clash-on-road-to-quantland): + +> The firms have been loading up on data scientists and coders to deliver on the promise of quantitative investing and lift their ho-hum returns. But they are discovering that the marriage of old-school managers and data-driven quants can be rocky. Managers who have relied on gut calls resist ceding control to scientists and their trading signals. And quants, emboldened by the success of computer-driven funds like Renaissance Technologies, bristle at their second-class status and vie for a bigger voice in investing. + +There are some interesting tidbits in the article that I think hold lessons for any collaboration between a data scientist or analyst and a non-data scientist (for lack of a better word). + +At Point72, the family office successor to SAC Capital, problems at the quant unit (known as Aperio): + +> The divide between Aperio quants and fundamental money managers was also intellectual. They struggled to communicate about the basics, like how big data could inform investment decisions. [Michael] Recce’s team, which was stacked with data scientists and coders, developed trading signals but didn’t always fully explain the margin of error in the analysis to make them useful to fund managers, the people said. + +It's hard to know the details of what actually happened, but for data scientists collaborating with others, there always needs to be an explanation of "what's going on". There's a general feeling that it's okay that machine learning techniques build complicated uninterpretable models because they work better. But in my experience that's not enough. People want to know why they work better, when they work better, and when they *don't* work. + +On over-theorizing: + +> Haynes, who joined Stamford, Connecticut-based Point72 in early 2014 after about two decades at McKinsey & Co., and other senior managers grew dissatisfied with Aperio’s progress and impact on returns, the people said. When the group obtained new data sets, it spent too much time developing theories about how to process them rather than quickly producing actionable results. + +I don't necessarily agree with this "criticism", but I only put it here because the land of hedge funds isn't generally viewed on the outside as a place where lots of theorizing goes on. + +At BlueMountain, another hedge fund: + +> When quants showed their risk analysis and trading signals to fundamental managers, they sometimes were rejected as nothing new, the people said. Quants at times wondered if managers simply didn’t want to give them credit for their ideas. + +I've seen this quite a bit. When a data scientist presents results to collaborators, there's often two responses: + +1. "I knew that already" and so you haven't taught me anything new +2. "I didn't know that already" and so you must be wrong + +The common link here, of course, is the inability to admit that there are things you don't know. Whether this is an inherent character flaw or something that can be overcome through teaching is not yet clear to me. But it is common when data is brought to bear on a problem that previously lacked data. One of the key tasks that a data scientist in any industry must prepare for is the task of giving people information that will make them uncomfortable. \ No newline at end of file diff --git a/_posts/2017-02-20-podcasting-setup.md b/_posts/2017-02-20-podcasting-setup.md new file mode 100644 index 0000000..edfa1eb --- /dev/null +++ b/_posts/2017-02-20-podcasting-setup.md @@ -0,0 +1,59 @@ +--- +title: My Podcasting Setup +author: roger +layout: post +--- + +I've gotten a number of inquiries over the last 2 years about my podcasting setup and I've been meaning to write about it but.... + +But here it is! I actually wanted to write this because I felt like there actually wasn't a ton of good information about this on the Internet that wasn't for people who wanted to do it professionally but were rather more "casual" podcasters. So here's what I've got. + +There are two types of podcasts roughly: The kind you record with everyone in the same room and the kind you record where everyone is in different rooms. They both require slightly different setups so I'll talk about both. For me, Elizabeth Matsui and I record [The Effort Report](http://effortreport.libsyn.com) locally because we're both at Johns Hopkins. But Hilary Parker and I record [Not So Standard Deviations](https://soundcloud.com/nssd-podcast) remotely because she's on the other side of the country most of the time. + +## Recording Equipment + +When Hilary and I first started we just used the microphone attached to the headphones you get with your iPhone or whatever. That's okay but the sound feels very "narrow" to me. That said, it's a good way to get started and it likely costs you nothing. + +The next level up for many people is the [Blue Yeti USB Microphone](https://www.amazon.com/Blue-Yeti-USB-Microphone-Silver/dp/B002VA464S/) which is perfectly fine microphone and not too expensive. Also, it uses USB (as opposed to more professional XLR) so it connects to any computer, which is nice. However, it typically retails for $120, which isn't nothing, and there are probably cheaper microphones that are just as good. For example, Jason Snell recommends the [Audio Technica ATR2100](https://www.amazon.com/Audio-Technica-ATR2100-USB-Cardioid-Dynamic-Microphone/dp/B004QJOZS4/ref=as_li_ss_tl?ie=UTF8&qid=1479488629&sr=8-2&keywords=audio-technica+atr&linkCode=sl1&tag=incomparablepod-20&linkId=0919132824ac2090de45f2b1135b0163) which is only about $70. + +If you're willing to shell out a little more money, I'd highly recommend the [Zoom H4n](https://www.zoom-na.com/products/field-video-recording/field-recording/zoom-h4n-handy-recorder) portable recorder. This is actually two things: a microphone *and* a recorder. It has a nice stero microphone built into the top along with two XLR inputs on the bottom that allow you to record from external mics. It records to SD cards so it's great for a portable setup where you don't want to carry a computer around with you. It retails for about $200 so it's *not* cheap, but in my opinion it is worth every penny. I've been using my H4n for years now. + +Because we do a lot or recording for our online courses here, we've actually got a bit more equipment in the office. So for in-person podcasts I sometimes record using a [Sennheiser MKH416-P48US](https://en-us.sennheiser.com/short-gun-tube-microphone-camera-films-mkh-416-p48u3) attached to an [Auray MS-5230T microphone stand](https://www.amazon.com/gp/product/B00D4AGIBS/) which is decidedly not cheap but is a great piece of hardware. + +By the way, a microphone stand is great to have, if you can get one, so you don't have to set the microphone on your desk or table. That way if you bump the table by accident or generally like to bang the table, it won't get picked up on the microphone. It's not something to get right away, but maybe later when you make the big time. + +## Recording Software + +If you're recording by yourself, you can just hook up your microphone to your computer and record to any old software that records sound (on the Mac you can use Quicktime). If you have multiple people, you can either + +1. Speak into the same mic and have both your voices recorded on the same audio file +2. Use separate mics (and separate computers) and record separtely on to separate audio files. This requires synching the audio files in an editor, but that's not too big a deal if you only have 2-3 people. + +For local podcasts, I actually just use the H4n and record directly to the SD card. This creates separate WAV files for each microphone that are already synced so you can just plop them in the editor. + +For remote podcasts, you'll need some communication software. Hilary and I use [Zencastr](https://zencastr.com) which has its own VoIP system that allows you to talk to anyone by just sending your guests a link. So I create a session in Zencastr, send Hilary the link for the session, she logs in (without needing any credentials) and we just start talking. The web site records the audio directly off of your microphone and then uploads the audio files (one for each guest) to Dropbox. The service is really nice and there are now a few just like it. Zencastr costs $20 a month right now but there is a limited free tier. + +The other approach is to use something like Skype and then use something like [ecamm call-recorder](http://www.ecamm.com/mac/callrecorder/) to record the conversation. The downside with this approach is that if you have any network trouble that messes up the audio, then you will also record that. However, Zencastr (and related services) do not work on iOS devices and other devices that use WebKit based browsers. So if you have someone calling in on a mobile device via Skype or something, then you'll have to use this approach. Otherwise, I prefer the Zencastr approach and can't really see any downside except for the cost. + +## Editing Software + +There isn't a lot of software that's specifically designed for editing podcasts. I actually started off editing podcasts in Final Cut Pro X (nonlinear video editor) because that's what I was familiar with. But now I use [Logic Pro X](http://www.apple.com/logic-pro/), which is not really designed for podcasts, but it's a real digital audio workstation and has nice features (like [strip silence](https://support.apple.com/kb/PH13055?locale=en_US)). But I think something like [Audacity](http://www.audacityteam.org) would be fine for basic editing. + +The main thing I need to do with editing is merge the different audio tracks together and cut off any extraneous material at the beginning or the end. I don't usually do a lot of editing in the middle unless there's a major mishap like a siren goes by or a cat jumps on the computer. Once the editing is done I bounce to an AAC or MP3 file for uploading. + +## Hosting + +You'll need a service for hosting your audio files if you don't have your own server. You can technically host your audio files anywhere, but specific services have niceties like auto-generating the RSS feed. For Not So Standard Deviations I use [SoundCloud](https://soundcloud.com/stream) and for The Effort Report I use [Libsyn](https://www.libsyn.com). + +Of the two services, I think I prefer Libsyn, because it's specifically designed for podcasting and has somewhat better analytics. The web site feels a little bit like it was designed in 2003, but otherwise it works great. Libsyn also has features for things like advertising and subscriptions, but I don't use any of those. SoundCloud is fine but wasn't really designed for podcasting and sometimes feels a little unnatural. + +## Summary + +If you're interested in getting started in podcasting, here's my bottom line: + +1. Get a partner. It's more fun that way! +2. If you and your partner are remote, use Zencastr or something similar. +3. Splurge for the Zoom H4n if you can, otherwise get a reasonable cheap microphone like the Audio Technica or the Yeti. +4. Don't focus too much on editing. Just clip off the beginning and the end. +5. Host on Libsyn. + diff --git a/_posts/2017-02-23-ml-earthquakes.md b/_posts/2017-02-23-ml-earthquakes.md new file mode 100644 index 0000000..bd1b283 --- /dev/null +++ b/_posts/2017-02-23-ml-earthquakes.md @@ -0,0 +1,153 @@ +--- +title: Learning about Machine Learning with an Earthquake Example +author: jeff +layout: post +comments: true +--- + +_Editor's note: This is the fourth chapter of a book I'm working on called [Demystifying Artificial Intelligence](https://leanpub.com/demystifyai/). I've also added a co-author, [Divya Narayanan](https://twitter.com/data_divya), a masters student here at Johns Hopkins! The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. We are developing the book over time - so if you buy the book on Leanpub know that there are only four chapters in there so far, but I'll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this [amazing tweet](https://twitter.com/notajf/status/795717253505413122) by Twitter user [@notajf](https://twitter.com/notajf/). Feedback is welcome and encouraged!_ + + +> "A learning machine is any device whose actions are influenced by past experience." - Nils John Nilsson + +Machine learning describes exactly what you would think: a machine that learns. As we described in the previous chapter a machine "learns" just like humans from previous examples. With certain experiences that give them an understanding about a particular concept, machines can be trained to have similar experiences as well, or at least mimic them. With very routine tasks, our brains become attuned to characteristics that define different objects or activities. + +Before we can dive into the algorithms - like neural networks - that are most commonly used for artificial intelligence, lets consider a real example to understand how machine learning works in practice. + +The Big One +----------- + +Earthquakes occur when the surface of the Earth experiences a shake due to displacement of the ground, and can readily occur along fault lines where there have already been massive displacements of rock or ground(Wikipedia 2017a). For people living in places like California where earthquakes occur relatively frequently, preparedness and safety are major concerns. One famous fault in southern California, called the San Andreas Fault, is expected to produce the next big earthquake in the foreseeable future, often referred to as the "Big One". Naturally, some residents are concerned and may like to know more so they are better prepared. + +The following data are pulled from **fivethirtyeight**, a political and sports blogging site, and describe how worried people are about the "Big One" (Hickey 2015). Here's an example of the first few observations in this dataset: + +| | worry\_general | worry\_bigone | will\_occur | +|:-----|:-------------------|:-------------------|:------------| +| 1004 | Somewhat worried | Somewhat worried | TRUE | +| 1005 | Not at all worried | Not at all worried | FALSE | +| 1006 | Not so worried | Not so worried | FALSE | +| 1007 | Not at all worried | Not at all worried | FALSE | +| 1008 | Not at all worried | Not at all worried | FALSE | +| 1009 | Not at all worried | Not at all worried | FALSE | +| 1010 | Not so worried | Somewhat worried | FALSE | +| 1011 | Not so worried | Extremely worried | FALSE | +| 1012 | Not at all worried | Not so worried | FALSE | +| 1013 | Somewhat worried | Not so worried | FALSE | + +Just by looking at this subset of the data, we can already get a feel for how many different ways it could be structured. Here, we see that there are 10 observations which represent 10 individuals. For each individual, we have information on 11 different aspects of earthquake preparedness and experience (only 3 of which are shown here). Data can be stored as text, logical responses (true/false), or numbers. Sometimes, and quite often at that, it may be missing; for example, observation 1013. + +So what can we do with this data? For example, we could predict - or classify - whether or not someone was likely to have taken any precautions for an upcoming earthquake, like bolting their shelves to the wall or come up with an evacuation plan. Using this idea, we have now found a question that we're interested in analyzing: are you prepared for an earthquake or not? And now, based on this question and the data that we have, we can see that you can either be prepared (seen above as "true") or not (seen above as "false"). + +> Our question: How well can we predict whether or not someone is prepared for an earthquake? + +An Algorithm -- what's that? +---------------------------- + +With our question in tow, we want to design a way for our machine to determine if someone is prepared for an earthquake or not. To do this, the machine goes through a flowchart-like set of instructions. At each fork in the flowchart, there are different answers which take the machine on a different path to get to the final answer. If you go through the correct series of questions and answers, it can correctly identify a person as being prepared. Here's a small portion of the final flowchart for the San Andreas data which we will proceed to dissect (note: the ellipses on the right-hand side of the flowchart indicate where the remainder of the algorithm lies. This will be expanded later in the chapter): + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Flowchart-partial.png) + +The steps that we take through the flowchart, or **tree** make up the **classification algorithm**. An algorithm is essentially a set of step-by-step instructions that we follow to organize, or in other words, to make a prediction about our data. In this case, our goal is to classify an individual as prepared or not by working our way through the different branches of the tree. So how did we establish this particular set of questions to be in our framework of identifying a prepared individual? + +**CART**, or a classification and regression tree, is one way to assess which of these characteristics is the most important in terms of splitting up the data into prepared and unprepared individuals (Wikipedia 2017b, Breiman et al. (1984)). There are multiple ways of implementing this method, often times with the earlier branches making larger splits in the data, and later branches making smaller splits. + +Within an algorithm, there exists another level of organization composed of **features** and **parameters**. + +In order to tell if someone is prepared for an earthquake or not, there have to be certain characteristics, known as **features**, that separate those who are prepared from those who are not. Features are basically the things you measured in your dataset that are chosen to give you insight into an individual and how to best classify them into groups. Looking at our sample data, we can see that some of the possible features are: whether or not an individual is worried about earthquakes in general, prior experiences with earthquakes, or their gender. As we will soon see, certain features will carry more weight in separating an individual into the two groups (prepared vs. unprepared). + +If we were looking at how important previously experiencing an earthquake was in classifying someone as prepared, we'd say it plays a pretty big role, since it's one of the first features that we encounter in our flowchart. The feature that seems to make a bigger split to our data is region, as it appears as the first feature in our algorithm shown above. We would expect that people in the Mountain and Pacific regions have more experience and knowledge about earthquakes, as that part of the country is more prone to experiencing an earthquake. However, someone's age may not be as important in classifying a prepared individual. Since it doesn't even show up in the top of our flowchart, it means it wasn't as crucial to know this information to decide if a person is prepared or not and it didn't separate the data much. + +The second form of organization within an algorithm are the questions and cutoffs for moving one direction or another at each node. These are known as **parameters** of our algorithm. These parameters give us insight as to how the features we have established define the observation we are trying to identify. Let us consider an example using the feature of region. As we mentioned earlier, we would expect that those in the Mountain and Pacific regions would have more experience with earthquakes, which may reflect in their level of preparedness. Looking back at our abbreviated classification tree, the first node in our tree has a parameter of "Mountain or Pacific" for the feature region, which can be split into "yes" (those living in one of these regions) or "no" (living elsewhere in the US). + +If we were looking at a continuous variable, say number of years living in a region, we may set a threshold instead, say greater than 5 years, as opposed to a yes/no distinction. In supervised learning, where we are teaching the machine to identify a prepared individual, we provide the machine multiple observations of prepared individuals and include different parameter values of features to show how a person could be prepared. To illustrate this point, let us consider the 10 sample observations below, and specifically examine the outcome, preparedness, with respect to the features: will\_occur, female, and household income. + +| | prepared | will\_occur | female | hhold\_income | +|:-----|:---------|:------------|:-------|:---------------------| +| 1004 | TRUE | TRUE | FALSE | $50,000 to $74,999 | +| 1005 | FALSE | FALSE | TRUE | $10,000 to $24,999 | +| 1006 | TRUE | FALSE | TRUE | $200,000 and up | +| 1007 | FALSE | FALSE | FALSE | $75,000 to $99,999 | +| 1008 | FALSE | FALSE | TRUE | Prefer not to answer | +| 1009 | FALSE | FALSE | FALSE | Prefer not to answer | +| 1010 | TRUE | FALSE | TRUE | $50,000 to $74,999 | +| 1011 | FALSE | FALSE | TRUE | Prefer not to answer | +| 1012 | FALSE | FALSE | TRUE | $50,000 to $74,999 | +| 1013 | FALSE | FALSE | NA | NA | + +Of these ten observations, 7 are not prepared for the next earthquake and 3 are. But to make this information more useful, we can look at some of the features to see if there are any similarities that the machine can use as a classifier. For example, of the 3 individuals that are prepared, two are female and only one is male. But notice we get the same distribution of males and females by looking at those who are not prepared: you have 4 females and 2 males, the same 2:1 ratio. From such a small sample, the algorithm may not be able to tell how important gender is in classifying preparedness. But, by looking through the remaining features and a larger sample, it can start to classify individuals. This is what we mean when we say a machine learning algorithm **learns**. + +Now, let us take a closer look at observations 1005, 1011, and 1012, and more specifically the household income feature: + +| | prepared | will\_occur | female | hhold\_income | +|:-----|:---------|:------------|:-------|:---------------------| +| 1005 | FALSE | FALSE | TRUE | $10,000 to $24,999 | +| 1011 | FALSE | FALSE | TRUE | Prefer not to answer | +| 1012 | FALSE | FALSE | TRUE | $50,000 to $74,999 | + +All three of these observations are females and believe that the "Big One" won't occur in their lifetime. But despite the fact that they are all unprepared, they each report a different household income. Based on just these three observations, we may conclude that household income is not as important in determining preparedness. By showing a machine different examples of which features a prepared individual has (or unprepared, as in this case), it can start to recognize patterns and identify the features, or combination of features, and parameters that are most indicative of preparedness. + +In summary, every flowchart will have the following components: + +1. **The algorithm** - The general workflow or logic that dictates the path the machine travels, based on chosen features and parameter values. In turn, the machine classifies or predicts which group an observation belongs to + +2. **Features** - The variables or types of information we have about each observation +3. **Parameters** - The possible values a particular feature can have + +Even with the experience of seeing numerous observations with various feature values, there is no way to show our machine information on every single person that exists in the world. What will it do when it sees a brand new observation that is not identified as prepared or unprepared? Is there a way to improve how well our algorithm performs? + +Training and Testing Data +------------------------- + +You may have heard of the terms *sample* and *population*. In case these terms are unfamiliar, think of the population as the entire group of people we want to get information from, study, and describe. This would be like getting a piece of information, say income, from every single person in the world. Wouldn't that be a fun exercise! + +If we had the resources to do this, we could then take all those incomes and find out the average income of an individual in the world. But since this is not possible, it might be easier to get that information from a smaller number of people, or *sample*, and use the average income of that smaller pool of people to represent the average income of the world's population. We could only say that the average income of the sample is *representative* of the population if the sample of people that we picked have the same characteristics of the population. + +For example, if we assumed that our population of interest was a company with 1,000 employees, where 200 of those employees make $60,000 each and 800 of them make $30,000 each. Since we have this information on everyone, we could easily calculate the average income of an employee in the company, which would be $36,000. Now, say we randomly picked a group of 100 individuals from the company as our sample. If all of those 100 individuals came from the group of employees that made $60,000, we might think that the average income for an employee was actually much higher than the true average of the population (the whole company). The opposite would be true if all 100 of those employees came from the group making less money - we would mistakenly think the average income of employees is lower. In order to accurately reflect the distribution of income of the company employees through our sample, or rather to have a *representative* sample, we would try to pick 20 individuals from the higher income group and 80 individuals from the lower income group to get an accurate representation of this company population. + +Now heading back to our earthquake example, our big picture goal is to be able to feed our algorithm a brand new observation of someone who answered information about themselves and earthquake preparedness, and have the machine be able to correctly identify whether or not they are prepared for a future earthquake. + +One definition of a population could consist of all individuals in the world. However, since we can't get access to data on all these individuals, we decide to sample 1013 respondents and ask them earthquake related questions. Remember that in order for our machine to be able to accurately identify an individual as prepared or unprepared, it needs to have had some experience seeing some observations where features identify the individual as prepared, as well as some observations that aren't. This seems a little counterintuitive though. If we want our algorithm to identify a prepared individual, why wouldn't we show it all the observations that are prepared? + +By showing our machine observations of respondents that are not prepared, it can better strengthen its idea of what features identify a prepared individual. But we also want to make our algorithm as *robust* as possible. For an algorithm to be robust, it should be able to take in a wide range of values for each feature, and appropriately go through the algorithm to make a classification. If we only show our machine a narrow set of experiences, say people who have an income of between $10,000 and $25,000, it will be harder for the algorithm to correctly classify an individual who has an income of $50,000. + +One way we can give our machine this set of experiences is to take all 1013 observations and randomly split them up into two groups. Note: for simplification, any observations that had missing data (total: 7) for the outcome variable were removed from the original dataset, leaving 1006 observations for our analysis. + +1. **Training data** - This serves as the wide range of experiences that we want our machine to see to have a better understanding of preparedness + +2. **Testing data** - This data will allow us to evaluate our algorithm and see how well it was able to pick up on features and parameter values that are specific to prepared individuals and correctly label them as such + +So what's the point of splitting up our data into training and testing? We could have easily fed all the data that we have into the algorithm and have it detect the most important features and parameters we have based on the provided observations. But there's an issue with that, known as **overfitting**. When an algorithm has overfit the data, it means that it has been fit specifically to the data at hand, and only that data. It would be harder to give our algorithm data that does not fit within the bounds of our training data (though it would perform very well in this sample set). Moreover, the algorithm would only accurately classify a very narrow set of observations. This example nicely illustrates the concept we introduced earlier - *robustness* - and demonstrates the importance of exposing our algorithm to a wide range of experiences. We want our algorithm to be useful, which means it needs to be able to take in all kinds of data with different distributions, and still be able to accurately classify them. + +To create training and testing sets, we can adopt the following idea: + +1. Split the 1006 observations in half: roughly 500 for training, and the remainder for testing +2. Feed the 500 training observations through the algorithm for the machine to understand what features best classify individuals as prepared or unprepared +3. Once the machine is trained, feed the remaining test observations through the algorithm and see how well it classifies them + +Algorithm Accuracy +------------------ + +Now that we've built up our algorithm and split our data into training and test sets, let's take a look at the full classification algorithm: + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Flowchart-full.png) + +Recall the question we set out to answer with respect to the earthquake data: **How well can we predict whether or not someone is prepared for an earthquake?** In a binary (yes/no) case like this, we can set up our results in a 2x2 table, where the rows represent predicted preparedness (based on the features of our algorithm) and the columns represent true preparedness (what their true label is). + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2x2-table-results.png) + +This simple 2x2 table carries quite a bit of information. Essentially, we can grade our machine on how well it learned to tell whether individuals are prepared or unprepared. We can see how well our algorithm did at classifying new observations by calculating the **predictive accuracy**, done by summing cells A and C and dividing by the total number of observations, or more simply, (A + C) / N. Through this calculation, we see that the algorithm from our example correctly classified individuals as prepared or unprepared 77.9% of the time. Not bad! + +When we feed the roughly 500 test observations through the algorithm, it is the first time the machine has seen those observations. As a result, there is a chance that despite going through the algorithm, the machine **misclassified** someone as prepared, when in fact they were unprepared. To determine how often this happens, we can calculate the **test error rate** from the 2x2 table from above. To calculate the test error rate, we take the total number of observations that are *discordant*, or dissimilar between true and predicted status, and divide this total by the total number of observations that were assessed. Based on the above table, the test error rate would be (B + C) / N, or 22.1%. + +There are a number of reasons that a test error rate would be high. Depending on the data set, there might be different methods that are better for developing the algorithm. Additionally, despite randomly splitting our data into training and testing sets, there may be some inherent differences between the two (think back to the employee income example above), making it harder for the algorithm to correctly label an observation. + +References +---------- + +Breiman, Leo, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. “Classification and Regression Trees. Wadsworth & Brooks.” *Monterey, CA*. + +Hickey, Walt. 2015. “The Rock Isn’t Alone: Lots of People Are Worried About ‘the Big One’.” *FiveThirtyEight*. FiveThirtyEight. . + +Wikipedia. 2017a. “Earthquake — Wikipedia, the Free Encyclopedia.” . + +———. 2017b. “Predictive analytics — Wikipedia, the Free Encyclopedia.” . diff --git a/_posts/2017-03-02-rr-glossy.md b/_posts/2017-03-02-rr-glossy.md new file mode 100644 index 0000000..0b46ea7 --- /dev/null +++ b/_posts/2017-03-02-rr-glossy.md @@ -0,0 +1,34 @@ +--- +title: Reproducibility and replicability is a glossy science now so watch out for the hype +author: jeff +layout: post +comments: true +--- + +[Reproducibility](http://biorxiv.org/content/early/2016/07/29/066803) is the ability to take the code and data from a previous publication, rerun the code and get the same results. [Replicability](http://biorxiv.org/content/early/2016/07/29/066803) is the ability to rerun an experiment and get "consistent" results with the original study using new data. Results that are not reproducible are hard to verify and results that do not replicate in new studies are harder to trust. It is important that we aim for reproducibility and replicability in science. + +Over the last few years there has been increasing concern about problems with reproducibility and replicability in science. There are a number of suggestions for why this might be: + +* Papers published by scientists with lack of training in statistics and computation +* Treating statistics as a second hand discipline that can be "tacked on" at the end of a science experiment +* Financial incentives for companies and others to publish desirable results. +* Academic incentives for scientists to publish desirable results so they can get their next grant. +* Incentives for journals to publish surprising/eye catching/interesting results. +* Over-hyped studies with limited statistical characteristics (small sample size, questionable study populations etc.) +* TED-style sound bytes of scientific results that are digested and repeated in the press despite limited scientific evidence. +* Scientists who refuse to consider alternative explanations for their data + +Usually the targets of discussion about reproducibility and replicability are highly visible scientific studies. The targets are usually papers in what are considered "top journals" or the papers in journals like Science and Nature that seek to maximize visibility. Or, more recently, entire fields of science that are widely publicized - like psychology or cancer biology are targeted for reproducibility and replicability studies. + +These studies have pointed out serious issues with the statistics, study designs, code availability and methods descriptions in papers they have studied. These are fundamental issues that deserve attention and should be taught to all scientists. As more papers have come out pointing out potential issues, they have merged into what is being called "a crisis of reproducibility", "a crisis of replicability", "a crisis of confidence in science" or other equally strong statements. + +As the interest around reproducibility and replicability has built to a fever pitch in the scientific community it has morphed into a glossy scientific field in its own right. All of the characteristics are in place: + +* A big central "positive" narrative that all science is not replicable, reproducible, or correct. +* Incentives to publish these types of results because they can appear in Nature/Science/other glossy journals. ([I'm not immune to this](http://www.pnas.org/content/112/6/1645.full)) +* Strong and aggressive responses to papers that provide alternative explanations or don't fit the narrative. +* Researchers whose careers depend on the narrative being true +* TED-style talks and sound bytes ("most published research is false", "most papers don't replicate") +* Press hype, including for papers with statistical weaknesses (small sample sizes, weaker study designs) + +Reproducibility and replicability has "arrived" and become a field in its own right. That has both positives and negatives. On the positive side it means critical statistical issues are now being talked about by a broader range of people. On the negative side, researchers now have to do the same sober evaluation of the claims in reproducibility and replicability papers that they do for any other scientific field. Papers on reproducibility and replicability must be judged with the same critical eye as we apply to any other scientific study. That way we can sift through the hype and move science forward. diff --git a/_posts/2017-03-07-time-series-model.md b/_posts/2017-03-07-time-series-model.md new file mode 100644 index 0000000..7c565b9 --- /dev/null +++ b/_posts/2017-03-07-time-series-model.md @@ -0,0 +1,9 @@ +--- +title: Model building with time series data +author: roger +layout: post +--- + +A nice post by Alex Smolyanskaya over the [Stitch Fix blog](http://multithreaded.stitchfix.com/blog/2017/02/28/whats-wrong-with-my-time-series/) about some of the unique challenges of model building in a time series context: + +> Cross validation is the process of measuring a model’s predictive power by testing it on randomly selected data that was not used for training. However, autocorrelations in time series data mean that data points are not independent from each other across time, so holding out some data points from the training set doesn’t necessarily remove all their associated information. Further, time series models contain autoregressive components to deal with the autocorrelations. These models rely on having equally spaced data points; if we leave out random subsets of the data, the training and testing sets will have holes that destroy the autoregressive components. \ No newline at end of file diff --git a/_posts/2017-03-08-when-do-we-need-interpretability.md b/_posts/2017-03-08-when-do-we-need-interpretability.md new file mode 100644 index 0000000..b7cec77 --- /dev/null +++ b/_posts/2017-03-08-when-do-we-need-interpretability.md @@ -0,0 +1,20 @@ +--- +title: When do we need interpretability? +author: roger +layout: post +--- + +I just saw a link to an [interesting article](https://arxiv.org/abs/1702.08608) by Finale Doshi-Velez and Been Kim titled "Towards A Rigorous Science of Interpretable Machine Learning". From the abstract: + +> Unfortunately, there is little consensus on what interpretability in machine learning is and how to evaluate it for benchmarking. Current interpretability evaluation typically falls into two categories. The first evaluates interpretability in the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable. The second evaluates interpretability via a quantifiable proxy: a researcher might first claim that some model class—e.g. sparse linear models, rule lists, gradient boosted trees—are interpretable and then present algorithms to optimize within that class. + +The paper raises a good point, which is that we don't really have a definition of "interpretability". We just know it when we see it. For the most part, there's been some agreement that parametric models are "more interpretable" than other models, but that's a relativey fuzzy statement. + +I've long heard that complex machine learning models that lack any real interpretability are okay because there are many situations where we don't care "how things work". When Netflix is recommending my next movie based on my movie history and perhaps other data, the only thing that matters is that the recommendation is something I like. In other words, the [user experience defines the value](http://simplystatistics.org/2017/01/23/ux-value/) to me. However, in other applications, such as when we're assessing the relationship between air pollution and lung cancer, a more interpretable model may be required. + +I think the dichotomization between these two kinds of scenarios will eventually go away for a few reasons: + +1. For some applications, lack of interpretability is fine...until it's not. In other words, what happens when things go wrong? Interpretability can help us to decipher why things went wrong and how things can be *modified* to be fixed. In order to move the levers of a machine to fix it, we need to know exactly where the levers are. Yet another way to say this is that it's possible to quickly jump from one situation (interpretability not needed) to another situation (what the heck just happened?) very quickly. +2. I think interpretability will become the new [reproducible research](http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/), transmogrified to the machine learning and AI world. In the scientific world, reproducibility took some time to catch on (and has not quite caught on completely), but it is not so controversial now and many people in many fields accept the notion that all studies should at least be reproducible (if [not necessarily correct](http://www.pnas.org/content/112/6/1645.full)). There was a time when people differentiated between cases that needed reproducibility (big data, computational work), and cases where it wasn't needed. But that differentiation is slowly going away. I believe interpretability in machine learning and statistical modeling wil go the same way as reproducibility in science. + +Ultimately, I think it's the success of machine learning that brings the requirement of interpretability on to the scene. Because machine learning has become ubiquitous, we as a society begin to develop expectations for what it is supposed to do. Thus, the [value of the machine learning begins to be defined externally](http://simplystatistics.org/2017/01/23/ux-value/). It will no longer be good enough to simply provide a great user experience. \ No newline at end of file diff --git a/_posts/2017-03-16-evo-ds-class.md b/_posts/2017-03-16-evo-ds-class.md new file mode 100644 index 0000000..c7e5830 --- /dev/null +++ b/_posts/2017-03-16-evo-ds-class.md @@ -0,0 +1,37 @@ +--- +title: The levels of data science class +author: jeff +layout: post +comments: true +--- + +In a recent post, Nathan Yau [points to a comment](http://flowingdata.com/2013/03/12/data-hackathon-challenges-and-why-questions-are-important/) by Jake Porway about data science hackathons. They both say that for data science/visualization projects to be successful you have to start with an important question, not with a pile of data. This is the [problem forward not solution backward](http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/) approach to data science and big data. This is the approach also advocated in the really nice piece on teaching data science by [Stephanie and Rafa](https://arxiv.org/abs/1612.07140) + +I have adopted a similar approach in the data science class here at Hopkins, largely inspired by Dan Meyer's [patient problem solving for middle school math class](https://www.ted.com/talks/dan_meyer_math_curriculum_makeover/transcript). So instead of giving students a full problem description I give them project suggestions like: + +* __Option 1__: Develop a prediction algorithm for identifying and classifying users that are trolling or being mean on Twitter. If you want an idea of what I’m talking about just look at the responses to any famous person’s tweets. +* __Option 2__: Analyze the traffic fatality data to identify any geographic, time varying, or other characteristics that are associated with traffic fatalities: https://www.transportation.gov/fastlane/2015-traffic-fatalities-data-has-just-been-released-call-action-download-and-analyze. +* __Option 3__: Develop a model for predicting life expectancy in Baltimore down to single block resolution with estimates of uncertainty. You may need to develop an approach for “downsampling” since the outcome data you’ll be able to find is likely aggregated at the neighborhood level (http://health.baltimorecity.gov/node/231). +* __Option 4__: Develop a statistical model for inferring the variables you need to calculate the Gail score (http://www.cancer.gov/bcrisktool/) for a woman based on her Facebook profile. Develop a model for the Gail score prediction from Facebook and its uncertainty. You should include estimates of uncertainty in the predicted score due to your inferred variables. +* __Option 5__: Potentially fun but super hard project. develop an algorithm for self-driving car using the training data: http://research.comma.ai/. Build a model for predicting at every moment what direction the car should be going, whether it should be signalling, and what speed it should be going. You might consider starting with a small subsample of the (big) training set. + +Each of these projects shares the characteristic that there is an interesting question - but the data may or may not be available. If it is available it may or may not have to be processed/cleaned/organized. Moreover, with the data in hand you may need to think about how it was collected or go out and collect some more data. This kind of problem is inspired by this quote from Dan's talk - he was talking about math but it could easily have been data science: + +> Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go +find some? + +I realize though that this is advanced data science. So I was thinking about the levels of data science course and how you would build up a curriculum. I came up with the following courses/levels and would be interested in what others thought. + +* __Level 0: Background__: Basic computing, some calculus with a focus on optimization, basic linear algebra. +* __Level 1: Data science thinking__: How to define a question, how to turn a question into a statement about data, how to identify data sets that may be applicable, experimental design, critical thinking about data sets. +* __Level 2: Data science communication__: Teaching students how to write about data science, how to express models qualitatively and in mathematical notation, explaining how to interpret results of algorithms/models. Explaining how to make figures. +* __Level 3: Data science tools__: Learning the basic tools of R, loading data of various types, reading data, plotting data. +* __Level 4: Real data__: Manipulating different file formats, working with "messy" data, trying to organize multiple data sets into one data set. +* __Level 5: Worked examples__: Use real data examples, but work them through from start to finish as case studies, don't make them easy clean data sets, but have a clear path from the beginning of the problem to the end. +* __Level 6: Just the question__: Give students a question where you have done a little research to know that it is posisble to get at least some data, but aren't 100% sure it is the right data or that the problem can be perfectly solved. Part of the learning process here is knowing how to define success or failure and when to keep going or when to quit. +* __Level 7: The student is the scientist__: Have the students come up with their own questions and answer them using data. + + +I think that a lot of the thought right now in biostatistics has been on level 3 and 4 courses. These are courses where we have students work with real data sets and learn about tools. To be self-sufficient as a data scientist it is clear you need to be able to work with real world data. +But what Jake/Nathan are referring to is level 5 or level 6 - cases where you have a question but the data needs a ton of work and may not even be good enough without collecting new information. Jake and Nathan have perfectly identified the ability to translate murkey questions into data answers as the most valuable data skill. If I had to predict the future of data courses I would see them trending in that direction. + diff --git a/_posts/2017-04-03-interactive-data-analysis.md b/_posts/2017-04-03-interactive-data-analysis.md new file mode 100644 index 0000000..8f24dbc --- /dev/null +++ b/_posts/2017-04-03-interactive-data-analysis.md @@ -0,0 +1,120 @@ +--- +title: The Importance of Interactive Data Analysis for Data-Driven Discovery +date: 2017-04-03 +author: rafa +layout: post +comments: true +--- + +Data analysis workflows and recipes are commonly used in science. They +are actually indispensable since reinventing the wheel for each +project would result in a colossal waste of time. On the other hand, +mindlessly applying a workflow can result in +totally wrong conclusions if the required assumptions don't hold. +This is why successful data analysts rely heavily on interactive +data analysis (IDA). I write today because I am somewhat +concerned that the importance of IDA is not fully appreciated by many +of the policy makers and thought leaders that will influence how we +access and work with data in the future. + +I start by constructing a very simple example to illustrate the +importance of IDA. Suppose that as +part of a demographic study you are asked to summarize male heights +across several counties. Since sample sizes are large and heights are +known to be well approximated by a normal distribution you feel +comfortable using a true and tested recipe: +report the average and standard deviation as a summary. You are +surprised to find a county with average heights of 6.1 feet with a +standard deviation (SD) of 7.8 feet. Do you start writing a paper and a +press release to describe this very interesting finding? Here, +interactive data analysis saves us from naively reporting this. +First, we note that the standard deviation is impossibly big if data is in +fact normally distributed: more than 15% of heights would be +negative. Given this nonsensical result, the next +obvious step for an experienced data analyst is to explore the data, +say with a boxplot (see below). This immediately reveals a problem, it +appears one value was reported in centimeters: 180 centimeters not +feet. After fixing this, the summary changes to an average height +of 5.75 and with a 3 inch SD. + +![European Outlier](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/heights-with-outlier.png) + + +Years of data analysis experience will show you that examples like this are +common. Unfortunately, as data and analyses get more complex, workflow +failures are harder to detect and often go unnoticed. An important +principle many of us teach our trainees is to carefully check for +hidden problems when data analysis leads you to unexpected results, +especialy when the unexpected results holding up benefits us +professionally, for example by leading to a publication. + +Interactive data analysis is also indispensable for the +development of new methodology. For example, in my field of research, exploring +the data has led to the discovery of the need for new methods and +motivated new approaches that handle specific cases that existing +workflows can't handle. + +So why I am concerned? +As public datasets become larger and more +numerous, many funding agencies, policy makers and industry leaders are +advocating for using cloud computing to bring computing to the +data. If done correctly, this would provide a great improvement over +the current redundant and unsystematic approach of everybody downloading data and working with it locally. However, after +looking into the details of some of these plans, I have become a bit +concerned that perhaps the importance of IDA is not fully appreciated by decision makers. + +As an example consider the NIH efforts to promote data-driven discovery +that center around plans for the +[_Data Commons_](https://datascience.nih.gov/commons). The linked page +describes an ecosystem with four components one of which is +"Software". According to the description, the software component of +_The Commons_ should provide "[a]ccess to and deployment of scientific analysis +tools and pipeline workflows". There is no mention of a strategy that +will grant access to the +raw data. Without this, carefully checking the workflow output and +developing the analysis tools and pipeline workflows of the future +will be difficult. + +I note that data analysis workflows are very popular in fields in which data +analysis is indispensible, as is the case in biomedical research, my +focus area. In this field, data generators, which typically +lead the scientific enterprise, are not always trained data +analysts. But the literature is overflowing with proposed workflows. +You can gauge the popularity of these by the vast number +published in the nature journals as demonstrated by this +[google search](https://www.google.com/search?q=workflow+site:nature.com&biw=1706&bih=901&source=lnms&tbm=isch&sa=X&ved=0ahUKEwi3usL8-dDPAhUDMSYKHaBFBTAQ_AUIBigB#tbm=isch&q=analysis+workflow+site:nature.com): + +![Nature workflows](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/many-workflows.png) + + +In a field in which data generators are not data analysis experts, the +workflow has the added allure that it removes the need to think deeply about +data analysis and instead shifts the responsibility to pre-approved +software. Note that these workflows are not always described with the +mathematical language or computer coded needed to truly understand it +but rather with a series of PowerPoint shapes. The gist of the typical +data analysis workflow can be simplified into the following: + +![workflows](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/workflow.png) + +This simplification of the data analysis process makes it particularly +worrisome that the intricacies of IDA are not fully appreciated. + +As mentioned above, data analysis workflows are a necessary component of +the scientific enterprise. Without them the process would slow down to +a halt. However, workflows should only be implemented once consensus +is reached regarding its optimality. And even then, IDA is needed to +assure that the process is performing as expected. The career of many of my +colleagues has been dedicated mostly to the development of such +analysis tools. We have learned that rushing to implement workflows +before they are mature enough can have widespread negative +consequences. And, at least in my experience, developing rigorous tools is +impossible without interactive data analysis. So I hope that this post +helps make a case for the importance of interactive data analysis and +that it continues to be a part of the scientific enterprise. + + + + + + diff --git a/_posts/2017-04-06-huelga.md b/_posts/2017-04-06-huelga.md new file mode 100644 index 0000000..8e7a46e --- /dev/null +++ b/_posts/2017-04-06-huelga.md @@ -0,0 +1,70 @@ +--- +title: La matrícula, el costo del crédito y las huelgas en la UPR +date: 2017-04-06 +author: rafa +layout: post +comments: true +--- + +La Universidad de Puerto Rico (UPR) recibe aproximádamente 800 millones de +dólares del estado cada año. Esta inversión le permite ofrecer salarios más +altos, lo cual atrae a los mejores profesores, tener las mejores instalaciones +para la investigación y enseñanza, y mantener el precio por crédito más bajo que las universidades privadas. Gracias a estas grandes +ventajas, la UPR suele ser la primera opción del estudiantado puertorriqueño, en +particular los dos recintos más grandes, Río Piedras (UPRRP) y Mayagüez. Un +estudiante que aprovecha su tiempo en la UPR, además de formarse como ciudadano, puede +entrar exitosamente en la fuerza laboral o continuar sus estudios en las mejores escuelas graduadas. El +precio módico del crédito, en combinación con las becas federales Pell, han +ayudado a miles de estudiantes económicamente desaventajados a completar sus +estudios sin tener que endeudarse. + +En la pasada década una realidad preocupante ha surgido: mientras la demanda por la +educación universitaria ha crecido, demostrado por el crecimiento de la matrícula en las universidades privadas, el número de estudiantes matriculados en la UPR +ha bajado. + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/matricula.png) + +¿Por qué ha bajado la matrícula en la UPR? +[Una explicación popular](http://www.elnuevodia.com/noticias/locales/nota/protestalauniondejuventudessocialistas-1331982/) +es que "la baja en matrícula es provocada por el aumento en el costo de la +matrícula". La teoría de que un alza en costos disminuye la matrícula es +comúnmente aceptada pues tiene sentido económico: cuando el precio sube, las +ventas bajan. Pero entonces ¿por qué ha crecido la matrícula en las +universidades privadas? Tampoco lo explica un crecimiento en el número de estudiantes ricos ya +que, en el 2012, [la mediana de ingreso familiar de aquellos jóvenes matriculados en +algún recinto de la UPR era de $32,379; en contraste, la mediana de ingreso de +aquellos que están matriculados en una universidad privada era de $25,979](http://www.80grados.net/hacia-una-universidad-mas-pequena-y-agil/). Otro problema con esta teoría es que, una vez ajustamos por inflación, el costo del crédito se ha mantenido más o menos estable tanto en la UPR como en las unversidades privadas. + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/costo.png) + +Ahora, si miramos detenidamente los datos de la matrícula notamos que los bajones más grandes fueron precisamente en los años de huelga (2005, 2010, 2011). En el 2005 comienza una tendencia positiva en la matrícula del Sagrado, con el crecimiento más alto en el 2010 y el 2011. + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-04-06/cambio-en-matricula.png) + +Actualmente, varios recintos, incluyendo Río Piedras, [están cerrados +indefinidamente](http://www.elnuevodia.com/noticias/locales/nota/estudiantesapruebanvotodehuelgasistemicaenlaupr-2307616/). En una asamblea nacional asistida por 10% de los más de 50,000 estudiantes del sistema, una huelga indefinida fue aprobada en una votación de 4,522 a 1,154. Para reiniciar labores los estudiantes exigen que "no se impongan sanciones a los estudiantes que participen en la huelga, que se presente un plan de reforma universitaria elaborado por la comunidad universitaria, que se audite la deuda pública y se restituya a los miembros de la comisión evaluadora de la auditoría pública y su prepuesto". Esto ocurre como respuesta a la propuesta por la [Junta de Supervición Fiscal (JSF)](https://en.wikipedia.org/wiki/PROMESA) y el gobernador de +[reducir](http://www.elnuevodia.com/noticias/locales/nota/revelanelplanderecortesparaelsistemadelaupr-2302675/) el presupuesto de la UPR como parte de sus intentos de +resolver una [grave crisis +fiscal](https://www.project-syndicate.org/commentary/puerto-rico-debt-plan-deep-depression-by-joseph-e--stiglitz-and-martin-guzman-2017-02). + +Durante el cierre, los estudiantes en huelga le impiden la entrada al recinto al +resto de la comunidad universitaria, incluyendo aquellos que no consideran la huelga una manera efectiva de protesta. Aquellos que se oponen y quieren continuar estudiando, se les acusa de ser egoistas o de ser aliados de quienes quieren destruir la UPR. Hasta ahora estos estudiantes tampoco han recibido el apoyo explícito de los profesores y administradores. No debe sorprendernos si los que quieren continuar estudiando recurren a pagar más en una universidad privada. + +portones2 + +Aunque existe la posibilidad de que la huelga ejerza suficiente presión política para que se responda a las exigencias determinadas en la asamblea, hay otras posibilidades menos favorables para los estudiantes: + +- La falta de actividad académica resulta en el exilio de miles de estudiantes a las universidades privadas. +- La JSF usa el cierre para justificar aun más recortes: una institución no requiere millones de dolares al día si está cerrada. +- Los recintos cerrados pierden su acreditación ya que una universidad en la cual no se da clases no puede cumplir con las [normas necesarias](http://www.msche.org/?Nav1=About&Nav2=FAQ&Nav3=Question07). +- Se revocan las becas Pell a los estudiantes en receso. + +Hay mucha evidencia empírica que demuestra la importancia de la educación universitaria accesible. Lo mismo no es cierto sobre las huelgas como estrategia para defender dicha educación. Y cabe la posibildad que la huelga indefinida tenga el efecto opuesto y perjudique enormemente a los estudiantes, en particular a los que se ven forzados a matricularse en una universidad privada. + + +Notas: + +1. Data proporcionada por el [Consejo de Educación de Puerto Rico (CEPR)](http://www2.pr.gov/agencias/cepr/inicio/estadisticas_e_investigacion/Pages/Estadisticas-Educacion-Superior.aspx). + +2. El costo del crédito del 2011 no incluye la cuota. diff --git a/_posts/2017-04-06-march-for-science.md b/_posts/2017-04-06-march-for-science.md new file mode 100644 index 0000000..62bb949 --- /dev/null +++ b/_posts/2017-04-06-march-for-science.md @@ -0,0 +1,13 @@ +--- +title: "Redirect" +date: 2017-04-06 +author: rafa +layout: post +comments: true +--- + +This page was generated in error. The "Science really is non-partisan: facts and skepticism annoy everybody" blog post is [here](http://simplystatistics.org/2017/04/24/march-for-science/) + +Apologies for the inconvenience. + + diff --git a/_posts/2017-04-24-march-for-science.md b/_posts/2017-04-24-march-for-science.md new file mode 100644 index 0000000..ddad582 --- /dev/null +++ b/_posts/2017-04-24-march-for-science.md @@ -0,0 +1,33 @@ +--- +title: "Science really is non-partisan: facts and skepticism annoy everybody" +date: 2017-04-24 +author: rafa +layout: post +comments: true +--- + +This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right. + +First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF. Although the current administration did propose a 20% cut to NIH, we are aware that, generally speaking, support for scientific research has traditionally been bipartisan. + +It is true that the typical data-driven scientists will disagree, sometimes strongly, with many stances that are considered conservative. For example, most scientists will argue that: + +1. Climate change is real and is driven largely by increased carbon dioxide and other human-made emissions into the atmosphere. +2. Evolution needs to be part of children’s education and creationism has no place in Science class. +3. Homosexuality is not a choice. +4. Science must be publically funded because the free market is not enough to make science thrive. + +But scientists will also hold positions that are often criticized heavily by some of those who identify as politically left wing: + +1. Current vaccination programs are safe and need to be enforced: without heard immunity thousands of children would die. +2. Genetically modified organisms (GMOs) are safe and are indispensable to fight world hunger. There is no need for warning labels. +3. Using nuclear energy to power our electrical grid is much less harmful than using natural gas, oil and coal and, currently, more viable than renewable energy. +4. Alternative medicine, such as homeopathy, naturopathy, faith healing, reiki, and acupuncture, is pseudo-scientific quackery. + +The timing of the announcement of the March for Science, along with the organizers’ focus on environmental issues and diversity, may have made it seem like a partisan or left-leaning event, but please also note that many scientists [criticized]( https://www.nytimes.com/2017/01/31/opinion/a-scientists-march-on-washington-is-a-bad-idea.html) the organizers for this very reason and there was much debate in general. Most scientists I know that went to the march did so not necessarily because they are against Republican administrations, but because they are legitimately concerned about some of the choices of this particular administration and the future of our country if we stop funding and trusting science. + +If you haven’t already seen this [Neil Degrasse Tyson video](https://www.youtube.com/watch?v=8MqTOEospfo) on the importance of Science to everyone, I highly recommend it. + + + + diff --git a/_posts/2017-05-04-debt-haircuts.md b/_posts/2017-05-04-debt-haircuts.md new file mode 100644 index 0000000..828a50e --- /dev/null +++ b/_posts/2017-05-04-debt-haircuts.md @@ -0,0 +1,16 @@ +--- +title: "Some default and debt restructuring data" +date: 2017-05-04 +author: rafa +layout: post +comments: true +--- + +Yesterday the government of Puerto Rico [asked for bankruptcy relief in federal court](https://www.nytimes.com/2017/05/03/business/dealbook/puerto-rico-debt.html). Puerto Rico owes about $70 billion to bondholders and about $50 billion in pension obligations. Before asking for protection the government offered to pay back some of the debt (50% according to some news reports) but bondholders refused. Bondholders will now fight in court to recover as much of what is owed as possible while the government and a federal oversight board will try to lower this amount. What can we expect to happen? + +A case like this is unprecedented, but there are plenty of data on restructurings. An [op-ed]( http://www.elnuevodia.com/opinion/columnas/ladeudaserenegociaraeneltribunal-columna-2317174/) by Juan Lara pointed me to [this]( http://voxeu.org/article/argentina-s-haircut-outlier) blog post describing data on 180 debt restructurings. I am not sure how informative these data are with regards to Puerto Rico, but the plot below sheds some light into the variability of previous restructurings. Colors represent regions of the world and the lines join points from the same country. I added data from US cases shown in [this paper](http://www.nfma.org/assets/documents/RBP/wp_statliens_julydraft.pdf). + +![](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2017-05-04/haircuts.png) + +The cluster of points you see below the 30% mark appear to be cases involving particularly poor countries: Albania, Argentina, Bolivia, Ethiopia, Bosnia and Herzegovina, Guinea, Guyana, Honduras, Cameroon, Iraq, Congo, Rep., Costa Rica, Mauritania, Sao Tome and Principe, Mozambique, Senegal, Nicaragua, Niger, Serbia and Montenegro, Sierra Leone, Tanzania, Togo, Uganda, Yemen, and Republic of Zambia. Note also these restructurings happened after 1990. + diff --git a/html/midterm2012.html b/html/midterm2012.html new file mode 100644 index 0000000..3ccc954 --- /dev/null +++ b/html/midterm2012.html @@ -0,0 +1,113 @@ + + + + + + + + + + + + + +A quick guide to 538 using formulas + + + + + + + + + + + + + + + + + + + + +
+ + + + + +

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I’ve always wanted to know more details. After preparing a “predict the midterm elections” homework for my data science class I have a better idea of what is going on. Here is my current best explanation of the model that motivates the way they create a posterior distribution for the election day result. Note: this was written in a couple of hours and may include mistakes.

+

Let \(\theta\) represents the real difference between the republican and democrat candidate on election day. The naive approach used by individual pollsters is to obtain poll data and construct a confidence interval. For example by using the normal approximation to the binomial distribution we can write:

+

\[Y = \theta + \varepsilon \mbox{ with } \varepsilon \sim N(0,\sigma^2)\]

+

with \(\sigma^2\) inversely proportional to the number of people polled. One of the most important insights made by poll aggregators is that this assumption underestimates the variance introduced by pollster effects (also referred to as house effects) as demonstrated by the plot below. For polls occurring within 1 week of the 2010 midterm election this plot shows the difference between individual predictions and the actual outcome of each race stratified by pollster.

+

plot of chunk unnamed-chunk-2

+

The model can be augmented to \[Y_{i,j} = \theta + h_i + \varepsilon_{i,j} \mbox{ for polls } i=1\dots,M \mbox{ and } j \mbox{ an index representing days left until election}\]

+

Here \(h_i\) represents random pollster effect. Another important insight is that by averaging these polls the estimator’s variance is reduced and that we can estimate the across pollster variance from data. Note that to estimate \(\theta\) we need an assumption such as \(\mbox{E}(h_i)=0\). More on this later. Also note that we can model the pollster specific effects to have different variances. To estimate these we can use previous election. With these in place, we can construct weighted estimates for \(\theta\) that down-weight bad pollsters.

+

This model is still insufficient as it ignores another important source of variability: time. In the figure below we see data from the Minnesota 2000 senate race. Note that had we formed a confidence interval, based on aggregated data (different colors represent different pollsters), 20 days before the election we would have been quite certain that the republican was going to win when in fact the democrat won (red X marks the final result). Note that the 99% confidence interval we formed with 20 days before the election was not for \(\theta\) but for \(\theta\) plus some day effect.

+

plot of chunk unnamed-chunk-3

+

There was a well documented internet feud in which Nate Silver explained why Princeton Election Consortium snapshot predictions were overconfident because they ignored this source of variability. We therefore augment the model to

+

\[Y_{i,j} = \theta + h_i + d_j + \varepsilon_{i,j}\]

+

with \(d_j\) the day effect. Although we can model this as a fixed effect and estimate it with, for example, loess, this is not that useful for forecasting as we don’t know if the trend will continue. More useful is to model it as a random effect with its variance depending on days left to the election. The plot below, shows the residuals for the Rasmussen pollster, and motivates the need to model a decreasing variance. Note that we also want to assume \(d\) is an auto-correlated process.

+

plot of chunk unnamed-chunk-4

+

If we apply this model to current data we obtain confidence intervals that are generally smaller than those implied by the current 538 forecast. This is because there are general biases that we have not accounted for. Specifically our assumption that \(\mbox{E}(h_i)=0\) is incorrect. This assumption says that, on average, pollsters are not biased, but this is not the case. Instead we need to add a general bias to the model

+

\[Y_{i,j} = \theta + h_i + d_j + b + \varepsilon_{i,j}.\]

+

But note we can’t estimate \(b\) from the data: this model is not identifiable. However, we can model \(b\) as a random effect with and estimate it’s variance from past elections where we know \(\theta\). Here is a plot of residuals that give us an idea of the values \(b\) can take. Note that the standard deviation of the yearly average bias is about 2. This means that the SE has a lower bound: even with data from \(\infty\) polls we should not assume our estimates have SE lower than 2.

+

plot of chunk unnamed-chunk-5

+

Here is a specific example where this bias resulted in all polls being wrong. Note where the red X is.

+

plot of chunk unnamed-chunk-6

+

Note that, despite these polls predicting a clear victory for Angle, 538 only gave her a 83% of winning. They must be including some extra variance term as our model above does. Also note, that we have written a model for one state. In a model including all states we could include a state-specific \(b\) as well as a general \(b\).

+

Finally, most of the aggregaors report statements that treat \(\theta\) as random. For example, they report the probability that the republican candidate will win \(\mbox{Pr}(\theta>0 | Y)\). This implies a prior distribution is set: \(\theta \sim N(\mu,\tau^2)\). As Nate Silver explained, 538 uses fundamentals to decide \(\mu\) while \(\tau\) can be deduced from the weight that fundamentals are given in the light of poll data:

+
+

“This works by treating the state fundamentals estimate as equivalent to a “poll” with a weight of 0.35. What does that mean? Our poll weights are designed such that a 600-voter poll from a firm with an average pollster rating gets a weight of 1.00 (on the day of its release55; this weight will decline as the poll ages). Only the lowest-rated pollsters will have a weight as low as 0.35. So the state fundamentals estimate is treated as tantamount to a single bad (though recent) poll. This differs from the presidential model, where the state fundamentals estimate is more reliable and gets a considerably heavier weight.”

+
+

I assume they used training/testing approaches to decide on this value of \(\tau\). But also note that it does not influence the final result of races with many polls. For example, note that for a race with 25 polls, the data receives about 99% of the weight making the posterior practically equivalent to the sampling distribution.

+

Finally, because the tails of the normal distribution are not fat enough to account for the upsets we occasionally see, 538 uses the Fleishman’s Transformation to increase these probabilities.

+

We have been discussing these ideas in class and part of the homework was to predict the number of republican senators. Here are few example example. The student that provides the smallest interval that includes the result wins (this explains why some took the risky approach of a one number interval). In a few hours we will know how well they did.

+

plot of chunk unnamed-chunk-7

+ + +
+ + + + + + + +