Data Stories

Starting this fall, there is a Quantitative Reasoning with Data (QRD) requirement. This course serves this requirement. It provides you with a toolkit for thoughtful, critical, and skeptical engagement with real data sets with the goal to develop a healthy skepticism regarding data collection and interpretation. The language of multi-variable calculus is well suited for describing data. Examples: Data are vectors, the dot product can be interpreted as a covariance, the correlation the cosine of the angle between vectors. To test sensitivity of data with respect to parameters we need partial derivatives. Least square fitting is an optimization problem. For geometric data, probabilities are computed via integration. These are all topics we study in this course. On this page, we share some `data set stories' and provide data sets to experiment with.

Area of polygons

In the homework for Unit 28 we have a data problem about area of regions. Here is the Mathematica code which computes the area of a polygon using Green's theorem:
country = "Ecuador";
CountryData[country, "LandArea"]
A = First[CountryData[country, "SchematicCoordinates"]];
P[{x_, y_}] := 6371*{y*Pi/180, Sin[x*Pi/180]};
B = Map[P, A];
Area[Polygon[B]]
MyArea[A_] := Sum[n = Length[A]; a = A[[k, 1]]; b = A[[k, 2]];
  c = A[[Mod[k, n] + 1, 1]];
  d = A[[Mod[k, n] + 1, 2]]; (a*d - b*c)/2, {k, n}];
MyArea[B]
Here is how you can get a random country:
country = RandomChoice[CountryData[]];


Elevation data

In Homework 16, we have a data project dealing with elevation data. The code is on the Mathematica page and also as a Mathematica Notebook. The picture to the right for example was generated with
A = Reverse[ Normal[GeoElevationData[
     Entity["City", {"Cambridge", "Massachusetts", "UnitedStates"}],
     GeoProjection -> Automatic]]]; ListContourPlot[A]
We use elevation data from Mathematica. The source are the GIS elevation data. Here is an example of the elevation data near Mt St Helens before and after eruption:
A=Import["http://sites.fas.harvard.edu/~math21a/data/oldhelens.dem","Data"];
B=Import["http://sites.fas.harvard.edu/~math21a/data/newhelens.dem","Data"];
ListContourPlot[A,Contours->20]
ListContourPlot[B,Contours->20]


Roller coaster

In this blog, Robin Deits uses a smartphone to measure the accelerations of a roller coaster ride on Cedar point, the foremost American roller coaster park. Robin made the data available on a github repository.
We have the data available on this Mathematica notebook.


Misery (was mentioned in intro meeting)

The Misery index is the sum of unemployment and inflation. The term was coined by the economist Arthur Melvin Okun (1928-1980) in the 1970ies who thought about big questions in economics like equality and efficiency, one of the big anti-podes also in politics, taxation. It became more complicated with globalization. Okun created the index in a time, when both quantities were large. It might have influenced elections like led to the defeat of Richard Ford, where the misery index peaked to 19.9. During the Carter presidency, it hit 21.98 and might have contributed to Carters defeat in 1980. Source. So, here are the latest data sets. There are written so that they can be run directly in Mathematica. What is important, when curating data is that one always includes the source so that one can check that the data are correct and not manipulated. To the right is the outcome of the dataplot. Of course, the animation in 2D (seen here) is much cooler.

Primes ((x-2y+z)2=4 appears in HW 1}

As an experimental number theorist, you investigate prime numbers. Let p(k) denote the k'th prime number. We have p(1)=2, p(2)=3, p(3)=5, p(4)=7, p(5)=11, etc. We would like to understand these data. One day, you decide to visualize the primes in space. To do so, you plot the points (p(k),p(k+1),p(k+2)). Here are the data points obtained with
Table[{Prime[k],Prime[k+1],Prime[k+2]},{k,8}] 
When looking at the points we notice that x-2y+z is often either 2 or -2. The quantity x-2y+z measures some kind of acceleration. While v(k) = p(k+1)-p(k) is a rate of change, the rate of change of the rate of change is v(k+1)-v(k) = [p(k+2)-p(k+1)] - [p(k+1)-p(k)] = p(k+2)-2p(k+1)+p(k). From the first 8 data points 75 percent have |x-2y+z|=2. To experiment more, use the computer
f[n_]:=Sum[If[(Prime[k]-2Prime[k+1]+Prime[k+2])^2==4,1,0],{k,n}]/n;
From the first 100 data points 45 are on the set S = { (x-2y+z)2=4 } The function f(n) gives the fraction of points X(k) which are on S if k runs from 1 to n, divided by n. For example, f(8)=6/8 or f(100)=45/100. An unsolved problem is to decide how f(n) behaves for large n. Does it go to zero? If yes, how fast? Nobody seems to have looked at it yet. This example shows that data also can appear in purely mathematical frame works. Almost all theorems in mathematics, almost all conjectures come from experimenting with data at first. In the case of primes, one has first noticed that there appear to be infinitely many, then proved it. Gauss, without a computer of course, looked at the prime data and conjectured how the number of primes grow. This was later proven and is called the prime number theorem. For differences of primes, one believes (again, by looking at data) that there are infinitely many cases, where (x,y) = (p(k),p(k+1)) are on the line y-x=2. This is the famous open prime twin conjecture.
x    y    z           x-2y+z
----------------------------
2    3    5               1
3    5    7               0
5    7    11              2
7    11   13             -2
11   13   17              2
13   17   19             -2
17   19   23              2
19   23   29              2

We see that a spacial representation of data leads to new questions and possibly insight. In this case the problem is a pure mathematical problem about prime numbers. Prime numbers used to be a domain which was of interest only for theoretical mathematicians. Because of cryptological applications, it has become one of the most applied topics of mathematics overall.

University ranking

(* https://www.topuniversities.com/university-rankings/world-university-rankings/2019 *) (* Uni, Overall, Academic, Employer, Fac/Stu,cite,intern faculty, intern. students *) A={ {"MIT" ,100 ,100 ,100 ,100 ,99.8,100 ,95.5}, {"Stanford" ,98.6 ,100 ,100 ,100 ,99 ,99.8 ,70.5}, {"Harvard" ,98.5, 100 ,100 ,99.3,99.8,92.1 ,75.7}, {"Caltech" ,97.2, 98.7,81.2,100 ,100 ,96.8 ,90.3}, {"Oxford" ,96.8, 100 , 100,100 , 83 ,99.6 ,98.8}, {"Cambridge" ,95.6, 100 , 100,100 ,77.2,99.4 ,97.9}, {"ETH" ,95.3, 98.2,96.2,82.4,98.7, 100 ,98.6}, {"Imperial" ,93.3, 98.7,99.9,99.9,67.8, 100 , 100}, {"Chicago" ,93.2, 99.6,90.7,97.4,83.6,74.2 ,82.5}, {"UCL" ,92.9, 99.3,99.2,99.2,66.2,98.7 , 100} };

Capital, Labor and Production values

The original article of Cobb and Douglas A Theory of Production, American Economic Review, 18/1, 1928 contains the following data:
(* Year,Capital,Labor,Production   *)
  
A={
{1899,100,100,100},
{1900,107,105,101},
{1901,114,110,112},
{1902,122,118,122},
{1903,131,123,124},
{1904,138,116,122},
{1905,149,125,143},
{1906,163,133,152},
{1907,176,138,151},
{1908,185,121,126},
{1909,198,140,155},
{1910,208,144,159},
{1911,216,145,153},
{1912,226,152,177},
{1913,236,154,184},
{1914,244,149,169},
{1915,266,154,189},
{1916,298,182,225},
{1917,335,196,227},
{1918,366,200,223}, 
{1919,387,193,218}, 
{1920,407,193,231}, 
{1921,417,147,179}, 
{1922,431,161,240}
};