Today, I want to talk about some data interpolation I had to do recently.
As part of a project of mine, I had to deal with US census data. As you probably know, the US census collects data on many aspects of US society (population, education, income, race, and many others…), but it does it once every 10 years. However, for my particular research, I would like to have yearly values for the population of some cities.
Both interpolation and regression can be used to predict unobserved values, but the basic different between them is that, when you do a regression (let’s say a linear one), you use all your data points to find the line that minimizes the distance to all points, and you are also interested in the functional form (the value for the slope and the intercept in this case). It is also common to keep the number of parameters as low as possible (you don’t choose a quadratic form if a line will do).
In the case of interpolation, you force the function to go exactly through the data points you have (that is, you assume that the points you have are the “true values”), and you use them to infer what the intermediate values would be. It is common to use an (N-1) order polynomial for interpolating N data points (remember that a N-1 order polynomial will have N-2 maxima and minima). You can also use a piece-wise linear interpolation (that is, linking each point with the next with a straight line), or a piece-wise cubic interpolation: that is, if you have let say 20 points, you don’t use a 19 order polynomial, but instead, use 4 points at the time to get a piece-wise cubic fit (this avoids getting a fit with too many maxima and minima). These are decisions that you have to make depending on your needs and the specifics of your data.
As an example of how US census data looks like regarding population and race, let’s take one single city. This is the raw data:
Note that until the 1990’s, some minorities were not even recorded separately!
This is a simple scatter plot of the same data:
To obtain the estimated values of population for intercensal years, I’ll interpolate using this data. Let’s start with linear interpolation, that is, assuming that the behavior for the years between two data points is just linear:
To do the interpolation, I used the Scipy function interpolate.interp1d. Further down in this post I’ll share my code, but let’s keep exploring.
In fact, using the same function, I can also extrapolate beyond my data, to get the estimates after 2010:
Note that in the previous cases, because data for some minorities was not even recorded before 1980 or 1990, I have chosen to keep the corresponding missing values. I could have also extrapolated the past… although in that case, the sum of the different sub-populations wouldn’t add to the total recorded by the US census, which is something to keep in mind and correct appropriately.
For the sub-populations that have 6 or 7 data points (Total, Black and White), I may want to consider a more sophisticated type of interpolation such as cubic, which will make the curves look smoother. In the case of sub-populations that only have 3 or 4 data points it is not advisable to do so because they could display funny behaviors in the beginning or end of the time series (more on this later).
This would be the result with cubic interpolation/extrapolation for the Total, Black and White series, and linear for the rest of the series:
In case you want to use it, now I’m sharing and commenting the code I wrote for this example:
Let’s assume that you have the original US census data that I showed at the very beginning, for one city, in a pandas data frame called df_census_city.
First I defined some variables to customize the type of interpolation/extrapolation:
Remember that you can create HTML code from a snippet of your Python code using hilite.
Then I list the different sub-populations that want to interpolate, and I define some nice colors for the different series:
Then, I loop over the different populations, and I define for each one a trace for the data and another one for the interpolated data (I do this by calling a little function I wrote, see below):
Now I take care of the layout for the figure and I plot it (with Plotly):
And this is the customized function I wrote for interpolation:
Finally, this is the code to create a new data frame that includes all the interpolated data, from the original df_census_city data frame:
And then new data frame df_interpolated looks like this: