Imputation for an Entire Python/Pandas Dataframe with Yearly Time Series - python

Dataframe 50 countries, 80 features (varying widely in scale), over 25 years.
The variance between feature values, and values per country within the same feature, is wide.
Trying to accurately impute the missing values across the whole dataframe, all at once.
Tried SimpleImputer with Mean, but this would give a mean for the entire feature column, and ignored any yearly time trend for that specific country.
This led to the imputed values being wildly inaccurate for smaller countries, as their imputed values reflected the mean of that feature column across ALL the larger countries as well
AND, if there was a trend for that feature declining across all countries, it would be ignored due to the mean being so much larger than the smaller countries'.
TLDR;
Currently:
Year x1 x2 x3 ...
USA 1990 4 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 NaN 97
CDR 1991 NaN 91000 NaN
etc...
How can I impute the missing values most accurately and efficiently, where the imputation takes into account the scale of the country and feature, while noting the yearly time trend??
Goal:
Year x1 x2 x3 ...
USA 1990 3 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 (87000) 97
CDR 1991 (3) 91000 (95)
etc...
Wherein the 3, 87000, and 95 would be suitable values as they follow a general time trend that other countries do, but the values are scaled to the other values in that same feature for the specific country (in this case CDR)
Using SimpleImputer, these values would be MUCH higher, and far less logical.
I know imputation is never perfect, but it can surely be made more accurate in this case
If there is a noticeable trend over the years for that country, how can I reflect that while keeping the imputed values to a scale that matches the feature for the particular country?

You can try the following techniques.
Random forest imputation .
you can refer to this paper .
Backward Forward filling(though it will just consider year) .
Log returns

Related

Random Forest Regression for Multiple Groups

I have monthly weather data all sampled from the past four years for about 50 different locations. Is there a way I can create a singular Random Forest Regression model that creates predictions for all 50 locations on their own? I don't want to have to create 50 different models, as that seems time-expensive. I don't think one-hot-encoding the names of the locations works, because the train/test/split then only takes the data from one of the locations to test with.
My data looks something like this:
Month Year Location Temp
3 2018 city1 42
3 2018 city2 50
3 2018 city3 30
4 2018 city1 50
4 2018 city2 55
4 2018 city3 60
...
12 2021 city1 20
12 2021 city2 40
12 2021 city3 30
And I want predictions for the next x number of months for city1, city2, city3, and so on.
I think Random Forest Regression is not the most appropriate model for this kind of task. A more appropriate model is a Recurrent Neural Network (RNN). Your problem fits exactly as a many-to-many sequence prediction as described in this blog.
In your skin, I would create a list of 50 RNN models and train for each city separately (can be done programmatically and reasonably quick with multiple processes and threads depending on the power of your computer and the size of your data).
To get a startup code, you can google many-to-many model prediction with RNN.An example is this blog post.

Which imputation technique to use for filling missing population data based on 3 categorical columns?

I am fairly new to data science. Apologies if the question is unclear.
**My Data is following format:**
*year age_group pop Gender Ethnicity
0 1957 0 - 4 Years 264727 Mixed Mixed
1 1957 5 - 9 Years 218097 Male Indian
2 1958 10 - 14 Years 136280 Female Indian
3 1958 15 - 19 Years 135679 Female Chinese
4 1959 20 - 24 Years 119266 Mixed Mixed*
.
.
.
.
Here Mixed means Both Male & Female for gender and Indian & Chinese & others for Ethnicity
where as pop is the population
I have some rows with missing values like the following:
year age_group pop Gender Ethnicity
344 1958 70 - 74 Years NaN Mixed Mixed
345 1958 75 - 79 Years NaN Male Indian
346 1958 80 - 84 Years NaN Mixed Mixed
349 1958 75 Years & Over NaN Mixed Mixed
350 1958 80 Years & Over NaN Female Chinese
.
.
.
These can't be dropped or filled with mean/median/previous values.
I am looking for any cold deck/any imputation techniques which would allow me fill the pop based on the values in year, age_group, gender and ethnicity.
Please do provide any sample code or documentation that would help me.
It's hard to a give a specific answer without knowing what you might want to use the data for. But here are some questions you should ask:
How many null values are there?
If there are a few e.g. less than 20, and you have the time, then you can look at each one individually. In this case, you might want to look up census data on google etc and make a guesstimate for each cell.
If there are more than can be individually assessed then we'll need to work some other magic.
Do you know how the other variables should relate to population?
Think about how other variables should relate to population. For example, if you know there's 500 males in one age cohort of a certain ethnicity but you don't know how many females... well 500 females would be a fair guess.
This would only cover some nulls, but is a logical assumption. You might be able to step through imputations of decreasing strength:
Find all single sex null values where the corresponding gender cohort is known, assume a 50:50 gender ratio for the cohort
Find all null values where the older cohort and younger cohort is known, impute this cohorts population linearly between them
Something else...
This is a lot of work -- but again -- what do you want the data for? If you're looking for a graph it's probably not worth it. But if you're doing a larger study / trying to win a kaggle competition... then maybe it is?
What real world context do you have?
For example, if this data is for population in a certain country, then you might know the age distribution curve of that country? You could then impute values for ethnicities along the age distribution curve given where other age cohorts for the same ethnicity sit. This is brutally simplistic, but might be ok for your use case.
Do you need this column?
If there are lots of nulls, then whatever imputation you do will likely add a good degree of error. So what are you doing with the data? If you don't have to have the column, and there are a lot of nulls, then perhaps you're better without it.
~~
Hope that helps -- good luck!

How should I structure my time series dataframe in Python/Pandas?

I have a dataframe with multiple repeating series, over time.
I want to create visualizations that compare series over time, as well as with one another.
What is the best way to structure my data to accomplish this?
I have thus far been trying to make smaller dataframes from this, by either dropping years or selecting only one series; using a variety of indexes, lists or series calls that refer to the multiple years, .Series, .loc or .drop etc..
I always seem to encounter the same issues when making the actual graphs; usually relating to the years.
My best result has been making simple bar graphs with countries on the x axis and GDP from only 2018 on the Y axis.
I would like to eventually be able to have countries represented by color with 3D plotly graphs, wherein a series like GDP is Z (depth), Years are Y, and some other series like GNI could be X.
For now I am just aiming to make a scatterplot
I am also fine with using matplot, seaborn, whatever makes the most sense here.
Columns: [country, series, 1994, 1995, 1996, etc..]
Country Series 1994 1995 1996 ...
USA GDP 3.12 4.13
USA Export% 25.5 32
USA GNI 867,123,111 989,666,123
UK GDP 2.87 etc.
UK Export% 43.1
UK GNI 981,125,555
China GDP 5.98
China Export% NaN
China GNI 787,123,447
...
df1 = df.loc[df['series']== 'GDP']
time = df1['1994':'1996']
gdp_time = px.scatter(df1, x = time, y= 'series', color="country")
gdp_time.show()
#Desired Graph
gdp_time = px.scatter(df1, x = years, y= GDP, color= Countries)
gdp_time.show()
I find it hard to believe that I cant simply create a series that references the multiple year columns as a singular 'time'.
What am I missing?

Combining similar rows in Stata / python

I am doing some data prep for graph analysis and my data looks as follows.
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
I would like to combine them such that
country1 country2 pair volume
USA CHN USA_CHN 15
AFG ALB AFG_ALB 7
Is there a simple way for me to do so in Stata or python? I've tried making a duplicate dataframe and renamed the 'pair' as country2_country1, then merged them, and dropped duplicate volumes, but it's a hairy way of going about things: I was wondering if there is a better way.
If it helps to know, my data format is for a directed graph, and I am converting it to undirected.
Your key must consist of sets of two countries, so that they compare equal regardless of order. In Python/Pandas, this can be accomplished as follows.
import pandas as pd
import io
# load in your data
s = """
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
"""
data = pd.read_table(io.BytesIO(s), sep='\s+')
# create your key (using frozenset instead of set, since frozenset is hashable)
key = data[['country1', 'country2']].apply(frozenset, 1)
# group by the key and aggregate using sum()
print(data.groupby(key).sum())
This results in
volume
(CHN, USA) 15
(AFG, ALB) 7
which isn't exactly what you wanted, but you should be able to get it into the right shape from here.
Here is a solution that takes pandas automatic alignment of indexes
df1 = df.set_index(['country1'])
df2 = df.set_index(['country2'])
df1['volume'] += df2['volume']
df1.reset_index().query('country1 > country2')
country1 country2 pair volume
0 USA CHN USA_CHN 15
3 ALB AFG ALB_AFG 7
Here is a solution based on #jean-françois-fabre comment.
split_sorted = df.pair.str.split('_').map(sorted)
df_switch = pd.concat([split_sorted.str[0],
split_sorted.str[1],
df['volume']], axis=1, keys=['country1', 'country2', 'volume'])
df_switch.groupby(['country1', 'country2'], as_index=False, sort=False).sum()
output
country1 country2 volume
0 CHN USA 15
1 AFG ALB 7
In Stata you can just lean on the fact that alphabetical ordering gives a distinct signature to each pair.
clear
input str3 (country1 country2) volume
USA CHN 10
CHN USA 5
AFG ALB 2
ALB AFG 5
end
gen first = cond(country1 < country2, country1, country2)
gen second = cond(country1 < country2, country2, country1)
collapse (sum) volume, by(first second)
list
+-------------------------+
| first second volume |
|-------------------------|
1. | AFG ALB 7 |
2. | CHN USA 15 |
+-------------------------+
You can merge back with the original dataset if wished.
Documented and discussed here
NB: Presenting a clear data example is helpful. Presenting it as the code to input the data is even more helpful.
Note: As Nick Cox comments below, this solution gets a bit crazy when the number of countries is large. (With 200 countries, you need to accurately store a 200-bit number)
Here's a neat way to do it using pure Stata.
I effectively convert the countries into binary "flags", making something like the following mapping:
AFG 0001
ALB 0010
CHN 0100
USA 1000
This is achieved by numbering each country as normal, then calculating 2^(country_number). When we then add these binary numbers, the result is a combination of the two "flags". For example,
AFG + CHN = 0101
CHN + AFG = 0101
Notice that it now doesn't make any difference which order the countries come in!
So we can now happily add the flags and collapse by the result, summing over volume as we go.
Here's the complete code (heavily commented so it looks much longer than it is!)
// Convert country names into numbers, storing the resulting
// name/number mapping in a label called "countries"
encode country1, generate(from_country) label(countries)
// Do it again for the other country, using the existing
// mappings where the countries already exist, and adding to the
// existing mapping where they don't
encode country2, generate(to_country) label(countries)
// Add these numbers as if they were binary flags
// Thus CHN (3) + USA (4) becomes:
// 010 +
// 100
// ---
// 110
// This makes adding strings commutative and unique. This means that
// the new variable doesn't care which way round the countries are
// nor can it get confused by pairs of countries adding up to the same
// number.
generate bilateral = 2^from_country + 2^to_country
// The rest is easy. Collapse by the new summed variable
// taking (arbitrarily) the lowest of the from_countries
// and the highest of the to_countries
collapse (sum) volume (min) from_country (max) to_country, by(bilateral)
// Tell Stata that these new min and max countries still have the same
// label:
label values from_country "countries"
label values to_country "countries"

Seaborn lmplot - Changing Marker Style and Color of single Datapoint

I was trying to find an answer to Harvards CS109, Homework 1, Part 1c from the year 2013 using seaborn, which they don't.
"Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. In which years can you detect a competitive advantage from the Oakland baseball team of using data science? When did this end?"
So we do have for multiple years and multiple teams, salaries as well as wins. I want to build a seaborn facet for each year regressing salaries against wins AND call out the datapoint for Oakland. Building the facet for one regression for each year works fine. But how would I change the data point for oakland?
Thats how my data looks like (the first 5 entries):
teamID yearID salary W
0 ANA 1997 31135472 84
1 ANA 1998 41281000 85
2 ANA 1999 55388166 70
3 ANA 2000 51464167 82
4 ANA 2001 47535167 75
...
This is how I am plotting the data:
facetplots = sns.lmplot(x="salary", y="W", col="yearID", data=df_data, col_wrap=4, size=3)
Any help would be much appreciated.
Regards

Categories

Resources