Random Forest Regression for Multiple Groups

Random Forest Regression for Multiple Groups - python

I have monthly weather data all sampled from the past four years for about 50 different locations. Is there a way I can create a singular Random Forest Regression model that creates predictions for all 50 locations on their own? I don't want to have to create 50 different models, as that seems time-expensive. I don't think one-hot-encoding the names of the locations works, because the train/test/split then only takes the data from one of the locations to test with.
My data looks something like this:
Month Year Location Temp
3 2018 city1 42
3 2018 city2 50
3 2018 city3 30
4 2018 city1 50
4 2018 city2 55
4 2018 city3 60
...
12 2021 city1 20
12 2021 city2 40
12 2021 city3 30
And I want predictions for the next x number of months for city1, city2, city3, and so on.

I think Random Forest Regression is not the most appropriate model for this kind of task. A more appropriate model is a Recurrent Neural Network (RNN). Your problem fits exactly as a many-to-many sequence prediction as described in this blog.
In your skin, I would create a list of 50 RNN models and train for each city separately (can be done programmatically and reasonably quick with multiple processes and threads depending on the power of your computer and the size of your data).
To get a startup code, you can google many-to-many model prediction with RNN.An example is this blog post.

Related

How to group by column and take average of column weighted by another column?

I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?

One Hot Encoding: Avoiding dummy variable trap and process unseen data with scikit learn

I'm building a model, pretty much similiar to the well known House Price Prediction. I got to the point that I need to encode my nominal categorical variables by using scikit-learns OneHotEncoder. The so called "Dummy Variable Trap" is clear to me so I need to drop one of my OneHot encoded columns to avoid multicollinearity.
What's bothering me, is the way to handle unseen categories. In my understanding the unseen categories will be treated the same way as the "base category" (the category I dropped).
To make it clear have a look at this example:
This is my training data i use to fit my OneHotEncoder.
X_train:
index
city
0
Munich
1
Berlin
2
Hamburg
3
Berlin
OneHotEncoding:
oh = OneHotEncoder(handle_unknown = 'ignore', drop = 'first')
oh.fit_transform(X_train)
Because of drop = 'first' the first column ('city_Munich') will be dropped.
index
city_Berlin
city_Hamburg
0
0
0
1
1
0
2
0
1
3
1
0
Now it's about to encode unseen data:
X_test:
index
city
10
Munich
11
Berlin
12
Hamburg
13
Cologne
oh.transform(X_test)
index
city_Berlin
city_Hamburg
10
0
0
11
1
0
12
0
1
13
0
0
I guess you may see my problem. Row10 (Munich) is treated the same way as row13 (Cologne).
Either I run into the "dummy variable trap" when not dropping one column or I gonna treat unseen data as the "base category" which is in fact wrong.
Whats a proper way to deal with that? Is there any option in the OneHotEncoder class to add a new column for unseen categories like "city_unseen"?

Which imputation technique to use for filling missing population data based on 3 categorical columns?

I am fairly new to data science. Apologies if the question is unclear.
**My Data is following format:**
*year age_group pop Gender Ethnicity
0 1957 0 - 4 Years 264727 Mixed Mixed
1 1957 5 - 9 Years 218097 Male Indian
2 1958 10 - 14 Years 136280 Female Indian
3 1958 15 - 19 Years 135679 Female Chinese
4 1959 20 - 24 Years 119266 Mixed Mixed*
.
.
.
.
Here Mixed means Both Male & Female for gender and Indian & Chinese & others for Ethnicity
where as pop is the population
I have some rows with missing values like the following:
year age_group pop Gender Ethnicity
344 1958 70 - 74 Years NaN Mixed Mixed
345 1958 75 - 79 Years NaN Male Indian
346 1958 80 - 84 Years NaN Mixed Mixed
349 1958 75 Years & Over NaN Mixed Mixed
350 1958 80 Years & Over NaN Female Chinese
.
.
.
These can't be dropped or filled with mean/median/previous values.
I am looking for any cold deck/any imputation techniques which would allow me fill the pop based on the values in year, age_group, gender and ethnicity.
Please do provide any sample code or documentation that would help me.

It's hard to a give a specific answer without knowing what you might want to use the data for. But here are some questions you should ask:
How many null values are there?
If there are a few e.g. less than 20, and you have the time, then you can look at each one individually. In this case, you might want to look up census data on google etc and make a guesstimate for each cell.
If there are more than can be individually assessed then we'll need to work some other magic.
Do you know how the other variables should relate to population?
Think about how other variables should relate to population. For example, if you know there's 500 males in one age cohort of a certain ethnicity but you don't know how many females... well 500 females would be a fair guess.
This would only cover some nulls, but is a logical assumption. You might be able to step through imputations of decreasing strength:
Find all single sex null values where the corresponding gender cohort is known, assume a 50:50 gender ratio for the cohort
Find all null values where the older cohort and younger cohort is known, impute this cohorts population linearly between them
Something else...
This is a lot of work -- but again -- what do you want the data for? If you're looking for a graph it's probably not worth it. But if you're doing a larger study / trying to win a kaggle competition... then maybe it is?
What real world context do you have?
For example, if this data is for population in a certain country, then you might know the age distribution curve of that country? You could then impute values for ethnicities along the age distribution curve given where other age cohorts for the same ethnicity sit. This is brutally simplistic, but might be ok for your use case.
Do you need this column?
If there are lots of nulls, then whatever imputation you do will likely add a good degree of error. So what are you doing with the data? If you don't have to have the column, and there are a lot of nulls, then perhaps you're better without it.
~~
Hope that helps -- good luck!

Imputation for an Entire Python/Pandas Dataframe with Yearly Time Series

Dataframe 50 countries, 80 features (varying widely in scale), over 25 years.
The variance between feature values, and values per country within the same feature, is wide.
Trying to accurately impute the missing values across the whole dataframe, all at once.
Tried SimpleImputer with Mean, but this would give a mean for the entire feature column, and ignored any yearly time trend for that specific country.
This led to the imputed values being wildly inaccurate for smaller countries, as their imputed values reflected the mean of that feature column across ALL the larger countries as well
AND, if there was a trend for that feature declining across all countries, it would be ignored due to the mean being so much larger than the smaller countries'.
TLDR;
Currently:
Year x1 x2 x3 ...
USA 1990 4 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 NaN 97
CDR 1991 NaN 91000 NaN
etc...
How can I impute the missing values most accurately and efficiently, where the imputation takes into account the scale of the country and feature, while noting the yearly time trend??
Goal:
Year x1 x2 x3 ...
USA 1990 3 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 (87000) 97
CDR 1991 (3) 91000 (95)
etc...
Wherein the 3, 87000, and 95 would be suitable values as they follow a general time trend that other countries do, but the values are scaled to the other values in that same feature for the specific country (in this case CDR)
Using SimpleImputer, these values would be MUCH higher, and far less logical.
I know imputation is never perfect, but it can surely be made more accurate in this case
If there is a noticeable trend over the years for that country, how can I reflect that while keeping the imputed values to a scale that matches the feature for the particular country?

You can try the following techniques.
Random forest imputation .
you can refer to this paper .
Backward Forward filling(though it will just consider year) .
Log returns

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.