creating a function for mathematical data imputation python - python

I am performing a number of similar operations and I would like to write a function but not even sure how to approach this. I am calculating the values for 0 data for the following series:
the formula is 2 * value in 2001 - value in 2002
I currently do it one by one in Python:
print(full_data.loc['Croatia', 'fertile_age_pct'])
print(full_data.loc['Croatia', 'working_age_pct'])
print(full_data.loc['Croatia', 'young_age'])
print(full_data.loc['Croatia', 'old_age'])
full_data.replace(to_replace={'fertile_age_pct': {0:(2*46.420061-46.326103)}}, inplace=True)
full_data.replace(to_replace={'working_age_pct': {0:(2*67.038157-66.889212)}}, inplace=True)
full_data.replace(to_replace={'young_age': {0:(2*0.723475-0.715874)}}, inplace=True)
full_data.replace(to_replace={'old_age': {0:(2*0.692245-0.709597)}}, inplace=True)
Data frame (full_data):
geo_full year fertile_age_pct working_age_pct young_age old_age
Croatia 2000 0 0 0 0
Croatia 2001 46.420061 67.038157 0.723475 0.692245
Croatia 2002 46.326103 66.889212 0.715874 0.709597
Croatia 2003 46.111822 66.771187 0.706091 0.72444
Croatia 2004 45.929829 66.782133 0.694854 0.735333
Croatia 2005 45.695932 66.742514 0.686534 0.747083

So you are trying to fill the 0 values in year 2000 with your formula. If you have any other country in the DataFrame then it can get messy.
Assuming the year with 0's is always the first year for each country, try this:
full_data.set_index('year', inplace=True)
fixed_data = {}
for country, df in full_data.groupby('geo_full')[full_data.columns[1:]]:
if df.iloc[0].sum() == 0:
df.iloc[0] = df.iloc[1] * 2 - df.iloc[0]
fixed_data[country] = df
fixed_data = pd.concat(list(fixed_data.values()), keys=fixed_data.keys(), names=['geo_full'], axis=0)

Related

Pandas: Repeating list in column does not work

I want to turn a dataframe from this
to this:
It took me a while to figure out the melt and transpose function to get to this
But I did not get to manage to apply the years from 1990 to 2019 in a repeating manner into for every of the 189 countries.
I tried:
year_list = []
for year in range(1990, 2020,1):
year_list.append(year)
years = pd.Series(year_list)
years
and then
df['year'] = years.repeat(30)
(I need to repeat it 30 times, because the frame consists of 5670 rows = 189 countries * 29 years)
I got this error message:
ValueError: cannot reindex on an axis with duplicate labels
Googling this error does not help.
One approach could be as follows:
Sample data
import pandas as pd
import numpy as np
data = {'country': ['Afghanistan','Angola']}
data.update({k: np.random.rand() for k in range(1990,1993)})
df = pd.DataFrame(data)
print(df)
country 1990 1991 1992
0 Afghanistan 0.103589 0.950523 0.323925
1 Angola 0.103589 0.950523 0.323925
Code
res = (df.set_index('country')
.unstack()
.sort_index(level=1)
.reset_index(drop=False)
.rename(columns={'country': 'geo',
'level_0': 'time',
0: 'hdi_human_development_index'})
)
print(res)
time geo hdi_human_development_index
0 1990 Afghanistan 0.103589
1 1991 Afghanistan 0.950523
2 1992 Afghanistan 0.323925
3 1990 Angola 0.103589
4 1991 Angola 0.950523
5 1992 Angola 0.323925
Explanation
Use df.set_index on column country and apply df.unstack to add the years from the column names to the index.
Now, we use df.sort_index on level=1 to get the countries in alphabetical order.
Finally, we use df.reset_index with drop parameter set to False to get the index back as columns, and we chain df.rename to customize the column names.

How to create new column from another dataframe based on conditions

I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.
First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires
Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960

Create composite variable from multiple variables and add to dataframe

I have a dataframe with three median rent variables. The dataframe looks like this:
region_id
year
1bed_med_rent
2bed_med_rent
3bed_med_rent
1
2010
800
1000
1200
1
2011
850
1050
1250
2
2010
900
1000
1100
2
2011
950
1050
1150
I would like to combine all rent variables into one variable using common elements of region and year like so:
region_id
year
med_rent
1
2010
1000
1
2011
1050
2
2010
1000
2
2011
1050
Using the agg() function in pandas, I have been able to perform functions on multiple variables, but I have not been able to combine variables and insert into the dataframe. I have attempted to use the assign() function in combination with the below code without success.
#Creating the group list of common IDs
group_list = ['region_id', 'year']
#Grouping by common ID and taking median values of each group
new_df = df.groupby(group_list).agg({'1bed_med_rent': ['median'],'2bed_med_rent':
['median'], '3bed_med_rent': ['median']}).reset_index()
What other method might there be for this?
Here set_index combined with apply applied to the rest of the row ought to do it:
(df.set_index(['region_id','year'])
.apply(lambda r:r.median(), axis=1)
.reset_index()
.rename(columns = {0:'med_rent'})
)
produces
region_id year med_rent
0 1 2010 1000.0
1 1 2011 1050.0
2 2 2010 1000.0
3 2 2011 1050.0

Get average number of days per year greater than a daily mean value with Pandas/Python

Lets say I create the following Pandas Series, which contains some daily measurement over 10 years at three different stations
import numpy as np
import pandas as pd
stations = ['a', 'b', 'c']
dates = pd.date_range(start = '2000-01-01', end = '2009-12-31')
index = [(stations[i], dates[j]) for i in range(len(stations)) for j in range(len(dates))]
index = pd.MultiIndex.from_tuples(index, names=["station", "date"])
x = np.random.random(len(index))
df = pd.Series(index = index, data = x)
Resulting in a Series that looks like:
>>> df
station date
a 2000-01-01 0.736381
2000-01-02 0.203178
2000-01-03 0.640063
2000-01-04 0.942664
2000-01-05 0.953994
...
c 2009-12-27 0.713189
2009-12-28 0.800085
2009-12-29 0.033923
2009-12-30 0.972547
2009-12-31 0.387804
Length: 10959, dtype: float64
Now, for each station, I want to calculate the average number of days per year that have measurement values which are greater than the daily mean on a given day.
I know I can calculate the daily mean value for each station like this:
daily_mean = df.groupby(['station',index.get_level_values('date').dayofyear]).mean()
>>> daily_mean
station date
a 1 0.529211
2 0.432048
3 0.438350
4 0.629226
5 0.523919
...
c 362 0.524537
363 0.346734
364 0.423349
365 0.433348
366 0.316085
Length: 1098, dtype: float64
But after this step, I can't figure out what to do.
Basically I want to do something like:
df['a','2000-01-01'] > daily_mean['a', 1]
df['a','2000-01-02'] > daily_mean['a', 2]
...
df['a','2000-12-31'] > daily_mean['a', 365]
...Then calculate how many days that year were above average, and do this for each year, and then take the mean number of days above average across all years. And then do that for each station.
I could probably do what I want with some painful looping, but I figure there might be a more Pandas-y way to do it?
You can compare a value for a column to the within-group column average with the following pattern.
This technique uses transform method on a grouped dataframe, which will yield a result of the same length as the original grouped dataframe, rather than condensing rows. As an illustrative example:
test = pd.DataFrame({'A': np.random.choice(['a', 'b', 'c'], 10), 'B': np.random.beta(2, 9, 10)})
test
Out
A B
0 b 0.099245
1 c 0.081244
2 b 0.239556
3 b 0.211645
4 c 0.256624
5 c 0.091649
6 b 0.213261
7 a 0.327473
8 a 0.240529
9 c 0.235569
test.groupby('A').B.mean()
Out
A
a 0.284001
b 0.190927
c 0.166271
Name: B, dtype: float64
Using transform:
test['within_A_mean'] = test.groupby('A').B.transform('mean')
test.sort_values('A')
Out
A B within_A_mean
7 a 0.327473 0.284001
8 a 0.240529 0.284001
0 b 0.099245 0.190927
2 b 0.239556 0.190927
3 b 0.211645 0.190927
6 b 0.213261 0.190927
1 c 0.081244 0.166271
4 c 0.256624 0.166271
5 c 0.091649 0.166271
9 c 0.235569 0.166271
So, going back to your example:
# setting up the data as a dataframe instead of a series, with 'measurement' column
import numpy as np
import pandas as pd
stations = ['a', 'b', 'c']
dates = pd.date_range(start = '2000-01-01', end = '2009-12-31')
index = [(stations[i], dates[j]) for i in range(len(stations)) for j in range(len(dates))]
index = pd.MultiIndex.from_tuples(index, names=["station", "date"])
x = np.random.random(len(index))
df = pd.DataFrame(index = index, data = x, columns=['measurement'])
# create a new boolean column which will indicate if a particular measurement
# is above the average measurement for the same day of year across the dataset
df['above_average'] = df\
.groupby(df.index.get_level_values('date').dayofyear)\
.measurement\
.transform(lambda x: x > x.mean())
The expression for df['above_average'] reads: for each grouped dataframe (ie for each dataframe for each dayofyear), for each row, is the row value greater than the average value of the column within the group df?
Once you have this boolean column calculated, you can easily get the number of days for each year that were above average:
df.groupby(df.index.get_level_values('date').year).above_average.mean()
Out
date
2000 0.478142
2001 0.515068
2002 0.508676
2003 0.466667
2004 0.534608
2005 0.518721
2006 0.478539
2007 0.484932
2008 0.467213
2009 0.509589
Name: above_average, dtype: float64
You can also get the overall average of days that were above day-of-year average:
df.above_average.mean()
Out
0.49621315813486633
EDIT:
To get number rather than mean, use sum() instead of mean() as your aggregate function. Getting this count by station/year is a matter of grouping by those values.
df = df.reset_index()
df.groupby(['station', df['date'].dt.year]).above_average.sum()
Out
station date
a 2000 193
2001 175
2002 181
2003 177
2004 163
2005 183
2006 200
2007 178
2008 180
2009 176
b 2000 159
2001 185
2002 186
2003 170
2004 188
2005 176
2006 190
2007 175
2008 185
2009 171
c 2000 183
2001 186
2002 194
2003 178
2004 181
2005 187
2006 185
2007 169
2008 195
2009 175
Name: above_average, dtype: int64

Looping this code to get new dataframe based on previous calculated dataframe? [duplicate]

This question already has answers here:
Create multiple dataframes in loop
(6 answers)
Closed 3 years ago.
df2 = df.copy()
df2['avgTemp'] = df2['avgTemp'] * tempchange
df2['year'] = df2['year'] + 20
df_final = pd.concat([df,df2])
OUTPUT:
Country avgTemp year
0 Afghanistan 14.481583 2012
0 Afghanistan 15.502164 2032
1 Africa 24.725917 2012
1 Africa 26.468460 2032
2 Albania 13.768250 2012
... ... ... ...
240 Zambia 21.697750 2012
241 Zimbabwe 23.038036 2032
241 Zimbabwe 21.521333 2012
242 Åland 6.063917 2012
242 Åland 6.491267 2032
So currently I'm trying to make a loop so I can then do the same calculations for "df_2" and return "df_3", and keep doing this until I have a certain amount of new dataframes that I can then concatinate together. Thank you for your help! :)
So end result should be like df_1, df_2, df_3 and so on. So I can then concat them together into one big dataset
Yes, I would use a loop to solve the issue. The x I'm passing in the function range represents the number of loops you wish to do:
lists_of_dfs = []
for i in range(x):
df_aux = df.copy()
df_aux['avgTemp'] = df['avgTemp'] * (tempchange ** i)
df_aux['year'] = df['year'] + (20 * i)
lists_of_dfs.append(df_aux)
Finally with the full the list of dataframes:
final_df = pd.concat(lists_of_dfs)
The only condition is that the variable tempchange has to be (1+%), it can't be only the % change, otherwise the formula will fail.

Categories

Resources