Create composite variable from multiple variables and add to dataframe

Create composite variable from multiple variables and add to dataframe - python

I have a dataframe with three median rent variables. The dataframe looks like this:
region_id
year
1bed_med_rent
2bed_med_rent
3bed_med_rent
1
2010
800
1000
1200
1
2011
850
1050
1250
2
2010
900
1000
1100
2
2011
950
1050
1150
I would like to combine all rent variables into one variable using common elements of region and year like so:
region_id
year
med_rent
1
2010
1000
1
2011
1050
2
2010
1000
2
2011
1050
Using the agg() function in pandas, I have been able to perform functions on multiple variables, but I have not been able to combine variables and insert into the dataframe. I have attempted to use the assign() function in combination with the below code without success.
#Creating the group list of common IDs
group_list = ['region_id', 'year']
#Grouping by common ID and taking median values of each group
new_df = df.groupby(group_list).agg({'1bed_med_rent': ['median'],'2bed_med_rent':
['median'], '3bed_med_rent': ['median']}).reset_index()
What other method might there be for this?

Here set_index combined with apply applied to the rest of the row ought to do it:
(df.set_index(['region_id','year'])
.apply(lambda r:r.median(), axis=1)
.reset_index()
.rename(columns = {0:'med_rent'})
)
produces
region_id year med_rent
0 1 2010 1000.0
1 1 2011 1050.0
2 2 2010 1000.0
3 2 2011 1050.0

Related

pandas average across dynamic number of columns

I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75

Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)

you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)

How to create new column from another dataframe based on conditions

I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.

First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires

Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?

The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000

Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

creating a function for mathematical data imputation python

I am performing a number of similar operations and I would like to write a function but not even sure how to approach this. I am calculating the values for 0 data for the following series:
the formula is 2 * value in 2001 - value in 2002
I currently do it one by one in Python:
print(full_data.loc['Croatia', 'fertile_age_pct'])
print(full_data.loc['Croatia', 'working_age_pct'])
print(full_data.loc['Croatia', 'young_age'])
print(full_data.loc['Croatia', 'old_age'])
full_data.replace(to_replace={'fertile_age_pct': {0:(2*46.420061-46.326103)}}, inplace=True)
full_data.replace(to_replace={'working_age_pct': {0:(2*67.038157-66.889212)}}, inplace=True)
full_data.replace(to_replace={'young_age': {0:(2*0.723475-0.715874)}}, inplace=True)
full_data.replace(to_replace={'old_age': {0:(2*0.692245-0.709597)}}, inplace=True)
Data frame (full_data):
geo_full year fertile_age_pct working_age_pct young_age old_age
Croatia 2000 0 0 0 0
Croatia 2001 46.420061 67.038157 0.723475 0.692245
Croatia 2002 46.326103 66.889212 0.715874 0.709597
Croatia 2003 46.111822 66.771187 0.706091 0.72444
Croatia 2004 45.929829 66.782133 0.694854 0.735333
Croatia 2005 45.695932 66.742514 0.686534 0.747083

So you are trying to fill the 0 values in year 2000 with your formula. If you have any other country in the DataFrame then it can get messy.
Assuming the year with 0's is always the first year for each country, try this:
full_data.set_index('year', inplace=True)
fixed_data = {}
for country, df in full_data.groupby('geo_full')[full_data.columns[1:]]:
if df.iloc[0].sum() == 0:
df.iloc[0] = df.iloc[1] * 2 - df.iloc[0]
fixed_data[country] = df
fixed_data = pd.concat(list(fixed_data.values()), keys=fixed_data.keys(), names=['geo_full'], axis=0)

Grouping data by specific years in python

I want to create a dataframe that is grouped by region and date which shows the average age of a region during specific years. so my coloumns would look something like
region, year, average age
so far I have:
#specify aggregation functions to column'age'
ageAverage = {'age':{'average age':'mean'}}
#groupby and apply functions
ageDataFrame = data.groupby(['Region', data.Date.dt.year]).agg(ageAverage)
This works great, but how can I make it so that I only group data from specific years? say for example between 2010 and 2015?

You need filter first by between:
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])
.agg(ageAverage))
Also in last version of pandas 0.22.0 get:
SpecificationError: cannot perform renaming for age with a nested dictionary
Correct solution is specify column in list after groupby and aggregate by tuple - first value is new column name and second aggregate function:
np.random.seed(123)
rng = pd.date_range('2009-04-03', periods=10, freq='13M')
data = pd.DataFrame({'Date': rng,
'Region':['reg1'] * 3 + ['reg2'] * 7,
'average age': np.random.randint(20, size=10)})
print (data)
Date Region average age
0 2009-04-30 reg1 13
1 2010-05-31 reg1 2
2 2011-06-30 reg1 2
3 2012-07-31 reg2 6
4 2013-08-31 reg2 17
5 2014-09-30 reg2 19
6 2015-10-31 reg2 10
7 2016-11-30 reg2 1
8 2017-12-31 reg2 0
9 2019-01-31 reg2 17
ageAverage = {('age','mean')}
#groupby and apply functions
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])['average age']
.agg(ageAverage))
print (ageDataFrame)
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10

Two variations using #jezrael's data (thx)
These are very close to what #jezrael has already shown. Only view this as a demonstration of what else can be done. As pointed out in the comments by #jezrael, it is better to pre-filter first as it reduces overall processing.
pandas.IndexSlice
instead of prefiltering with between
data.groupby(
['Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[pd.IndexSlice[:, 2010:2015], :]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
between as part of the groupby
data.groupby(
[data.Date.dt.year.between(2010, 2015),
'Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[True]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create composite variable from multiple variables and add to dataframe - python

Related

pandas average across dynamic number of columns

How to create new column from another dataframe based on conditions

Count ratios conditional on 2 columns

creating a function for mathematical data imputation python

Grouping data by specific years in python

Categories

Resources