Python - get the mean of value between 2 date - python

I'd like to get the mean of value between 2 dates grouped by shop.
In fact I've a first xlsx with the sells by shop and date
shop sell date
a 100 2000
a 122 2001
a 300 2002
b 55 2000
b 245 2001
b 1156 2002
And I've another file with the start and end date for each shop
shop start stop
a 2000 2002
a 2000 2001
b 2000 2000
And so I'd like to get the sell mean between each date from the 2nd file.
I try something like this but I got a list of Df and it's not pretty optimize for me
dfend = []
for i in df2.values:
filt1 = df.shop == i[0]
filt2 = df.date >= i[1]
filt3 = df.date <= i[2]
dfgrouped = df.where(filt1 & filt2 & filt3).groupby('shop').agg(mean = ('sell','mean'), begin = ('date','min'), end = ('date', 'max'))
dfend.append(dfgrouped)
Someone can help me ?
Thx a lot

merge the two DataFrames on 'shop'. Then you can check the date condition using between to filter down to the rows that count. Finally groupby + sum. (This assumes your second df is unique)
m = df2.merge(df1, how='left')
(m[m['date'].between(m['start'], m['stop'])]
.groupby(['shop', 'start', 'stop'])['sell'].mean()
.reset_index())
# shop start stop sell
#0 a 2000 2001 111
#1 a 2000 2002 174
#2 b 2000 2000 55
If there are some rows in df2 that will have no qualifying rows in df1, then instead use mask so that they still get a row after the groupby (this is also the reason why df2 is the left DataFrame in the merge). Here I added an extra row
print(df2)
# shop start stop
#0 a 2000 2002
#1 a 2000 2001
#2 b 2000 2000
#3 e 1999 2011
m = df2.merge(df1, how='left')
(m.where(m['date'].between(m['start'], m['stop']))
.groupby([m.shop, m.start, m.stop])['sell'].mean()
.reset_index())
# shop start stop sell
#0 a 2000 2001 111.0
#1 a 2000 2002 174.0
#2 b 2000 2000 55.0
#3 e 1999 2011 NaN

Related

How to create new column from another dataframe based on conditions

I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.
First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires
Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

creating a function for mathematical data imputation python

I am performing a number of similar operations and I would like to write a function but not even sure how to approach this. I am calculating the values for 0 data for the following series:
the formula is 2 * value in 2001 - value in 2002
I currently do it one by one in Python:
print(full_data.loc['Croatia', 'fertile_age_pct'])
print(full_data.loc['Croatia', 'working_age_pct'])
print(full_data.loc['Croatia', 'young_age'])
print(full_data.loc['Croatia', 'old_age'])
full_data.replace(to_replace={'fertile_age_pct': {0:(2*46.420061-46.326103)}}, inplace=True)
full_data.replace(to_replace={'working_age_pct': {0:(2*67.038157-66.889212)}}, inplace=True)
full_data.replace(to_replace={'young_age': {0:(2*0.723475-0.715874)}}, inplace=True)
full_data.replace(to_replace={'old_age': {0:(2*0.692245-0.709597)}}, inplace=True)
Data frame (full_data):
geo_full year fertile_age_pct working_age_pct young_age old_age
Croatia 2000 0 0 0 0
Croatia 2001 46.420061 67.038157 0.723475 0.692245
Croatia 2002 46.326103 66.889212 0.715874 0.709597
Croatia 2003 46.111822 66.771187 0.706091 0.72444
Croatia 2004 45.929829 66.782133 0.694854 0.735333
Croatia 2005 45.695932 66.742514 0.686534 0.747083
So you are trying to fill the 0 values in year 2000 with your formula. If you have any other country in the DataFrame then it can get messy.
Assuming the year with 0's is always the first year for each country, try this:
full_data.set_index('year', inplace=True)
fixed_data = {}
for country, df in full_data.groupby('geo_full')[full_data.columns[1:]]:
if df.iloc[0].sum() == 0:
df.iloc[0] = df.iloc[1] * 2 - df.iloc[0]
fixed_data[country] = df
fixed_data = pd.concat(list(fixed_data.values()), keys=fixed_data.keys(), names=['geo_full'], axis=0)

Search Data between excel and csv file python- its like vlookup [duplicate]

I have previously worked with Stata and am now trying to get the same done with Python. However, I have troubles with the merge command. Somehow I must be missing something. My two dataframes I want to merge look like this:
df1:
Date id Market_Cap
2000 1 400
2000 2 200
2001 1 410
2001 2 220
df2:
id Ticker
1 Shell
2 ExxonMobil
My aim now is to get the following dataset:
Date id Market_Cap Ticker
2000 1 400 Shell
2000 2 200 ExxonMobil
2001 1 410 Shell
2001 2 220 ExxonMobil
I tried the following command:
merged= pd.merge(df1, df2, how="left", on="id")
This merges the datasets, but gives me only nan's in the Ticker column.
I looked at several sources and maybe I am mistaken, but isn't the "left" command the right thing do to for my purpose? I also tried "right" and "outer". They don't get the result I want to and "inner" does not seem to work here in general.
Am I missing something crucial?
Thyere is problem your column id in one df is object (obviously string) and another int, so no match and get NaN.
If have same dtypes:
print (df1['id'].dtypes)
int64
print (df2['id'].dtypes)
int64
merged = pd.merge(df1, df2, how="left", on="id")
print (merged)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Another solution if need add only one new column is map:
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Simulate your problem:
print (df1['id'].dtypes)
object
print (df2['id'].dtypes)
int64
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 NaN
1 2000 2 200 NaN
2 2001 1 410 NaN
3 2001 2 220 NaN
And solution is convert to int by astype (or column id in df2 to str):
df1['id'] = df1['id'].astype(int)
#alternatively
#df2['id'] = df2['id'].astype(str)
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil

Merge two datasets in Pandas

I have previously worked with Stata and am now trying to get the same done with Python. However, I have troubles with the merge command. Somehow I must be missing something. My two dataframes I want to merge look like this:
df1:
Date id Market_Cap
2000 1 400
2000 2 200
2001 1 410
2001 2 220
df2:
id Ticker
1 Shell
2 ExxonMobil
My aim now is to get the following dataset:
Date id Market_Cap Ticker
2000 1 400 Shell
2000 2 200 ExxonMobil
2001 1 410 Shell
2001 2 220 ExxonMobil
I tried the following command:
merged= pd.merge(df1, df2, how="left", on="id")
This merges the datasets, but gives me only nan's in the Ticker column.
I looked at several sources and maybe I am mistaken, but isn't the "left" command the right thing do to for my purpose? I also tried "right" and "outer". They don't get the result I want to and "inner" does not seem to work here in general.
Am I missing something crucial?
Thyere is problem your column id in one df is object (obviously string) and another int, so no match and get NaN.
If have same dtypes:
print (df1['id'].dtypes)
int64
print (df2['id'].dtypes)
int64
merged = pd.merge(df1, df2, how="left", on="id")
print (merged)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Another solution if need add only one new column is map:
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Simulate your problem:
print (df1['id'].dtypes)
object
print (df2['id'].dtypes)
int64
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 NaN
1 2000 2 200 NaN
2 2001 1 410 NaN
3 2001 2 220 NaN
And solution is convert to int by astype (or column id in df2 to str):
df1['id'] = df1['id'].astype(int)
#alternatively
#df2['id'] = df2['id'].astype(str)
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil

Categories

Resources