I am somewhat new to Pandas and I have been stuck on a problem.
Assume I have the following dataframe (df1):
Name
Day
Score
Al
Monday
75
Al
Friday
88
Bo
Monday
90
Bo
Friday
100
Cy
Monday
85
Cy
Friday
95
I would like to create another dataframe (df2) with each person's name and their percent improvement from Monday to Friday.
The result would be:
Name
Improvement
Al
17.33
Bo
11.11
Cy
11.76
For example, Al improved by 17.33% between Monday and Friday (((88-75)/75) * 100)
If there is for each Name always ordered Monday and Friday like in sample data solution is GroupBy.pct_change:
df = (df[['Name']].join(df.groupby('Name')['Score'].pct_change().mul(100)
.rename('Improvement'))
.dropna())
print (df)
Name Improvement
1 Al 17.333333
3 Bo 11.111111
5 Cy 11.764706
Let us pivot to reshape then calculate pct change along column axis
s = df.pivot('Name', 'Day', 'Score')
s = s.pct_change(-1, axis=1)['Friday'].reset_index(name='Improvement')
Result
Name Improvement
0 Al 0.173333
1 Bo 0.111111
2 Cy 0.117647
Related
I have a dataframe as follows;
Country
From Date
02/04/2020 Canada
04/02/2020 Ireland
10/03/2020 France
11/03/2020 Italy
15/03/2020 Hungary
.
.
.
10/10/2020 Canada
And I simply want to do a groupby() or something similar which will tell me how many times a country occurs per month
eg.
Canada Ireland France . . .
2010 1 3 4 1
2 4 3 2
.
.
.
10 4 4 4
Is there a simple way to do this?
Any help much appreciated!
A different angle to solve your question would be to use groupBy, count_values and unstack.
It goes like this:
I assume your "from date" is type date (datetime64[ns]) if not:
df['From Date']=pd.to_datetime(df['From Date'], format= '%d/%m/%Y')
convert the date to string with Year + Month:
df['From Date'] = df['From Date'].dt.strftime('%Y-%m')
group by From Date and count the values:
df.groupby(['From Date'])['Country'].value_counts().unstack().fillna(0).astype(int).reindex()
desired result (from the snapshot in your question):
Country Canada France Hungary Ireland Italy
From Date
2020-02 0 0 0 1 0
2020-03 0 1 1 0 1
2020-04 1 0 0 0 0
note the unstack that places the countries on the horizontal, astype(int) to avoid instances such as 1.0 and fillna(0) just in case any country has nothing - show zero.
Check with crosstab
# df.index=pd.to_datetime(df.index, format= '%d/%m/%Y')
pd.crosstab(df.index.strftime('%Y-%m'), df['Country'])
I have two dataframes of differing index lengths that look like this:
df_1:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
df_2:
State Month ... N columns
AL 4
AL 5
.
.
.
VA 11
VA 12
I would like to add a Total Time column to df_2 that uses the values from the Total Time column of df_1 if the State and Month value are the same between dataframes. Ultimately, I would end up with:
df_2:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
I want to do this conditionally since the index lengths are not the same. I have tried this so far:
def f(row):
if (row['State'] == row['State']) & (row['Month'] == row['Month']):
val = x for x in df_1["Total Time"]
return val
df_2['Total Time'] = df_2.apply(f, axis=1)
This did not work. What method would you use to accomplish this?
Any help is appreciated!
You can do this:
Consider my sample dataframes:
In [2327]: df_1
Out[2327]:
State Month Total Time
0 AL 2 1000
1 AB 4 500
2 BC 1 600
In [2328]: df_2
Out[2328]:
State Month
0 AL 2
1 AB 5
In [2329]: df_2 = pd.merge(df_2, df_1, on=['State', 'Month'], how='left')
In [2330]: df_2
Out[2330]:
State Month Total Time
0 AL 2 1000.0
1 AB 5 NaN
As mentioned in other comment, pd.merge() is how you would join two dataframes and extract a column. The issue is that merging solely on 'State' and 'Month' would result in every permutation becoming a new column (all Al-4 in df_1 would join with all other AL-4 in df_2).
With your example, there'd be
df_1
State Month Total Time df_1 col...
0 AL 4 1000 6
1 AL 4 500 7
2 VA 12 750 8
3 VA 12 1500 9
df_2
State Month df_2 col...
0 AL 4 1
1 AL 4 2
2 VA 12 3
3 VA 12 4
df_1_cols_to_use = ['State', 'Month', 'Total Time']
# note the selection of the column to use from df_1. We only want the column
# we're merging on, plus the column(s) we want to bring in, in this case 'Total Time'
new_df = pd.merge(df_2, df_1[df_1_cols_to_use], on=['State', 'Month'], how='left')
new_df:
State Month df_2 col... Total Time
0 AL 4 1 1000
1 AL 4 1 500
2 AL 4 2 1000
3 AL 4 2 500
4 VA 12 3 750
5 VA 12 3 1500
6 VA 12 4 750
7 VA 12 4 1500
You mention these have differing index lengths. Based on the parameters of the question, it's not possible to determine what value of Total Time would match up with a row in df_2. If there's three AL-4 entries in df_2, do they each get 1000, 500, or some combination? That info would be needed. Without this, this would be the best guess at getting all possibilities.
I have two dataframes (with strings) that I am trying to compare to each other. One has a list of areas, the other has a list of areas with long,lat info as well. I am struggling to write a code to perform the following:
1) Check if the string in df1 matches (or a partially matches) area names in df2, then it will merge & carry over the long lat columns.
2) if df1 does not match with df2, then the new column will have NaN or zero.
Code:
import pandas as pd
df1 = pd.read_csv('Dubai Communities1.csv')
df1.head()
CNAME_E1
0 Abu Hail
1 Al Asbaq
2 Al Aweer First
3 Al Aweer Second
4 Al Bada
df2 = pd.read_csv('Dubai Communities2.csv')
df2.head()
COMM_NUM CNAME_E2 Latitude Longitude
0 315 UMM HURAIR 55.3237 25.2364
1 917 AL MARMOOM 55.4518 24.9756
2 624 WARSAN 55.4034 25.1424
3 123 AL MUTEENA 55.3228 25.2739
4 813 AL ROWAIYAH 55.3981 25.1053
The output after search and join will look like this:
CName_E1 CName_E3 Latitude Longitude
0 Area1 Area1 22 7.25
1 Area2 Area2 38 71.83
2 Area3 NaN NaN NaN
3 Area4 Area4 35 8.05
I have 2 dataframes. One with the City, dates and sales
sales = [['20101113','Miami',35],['20101114','New York',70],['20101114','Los Angeles',4],['20101115','Chicago',36],['20101114','Miami',12]]
df2 = pd.DataFrame(sales,columns=['Date','City','Sales'])
print (df2)
Date City Sales
0 20101113 Miami 35
1 20101114 New York 70
2 20101114 Los Angeles 4
3 20101115 Chicago 36
4 20101114 Miami 12
The second has some dates and cities.
date = [['20101114','New York'],['20101114','Los Angeles'],['20101114','Chicago']]
df = pd.DataFrame(date,columns=['Date','City'])
print (df)
I want to extract the sales from the first dataframe that match the city and and dates in the 3nd dataframe and add the sales to the 2nd dataframe. If the date is missing in the first table then the next highest date's sales should be retrieved.
The new dataframe should look like this
Date City Sales
0 20101114 New York 70
1 20101114 Los Angeles 4
2 20101114 Chicago 36
I am having trouble extracting and merging tables. Any suggestions?
This is pd.merge_asof, which allows you to join on a combination of exact matches and then a "close" match for some column.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df2['Date'] = pd.to_datetime(df2.Date)
pd.merge_asof(df.sort_values('Date'),
df2.sort_values('Date'),
by='City', on='Date',
direction='forward')
Output:
Date City Sales
0 2010-11-14 New York 70
1 2010-11-14 Los Angeles 4
2 2010-11-14 Chicago 36
I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0