replicate iferror and vlookup in a pandas join

replicate iferror and vlookup in a pandas join - python

I want to join two dataframes:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Country ': {0: 'de', 1: 'it', 2: 'de'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
df2 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1',1: 'campaign2', 2: 'none',3: 'campaign4',4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
edit:
let's even imagine the option:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
I have to join df2 and df1 on the keys:
Date
Campaign
Banner
The issue here is that when the match under the key "Campaign" is not found, the key should be switched to field "id_campaign".
I would like to obtain this dataframe:
df_joined = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: 'none', 3: 'campaign4', 4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20, 3: 0, 4: 0},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
any help is really appreciated.

You can use double merge by 3 and 2 keys and then fill not match values by combine_first from column Value_1 of df4:
df3 = pd.merge(df2, df1.drop('Country', axis=1), on=['Date','Campaign','Banner'], how='left')
df4 = pd.merge(df2, df1, on=['Date','Banner'], how='left')
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10.0
1 banner2 campaign2 it 2/1/2016 10 none 5.0
2 banner3 none de 1/1/2016 15 12345 NaN
3 banner4 campaign4 en 3/1/2016 20 none NaN
4 banner5 campaign5 en 4/1/2016 25 none NaN
print (df4['Value_1'])
0 10.0
1 5.0
2 20.0
3 NaN
4 NaN
Name: Value_1, dtype: float64
df3['Value_1'] = df3['Value_1'].combine_first(df4['Value_1']).fillna(0).astype(int)
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10
1 banner2 campaign2 it 2/1/2016 10 none 5
2 banner3 none de 1/1/2016 15 12345 20
3 banner4 campaign4 en 3/1/2016 20 none 0
4 banner5 campaign5 en 4/1/2016 25 none 0

Related

Python Empty series when filtering data

I have the following DF:
pd.DataFrame({'Fecha': {0: '2022-05-01',
1: '2022-04-24',
2: '2022-04-21',
3: '2022-04-16',
4: '2022-04-10'},
'team': {0: 'América ',
1: 'Tigres UANL ',
2: 'América ',
3: 'Club Tijuana ',
4: 'América '},
'opponent': {0: 'Cruz Azul',
1: 'América',
2: 'León',
3: 'América',
4: 'Juárez'},
'variable': {0: 'xG_for', 1: 'xG_for', 2: 'xG_for', 3: 'xG_for', 4: 'xG_for'},
'value': {0: 1.53, 1: 0.47, 2: 1.4, 3: 0.65, 4: 1.58},
'venue': {0: 'H', 1: 'H', 2: 'H', 3: 'H', 4: 'H'}})
I want to filter the data to create a rolling plot with the following code:
Y_for = df[(df["team"] == "América") & (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
But when I run the code I get an empty series:
Series([], Name: value, dtype: float64)
What am I doing wrong?

== requires an exact match but you have trailing spaces ('América '), strip them with str.strip:
Y_for = df[(df["team"].str.strip() == "América")
& (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
Y_for
or use str.contains:
Y_for = df[ df["team"].str.contains("América")
& (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
Y_for
output:
0 1.53
1 1.40
2 1.58
Name: value, dtype: float64

Pandas dataframe don't merge on specific column

Good evening,
I have a problem with my df
Here is df1
and df2
Trimestre level_0
0 "A1101" Agriculteurs, éleveurs, sylviculteurs, bûcherons"
1 "A1401" Maraîchers, jardiniers, viticulteurs"
2 "A1405" Maraîchers, jardiniers, viticulteurs"
3 "A1406" Marins, pêcheurs, aquaculteurs"
4 "N3101" Marins, pêcheurs, aquaculteurs"
... ... ...
123 "K1205" Professionnels de l'action sociale et de l'ori...
124 "K2104" Professionnels de l'action culturelle, sportiv...
125 "K2108" Enseignants"
126 "K2110" Formateurs"
127 "K2111" Formateurs"
I try to merge df1 with df2 on "Trimestre" column
df2.Trimestre = df2.Trimestre.astype(str)
df1.Trimestre = df1.Trimestre.astype(str)
df=pd.merge(df1,df2,on="Trimestre")
and nothing appear
Trimestre level_0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Help me pls
EDIT: Here is the output of df.head().to_dict() to reproduce the error
df1
{'Trimestre': {0: 'A1101 ',
1: 'A1201 ',
2: 'A1202 ',
3: 'A1203 ',
4: 'A1204 '},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19140, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76580, 4: 510}}
df2
{'Trimestre': {0: 'A1101', 1: 'A1401', 2: 'A1405', 3: 'A1406', 4: 'N3101'},
'level_0': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons"',
1: 'Maraîchers, jardiniers, viticulteurs"',
2: 'Maraîchers, jardiniers, viticulteurs"',
3: 'Marins, pêcheurs, aquaculteurs"',
4: 'Marins, pêcheurs, aquaculteurs"'}}

Pandas groupby apply is taking too much time

I am having the following code.
pd.DataFrame({'user_wid': {0: 3305613, 1: 57, 2: 80, 3: 31, 4: 38, 5: 12, 6: 35, 7: 25, 8: 42, 9: 16}, 'user_name': {0: 'Ter', 1: 'Am', 2: 'Wi', 3: 'Ma', 4: 'St', 5: 'Ju', 6: 'De', 7: 'Ri', 8: 'Ab', 9: 'Ti'}, 'user_age': {0: 41, 1: 34, 2: 45, 3: 47, 4: 70, 5: 64, 6: 64, 7: 63, 8: 32, 9: 24}, 'user_gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Male', 5: 'Female', 6: 'Female', 7: 'Female', 8: 'Female', 9: 'Female'}, 'sale_date': {0: '2018-05-15', 1: '2020-02-28', 2: '2020-04-02', 3: '2020-05-09', 4: '2020-11-29', 5: '2020-12-14', 6: '2020-04-21', 7: '2020-06-15', 8: '2020-07-03', 9: '2020-08-10'}, 'days_since_first_visit': {0: 426, 1: 0, 2: 0, 3: 8, 4: 126, 5: 283, 6: 0, 7: 189, 8: 158, 9: 270}, 'visit': {0: 4, 1: 1, 2: 1, 3: 2, 4: 4, 5: 3, 6: 1, 7: 2, 8: 4, 9: 2}, 'num_user_visits': {0: 4, 1: 2, 2: 1, 3: 2, 4: 10, 5: 7, 6: 1, 7: 4, 8: 4, 9: 2}, 'product': {0: 13, 1: 2, 2: 2, 3: 2, 4: 5, 5: 5, 6: 1, 7: 8, 8: 5, 9: 4}, 'sale_price': {0: 10.0, 1: 0.0, 2: 41.3, 3: 41.3, 4: 49.95, 5: 74.95, 6: 49.95, 7: 5.0, 8: 0.0, 9: 0.0}, 'whether_member': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}})
def f(x):
d = {}
d['user_name'] = x['user_name'].max()
d['user_age'] = x['user_age'].max()
d['user_gender'] = x['user_gender'].max()
d['last_visit_date'] = x['sale_date'].max()
d['days_since_first_visit'] = x['days_since_first_visit'].max()
d['num_visits_window'] = x['visit'].max()
d['num_visits_total'] = x['num_user_visits'].max()
d['products_used'] = x['product'].max()
d['user_total_sales'] = (x['sale_price'].sum()).round(2)
d['avg_spend_visit'] = (x['sale_price'].sum() / x['visit'].max()).round(2)
d['membership'] = x['whether_member'].max()
return pd.Series(d)
users = xactions.groupby('user_wid').apply(f).reset_index()
It is taking too much time to execute, I want to optimize the following function.
Any suggestions would be appreciated.
Thanks in advance.

Try:
users2 = xactions.groupby("user_wid", as_index=False).agg(
user_name=("user_name", "max"),
user_age=("user_age", "max"),
user_gender=("user_gender", "max"),
last_visit_date=("sale_date", "max"),
days_since_first_visit=("days_since_first_visit", "max"),
num_visits_window=("visit", "max"),
num_visits_total=("num_user_visits", "max"),
products_used=("product", "max"),
user_total_sales=("sale_price", "sum"),
membership=("whether_member", "max"),
)
users2["avg_spend_visit"] = (
users2["user_total_sales"] / users2["num_visits_window"]
).round(2)
print(users2)
Prints:
user_wid user_name user_age user_gender last_visit_date days_since_first_visit num_visits_window num_visits_total products_used user_total_sales membership avg_spend_visit
0 12 Ju 64 Female 2020-12-14 283 3 7 5 74.95 0 24.98
1 16 Ti 24 Female 2020-08-10 270 2 2 4 0.00 0 0.00
2 25 Ri 63 Female 2020-06-15 189 2 4 8 5.00 0 2.50
3 31 Ma 47 Male 2020-05-09 8 2 2 2 41.30 0 20.65
4 35 De 64 Female 2020-04-21 0 1 1 1 49.95 0 49.95
5 38 St 70 Male 2020-11-29 126 4 10 5 49.95 0 12.49
6 42 Ab 32 Female 2020-07-03 158 4 4 5 0.00 0 0.00
7 57 Am 34 Female 2020-02-28 0 1 2 2 0.00 0 0.00
8 80 Wi 45 Male 2020-04-02 0 1 1 2 41.30 0 41.30
9 3305613 Ter 41 Male 2018-05-15 426 4 4 13 10.00 0 2.50

Joining two columns in the same data frame

I am trying to add one column at the end of another column. I have included a picture that kind of demonstrates what I want to achieve. How can this be done?
For example, in this case I added the age column under the name column
Dummy data:
{'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}}

One way is to use .append. If your data is in the DataFrame df:
# Split out the relevant parts of your DataFrame
top_df = df[['name','sex']]
bottom_df = df[['age','sex']]
# Make the column names match
bottom_df.columns = ['name','sex']
# Append the two together
full_df = top_df.append(bottom_df)
You might have to decide on what kind of indexing you want. This method above will have non-unique indexing in full_df, which could be fixed by running the following line:
full_df.reset_index(drop=True, inplace=True)

You can use pd.melt and drop variable column using df.drop here.
df = pd.DataFrame({'Unnamed: 0': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}})
df.melt(id_vars=['sex'], value_vars=['name', 'age']).drop(columns='variable')
sex value
0 female andrea
1 male juan
2 male jose
3 male manuel
4 female 35
5 male 56
6 male 22
7 male 16

Calculating total unique values per column

I am trying to use the below data to get the 'Total Facebook likes' for each unique actor. The output should be in two columns, column 1
containing the unique actor names from all the actor_name columns and
column 2 should have the total likes from all three
actor_facebook_likes columns. Any idea on how this can done, will be
appreciated.
{'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: nan, 1: 27000.0, 2: 9800.0, 3: nan, 4: 3300.0}}

Use pivot to get sum of likes for each actor in each facebook like category
df3=pd.pivot_table(df,columns=['actor_1_name', 'actor_2_name', 'actor_3_name'],values=['actor_1_facebook_likes', 'actor_2_facebook_likes',
'actor_3_facebook_likes'],aggfunc=[np.sum]).reset_index()
Melt the Actors, groupby and sum all categories
res=pd.melt(df3,id_vars=['sum'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).groupby('value').agg(Totallikes =('sum', 'sum')).reset_index()
Rename the columns
res.columns=['Actor','Totallikes']
print(res)
Actor Totallikes
0 Amiée Conn 33000.0
1 Amy Adams 40300.0
2 Casey Affleck 74818.0
3 Dev Patel 138800.0
4 Emma Stone 33000.0
5 Forest Whitaker 40300.0
6 Ginnifer Goodwin 57800.0
7 Idris Elba 57800.0
8 Jason Bateman 57800.0
9 Jeremy Renner 40300.0
10 Kyle Chandler 74818.0
11 Michelle Williams 74818.0
12 Nicole Kidman 138800.0
13 Rooney Mara 138800.0
14 Ryan Gosling 33000.0

This makes the job :
df0 = pd.DataFrame({'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: 0, 1: 27000.0, 2: 9800.0, 3: 0, 4: 3300.0}})
df1 = pd.concat([df0, df0, df0])
dfa = pd.DataFrame()
for i in range(0, 3):
names = list(df1.iloc[3*i:4+3*i, i])
val = df1.iloc[3*i:4+3*i, 3+i]
df = pd.DataFrame(names)
df['value'] = val
dfa = pd.concat([dfa, df], axis = 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replicate iferror and vlookup in a pandas join - python

Related

Python Empty series when filtering data

Pandas dataframe don't merge on specific column

Pandas groupby apply is taking too much time

Joining two columns in the same data frame

Calculating total unique values per column

Categories

Resources