Calculating total unique values per column - python

I am trying to use the below data to get the 'Total Facebook likes' for each unique actor. The output should be in two columns, column 1
containing the unique actor names from all the actor_name columns and
column 2 should have the total likes from all three
actor_facebook_likes columns. Any idea on how this can done, will be
appreciated.
{'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: nan, 1: 27000.0, 2: 9800.0, 3: nan, 4: 3300.0}}

Use pivot to get sum of likes for each actor in each facebook like category
df3=pd.pivot_table(df,columns=['actor_1_name', 'actor_2_name', 'actor_3_name'],values=['actor_1_facebook_likes', 'actor_2_facebook_likes',
'actor_3_facebook_likes'],aggfunc=[np.sum]).reset_index()
Melt the Actors, groupby and sum all categories
res=pd.melt(df3,id_vars=['sum'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).groupby('value').agg(Totallikes =('sum', 'sum')).reset_index()
Rename the columns
res.columns=['Actor','Totallikes']
print(res)
Actor Totallikes
0 Amiée Conn 33000.0
1 Amy Adams 40300.0
2 Casey Affleck 74818.0
3 Dev Patel 138800.0
4 Emma Stone 33000.0
5 Forest Whitaker 40300.0
6 Ginnifer Goodwin 57800.0
7 Idris Elba 57800.0
8 Jason Bateman 57800.0
9 Jeremy Renner 40300.0
10 Kyle Chandler 74818.0
11 Michelle Williams 74818.0
12 Nicole Kidman 138800.0
13 Rooney Mara 138800.0
14 Ryan Gosling 33000.0

This makes the job :
df0 = pd.DataFrame({'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: 0, 1: 27000.0, 2: 9800.0, 3: 0, 4: 3300.0}})
df1 = pd.concat([df0, df0, df0])
dfa = pd.DataFrame()
for i in range(0, 3):
names = list(df1.iloc[3*i:4+3*i, i])
val = df1.iloc[3*i:4+3*i, 3+i]
df = pd.DataFrame(names)
df['value'] = val
dfa = pd.concat([dfa, df], axis = 0)

Related

Flatten multi-index columns into one pandas

I'm trying to clean up a dataframe by merging the columns on a multi-index so all values in columns that belong to the same first-level index appear in one column.
From This:
To This:
I was doing it manually by defining each column and joining them like this:
df['Subjects'] = df['Which of the following subjects are you taking this semester?'].apply(lambda x: '|'.join(x.dropna()), axis = 1)
df.drop('Which of the following subjects are you taking this semester?', axis = 1, level = 0, inplace = True)
The problem is I have a large dataframe with many more columns then this, so I was wondering if there is a way to do this dynamically for all columns instead of copying this code and defining each column individually?
data = {('Name', ''): {0: 'Jane',
1: 'John',
2: 'Lisa',
3: 'Michael'},
('Location', ''): {0: 'Houston', 1: 'LA', 2: 'LA', 3:
'Dallas'},
('Which of the following subjects are you taking this
semester?', 'Math'): {0: 'Math',
1: 'Math',
2: np.nan,
3: 'Math'},
('Which of the following subjects are you taking this
semester?', 'Science'): {0: 'Science',
1: np.nan,
2: np.nan,
3: 'Science'},
('Which of the following subjects are you taking this
semester?', 'Art'): {0: np.nan,
1: 'Art',
2: 'Art',
3: np.nan},
('Which of the following electronic devices do you own?',
'Laptop'): {0: 'Laptop',
1: 'Laptop',
2: 'Laptop',
3: 'Laptop'},
('Which of the following electronic devices do you own?',
'Phone'): {0: 'Phone',
1: 'Phone',
2: 'Phone',
3: 'Phone'},
('Which of the following electronic devices do you own?',
'TV'): {0: np.nan,
1: 'TV',
2: np.nan,
3: np.nan},
('Which of the following electronic devices do you own?',
'Tablet'): {0: 'Tablet',
1: np.nan,
2: 'Tablet',
3: np.nan},
('Age', ''): {0: 24, 1: 20, 2: 19, 3: 29},
('Which Social Media Platforms Do You Use?', 'Instagram'):
{0: np.nan,
1: 'Instagram',
2: 'Instagram',
3: 'Instagram'},
('Which Social Media Platforms Do You Use?', 'Facebook'):
{0: 'Facebook',
1: 'Facebook',
2: np.nan,
3: np.nan},
('Which Social Media Platforms Do You Use?', 'Tik Tok'):
{0: np.nan,
1: 'Tik Tok',
2: 'Tik Tok',
3: np.nan},
('Which Social Media Platforms Do You Use?', 'LinkedIn'):
{0: 'LinkedIn',
1: 'LinkedIn',
2: np.nan,
3: np.nan}
}
You can try this:
df.T.groupby(level=0).agg(list).T
You can use melt as starting point to flatten your dataframe, filter out nan values then pivot_table to reshape your dataframe:
pat = r'(subjects|electronic devices|Social Media Platforms)'
cols = ['Name', 'Location', 'Age']
out = df.droplevel(1, axis=1).melt(cols, ignore_index=False).query('value.notna()')
out['variable'] = out['variable'].str.extract(pat, expand=False).str.title()
out = out.reset_index().pivot_table('value', ['index'] + cols, 'variable', aggfunc='|'.join) \
.reset_index(cols).rename_axis(index=None, columns=None)
Output:
>>> out
Name Location Age Electronic Devices Social Media Platforms Subjects
0 Jane Houston 24 Laptop|Phone|Tablet Facebook|LinkedIn Math|Science
1 John LA 20 Laptop|Phone|TV Instagram|Facebook|Tik Tok|LinkedIn Math|Art
2 Lisa LA 19 Laptop|Phone|Tablet Instagram|Tik Tok Art
3 Michael Dallas 29 Laptop|Phone Instagram Math|Science

Define a function for df using a pandas serie

I would like to calculate the number of people (dataframe variable) for a sector (ROME column) belonging to a workgroup (FAP column) for each year that I divide by the total number of people in that workgroup.
The total number of workgroups is stored in a variable Total_FAP :
Total_FAP = df2.Total
Total_FAP.head()
which shows
FAP
Agents administratifs et commerciaux des transports et du tourisme 63160.0
Agents d'entretien 718150.0
Agents d'exploitation des transports 142680.0
Agents de gardiennage et de sécurité 465010.0
Agriculteurs, éleveurs, sylviculteurs, bûcherons 121040.0
For example, for the year 2010, I have to take the number of people for the ROME A1101 corresponding to the FAP "Agriculteurs, éleveurs, sylviculteurs, bûcherons " (which is 2630) and divide it by the total number that is in the pandas series (which is 121040).
It would make something like : 2630/121040 = 0.02172835426
I would like to know if there is a way to make a function, because I wanted to try to make an iteration on the dataframes but I saw that it was not advised....
Thanks for your help
EDIT: Here is the raw data for DF1
{'FAP': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
1: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
2: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
3: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
4: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons'},
'ROME': {0: 'A1101', 1: 'A1201', 2: 'A1202', 3: 'A1203', 4: 'A1204'},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19130, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76570, 4: 510},
'1e Trimestre 2021': {0: 600, 1: 560, 2: 300, 3: 6090, 4: 30}}
You could use:
cols = df.filter(regex='^\d{4}$').columns
df = df.merge(Total_FAP, left_on='FAP', right_index=True, suffixes=('', '_total'))
df[cols].div(df['FAP_total'], axis=0)
output:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 0.021728 0.023050 0.022307 0.024537 0.022141 0.020159 0.029742 0.024207 0.022637 0.013219 0.092036
1 0.011401 0.012393 0.010905 0.013962 0.016358 0.014706 0.016358 0.020406 0.016606 0.014541 0.053619
2 0.036765 0.030321 0.033212 0.029081 0.023050 0.021811 0.020985 0.020737 0.017597 0.008675 0.115664
3 0.167961 0.165565 0.158047 0.169365 0.139623 0.134749 0.146067 0.153007 0.159286 0.117812 0.632601
4 0.001074 0.000744 0.001074 0.001157 0.001239 0.001404 0.000744 0.001074 0.001239 0.000661 0.004213

Pandas dataframe don't merge on specific column

Good evening,
I have a problem with my df
Here is df1
and df2
Trimestre level_0
0 "A1101" Agriculteurs, éleveurs, sylviculteurs, bûcherons"
1 "A1401" Maraîchers, jardiniers, viticulteurs"
2 "A1405" Maraîchers, jardiniers, viticulteurs"
3 "A1406" Marins, pêcheurs, aquaculteurs"
4 "N3101" Marins, pêcheurs, aquaculteurs"
... ... ...
123 "K1205" Professionnels de l'action sociale et de l'ori...
124 "K2104" Professionnels de l'action culturelle, sportiv...
125 "K2108" Enseignants"
126 "K2110" Formateurs"
127 "K2111" Formateurs"
I try to merge df1 with df2 on "Trimestre" column
df2.Trimestre = df2.Trimestre.astype(str)
df1.Trimestre = df1.Trimestre.astype(str)
df=pd.merge(df1,df2,on="Trimestre")
and nothing appear
Trimestre level_0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Help me pls
EDIT: Here is the output of df.head().to_dict() to reproduce the error
df1
{'Trimestre': {0: 'A1101 ',
1: 'A1201 ',
2: 'A1202 ',
3: 'A1203 ',
4: 'A1204 '},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19140, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76580, 4: 510}}
df2
{'Trimestre': {0: 'A1101', 1: 'A1401', 2: 'A1405', 3: 'A1406', 4: 'N3101'},
'level_0': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons"',
1: 'Maraîchers, jardiniers, viticulteurs"',
2: 'Maraîchers, jardiniers, viticulteurs"',
3: 'Marins, pêcheurs, aquaculteurs"',
4: 'Marins, pêcheurs, aquaculteurs"'}}

Joining two columns in the same data frame

I am trying to add one column at the end of another column. I have included a picture that kind of demonstrates what I want to achieve. How can this be done?
For example, in this case I added the age column under the name column
Dummy data:
{'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}}
One way is to use .append. If your data is in the DataFrame df:
# Split out the relevant parts of your DataFrame
top_df = df[['name','sex']]
bottom_df = df[['age','sex']]
# Make the column names match
bottom_df.columns = ['name','sex']
# Append the two together
full_df = top_df.append(bottom_df)
You might have to decide on what kind of indexing you want. This method above will have non-unique indexing in full_df, which could be fixed by running the following line:
full_df.reset_index(drop=True, inplace=True)
You can use pd.melt and drop variable column using df.drop here.
df = pd.DataFrame({'Unnamed: 0': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}})
df.melt(id_vars=['sex'], value_vars=['name', 'age']).drop(columns='variable')
sex value
0 female andrea
1 male juan
2 male jose
3 male manuel
4 female 35
5 male 56
6 male 22
7 male 16

replicate iferror and vlookup in a pandas join

I want to join two dataframes:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Country ': {0: 'de', 1: 'it', 2: 'de'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
df2 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1',1: 'campaign2', 2: 'none',3: 'campaign4',4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
edit:
let's even imagine the option:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
I have to join df2 and df1 on the keys:
Date
Campaign
Banner
The issue here is that when the match under the key "Campaign" is not found, the key should be switched to field "id_campaign".
I would like to obtain this dataframe:
df_joined = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: 'none', 3: 'campaign4', 4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20, 3: 0, 4: 0},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
any help is really appreciated.
You can use double merge by 3 and 2 keys and then fill not match values by combine_first from column Value_1 of df4:
df3 = pd.merge(df2, df1.drop('Country', axis=1), on=['Date','Campaign','Banner'], how='left')
df4 = pd.merge(df2, df1, on=['Date','Banner'], how='left')
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10.0
1 banner2 campaign2 it 2/1/2016 10 none 5.0
2 banner3 none de 1/1/2016 15 12345 NaN
3 banner4 campaign4 en 3/1/2016 20 none NaN
4 banner5 campaign5 en 4/1/2016 25 none NaN
print (df4['Value_1'])
0 10.0
1 5.0
2 20.0
3 NaN
4 NaN
Name: Value_1, dtype: float64
df3['Value_1'] = df3['Value_1'].combine_first(df4['Value_1']).fillna(0).astype(int)
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10
1 banner2 campaign2 it 2/1/2016 10 none 5
2 banner3 none de 1/1/2016 15 12345 20
3 banner4 campaign4 en 3/1/2016 20 none 0
4 banner5 campaign5 en 4/1/2016 25 none 0

Categories

Resources