Pandas dataframe don't merge on specific column

Pandas dataframe don't merge on specific column - python

Good evening,
I have a problem with my df
Here is df1
and df2
Trimestre level_0
0 "A1101" Agriculteurs, éleveurs, sylviculteurs, bûcherons"
1 "A1401" Maraîchers, jardiniers, viticulteurs"
2 "A1405" Maraîchers, jardiniers, viticulteurs"
3 "A1406" Marins, pêcheurs, aquaculteurs"
4 "N3101" Marins, pêcheurs, aquaculteurs"
... ... ...
123 "K1205" Professionnels de l'action sociale et de l'ori...
124 "K2104" Professionnels de l'action culturelle, sportiv...
125 "K2108" Enseignants"
126 "K2110" Formateurs"
127 "K2111" Formateurs"
I try to merge df1 with df2 on "Trimestre" column
df2.Trimestre = df2.Trimestre.astype(str)
df1.Trimestre = df1.Trimestre.astype(str)
df=pd.merge(df1,df2,on="Trimestre")
and nothing appear
Trimestre level_0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Help me pls
EDIT: Here is the output of df.head().to_dict() to reproduce the error
df1
{'Trimestre': {0: 'A1101 ',
1: 'A1201 ',
2: 'A1202 ',
3: 'A1203 ',
4: 'A1204 '},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19140, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76580, 4: 510}}
df2
{'Trimestre': {0: 'A1101', 1: 'A1401', 2: 'A1405', 3: 'A1406', 4: 'N3101'},
'level_0': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons"',
1: 'Maraîchers, jardiniers, viticulteurs"',
2: 'Maraîchers, jardiniers, viticulteurs"',
3: 'Marins, pêcheurs, aquaculteurs"',
4: 'Marins, pêcheurs, aquaculteurs"'}}

Related

How to highlight certain table rows in Plotly?

In my table from a dataset I need to highlight rows in bold that contain "All" in columns Building, Floor or Teams:
My code :
headerColor = 'darkgrey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig_occ_fl_team = go.Figure(data=[go.Table(
header=dict(
values=list(final_table_occ_fl_team.columns),
line_color='black',
fill_color=headerColor,
align=['left','left','left','left','left','left','left','left','left','left'],
font=dict(color='black', size=9)
),
cells=dict(
values=[final_table_occ_fl_team['Building'],
final_table_occ_fl_team['Floor'],
final_table_occ_fl_team['Team'],
final_table_occ_fl_team['Number of Desks'],
final_table_occ_fl_team['Avg Occu (#)'],
final_table_occ_fl_team['Avg Occu (%)'],
final_table_occ_fl_team['Avg Occu 10-4 (#)'],
final_table_occ_fl_team['Avg Occu 10-4 (%)'],
final_table_occ_fl_team['Max Occu (#)'],
final_table_occ_fl_team['Max Occu (%)'],
],
line_color='black',
# 2-D list of colors for alternating rows
fill_color = [[rowOddColor,rowEvenColor]*56],
align = ['left','left','left','left','left','left','left','left','left','left'],
font = dict(color = 'black', size = 7)
))
])
fig_occ_fl_team.show()
Dataset head :
data = {'Building': {0: 'All',
1: '1LWP',
2: '1LWP',
3: '1LWP',
4: '1LWP',
5: '1LWP',
6: '1LWP',
7: '1LWP',
8: '1LWP',
9: '1LWP'},
'Floor': {0: 'All',
1: 'All',
2: '2nd',
3: '2nd',
4: '2nd',
5: '2nd',
6: '2nd',
7: '2nd',
8: '2nd',
9: '2nd'},
'Team': {0: 'All',
1: 'All',
2: 'All',
3: 'Anderson/Money',
4: 'Banking & Treasury',
5: 'Charities',
6: 'Client Management',
7: 'Compliance, Legal & Risk',
8: 'DFM',
9: 'Emmerson'},
'Number of Desks': {0: 2297,
1: 2008,
2: 381,
3: 22,
4: 8,
5: 19,
6: 9,
7: 41,
8: 20,
9: 33},
'Avg Occu (#)': {0: 1261,
1: 1126,
2: 195,
3: 14,
4: 4,
5: 9,
6: 5,
7: 21,
8: 13,
9: 18},
'Avg Occu (%)': {0: '55%',
1: '56%',
2: '51%',
3: '64%',
4: '50%',
5: '48%',
6: '56%',
7: '52%',
8: '65%',
9: '55%'},
'Avg Occu 10-4 (#)': {0: 851,
1: 759,
2: 132,
3: 8,
4: 3,
5: 6,
6: 3,
7: 14,
8: 9,
9: 12},
'Avg Occu 10-4 (%)': {0: '37%',
1: '38%',
2: '35%',
3: '37%',
4: '38%',
5: '32%',
6: '34%',
7: '35%',
8: '45%',
9: '37%'},
'Max Occu (#)': {0: 1901,
1: 1680,
2: 274,
3: 22,
4: 6,
5: 13,
6: 7,
7: 27,
8: 17,
9: 25},
'Max Occu (%)': {0: '83%',
1: '84%',
2: '72%',
3: '100%',
4: '75%',
5: '69%',
6: '78%',
7: '66%',
8: '85%',
9: '76%'}}

You can add the bold style to your dataframe prior to creating the table as follows:
import pandas as pd
df = pd.DataFrame().from_dict(data)
indices = df.index[(df[["Building","Floor","Team"]] == "All").all(1)]
for i in indices:
for j in range(len(df.columns)):
df.iloc[i,j] = "<b>{}</b>".format(df.iloc[i,j])
You can now create the table, I increase the size of font to 12:
import plotly.graph_objects as go
headerColor = 'darkgrey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig_occ_fl_team = go.Figure(data=[go.Table(
header=dict(
values=list(df.columns),
line_color='black',
fill_color=headerColor,
align=['left','left','left','left','left','left','left','left','left','left'],
font=dict(color='black', size=9)
),
cells=dict(
values=[df['Building'],
df['Floor'],
df['Team'],
df['Number of Desks'],
df['Avg Occu (#)'],
df['Avg Occu (%)'],
df['Avg Occu 10-4 (#)'],
df['Avg Occu 10-4 (%)'],
df['Max Occu (#)'],
df['Max Occu (%)'],
],
line_color='black',
# 2-D list of colors for alternating rows
fill_color = [[rowOddColor,rowEvenColor]*56],
align = ['left','left','left','left','left','left','left','left','left','left'],
font = dict(color = 'black', size = 12)
))
])
fig_occ_fl_team.show()
Output:
You will notice that the first and forth columns are bold. If you want to keep the original dataframe unchanged, you can use such that df2 = df1.copy().

Pandas Loc string giving KeyError

I am trying to pass into a url a date in the format 2015-12-20, search the pandas dataframe and do a model.predict on it.
The problem is that I am trying to convert a working code from the jupyter lab into the .py file in order to run everything on the flask server and following I can not transfer.
The following code only works if the 'Date' column is converted to datetime. If it is in object format, the following code also doesn't work.
data.loc[2015-12-06]
The above works but the following gives an error:
data.loc['2015-12-06']
KeyError: '2015-12-06'
How do I pass in the 2015-12-06 not as string for the .loc to work?
print(data.head(5).to_dict())
{'Date': {0: '2015-12-27', 1: '2015-12-20', 2: '2015-12-13', 3: '2015-12-06', 4: '2015-11-29'}, 'Total Volume': {0: 64236.62, 1: 54876.98, 2: 118220.22, 3: 78992.15, 4: 51039.6}, '4046': {0: 1036.74, 1: 674.28, 2: 794.7, 3: 1132.0, 4: 941.48}, '4225': {0: 54454.85, 1: 44638.81, 2: 109149.67, 3: 71976.41, 4: 43838.39}, '4770': {0: 48.16, 1: 58.33, 2: 130.5, 3: 72.58, 4: 75.78}, 'Total Bags': {0: 8696.87, 1: 9505.56, 2: 8145.35, 3: 5811.16, 4: 6183.95}, 'Small Bags': {0: 8603.62, 1: 9408.07, 2: 8042.21, 3: 5677.4, 4: 5986.26}, 'Large Bags': {0: 93.25, 1: 97.49, 2: 103.14, 3: 133.76, 4: 197.69}, 'XLarge Bags': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'type': {0: 'conventional', 1: 'conventional', 2: 'conventional', 3: 'conventional', 4: 'conventional'}, 'year': {0: 2015, 1: 2015, 2: 2015, 3: 2015, 4: 2015}, 'region': {0: 'Albany', 1: 'Albany', 2: 'Albany', 3: 'Albany', 4: 'Albany'}}

Define a function for df using a pandas serie

I would like to calculate the number of people (dataframe variable) for a sector (ROME column) belonging to a workgroup (FAP column) for each year that I divide by the total number of people in that workgroup.
The total number of workgroups is stored in a variable Total_FAP :
Total_FAP = df2.Total
Total_FAP.head()
which shows
FAP
Agents administratifs et commerciaux des transports et du tourisme 63160.0
Agents d'entretien 718150.0
Agents d'exploitation des transports 142680.0
Agents de gardiennage et de sécurité 465010.0
Agriculteurs, éleveurs, sylviculteurs, bûcherons 121040.0
For example, for the year 2010, I have to take the number of people for the ROME A1101 corresponding to the FAP "Agriculteurs, éleveurs, sylviculteurs, bûcherons " (which is 2630) and divide it by the total number that is in the pandas series (which is 121040).
It would make something like : 2630/121040 = 0.02172835426
I would like to know if there is a way to make a function, because I wanted to try to make an iteration on the dataframes but I saw that it was not advised....
Thanks for your help
EDIT: Here is the raw data for DF1
{'FAP': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
1: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
2: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
3: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
4: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons'},
'ROME': {0: 'A1101', 1: 'A1201', 2: 'A1202', 3: 'A1203', 4: 'A1204'},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19130, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76570, 4: 510},
'1e Trimestre 2021': {0: 600, 1: 560, 2: 300, 3: 6090, 4: 30}}

You could use:
cols = df.filter(regex='^\d{4}$').columns
df = df.merge(Total_FAP, left_on='FAP', right_index=True, suffixes=('', '_total'))
df[cols].div(df['FAP_total'], axis=0)
output:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 0.021728 0.023050 0.022307 0.024537 0.022141 0.020159 0.029742 0.024207 0.022637 0.013219 0.092036
1 0.011401 0.012393 0.010905 0.013962 0.016358 0.014706 0.016358 0.020406 0.016606 0.014541 0.053619
2 0.036765 0.030321 0.033212 0.029081 0.023050 0.021811 0.020985 0.020737 0.017597 0.008675 0.115664
3 0.167961 0.165565 0.158047 0.169365 0.139623 0.134749 0.146067 0.153007 0.159286 0.117812 0.632601
4 0.001074 0.000744 0.001074 0.001157 0.001239 0.001404 0.000744 0.001074 0.001239 0.000661 0.004213

Calculating total unique values per column

I am trying to use the below data to get the 'Total Facebook likes' for each unique actor. The output should be in two columns, column 1
containing the unique actor names from all the actor_name columns and
column 2 should have the total likes from all three
actor_facebook_likes columns. Any idea on how this can done, will be
appreciated.
{'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: nan, 1: 27000.0, 2: 9800.0, 3: nan, 4: 3300.0}}

Use pivot to get sum of likes for each actor in each facebook like category
df3=pd.pivot_table(df,columns=['actor_1_name', 'actor_2_name', 'actor_3_name'],values=['actor_1_facebook_likes', 'actor_2_facebook_likes',
'actor_3_facebook_likes'],aggfunc=[np.sum]).reset_index()
Melt the Actors, groupby and sum all categories
res=pd.melt(df3,id_vars=['sum'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).groupby('value').agg(Totallikes =('sum', 'sum')).reset_index()
Rename the columns
res.columns=['Actor','Totallikes']
print(res)
Actor Totallikes
0 Amiée Conn 33000.0
1 Amy Adams 40300.0
2 Casey Affleck 74818.0
3 Dev Patel 138800.0
4 Emma Stone 33000.0
5 Forest Whitaker 40300.0
6 Ginnifer Goodwin 57800.0
7 Idris Elba 57800.0
8 Jason Bateman 57800.0
9 Jeremy Renner 40300.0
10 Kyle Chandler 74818.0
11 Michelle Williams 74818.0
12 Nicole Kidman 138800.0
13 Rooney Mara 138800.0
14 Ryan Gosling 33000.0

This makes the job :
df0 = pd.DataFrame({'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: 0, 1: 27000.0, 2: 9800.0, 3: 0, 4: 3300.0}})
df1 = pd.concat([df0, df0, df0])
dfa = pd.DataFrame()
for i in range(0, 3):
names = list(df1.iloc[3*i:4+3*i, i])
val = df1.iloc[3*i:4+3*i, 3+i]
df = pd.DataFrame(names)
df['value'] = val
dfa = pd.concat([dfa, df], axis = 0)

Python - making scatterplot from non-numeric values

I have a csv file with data that I have imported into a dataframe.
'RI_df = pd.read_csv("../Week15/police.csv")'
Using .head() my data looks like this:
state stop_date stop_time county_name driver_gender driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop district
0 RI 2005-01-04 12:55 NaN M White Equipment/Inspection Violation Equipment False NaN Citation False 0-15 Min False Zone X4
1 RI 2005-01-23 23:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone K3
2 RI 2005-02-17 04:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone X4
3 RI 2005-02-20 17:15 NaN M White Call for Service Other False NaN Arrest Driver
RI_df.head().to_dict()
Out[55]:
{'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-04',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-20',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}}
RI_df['drugs_related_stop'].value_counts()
Out[27]:
False 90879
True 862
Name: drugs_related_stop, dtype: int64
I am trying to take the true value counts of "drug related stops" and put them on a line graph, in order to see if "drug related stops" have been increasing over time.
ax = RI_df['drugs_related_stop'].value_counts().plot(kind='line',
figsize=(10,8),
title="Drug stops")
ax.set_xlabel("drug stops")
ax.set_ylabel("number of stops")

You should just use groupby().count()
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')
Here is the complete code so you can double-check:
import pandas as pd
import numpy as np
df = pd.DataFrame({'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-23',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-17',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}})
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

This is what I'm getting with the code below...
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe don't merge on specific column - python

Related

How to highlight certain table rows in Plotly?

Pandas Loc string giving KeyError

Define a function for df using a pandas serie

Calculating total unique values per column

Python - making scatterplot from non-numeric values

Categories

Resources