i wanna concatenate two values from the same column in that column here is my csv file :
Date,Region,TemperatureMax,TemperatureMin,PrecipitationMax,PrecipitationMin
01/01/2016,Champagne Ardenne,12,6,2.5,0.3
02/01/2016,Champagne Ardenne,13,9,3.9,0.6
03/01/2016,Champagne Ardenne,14,7,22.5,12.5
01/01/2016,Bourgogne,9,5,0.1,0
02/01/2016,Bourgogne,11,8,16.3,4.2
03/01/2016,Bourgogne,10,5,12.2,6.3
01/01/2016,Pays de la Loire,12,6,2.5,0.3
02/01/2016,Pays de la Loire,13,9,3.9,0.6
03/01/2016,Pays de la Loire,14,7,22.5,12.5
i want to have Bourgogne Champagne Ardenne instead of having them separated and calculate the average of TemperatureMax, TemperatureMin, PrecipitationMax, PrecipitationMin:
01/01/2016,Bourgogne Champagne Ardenne,10.5,5.5,1.3,0.15
02/01/2016,Bourgogne Champagne Ardenne,12,8.5,10.1,2.4
03/01/2016,Bourgogne Champagne Ardenne,12,6,17.35,9.4
01/01/2016,Pays de la Loire,12,6,2.5,0.3
02/01/2016,Pays de la Loire,13,9,3.9,0.6
03/01/2016,Pays de la Loire,14,7,22.5,12.5
More general solution is first replace by dict and then groupby + aggregate mean:
d = {'Champagne Ardenne':'Bourgogne Champagne Ardenne',
'Bourgogne':'Bourgogne Champagne Ardenne'}
df['Region'] = df['Region'].replace(d)
df1 = df.groupby(['Date', 'Region'], as_index=False, sort=False).mean()
print (df1)
Date Region TemperatureMax TemperatureMin \
0 01/01/2016 Bourgogne Champagne Ardenne 10.5 5.5
1 02/01/2016 Bourgogne Champagne Ardenne 12.0 8.5
2 03/01/2016 Bourgogne Champagne Ardenne 12.0 6.0
3 01/01/2016 Pays de la Loire 12.0 6.0
4 02/01/2016 Pays de la Loire 13.0 9.0
5 03/01/2016 Pays de la Loire 14.0 7.0
PrecipitationMax PrecipitationMin
0 1.30 0.15
1 10.10 2.40
2 17.35 9.40
3 2.50 0.30
4 3.90 0.60
5 22.50 12.50
Use groupby's agg method:
df.groupby('Date').agg({
'Region': lambda g: g.sort_values().str.cat(sep=' '),
'TemperatureMax': 'mean',
'TemperatureMin': 'mean',
'PrecipitationMax': 'mean',
'PrecipitationMin': 'mean'
})
Note that this concatenates regions by alphabetical order.
Related
I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64
Can someone please show me how to merge df2 to df1 for just the matching cities, then use the df2's average monthly temperature columns to match to each city's date range (according to the month) into a new column called 'Temp' in df1?
These are sample data of much larger files for state and cities in Brazil.
df1
State City Dates
0 AC Rio Branco 3/20/2020
1 BA Salvador 5/2/2020
2 CE Fortaleza 4/6/2020
3 AC Rio Branco 5/30/2020
df2: has average monthly temperatures for each city.
State City MAR APR MAY
0 CE Fortaleza 75.6 72.7 69.4
1 ES Vitória 69.1 64.6 62.7
2 AC Rio Branco 72.8 70.5 68.9
3 BA Salvador 74.6 71.3 70.1
Desired output:
df1 with new column 'Temp'
State City Dates Temp
0 AC Rio Branco 3/20/2020 72.8
1 BA Salvador 5/2/2020 70.1
2 CE Fortaleza 4/6/2020 72.7
3 AC Rio Branco 5/30/2020 68.9
You should first convert the date field to datetime, if it is not. Then you can extract the month in MON format so that it matches the column names. The Temp column can be created by checking the Month column and assigning the value from the appropriate column. Finally remove the interim columns created. Hope this is what you are looking for.
import datetime
import calendar
df1=pd.DataFrame({'State': ['AC', 'BA', 'CE', 'AC'], 'City':['Rio Branco', 'Salvador', 'Fortaleza', 'Rio Branco'],
'Dates':['3/20/2020', '5/2/2020', '4/6/2020', '5/30/2020']})
df2=pd.DataFrame({'State': ['CE', 'ES', 'AC', 'BA'], 'City':['Fortaleza', 'Vitória', 'Rio Branco', 'Salvador'],
'MAR' : [75.6, 69.1, 72.8, 74.6], 'APR' : [72.7, 64.6, 70.5, 71.3], 'MAY': [69.4, 62.7, 68.9, 70.1]})
df1['Dates']=pd.to_datetime(df1['Dates']) ##Convert to datetime
df1 = pd.merge(df1,df2, on='City', how="inner") ##Merge the dfs using City as the primary key
df1['Month']=df1.Dates.dt.month.apply(lambda x: calendar.month_abbr[x]).str.upper() ## Get MON for each date
df1['Temp']=np.where(df1['Month'] == 'MAR', df1['MAR'], np.where(df1['Month']=='APR', df1['APR'], df1['MAY'])) ## Add Temp value
df1=df1.drop(columns=['State_y', 'MAR', 'APR', 'MAY', 'Month']).rename(columns={'State_x':'State'}) #Drop unnecessary columns
print(df1)
Output
State City Dates Temp
0 AC Rio Branco 2020-03-20 72.8
1 AC Rio Branco 2020-05-30 68.9
2 BA Salvador 2020-05-02 70.1
3 CE Fortaleza 2020-04-06 72.7
You can use a merge after reshaping df2 to long form with melt and extracting the month abbreviation with to_datetime and strftime:
(df1.assign(month=pd.to_datetime(df1['Dates']).dt.strftime('%b').str.upper())
.merge(df2.melt(['State', 'City'], var_name='month', value_name='Temp'),
on=['State', 'City', 'month'])
#.drop(columns='month') # uncomment to remove the column
)
output:
State City Dates month Temp
0 AC Rio Branco 3/20/2020 MAR 72.8
1 BA Salvador 5/2/2020 MAY 70.1
2 CE Fortaleza 4/6/2020 APR 72.7
3 AC Rio Branco 5/30/2020 MAY 68.9
you can use a function:
df1['Dates']=pd.to_datetime(df1['Dates'])
df1['Month'] = df1['Dates'].dt.strftime('%b').str.upper()
final=df1.merge(df2,on='State')
def get_value(x,row):
ort=final[x].iat[row]
return ort
final['Temp']=final.apply(lambda x: get_value(x['Month'],row=x.name),axis=1)
#now rename columns and format date
final=final[['State','City_x','Dates','Temp']]
final.columns=['State','City','Dates','Temp']
final['Dates'] = final['Dates'].dt.strftime('%d/%m/%Y')
print(final)
'''
State City Dates Temp
0 AC Rio Branco 20/03/2020 72.8
1 AC Rio Branco 30/05/2020 68.9
2 BA Salvador 02/05/2020 70.1
3 CE Fortaleza 06/04/2020 72.7
'''
I have two different excel files which I read using pd.readExcel. The first excel file is kind of a master file which has a lot of columns. showing only those columns which are relevant:
df1
Company Name Excel Company ID
0 cleverbridge AG IQ109133656
1 BT España, Compañía de Servicios Globales de T... IQ3806173
2 Technoserv Group IQ40333012
3 Blue Media S.A. IQ50008102
4 zeb.rolfes.schierenbeck.associates gmbh IQ30413992
and the second excel is basically an output excel file which looks like this:
df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
I want this output excel file/ df2 to include those company_id and company name from df1 where company id and company name from df1 is not a part of df2. Something like this:
df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
4 IQ30413992 NaN NaN zeb.rolfes.schierenbeck.associates gmbh
I tried several ways of achieveing this by using pd.merge as well as np.where
I even tried reindexing based on columns but nothing worked out. What exactly do I need to do so that it works as expected. Please help me out.Thanks!
EDIT:
using pd.merge
df2.merge(df, right_on='company_id', left_on='Excel Company ID', how='outer')
which gave an output with [220 rows X 31 columns]
Your expected output is unclear. If you use pd.merge with how='outer' and indicator=True, you will have:
df1 = df1.rename(columns={'Company Name': 'company_name', 'Excel Company ID': 'company_id'})
out = df2.merge(df1, on=['company_id', 'company_name'], how='outer', indicator=True)
Output:
>>> out
company_id found_keywords no_of_url company_name _merge
0 IQ137156215 insurance 15.0 Zühlke Technology Group AG left_only
1 IQ3806173 insurance 15.0 BT España, Compañía de Servicios Globales de T... both
2 IQ40333012 insurance 4.0 Technoserv Group both
3 IQ51614192 insurance 15.0 Octo Telematics S.p.A. left_only
4 IQ109133656 NaN NaN cleverbridge AG right_only
5 IQ50008102 NaN NaN Blue Media S.A. right_only
6 IQ30413992 NaN NaN zeb.rolfes.schierenbeck.associates gmbh right_only
Check the last column _merge. If you have right_only, it means the company_id and company_name are not found in df2.
I am a beginner at Python, and I have some issues with a code I wrote.
I have 2 dataframes: one with general informations about books (dfMediaGe), and the other with the names of books which were shown on TV (dfTV).
My goal is to compare them, and to fill the column 'At least 1 TV emission' in dfMediaGe with a 1 if the book appears in the dfTV dataframe.
My difficulty is that the dataframes do not have the same number of lines/columns.
Here is a sample of dfMediaGe :
Titre original_title AUTEUR DATE EDITEUR THEMESIMPLE THEME GENRE rating rating_average ... current_count done_count list_count recommend_count review_count TRADUITDE LANGUEECRITURE NOTE At least 1 TV emission id
0 La souris des dents NaN Roger, Marie-Sabine|Desbons, Marie 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN ... 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 0 46220676.0
1 La petite mare du grand crocodile NaN Buteau, Gaëlle|Hudrisier, Cécile 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 46220678.0
and here is a sample of dfTV :
Titre AUTEUR DATE EDITEUR GENRE THEMESIMPLE TRADUITDE NOTE THEME LANGUEECRITURE FORMATNUMERIQUE PUBLIC MATIERE LEXIQUE DESCRIPTION
0 Les strates Bagieu, Pénélope 11/12/2021 Gallimard NaN TIPBD001 Albums NaN NaN T090200 Bandes dessinées / Bandes dessinées fre NaN NaN NaN NaN 1 vol. ; illustrations en noir et blanc ; 24 x...
And here is the code I wrote, which is not working at all.
for Titre, r in dfMediaGe.iterrows():
for Titre, r in dfTV.iterrows():
p = 0
if r['Titre'].values == (dfTV['Titre']).values.any():
p = 1
r['Au moins 1 passage TV'].append(p)
I do get this error :
AttributeError: 'str' object has no attribute 'values'
Thank you very much for your help !!
I don't think your two data frames not having the same amount of columns is a problem.
You can achieve what you are looking for using this:
data_dfMediaGe = [
['Les strates Bagieu'],
['La petite mare du grand crocodile'],
['La souris des dents NaN Roger'],
['Movie XYZ']
]
dfMediaGe = pd.DataFrame(data=data_dfMediaGe, columns=['Titre'])
dfMediaGe['Au moins 1 passage TV'] = 0
data_dfTV = [
['La petite mare du grand crocodile'],
['Movie XYZ']
]
dfTV = pd.DataFrame(data=data_dfTV, columns=['Titre'])
for i, row in dfMediaGe.iterrows():
if row['Titre'] in list(dfTV['Titre']):
dfMediaGe.at[i, 'Au moins 1 passage TV'] = 1
print(dfMediaGe)
Titre Au moins 1 passage TV
0 Les strates Bagieu 0
1 La petite mare du grand crocodile 1
2 La souris des dents NaN Roger 0
3 Movie XYZ 1
All you have to do is iterate through rows in dfMediaGe and check if the value in the Titrecolumn is present in dfTV in the Titre column.
This question already has an answer here:
pandas: drop duplicates in groupby 'date'
(1 answer)
Closed 4 years ago.
I'm trying to create a new column in the dataframe called volume. The DF already consists of other columns like market. What I want to do is to group by price and company and then get their count and add it in a new column called volume. Here's what I have:
df['volume'] = df.groupby(['price', 'company']).transform('count')
This does create a new column, however, it's giving me all the rows. I don't need all the rows. For example, before the transformation I would get 4 rows and after the transformation I still get 4 rows but with a new column.
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA EK 206.0 2
LA EK 36.0 3
LA EK 36.0 3
LA SQ 36.0 1
LA EK 36.0 3
I'd like to drop the duplicated rows. Is there a query that I can do with groupby that will only show the rows like so:
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA SQ 36.0 1
LA EK 36.0 3
Simply drop_duplicates with the columns ['market', 'company', 'price']:
>>> df.drop_duplicates(['market', 'company', 'price'])
market company price volume
0 LA EK 206.0 2
1 LA SQ 206.0 1
3 LA EK 36.0 3
5 LA SQ 36.0 1
Your data contains duplicates, probably because you are only including a subset of the columns. You need something else in your data other than price (e.g. two different days could close at the same price, but you wouldn't aggregate the volume from the two).
Assuming that the price is unique for a given timestamp, market and company and you first sort on your timestamp column if any (not required if there is only one price per company and market):
df = pd.DataFrame({
'company': ['EK', 'SQ', 'EK', 'EK', 'EK', 'SQ', 'EK'],
'date': ['2018-08-13'] * 3 + ['2018-08-14'] * 4,
'market': ['LA'] * 7,
'price': [206] * 3 + [36] * 4})
>>> (df.groupby(['market', 'date', 'company'])['price']
.agg({'price': 'last', 'volume': 'count'}[['price', 'volume']]
.reset_index()
market date company price volume
0 LA 2018-08-13 EK 206 2
1 LA 2018-08-13 SQ 206 1
2 LA 2018-08-14 EK 36 3
3 LA 2018-08-14 SQ 36 1