Pandas translating column of a dataframe with a lookup dataframe - python

I have a dataframe that looks like:
df = pd.DataFrame({'ISIN': ['A1kT23', '4523', 'B333', '49O33'], 'Name': ['Example A', 'Name Xy', 'Example B', 'Test123'], 'Sector': ['Energy', 'Industrials', 'Utilities', 'Real Estate'], 'Country': ['UK', 'USA', 'Germany', 'China']})
I would like to translate the column Sector into German by using the dataframe Sector_EN_DE
Sector_EN_DE = pd.DataFrame({'Sector_EN': ['Energy', 'Industrials', 'Utilities', 'Real Estate', 'Materials'], 'Sector_DE': ['Energie', 'Industrie', 'Versorger', 'Immobilien', 'Materialien']})
so that I get as result the dataframe
df = pd.DataFrame({'ISIN': ['A1kT23', '4523', 'B333', '49O33'], 'Name': ['Example A', 'Name Xy', 'Example B', 'Test123'], 'Sector': ['Energie', 'Industrie', 'Versorger', 'Immobilien'], 'Country': ['UK', 'USA', 'Germany', 'China']})
What would be the apropriate code line?

Another way via map():
df['Sector']=df['Sector'].map(dict(Sector_EN_DE[['Sector_EN', 'Sector_DE']].values))
OR
via replace():
df['Sector']=df['Sector'].replace(dict(Sector_EN_DE[['Sector_EN', 'Sector_DE']].values))

This line will do the merge and DataFrame cleanup:
df.merge(Sector_EN_DE, left_on='Sector', right_on='Sector_EN').drop(['Sector', 'Sector_EN'], axis=1).rename(columns={'Sector_DE': 'Sector'})
Explanation:
The merge function do the join between both DataFrames.
The drop function drops the English version of Sector, with axis=1 because you're dropping columns (you can also use that function to drop rows).
The rename function renames the Sector_DE column.

Related

Matching part of a string with a value in two pandas dataframes

Given the following df with street names:
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
And df2 which contains that match streets and their following county:
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
How can I create a column that tells me the state where each street of DF is, through a pairing of df(street) df2(street2). The matching does not have to be perfect, it must match at least one word?
The following dataframe is an example of what I want to obtain:
desiredoutput = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown'], 'state': ["Utuado", "NA", "NA", "Bayamon"]})
Maybe a Naive approach, but works well.
df = pd.DataFrame({'street1': ['36 Angeles', 'New York', 'Rice Street', 'Levitown']})
df2 = pd.DataFrame({'street2': ['Angeles', 'Caguana', 'Levitown'], 'county': ["Utuado", "Utuado", "Bayamon"]})
output = {'street1':[],'county':[]}
streets1 = df['street1']
streets2 = df2['street2']
county = df2['county']
for street in streets1:
for index,street2 in enumerate(streets2):
if street2 in street:
output['street1'].append(street)
output['county'].append(county[index])
count = 1
if count == 0:
output['street1'].append(street)
output['county'].append('NA')
count = 0
print(output)

Merge dfs in a dictionary based on a column key

I have a dictionary like so: {key_1: pd.Dataframe, key_2: pd.Dataframe, ...}.
Each of these dfs within the dictionary has a column called 'ID'.
Not all instances appear in each dataframe meaning that the dataframes are of different size.
Is there anyway I could combine these into one large dataframe?
Here's a minimal reproducible example of the data:
data1 = [{'ID': 's1', 'country': 'Micronesia', 'Participants':3},
{'ID':'s2', 'country': 'Thailand', 'Participants': 90},
{'ID':'s3', 'country': 'China', 'Participants': 36},
{'ID':'s4', 'country': 'Peru', 'Participants': 30}]
data2 = [{'ID': '1', 'country': 'Micronesia', 'Kids_per_participant':3},
{'ID':'s2', 'country': 'Thailand', 'Kids_per_participant': 9},
{'ID':'s3', 'country': 'China', 'Kids_per_participant': 39}]
data3= [{'ID': 's1', 'country': 'Micronesia', 'hair_style_rank':3},
{'ID':'s2', 'country': 'Thailand', 'hair_style_rank': 9}]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dict_example={'df1_key':df1,'df2_key':df2,'df3_key':df3}
pd.merge(dict_example.values(), on="ID", how="outer")
For a dict with arbitrary number of keys you could do this
i=list(dict_example.keys())
newthing = dict_example[i[0]]
for j in range(1,len(i)):
newthing = newthing.merge(dict_example[i[j]],on='ID', how = 'outer')
First make a list of your dataframes. Second create a first DataFrame. Then iterate through the rest of your DataFrames and merge each one after that. I did notice you have country for each ID, but it's not listing in your initial on statement. Do you want to join on country also? If so replace the merge above with this changing the join criteria to a list including country
newthing = newthing.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')
Documents on merge
If you don't care about altering your DataFrames code could be shorter like this
for j in range(1,len(i)):
df1 = df1.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')

Change column names except for certain columns

Assuming I have the following dataframe
df = pd.DataFrame(
{
'ID': ['AB01'],
'Col A': ["Yes"],
'Col B': ["L"],
'Col C': ["Yes"],
'Col D': ["L"],
'Col E': ["Yes"],
'Col F': ["L"],
'Type': [85]
}
)
I want to change all column names by changing it lowercase, replace space with underscore and adding string _filled to the end of name, except for columns named in list skip = ['ID', 'Type'].
How can I achieve this? I want the end resulting dataframe to have column names as ID, col_a_filled, col_b_filled......,Type
You can use df.rename along with a dict comprehension to get a nice one-liner:
df = df.rename(columns={col:col.lower().replace(" ", "_")+"_filled" for col in df.columns if col not in skip})

Aggregating and group by in Pandas considering some conditions

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?
try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

How to iterate several dataframe row by row and modified one of them

I am stuck with a loop that is not working for me. What I want is to extract values from dataframes based in a condition to my final dataframe.
I have:
Final Dataframe:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['','',''],
'df2_name': ['', '', ''],
'df3_code': ['','',''],
'df3_name': ['', '', '']}
This dataframe must be fill. Specifically, the columns with the prefix df2, df3,...
It must be filled with the 'code' and 'name' column of other dataframes that contains the same firsts three column names of the 'final dataframe' (code, reg, name). A condition applies to fill, the 'reg' number must be the same in both dataframes.
An example of the others:
df2 = {'code': ['P001','A002','P003'],
'reg': ['2234','3432', '9978'],
'name': ['Chips 23', 'Flying 23', 'American99']}
So, until now, the product of this logic would be:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
But, the problem is a little more complex. There are duplicates in the df2 of the 'reg' numbers which serve as conditions. So 'df2' actually is:
df2 = {'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', NaN]}
And this must be taken into account by adding the 'code' and the 'name' of the two in the same cells. The product would be:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001&B004','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
Until now, I have written this code for only one dataframe (df2) and it takes too many time as the final df has 200.000+ rows (I have 5 df to scan but these are tinier):
for i, row in final.iterrows():
for j, inrow in df2.iterrows():
if row['reg'] == inrow['reg']:
if final['df2_code'].iloc[i] == '':
final['df2_code'].iloc[i] = str(inrow['code'])
else:
final['df2_code'].iloc[i] += '&' + str(inrow['code'])
if inrow['name'] is None:
continue
else:
if final['df2_name'].iloc[i] == '':
final['df2_name'].iloc[i] = str(inrow['name'])
else:
final['df2_name'].iloc[i] += '&' + str(inrow['name'])
Consider Series.cat + groupby:
df2 = pd.DataFrame({'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', float('nan')]})
agg_df = (df2.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.str.cat(sep="&"),
'name': 'max'})
.add_prefix("df2_")
)
agg_df
# df2_code df2_name
# reg
# 2234 P001&B004 Chips 23
# 3432 A002 Flying 23
# 9978 P003 American99
To iterate across data frames, run horizontal merge from list of data frames (using elementwise zip in list comprehension for prefixes).
# LIST OF AGGREGATED DATA FRAMES
dfs = [
(df.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.fillna("").str.cat(sep="&"),
'name': max})
.add_prefix(f"{nm}_")
)
for df, nm
in zip([df2, df3, df4, df5], ["df2", "df3", "df4", "df5"])
]
# HORIZONTAL MERGE ON "reg"
final_df = pd.concat(dfs, axis=1)

Categories

Resources