Pandas Replace part of string with value from other column - python

I have a dataframe (see example, because sensitive data) with a full term (string), a text snippet (one big string) containing the full term and an abbreviation (string). I have been struggling with how to replace the full term in the text snippet with the corresponding abbrevation. Can anyone help? Example:
term text_snippet abbr
0 aanvullend onderzoek aanvullend onderzoek is vereist om... ao/
So I want to end up with:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/

You can use apply and replace terms with abbrs:
df['text_snippet'] = df.apply(
lambda x: x['text_snippet'].replace(x['term'], x['abbr']), axis=1)
df
Output:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/

Here is a solution. I made up a simple dataframe:
df = pd.DataFrame({'term':['United States of America','Germany','Japan'],
'text_snippet':['United States of America is in America',
'Germany is in Europe',
'Japan is in Asia'],
'abbr':['USA','DE','JP']})
df
Dataframe before solution:
term text_snippet abbr
0 United States of America United States of America is in America USA
1 Germany Germany is in Europe DE
2 Japan Japan is in Asia JP
Use apply function on every row:
df['text_snippet'] = df.apply(lambda row : row['text_snippet'].replace(row['term'],row['abbr']), axis= 1 )
Output:
term text_snippet abbr
0 United States of America USA is in America USA
1 Germany DE is in Europe DE
2 Japan JP is in Asia JP

Related

Compare the content of a dataframe with a cell value in other and replace the value matched in Pandas

I have two dataframes,
df1 =
Countries description
Continents
values
C0001 also called America,
America
21tr
C0004 and C0003 are neighbhors
Europe
504 bn
on advancing C0005 with C0001.security
Europe
600bn
C0002, the smallest continent
Australi
1.7tr
df2 =
Countries
Id
US
C0001
Australia
C0002
Finland
C0003
Norway
C0004
Japan
C0005
df1 has columns Countries descriptions but instead of their actual names, codes are given.
df2 has countries with their codes.
I want to replace the countries Code(like C0001, C0002) with their Names in the df1, like this:
df1 =
Countries description
Continents
values
US also called America, some..
America
21tr
Norway and Finland are neighbhors
Europe
504 bn
on advancing Japan with US.security
Europe
600bn
Australia, the smallest continent
Austral
1.7tr
I tried with the Pandas merge method but that didnt work:
df3 = df1.merge(df2, on=['Countries'], how='left')
Thanks :)
Here is one way to approach it with replace :
d = dict(zip(df2["Id"], df2["Countries"]))
​
df1["Countries description"] = df1["Countries description"].replace(d, regex=True)
Output :
​
print(df1)
Countries description Continents values
0 US also called America, America 21tr
1 Norway and Finland are neighbhors Europe 504 bn
2 on advancing Japan with US.security Europe 600bn
3 Australia, the smallest continent Australi 1.7tr

compare 2 csv files uing the merge and compare row by row

so I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
in file1.cvs
research_groups_names
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
in file2.cvs
research_groups_names
Location
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
United States of America (USA)
The University of Queensland (UQ)
Australia
the Output.csv
f1_research_groups_names
f2_research_groups_names
Location
Chinese Academy of Sciences
Chinese Academy ofSciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
United States of America(USA)
UQ
The University of Queensland (UQ)
Australia
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def compare_nae(df):
if df1['f1_research_groups_names'] == df2['f2_research_groups_names']:
return 1
else:
return 0
result = pd.merge(df1, df2, left_on=['f1_research_groups_names'],right_on=['f2_research_groups_names'], how="left")
result.to_csv('output.csv')
You can try:
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names"] == row["research_groups_names"]
or row["research_groups_names"] in n["research_groups_names"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
Prints:
f1_research_groups_names f2_research_groups_names f2_location
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) United States of America (USA)
3 UQ The University of Queensland (UQ) Australia
Note: If in df1 is name not found in df2 there will be None, None in columns "f2_research_groups_names" and "f2_location"

Removing everything after a char in a dataframe

If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington

How do I get the name of the highest value in a group in Pandas?

I have the following dataframe:
Country Continent Population
--- ------- ------------- ------------
0 United States North America 329,451,665
1 Canada North America 37,602,103
2 Brazil South America 210,147,125
3 Argentina South America 43,847,430
I want to group by the continent, and get the name of the country with the highest population in that continent, so basically I want my result to look as follows:
Continent Country
---------- -------------
North America United States
South America Brazil
How can I do this?
Use idxmax to get index of the max row:
df['Population'] = pd.to_numeric(df['Population'].str.replace(',', ''))
idx = df.groupby('Continent')['Population'].idxmax()
df.loc[idx]
Result:
Country Continent Population
0 United States North America 329451665
2 Brazil South America 210147125

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Categories

Resources