Find and Replace in DataFrame using Pandas in optimized way

Find and Replace in DataFrame using Pandas in optimized way - python

I am trying to find and replace words from the 20K comments. Find and replace words are stored in dataframe and its around more than 20000. Comments in different dataframe and its around 20K.
Below is the example
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
And I am expecting the output below
op = ["Hull Damage happened and its insured by maritime hull insurance company","Blocked and claims are blocked"]})
Please help.
I am using loop but its taking more than 20 mins to do this.
20 k records in the data, 30000 words to be replaced
""KeywordSynonym"" -- Dataframe holds find and replace data in sql
""backup"" -- Dataframe hold data to be cleaned
backup = str(backup)
TrainingClaimNotes_KwdSyn = []
for index,row in KeywordSynonym.iterrows():
word = KeywordSynonym.Synonym[index].lower()
value = KeywordSynonym.Keyword[index].lower()
my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
if re.search(my_regex,backup):
backup = re.sub(my_regex, value, backup)
TrainingClaimNotes_KwdSyn.append(backup)
TrainingClaimNotes_KwdSyn_Cmp = backup.split('\'", "\'')

Use:
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
find_repl = dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Output
>>> print(df1['Data_1'].tolist())
['hull damage happened and its insured by maritime hull insurance company', 'blocked and claims are blocked']
Explanation
dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower())) creates a mapping between what you want to replace and the string you want to replace with -
{'insurence': 'insurance', 'non cash entry': 'blocked'}
Convert the lookups to regex making it ready for lookup -
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
{'(\\b)insurence(\\b)': '\\1insurance\\2', '(\\b)non cash entry(\\b)': '\\1blocked\\2'}
The final piece is just making the actual replacement -
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Note: I did a .lower() everywhere to find proper matches. Obviously you can reshape it to the way you want it to look.

Related

Getting specific values from list of key value pairs in dataframe

I've written the code below to get some citation data from an API and write it to a CSV. It works fine except that one of the columns returns a list of authors and it comes into the CSV like this:
[{'authorId': '83129125', 'name': 'June A. Sekera'}, {'authorId': '13328115', 'name': 'A. Lichtenberger'}]
How can I parse this so I get simply a comma-separated list of the authors in a single cell, ignoring the authorId?
import requests
import json
import pandas as pd
# get data from the API
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
json = r.json()
df = pd.DataFrame(json['data'])
# new df from the column of lists
split_df = pd.DataFrame(df['citingPaper'].tolist())
# display the resulting df
print(split_df)
split_df.to_csv("citations.csv", index=False)

Something like the below (do the authors "cleanup" before we populate the df)
import requests
import pandas as pd
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
data = []
if r.status_code == 200:
entries = r.json()['data']
for entry in entries:
entry['citingPaper']['authors'] = ','.join(x['name'] for x in entry['citingPaper'].get('authors',[]))
data.append(entry['citingPaper'])
df = pd.DataFrame(data)
df.to_csv("citations.csv",index = False)
citations.csv
paperId,url,title,year,authors
1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,https://www.semanticscholar.org/paper/1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,"A comparative study on deformation mechanisms, microstructures and mechanical properties of wide thin-ribbed sections formed by sideways and forward extrusion",2021,"Wenbin Zhou,Junquan Yu,Xiaona Lu,Jianguo Lin,T. Dean"
46f209486a9e8f81c77dbbb39991f4045dbc8f7d,https://www.semanticscholar.org/paper/46f209486a9e8f81c77dbbb39991f4045dbc8f7d,The low-carbon steel industry-Interactions between the hydrogen direct reduction of steel and the electricity system,2021,A. Toktarova
5638200cc8b188a48f923b75dee9793c06c99b62,https://www.semanticscholar.org/paper/5638200cc8b188a48f923b75dee9793c06c99b62,Pore-scale assessment of subsurface carbon storage potential: implications for the UK Geoenergy Observatories project,2021,"R. Payton,M. Fellgett,B. Clark,D. Chiarella,A. Kingdon,S. Hier‐Majumder"
6055ea75b377b6776f468ccb9f21551614b5f61f,https://www.semanticscholar.org/paper/6055ea75b377b6776f468ccb9f21551614b5f61f,Can Nature-Based Solutions Deliver a Win-Win for Biodiversity and Climate Change Adaptation?,2021,"Isabel Key,Alison C. Smith,B. Turner,A. Chausson,C. Girardin,Megan MacGillivray,N. Seddon"
8734d34823bcfb362f05df2a48bad19cc026b1c1,https://www.semanticscholar.org/paper/8734d34823bcfb362f05df2a48bad19cc026b1c1,Trends in air travel inequality in the UK: From the few to the many?,2021,"M. Büchs,Giulio Mattioli"
020eecb5f6edf918b6ef1120d97276b8d0748dc7,https://www.semanticscholar.org/paper/020eecb5f6edf918b6ef1120d97276b8d0748dc7,"Decarbonising the critical sectors of aviation, shipping, road freight and industry to limit warming to 1.5–2°C",2020,"M. Sharmina,O. Edelenbosch,C. Wilson,R. Freeman,D. Gernaat,P. Gilbert,A. Larkin,E. Littleton,M. Traut,D. V. van Vuuren,N. Vaughan,F. R. Wood,C. Le Quéré"
1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,https://www.semanticscholar.org/paper/1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,Investments in climate-friendly materials to strengthen the recovery package JUNE 2020,2020,"F. Lettow,Olga Chiappinelli"
3916ee1df6b8e07f8798a90726407554e990847e,https://www.semanticscholar.org/paper/3916ee1df6b8e07f8798a90726407554e990847e,Pathways for Low-Carbon Transition of the Steel Industry—A Swedish Case Study,2020,"A. Toktarova,I. Karlsson,Johan Rootzén,L. Göransson,M. Odenberger,F. Johnsson"
55667e4e2e3c35d7c6fcf98021a083d4397a308d,https://www.semanticscholar.org/paper/55667e4e2e3c35d7c6fcf98021a083d4397a308d,Potentials for reducing climate impact from tourism transport behavior,2020,"Anneli Kamb,E. Lundberg,J. Larsson,Jonas Nilsson"
9fc21836d61fbc76e3923021cb88f94e3e8c5a41,https://www.semanticscholar.org/paper/9fc21836d61fbc76e3923021cb88f94e3e8c5a41,Decarbonization of construction supply chains-Achieving net-zero carbon emissions in the supply chains linked to the construction of buildings and transport infrastructure,2020,
b657008ea89807ccbb204ed1e2f4debe92ce252b,https://www.semanticscholar.org/paper/b657008ea89807ccbb204ed1e2f4debe92ce252b,Roadmap for Decarbonization of the Building and Construction Industry—A Supply Chain Analysis Including Primary Production of Steel and Cement,2020,"I. Karlsson,Johan Rootzén,A. Toktarova,M. Odenberger,F. Johnsson,L. Göransson"
c611929ef750ba845a9c87da862ad1f8c9711e64,https://www.semanticscholar.org/paper/c611929ef750ba845a9c87da862ad1f8c9711e64,"Assessing Carbon Capture: Public Policy, Science, and Societal Need",2020,"June A. Sekera,A. Lichtenberger"

The easiest way I can think of is to use apply(lambda x: ...), creating a list of values for dictionary key "name" in each dictionary p in each item of the column authors.
Add this underneath split_df = pd.DataFrame(...):
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x])
split_df["authors"][0]
#Out: ['Wenbin Zhou', 'Junquan Yu', 'Xiaona Lu', 'Jianguo Lin', 'T. Dean']
Edit
To have blank "" if there are no authors:
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x] if len(x) > 0 else "")

Argument 'string' has incorrect type (expected str, got list) Spacy NLP

I want to calculate cosine similarity, but I got an error message after converting the dataframe column to its list: Argument 'string' has incorrect type (expected str, got list).
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
df= [['24, Single, Consultant, Canada, I am interested in visiting Isreal again'], ['18, Single, Student, I want to go back Costa Rica again'], ['45,Married, Unemployed, I want to take my family to Florida for the summer vacation']]
df = pd.DataFrame(df, columns = ['Free Text'])
df["N_Application"]=range(0, len(df))
# convert datafram to list
data=df['Free Text'].tolist()
df_spacy=nlp(data)
I appreciate someone help me fix it, Thank you.

The way you get a function to operate across an entire pd.Series is to use .apply(). And you can chain .apply() calls.
Example:
# changing to strings instead of nested list
l = ['24, Single, Consultant, Canada, I am interested in visiting Isreal again',
'18, Single, Student, I want to go back Costa Rica again',
'45,Married, Unemployed, I want to take my family to Florida for the summer vacation']
# remove stop words and punctuation for later similarity calculations
df_spacy = df['Free Text'].apply(nlp)\
.apply(lambda doc: nlp(' '.join(str(t)
for t in doc
if not t.is_stop
and not t.is_punct)))
Edit: per your comment, here is a similarity calculation between each row and all other rows:
df_spacy.apply(lambda row: df_spacy\
.apply(lambda doc: row.similarity(doc) if row != doc else None))
Resulting similarity matrix:
0 1 2
0 NaN 0.776098 0.716560
1 0.776098 NaN 0.705024
2 0.716560 0.705024 NaN

Categorical Nested Lists into Pandas DataFrame

I have three levels of categorical data that I need to convert into a Pandas DataFrame with repeating labels on the upper categories. I have lists for "main", "sub", and "tertiary" as follows:
main_labels = ['Certain infectious and parasitic diseases','Neoplasms']
main_icds = ['A00-B99','C00-D49']
sub_labels = ['Intestinal infectious diseases','Tuberculosis','Malignant neoplasms of lip, oral cavity and pharynx','Malignant neoplasms of digestive organs']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_labels = ['Cholera','Typhoid and paratyphoid fevers','Respiratory tuberculosis','Tuberculosis of nervous system','Malignant neoplasm of lip','Malignant neoplasm of base of tongue','Malignant neoplasm of esophagus','Malignant neoplasm of stomach']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
For illustration and example purposes, I need them to look like below in a Pandas DataFrame. If I can accomplish this, I can add in the label values.
It seemed like it would be easy but I'm stumped. Any help greatly appreciated. I tried searching historical posts but was having trouble finding the right key words to get anything close to what I'm trying to do. Thanks!

I think the best way is to start with the ternary category, then find its sub and main classifications. python allows inequalities on alphanumeric strings, so this should be pretty robust.
import pandas as pd
main_icds = ['A00-B99','C00-D49']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
#split on '-' to get bounds for each category
subs = [sub.split('-') for sub in sub_icds]
mains = [main.split('-') for main in main_icds]
df = pd.DataFrame({'ter_icd':ter_icds})
df['sub_icd'] = [sub_icd for ter in ter_icds
for sub_icd,sub in zip(sub_icds,subs)
if (ter >= sub[0]) & (ter <= sub[1])]
df['main_icd'] = [main_icd for ter in ter_icds
for main_icd,main in zip(main_icds,mains)
if (ter >= main[0]) & (ter <= main[1])]

How to combine two sets of data with differences in merge-index strings?

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}

So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...

Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")

Cannot convert object type to string; and then filter on that string; python pandas dataframe

I am trying to pull all stock tickers from NYSE, and then filter out for only those with MarketCap above 5B.
I am running into a problem because based on how my data load comes in all columns are data type "Object" and I cannot find anyway to convert them to anything else. See my code and comments below:
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
This is my initial data load of NYSE stocks, and then I filter for just MarketCap, Sector, and Industry.
At first I was hoping to filter out MarketCap first by anything with "M" in it was removed and then removing the first and last characters to get a number which then could be filtered to keep anything above 5. However I think it is because of the data types being "Object" and not string I have not bee able to do it directly. So I then created new columns that would contain only letters or numbers, hoping that I could then convert to data type string and float from there.
df['MarketCap_Num'] = df.MarketCap.str[1:-1]
df['Billion_Filter'] = df.MarketCap.str[-1:]
So MarketCap_Num column has only the numbers by removing the first and last characters and Billion_Filter is only the last character where I will remove any value that = M.
However even though these columns are just numbers or just strings I CANNOT find anyway to convert to change from object datatype so then my filtering is not working at all. Any help is much appreciated.
I have tried .astype(float), pd.to_numeric, type functions to no success.
My filtering code would then be:
df[df.Billion_Filter.str.contains("B")]
But when I run that nothing happens, no error but also no filter happens. When I run this code on a different table it works, so it must be the object data type that is holding it up.

Convert the MarketCap column into floats by first removing the dollar signs and then substituting B with e9 and M with e6. This should make it easy to use .astype(float) on the column to do the conversion.
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
df = df.replace({'MarketCap': {'\$': '', 'B': 'e9', 'M': 'e6', 'n/a': np.nan}}, regex=True)
df.MarketCap = df.MarketCap.astype(float)
print(df[df.MarketCap > 5000000000].head(10))
Yields:
MarketCap Sector industry
Symbol
MMM 1.419900e+11 Health Care Medical/Dental Instruments
WUBA 1.039000e+10 Technology Computer Software: Programming, Data Processing
ABB 5.676000e+10 Consumer Durables Electrical Products
ABT 9.887000e+10 Health Care Major Pharmaceuticals
ABBV 1.563200e+11 Health Care Major Pharmaceuticals
ACN 9.388000e+10 Miscellaneous Business Services
AYI 7.240000e+09 Consumer Durables Building Products
ADNT 7.490000e+09 Capital Goods Auto Parts:O.E.M.
AAP 7.370000e+09 Consumer Services Other Specialty Stores
ASX 1.083000e+10 Technology Semiconductors

You should be able to change the type of the MarketCap_Num column by using:
df['MarketCap_Num'] = df.MarketCap.str[1:-1].astype(np.float64)
You can then check the data types by df.dtypes.
As for the filter, you can simple just say
df_filtered = df[df['Billion_Filter'] =="B"].copy()
since you will only have one letter in your Billion_Filter column.

Obhect datatype works as string. You should be able to use both str.contains and extract the number without having to convert the object type to string
df = df[df['MarketCap'].str.contains('B')].copy()
df['MarketCap'] = df['MarketCap'].str.extract('(\d+.?\d*)', expand = False)
MarketCap Sector industry
Symbol
DDD 1.12 Technology Computer Software: Prepackaged Software
MMM 141.99 Health Care Medical/Dental Instruments
WUBA 10.39 Technology Computer Software: Programming, Data Processing
EGHT 1.32 Public UtilitiesTelecommunications Equipment
AIR 1.48 Capital Goods Aerospace

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.