Python Fuzzywuzzy matching with process and add info from comparing dataframe - python

I am trying to match names from two dataframes (in the name columns) using fuzzywuzzy with process. The result should be df1(dfdum) with the best matching name from df2(dfpep) and the similarity score. This is going very well with the code below but besides the matching name and score I want to append more columns from df2 to df1 in the result. The dates of birth and countries of residence from df2 belonging to the matching name should also be added to df1. I cannot simply merge on names because there are duplicates.
Can anyone help me to amend the code so that i can add the extra info from the matching names from df2? I thus want to add two extra columns to df1 with relating information from the matching name from df2.
pep_name = []
sim_name = []
for i in dfdum.NAME:
ratio = process.extract(i, dfpep.NAME, limit=1,scorer=fuzz.token_set_ratio)
pep_name.append(ratio[0][0])
sim_name.append(ratio[0][1])
dfdum['pep_name'] = pd.Series(pep_name)
dfdum['sim_name'] = pd.Series(sim_name)

You could find the index of the best match in dfpep.NAME, and use that to retrieve the corresponding values of the other two columns.
This code (with some mock data) should give you the desired result; it assumes that dfpep.NAME has only unique values, though.
Please note that I'm far from a pandas expert so this solution is by no means the fastest or most elegant, but it should do the job :)
Also, I feel like there should be a way to do this without the for loop; maybe someone here has an idea for that.
import pandas as pd
from fuzzywuzzy import process, fuzz
dfdum = pd.DataFrame(["Johnny", "Peter", "Ben"])
dfdum.columns = ["NAME"]
dfpep = pd.DataFrame(["Pete", "John", "Bennie"])
dfpep.columns = ["NAME"]
dfpep["dob"] = pd.Series(["1990", "1991", "1992"])
dfpep["cor"] = pd.Series(["USA", "UK", "Germany"])
pep_name = []
sim_name = []
dob = []
cor = []
for i in dfdum.NAME:
ratio = process.extract(i, dfpep.NAME, limit=1, scorer=fuzz.token_set_ratio)
pep_name.append(ratio[0][0])
sim_name.append(ratio[0][1])
j = dfpep.index[dfpep.NAME == ratio[0][0]].tolist()[0]
dob.append(dfpep['dob'][j])
cor.append(dfpep['cor'][j])
dfdum['pep_name'] = pd.Series(pep_name)
dfdum['sim_name'] = pd.Series(sim_name)
dfdum['dob'] = pd.Series(dob)
dfdum['cor'] = pd.Series(cor)
print(dfdum)

Related

Dynamic creation of pandas DataFrames

The end goal is to read multiple .cvs files into multiple DataFrames with certain names.
I want to be able to refer to my DataFrame by the name of the city for further analysis and manipulate them separately. So it is important to achieve that and not keep them in a dictionary. But what ends up happening is that the last item in the dict gets assigned to every variable, so I get differently names dfs created but they all have the same data.
lst0 = ['/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv']
names_3 = ['annapolis','apalachicola','atlantic_city']
d = {}
for fname in lst0:
d[fname] = pd.read_csv(fname)
for nm in names_3:
for fname in lst0:
globals()[nm] = d[fname]
What am I doing wrong?
Thank you!
Your variable naming makes no sense to me. Please name them something relevant to the values they hold.
As to your problem:
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
# Join the frames together, adding the new `city` column
df = (
pd.concat(d.values(), keys=d.keys(), names=["city", None])
.reset_index(level=0)
.reset_index(drop=True)
)
Ok. I figured it out.
Combining what Code Different suggested below but skipping the concatenation part.
I finally get the variables(that are dataframes) with the names of the cities created.
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
for k in d.keys():
exec(f"{k} = d['{k}']")

Concatenate specific columns in pandas

Im trying to concatenate 4 different datasets onto pandas python. I can concatenated them but it results in several of the same column names. How do I only produce only one column of the same name, then multiples?
concatenated_dataframes = pd.concat(
[
dice.reset_index(drop=True),
json.reset_index(drop=True),
flexjobs.reset_index(drop=True),
indeed.reset_index(drop=True),
simply.reset_index(drop=True),
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dice.columns),
list(json.columns),
list(flexjobs.columns),
list (indeed.columns),
list(simply.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
df= concatenated_dataframes
This results in
UNNAMED: 0 TITLE COMPANY DESCRIPTION LOCATION TITLE JOBLOCATION POSTEDDATE DETAILSPAGEURL COMPANYPAGEURL COMPANYLOGOURL SALARY CLIENTBRANDID COMPANYNAME EMPLOYMENTTYPE SUMMARY SCORE EASYAPPLY EMPLOYERTYPE WORKFROMHOMEAVAILABILITY ISREMOTE UNNAMED: 0 TITLE SALARY JOBTYPE LOCATION DESCRIPTION UNNAMED: 0 TITLE SALARY JOBTYPE DESCRIPTION LOCATION UNNAMED: 0 COMPANY DESCRIPTION LOCATION SALARY TITLE
Again, how do i combined all the 'titles' in one column, all the 'location' in one column, and so on? Instead of have multiple of them.
I think we can get away with making a blank dataframe that just has the columns we will want at the end and then concat() everything onto it.
import numpy as np
import pandas as pd
all_columns = list(dice.columns) + list(json.columns) + list(flexjobs.columns) + list(indeed.columns) + list(simply.columns)
all_unique_columns = np.unique(np.array(all_columns)) # this will, as the name suggests, give an end list of just the unique columns. You could run print(all_unique_columns) to make sure it has what you want
df = pd.DataFrame(columns=all_unique_columns)
df = pd.concat([dice, json, flexjobs, indeed, simply],axis=0)
It's a little tricky not having reproducible examples of the dataframes that you have. I tested this on a small mock-up example I put together, but let me know if it works for your more complex example.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Performing similar analysis on multiple dataframes

I am reading data from multiple dataframes.
Since the indexing and inputs are different, I need to repeat the pairing and analysis. I need dataframe specific outputs. This pushes me to copy paste and repeat the code.
Is there a fast way to refer to multiple dataframes to do the same analysis?
DF1= pd.read_csv('DF1 Price.csv')
DF2= pd.read_csv('DF2 Price.csv')
DF3= pd.read_csv('DF3 Price.csv') # These CSV's contain main prices
DF1['ParentPrice'] = FamPrices ['Price1'] # These CSV's contain second prices
DF2['ParentPrice'] = FamPrices ['Price2']
DF3['ParentPrice'] = FamPrices ['Price3']
DF1['Difference'] = DF1['ParentPrice'] - DF1['Price'] # Price difference is the output
DF2['Difference'] = DF2['ParentPrice'] - DF2['Price']
DF3['Difference'] = DF3['ParentPrice'] - DF3['Price']```
It is possible to parametrize strings using f-strings, available in python >= 3.6. In an f string, it is possible to insert the string representation of the value of a variable inside the string, as in:
>> a=3
>> s=f"{a} is larger than 11"
>> print(s)
3 is larger than 1!
Your code would become:
list_of_DF = []
for symbol in ["1", "2", "3"]:
df = pd.read_csv(f"DF{symbol} Price.csv")
df['ParentPrice'] = FamPrices [f'Price{symbol}']
df['Difference'] = df['ParentPrice'] - df['Price']
list_of_DF.append(df)
then DF1 would be list_of_DF[0] and so on.
As I mentioned, this answer is only valid if you are using python 3.6 or later.
for the third part ill suggest to create a something like
DFS=[DF1,DF2,DF3]
def create_difference(dataframe):
dataframe['Difference'] = dataframe['ParentPrice'] - dataframe['Price']
for dataframe in DFS:
create_difference(dataframe)
for the second way there is no like superconvenient and short way i might think about , except maybe of
for i in range len(DFS) :
DFS[i]['ParentPrice'] = FamPrices [f'Price{i}']

Iterate a piece of code connecting to API using two variables pulled from two lists

I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))

Categories

Resources