i have a code and the prints look pretty weird. i want to fix it
*The Prints
Matching Score
0 john carry 73.684211
Matching Score
0 alex midlane 80.0
Matching Score
0 alex midlane 50.0
Matching Score
0 robert patt 53.333333
Matching Score
0 robert patt 70.588235
Matching Score
0 david baker 100.0
*I need this format
| Matching | Score |
| ------------ | -----------|
| john carry | 73.684211 |
| alex midlane | 80.0 |
| alex midlane | 50.0 |
| robert patt | 53.333333 |
| robert patt | 70.588235 |
| david baker | 100.0 |
*My Code
import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz
df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})
NameTests = [name for name in df["NameTest"] if isinstance(name, str)]
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data = {'Matching': [match[0]],
'Score': [match[1]]}
df1 = pd.DataFrame(data)
print(df1)
I have tried many ways. but got the same prints
thank you for suggestion.
You need an array or a list in order to keep all the data (I use an array) because you creating a dataframe in each loop
data = []
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
print(match[0])
data.append({'Matching': match[0],
'Score': match[1]})
df1 = pd.DataFrame(data)
print(df1)
Here is the output
enter image description here
You create a new dataframe in each loop. You can store the result in a global dict and create dataframe from that dict after the loop.
data = {'Matching': [], 'Score': []}
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data['Matching'].append(match[0])
data['Score'].append(match[1])
df1 = pd.DataFrame(data)
Related
I'm trying to rationalize a quite scrambled phonebook xls of several thousandth of records. Some fields are kind of merged with other and/or saved into the wrong column, while other filed are splitted through 2 or more ones... and so on. I'm trying to find the path of the main error and solve those through regex, placing the right record into right column.
An example:
DataFrame as df:
id
Name
SecondName
Surname
Title
Company
01
Marc
Gigio
ETC ltd
02
Piero (Four
Season
Restaurant
)
03
bubbu(Caterpilar)
04
gaby(ts Inc)
05
Pit(REV inc)
REV Inc
06
Pluto
In record 01: would nothing to do, but see how manage conditional exception as point 5.
In record 02: merge Name + SecondName + Surname , then extract from new string the name (Piero) to place in Name column while extract from same string the content of squared bracket and place it into Company Column
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df[Nameall_tmp].str.extract(r'(.+)(.+')
df['Company_tmp'] = df[Nameall_tmp].str.extract(r'.*((.+))')
In record 03 and 04: is almost 02
In record 06:
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df['Nameall_tmp'].str.extract(r'(.+)(.+')
df['Name_tmp']= np.where(df['Name_tmp'] == 'nan' , df['Name'],df['Name_tmp'] )
In this case np.where statement doesn't work like if then else, in order to check if df['Name_tmp'] is "nan", in the case, fill with original df['Name'] to eliminate "nan" from record,else take df['Name_tmp']. Any sugestion ?
Rough thinking here:
munge the "company" column so that: if it contains a legit company name,
add () to it. If not, keep original content
concat all columns into one conglomerate column
use 1 regex to sr.str.extract(rex) that single conglomerate column into desired columns again
anyways, following the rough thinking, I have at least reduced the problem into fine tunning a single regex:
df = pd.DataFrame(
columns=" index Name SecondName Surname Company ".split(),
data= [
[ 0, "Marc", np.nan, "Gigio", "ETC ltd", ],
[ 1, "Piero", "(four", "season", "restaurant)", ],
[ 2, "bubbu(caterpilar)", np.nan, np.nan, np.nan, ],
[ 3, np.nan, np.nan, np.nan, "gaby(ts inc)", ],
[ 4, "Pit(REV inc)", np.nan, np.nan, "REV inc", ],
[ 5, "pluto", np.nan, np.nan, np.nan, ],]).set_index("index", drop=True)
df = df.fillna('')
df['Company'] = df['Company'].apply(lambda x: f'({x})' if ('(' not in x and ')' not in x and x!="") else x)
# df['sum'] = df.sum(axis=1)
df['sum'] = df['Name'] + ' ' + df['SecondName'] + ' ' + df['Surname'] + ' ' + df['Company']
df['sum'] = df['sum'].str.replace(r'\s+', ' ', regex=True) # get rid of extra \s due to above concat
rex = re.compile( # very fragil and hardcoded,
r"""
(?P<name0>[a-z]{2,})
\s?
(?P<surename0>[a-z]{2,})?
\s?
\(?
(?P<company0>[a-z\s]{3,})?
\)?
\s?
""",
re.X+re.I
)
df['sum'].str.extract(rex)
output:
+---------+---------+-------------+------------------------+
| index | name0 | surename0 | company0 |
|---------+---------+-------------+------------------------|
| 0 | Marc | Gigio | ETC ltd |
| 1 | Piero | nan | four season restaurant |
| 2 | bubbu | nan | caterpilar |
| 3 | gaby | nan | ts inc |
| 4 | Pit | nan | REV inc |
| 5 | pluto | nan | nan |
+---------+---------+-------------+------------------------+
EDIT:
Earlier answer contains an error in my regex (forgot to ? the \(), couldn't quite handle "pluto", corrected now.
The moral of the story is that, the regex you need to design will be very very specialized, fragil and hardcoded. almost worth considering a df['sum'].apply(myfoo) approach just to parse df['sum'] more thoroughly.
Good morning to everyone!
I'm hoping the title does fit in some way to my question!? I have a dataframe that looks something like this:
Dataframe (before):
id | name | position
1 | jane doe | position_1
2 | john doe | position_2
3 | john smith | position_3
Also I'm having multiple lists of groups:
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
Now I'm wondering what is the best way to achieve another column inside my dataframe with the assigned group? For example:
Dataframe (after):
id | name | position | group
1 | jane doe | position_1 | group_7
2 | john doe | position_2 | group_2
3 | john smith | position_3 | group_1
Notes:
every position is unique and will never apear in more than one group!
You can create a mapping dictionary in which key is the position and value is the group name, then map this dictionary onto the column position:
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
mapping_dct = {pos:grp for grp, positions in dct.items() for pos in positions}
df['group'] = df['position'].map(mapping_dct)
>>> df
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1
id=[1,2,3]
name=['jane doe','john doe','john smith']
position=['position_1','position_2','position_3']
df=pd.DataFrame({'id':id,'name':name,'position':position,})
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
def lookup(itemParam):
keys=[]
for key,item in dct.items():
if itemParam in item:
keys.append(key)
return keys
mylist=[*map(lookup,df['position'])]
mylist=[x[0] for x in mylist]
df['group']=mylist
print(df.head())
output:
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1
I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1
I have a DataFrame and I'd like to make only specific parts of strings to be made uppercase with an underscore afterwords.
| TYPE | NAME |
|-----------------------------|
| Contract Employee | John |
| Full Time Employee | Carol |
| Temporary Employee | Kyle |
I'd like the words "Contract" and "Temporary" made into uppercase like this with an underscore after and before the word:
| TYPE | NAME |
|-------------------------------|
| _CONTRACT_ Employee | John |
| Full Time Employee | Carol |
| _TEMPORARY_ Employee | Kyle |
I tried using str.upper() but that made the entire cell uppercase and I'm looking only for those certain words.
EDIT: I should mention sometimes the words are not capitalized if that matters. Often it will display as temporary employee instead of Temporary Employee.
Here is one option using re.sub:
def type_to_upper(match):
return match.group(1).upper()
text = "Contract Employee"
output = re.sub(r'\b(Contract|Temporary)\b', type_to_upper, text)
EDIT:
This is the same approach applied within pandas, also addressing the latest edit regarding uncertain upper or lower case words to be replaced:
test dataframe:
TYPE NAME
0 Contract Employee John
1 Full Time Employee Carol
2 Temporary Employee Kyle
3 contract employee John
4 Full Time employee Carol
5 temporary employee Kyle
solution:
def type_to_upper(match):
return '_{}_'.format(match.group(1).upper())
df.TYPE = df.TYPE.str.replace(r'\b([Cc]ontract|[Tt]emporary)\b', type_to_upper)
result:
df
TYPE NAME
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
3 _CONTRACT_ employee John
4 Full Time employee Carol
5 _TEMPORARY_ employee Kyle
Note that this is only for addressing exactly these two cases which are defined in the OPs request. For complete case insensitivity it's even simpler:
df.TYPE = df.TYPE.str.replace(r'\b(contract|temporary)\b', type_to_upper, case=False)
Something that modifies the data-frame (without regex or anything):
l=['Contract','Temporary']
df['TYPE']=df['TYPE'].apply(lambda x: ' '.join(['_'+i.upper()+'_' if i in l else i for i in x.split()]))
join and split, being in a apply.
And then now:
print(df)
Is:
TYPE NAME
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
This is simples and easy way by using replace with dictionary format.
Please refer pandas Doc for Series.replace
df["TYPE"] = df["TYPE"].replace({'Contract': '_CONTRACT_', 'Temporary': '_Temporary_'}, regex=True)
Just reproduced:
>>> df
TYPE Name
0 Contract Employee John
1 Full Time Employee Carol
2 Temporary Employee Kyle
>>> df["TYPE"] = df["TYPE"].replace({'Contract': '_CONTRACT_', 'Temporary': '_TEMPORARY_'}, regex=True)
>>> df
TYPE Name
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
U9 beat me to it, using lambda and split() on the input:
def match_and_upper(match):
matches = ["Contract", "Temporary"]
if match in matches:
return match.upper()
return match
input = "Contract Employee"
output = " ".join(map(lambda x: match_and_upper(x), input.split()))
# Result: CONTRACT Employee #
Answering part of my own question here. Using #Tim Biegeleisen's regex he provided, I did a string replace on the column.
df["TYPE"] = df["TYPE"].str.replace(r'\b(Contract)\b', '_CONTRACT_')
Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()