I would like to create a new column in a dataframe containing pieces of 3 different columns.I would like the first 5 letters of the last name, after removing non alphabeticals, if it is that long else just the last name, the first 2 letters of the first name and a code appended to the end.
The code below doesnt work but thats where I am and it isnt close to working
df['namecode'] = df.Last.str.replace('[^a-zA-Z]', '')[:5]+df.First.str.replace('[^a-zA-Z]', '')[:2]+str(jr['code'])
Name lastname code namecode
jeff White 0989 Whiteje0989
Zach Bunt 0798 Buntza0798
ken Black 5764 Blackke5764
Here is one approach.
Use pandas str.slice instead of trying to do string indexing.
For example:
import pandas as pd
df = pd.DataFrame(
{
'First': ['jeff', 'zach', 'ken'],
'Last': ['White^', 'Bun/t', 'Bl?ack'],
'code': ['0989', '0798', '5764']
}
)
print(df['Last'].str.replace('[^a-zA-Z]', '').str.slice(0,5)
+ df['First'].str.slice(0,2) + df['code'])
#0 Whiteje0989
#1 Buntza0798
#2 Blackke5764
#dtype: object
Related
I would like to make a long to wide transformation of my dataframe, starting from
match_id player goals home
1 John 1 home
1 Jim 3 home
...
2 John 0 away
2 Jim 2 away
...
ending up with:
match_id player_1 player_2 player_1_goals player_2_goals player_1_home player_2_home ...
1 John Jim 1 3 home home
2 John Jim 0 2 away away
...
Since I'm going to have columns with new names, I though that maybe I should try to build a dictionary for that, where the outer key is match id, for everylike so:
dict = {1: {
'player_1': 'John',
'player_1_goals':1,
'player_1_home': 'home'
'player_2': 'Jim',
'player_2_goals':3,
'player_2_home': 'home'
},
2: {
'player_1': 'John',
'player_1_goals':0,
'player_1_home': 'away',
'player_2': 'Jim',
'player_2_goals':2
'player_2_home': 'away'
},
}
and then:
pd.DataFrame.from_dict(dict).T
In the real case scenario, however, the number of players will vary and I can't hardcode it.
Is this the best way of doing this using diciotnaries? If so, how could I build this dict and populate it from my original pandas dataframe?
It looks like you want to pivot the dataframe. The problem is there is no column in your dataframe that "enumerates" the players for you. If you assign such a column via assign() method, then pivot() becomes easy.
So far, it actually looks incredibly similar this case here. The only difference is you seem to need to format the column names in a specific way where the string "player" needs to prepended to each column name. The set_axis() call below does that.
(df
.assign(
ind=df.groupby('match_id').cumcount().add(1).astype(str)
)
.pivot('match_id', 'ind', ['player', 'goals', 'home'])
.pipe(lambda x: x.set_axis([
'_'.join([c, i]) if c == 'player' else '_'.join(['player', i, c])
for (c, i) in x
], axis=1))
.reset_index()
)
I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])
I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column
I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.
Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'
Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!
I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100