Concatenate a set of column values based on another column in Pandas - python

Given a Pandas dataframe which has a few labeled series in it, say Name and Villain.
Say the dataframe has values such:
Name: {'Batman', 'Batman', 'Spiderman', 'Spiderman', 'Spiderman', 'Spiderman'}
Villain: {'Joker', 'Bane', 'Green Goblin', 'Electro', 'Venom', 'Dr Octopus'}
In total the above dataframe has 2 series(or columns) each with six datapoints.
Now, based on the Name, I want to concatenate 3 more columns: FirstName, LastName, LoveInterest to each datapoint.
The result of which adds 'Bruce; Wayne; Catwoman' to every row which has Name as Batman. And 'Peter; Parker; MaryJane' to every row which has Name as Spiderman.
The final result should be a dataframe containing 5 columns(series) and 6 rows each.

This is a classic inner-join scenario. In pandas, use the merge module-level function:
In [13]: df1
Out[13]:
Name Villain
0 Batman Joker
1 Batman Bane
2 Spiderman Green Goblin
3 Spiderman Electro
4 Spiderman Venom
5 Spiderman Dr. Octopus
In [14]: df2
Out[14]:
FirstName LastName LoveInterest Name
0 Bruce Wayne Catwoman Batman
1 Peter Parker MaryJane Spiderman
In [15]: pd.DataFrame.merge(df1,df2,on='Name')
Out[15]:
Name Villain FirstName LastName LoveInterest
0 Batman Joker Bruce Wayne Catwoman
1 Batman Bane Bruce Wayne Catwoman
2 Spiderman Green Goblin Peter Parker MaryJane
3 Spiderman Electro Peter Parker MaryJane
4 Spiderman Venom Peter Parker MaryJane
5 Spiderman Dr. Octopus Peter Parker MaryJane

Related

Split pandas dataframe column of type string into multiple columns based on number of ',' characters

Let's say I have a pandas dataframe that looks like this:
import pandas as pd
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar']}
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
I know I can split the names into separate columns using the comma as separator, like so:
df[["name1", "name2", "name3"]] = df["name"].str.split(", ", expand=True)
df
name name1 name2 name3
0 Tom, Jeffrey, Henry Tom Jeffrey Henry
1 Nick, James Nick James None
2 Chris Chris None None
3 David, Oscar David Oscar None
However, if the name column would have a row that contains 4 names, like below, the code above will yield a ValueError: Columns must be same length as key
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar', 'Jim, Jones, William, Oliver']}
# Create DataFrame
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
4 Jim, Jones, William, Oliver
How can automatically split the name column into n-number of separate columns based on the ',' separator? The desired output would be this:
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
Use DataFrame.join for new DataFrame with rename for new columns names:
f = lambda x: f'name{x+1}'
df = df.join(df["name"].str.split(", ", expand=True).rename(columns=f))
print (df)
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver

Create the same ids for the same names in different dataframes in pandas

I have a dataset with unique names. Another dataset contains several rows with the same names as in the first dataset.
I want to create a column with unique ids in the first dataset and another column in the second dataset with the same ids corresponding to all the same names in the first dataset.
For example:
Dataframe 1:
player_id Name
1 John Dosh
2 Michael Deesh
3 Julia Roberts
Dataframe 2:
player_id Name
1 John Dosh
1 John Dosh
2 Michael Deesh
2 Michael Deesh
2 Michael Deesh
3 Julia Roberts
3 Julia Roberts
I want to do to use both data frames to run deep feature synthesis using featuretools.
To be able to do something like this:
entity_set = ft.EntitySet("basketball_players")
entity_set.add_dataframe(dataframe_name="players_set",
dataframe=players_set,
index='name'
)
entity_set.add_dataframe(dataframe_name="season_stats",
dataframe=season_stats,
index='season_stats_id'
)
entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")
This should do what your question asks:
import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])
df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])
print('inputs:', '\n')
print(df1)
print(df2)
df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]
print('\noutputs:', '\n')
print(df1)
print(df2)
Input/output:
inputs:
Name
0 John Dosh
1 Michael Deesh
2 Julia Roberts
Name
0 John Dosh
1 John Dosh
2 Michael Deesh
3 Michael Deesh
4 Michael Deesh
5 Julia Roberts
6 Julia Roberts
outputs:
id Name
0 1 John Dosh
1 2 Michael Deesh
2 3 Julia Roberts
id Name
0 1 John Dosh
1 1 John Dosh
2 2 Michael Deesh
3 2 Michael Deesh
4 2 Michael Deesh
5 3 Julia Roberts
6 3 Julia Roberts
UPDATE:
An alternative solution which should give the same result is:
df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]

Python - Pandas - finding matches between two data frames

Suppose I have 2 pandas data frames, both sharing the same column names, like this:
name: dob: role:
James Franco 1-1-1980 Actor
Cameron Diaz 4-2-1976 Actor
Jim Carey 12-1-1968 Actor
Miley Cyrus 5-23-1987 Actor
name: dob: role:
50 cent 4-6-1984 Singer
lil baby 12-1-1990 Singer
ghostmane 8-10-1989 Singer
Miley Cyrus 5-23-1987 Singer
And say I wanted to identify individuals who share the same name and dob, and exist in both dataframes (and thus, have 2 different roles).
How can I do this?
similar to if everything existed in 1 dataframe, and I did a df.groupby(["name", "dob"]).count())
I would like to be able to identify these individuals, print them, and count the number of occurrences.
Thank you
df2=df.append(df1)#append the two dfs
dfnew=df2[df2.duplicated(subset=['name:',"dob:"], keep=False)]#keep all duplicated on the columns you wires to check
Well ,this will give you just the matches:
df1.merge(df2, on=["name:","dob:",])
output:
name: dob: role:_x role:_y
0 Miley Cyrus 5-23-1987 Actor Singer
You can use an outer join to get all the results and filter them as you see fit:
df1.merge(df2, how="outer", on=["name:","dob:",])
Output:
name: dob: role:_x role:_y
0 James Franco 1-1-1980 Actor NaN
1 Cameron Diaz 4-2-1976 Actor NaN
2 Jim Carey 12-1-1968 Actor NaN
3 Miley Cyrus 5-23-1987 Actor Singer
4 50 cent 4-6-1984 NaN Singer
5 lil baby 12-1-1990 NaN Singer
6 ghostmane 8-10-1989 NaN Singer

Analyzing Token Data from a Pandas Dataframe

I'm a relative python noob and also new to natural language processing (NLP).
I have dataframe containing names and sales. I want to: 1) break out all the tokens and 2) aggregate sales by each token.
Here's an example of the dataframe:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do? I'm using Python.
Assumption: you have a function tokenize that takes in a string as input and returns a list of tokens
I'll use this function as a tokenizer for now:
def tokenize(word):
return word.casefold().split()
Solution
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation
The assign step creates a column called tokens that applies the tokenize functio
Note: For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)
this generates a df that looks like
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
use df.explode on this to get
name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
last step is just a groupy-agg step.
You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])

randomly select n rows from each block - pandas DataFrame [duplicate]

This question already has answers here:
Python: Random selection per group
(11 answers)
Closed 4 years ago.
Let's say I have a pandas DataFrame named df that looks like this
father_name child_name
Robert Julian
Robert Emily
Robert Dan
Carl Jack
Carl Rose
John Lucy
John Mark
John Alysha
Paul Christopher
Paul Thomas
Robert Kevin
Carl Elisabeth
where I know for sure that each father has at least 2 children.
I would like to obtain a DataFrame where each father has exactly 2 of his children, and those two children are selected at random. An example output would be
father_name child_name
Robert Emily
Robert Kevin
Carl Jack
Carl Elisabeth
John Alysha
John Mark
Paul Thomas
Paul Christopher
How can I do that?
You can apply DataFrame.sample on the grouped data. It takes the parameter n which you can set to 2
df.groupby('father_name').child_name.apply(lambda x: x.sample(n=2))\
.reset_index(1, drop = True).reset_index()
father_name child_name
0 Carl Elisabeth
1 Carl Jack
2 John Mark
3 John Lucy
4 Paul Thomas
5 Paul Christopher
6 Robert Emily
7 Robert Julian

Categories

Resources