I'm fairly new to python and working with a DataFrame in pandas & numpy from The Movie Database. One of the columns notes the main cast of each movie separated by the pipe symbol (|). I'm trying to find a way to split each individual cast member and list it in its own row with the movie title. I've attached a snippet below of the results I get.
tmdb_data = pd.read_csv('tmdb-movies.csv')
cast_split = tmdb_data[['original_title', 'cast']]
df = pd.DataFrame(cast_split)
df.head()
Movie Title & Cast
Expected Output:
original_title cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
Use pop + split + stack + rename + reset_index for new Series and then join to original:
tmdb_data = pd.DataFrame({'movie':['Jursassic World', 'Insurgent'],
'cast':['Chris Patt|Bryce Dallas Howard|Irrfan Khan',
'Shailene Woodley|Theo James']},
columns=['movie', 'cast'])
print (tmdb_data)
movie cast
0 Jursassic World Chris Patt|Bryce Dallas Howard|Irrfan Khan
1 Insurgent Shailene Woodley|Theo James
df1 = (tmdb_data.join(tmdb_data.pop('cast').str.split('|', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('cast'))
.reset_index(drop=True))
print (df1)
movie cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
3 Insurgent Shailene Woodley
4 Insurgent Theo James
First cast as a list (pardon the pun!), then rebuild dataframe via numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame([['Jursassic World', 'Chris Patt|Bryce Dallas Howard']], columns=['movie', 'cast'])
df.cast = df.cast.str.split('|')
df2 = pd.DataFrame({'movie': np.repeat(df.movie.values, df.cast.str.len()),
'cast': np.concatenate(df.cast.values)})
# cast movie
# 0 Chris Patt Jursassic World
# 1 Bryce Dallas Howard Jursassic World
Related
I am most interested in how this is done in a good and excellent pandas way.
In this example data Tim from Osaka has two fruit's.
import pandas as pd
data = {'name': ['Susan', 'Tim', 'Tim', 'Anna'],
'fruit': ['Apple', 'Apple', 'Banana', 'Banana'],
'town': ['Berlin', 'Osaka', 'Osaka', 'Singabpur']}
df = pd.DataFrame(data)
print(df)
Result
name fruit town
0 Susan Apple Berlin
1 Tim Apple Osaka
2 Tim Banana Osaka
3 Anna Banana Singabpur
I investigate the data ans see that one of the persons have multiple fruits. I want to create a new "category" for it named banana&fruit (or something else). The point is that the other fields of Tim are equal in their values.
df.groupby(['name', 'town', 'fruit']).size()
I am not sure if this is the correct way to explore this data set. The logical question behind is if some of the person+town combinations have multiple fruits.
As a result I want this
name fruit town
0 Susan Apple Berlin
1 Tim Apple&Banana Osaka
2 Anna Banana Singabpur
Use groupby agg:
new_df = (
df.groupby(['name', 'town'], as_index=False, sort=False)
.agg(fruit=('fruit', '&'.join))
)
new_df:
name town fruit
0 Susan Berlin Apple
1 Tim Osaka Apple&Banana
2 Anna Singabpur Banana
>>> df.groupby(["name", "town"], sort=False)["fruit"]
.apply(lambda f: "&".join(f)).reset_index()
name town fruit
0 Anna Singabpur Banana
1 Susan Berlin Apple
2 Tim Osaka Apple&Banana
I'm trying to do a 'find and replace' in a dataframe data on a specific column type1 using terms stored in a dictionary. I first make all existing values in the dataframe to be uppercase. I create the dictionary mdata, I make sure they are all uppercase as well. Then through the for syntax I loop through the items in mdata replacing accordingly. This code used to work before I turned it into a function.
Any ideas where I've gone wrong?
def to_fish(data, fish):
data['type1'] = data['type1'].str.upper()
if fish == 'monument':
mdata = {
'natural': 'NATURAL FEATURe',
'DITCH TERMINUS': 'DITCH',
'DITCH RECUT': 'DITCH',
'NATURAL_lyr': 'NATURAL FEATURE'
}
mdata = {k.upper(): v.upper() for k, v in mdata.items()}
for copa, fish in mdata.items():
data = data.str.rstrip().str.lstrip().replace(copa, fish, regex=True)
Try the map method:
data['type1'] = data['type1'].map(mdata)
You can give this to get the results.
import pandas as pd
df = pd.DataFrame({'A':['Hello','Bye','OK','Hi','Bonjour'],
'B':['Jack','Jill','Bryan','Kevin','Susan'],
'C':['High','High','Middle','Middle','Low']})
print (df)
lookup_dict = {'High':'California','Middle':'Chicago','Low':'New York'}
df['C'] = df['C'].map(lookup_dict)
print (df)
Before:
A B C
0 Hello Jack High
1 Bye Jill High
2 OK Bryan Middle
3 Hi Kevin Middle
4 Bonjour Susan Low
After:
A B C
0 Hello Jack California
1 Bye Jill California
2 OK Bryan Chicago
3 Hi Kevin Chicago
4 Bonjour Susan New York
I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!
If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma
You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
I´m doing some resarch on a dataframe for people that are relative. But I can´t manage when I find brothers, I can´t find a way to write them down all on a specific column. Here follow an example:
cols = ['Name','Father','Brother']
df = pd.DataFrame({'Brother':'',
'Father':['Erick Moon','Ralph Docker','Erick Moon','Stewart Adborn'],
'Name':['John Smith','Rodolph Ruppert','Mathew Common',"Patrick French"]
},columns=cols)
df
Name Father Brother
0 John Smith Erick Moon
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon
3 Patrick French Stewart Adborn
What I want is this:
Name Father Brother
0 John Smith Erick Moon Mathew Common
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon John Smith
3 Patrick French Stewart Adborn
I apreciate any help!
Here is an idea you can try, firstly create a Brother column with all brothers as a list including itself and then remove itself separately. The code could probably be optimized but where you can start from:
import numpy as np
import pandas as pd
df['Brother'] = df.groupby('Father')['Name'].transform(lambda g: [g.values])
def deleteSelf(row):
row.Brother = np.delete(row.Brother, np.where(row.Brother == row.Name))
return(row)
df.apply(deleteSelf, axis = 1)
# Name Father Brother
# 0 John Smith Erick Moon [Mathew Common]
# 1 Rodolph Ruppert Ralph Docker []
# 2 Mathew Common Erick Moon [John Smith]
# 3 Patrick French Stewart Adborn []
def same_father(me, data):
hasdad = data.Father == data.at[me, 'Father']
notme = data.index != me
isbro = hasdad & notme
return data.loc[isbro].index.tolist()
df2 = df.set_index('Name')
getbro = lambda x: same_father(x.name, df2)
df2['Brother'] = df2.apply(getbro, axis=1)
I think this should work.(untested)
Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).
The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.
Is there a way to do this kind of merging using pandas?
Did you look at fuzzywuzzy?
You might do something like:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
Caveat Emptor: I haven't tried to run this.
Let's say you have that function which returns the best match if any, None otherwise:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))
The simplest idea I can get now is to create special dataframe with distances between all names:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN