I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
I want to turn into:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)
Not as much interesting/fancy pandas usage, but this works:
import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})
It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location entries are already iterables, you can remove the atleast_1d call, which gives about another 20% speedup.
If you return a series whose index is a list of locations, then dfIn.apply will collate those series into a table:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
You can then stack this DataFrame to obtain:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
This is a Series, while you want a DataFrame. A little massaging with reset_index gives you the desired result:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
yields
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus
import pandas as pd
dfIn = pd.DataFrame({
u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ],
})
print(dfIn.explode('location'))
>>>
name location
0 Tom Amsterdam
1 Jim Berlin
1 Jim Paris
2 Claus Antwerp
2 Claus Barcelona
2 Claus Pisa
Related
Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
I have a datafame as follows
import pandas as pd
d = {
'Name' : ['James', 'John', 'Peter', 'Thomas', 'Jacob', 'Andrew','John', 'Peter', 'Thomas', 'Jacob', 'Peter', 'Thomas'],
'Order' : [1,1,1,1,1,1,2,2,2,2,3,3],
'Place' : ['Paris', 'London', 'Rome','Paris', 'Venice', 'Rome', 'Paris', 'Paris', 'London', 'Paris', 'Milan', 'Milan']
}
df = pd.DataFrame(d)
Name Order Place
0 James 1 Paris
1 John 1 London
2 Peter 1 Rome
3 Thomas 1 Paris
4 Jacob 1 Venice
5 Andrew 1 Rome
6 John 2 Paris
7 Peter 2 Paris
8 Thomas 2 London
9 Jacob 2 Paris
10 Peter 3 Milan
11 Thomas 3 Milan
[Finished in 0.7s]
The dataframe represents people visiting various cities, Order column defines the order of visit.
I would like find which city people visited before Paris.
Expected dataframe is as follows
Name Order Place
1 John 1 London
2 Peter 1 Rome
4 Jacob 1 Venice
Which is the pythonic way to find it ?
Using merge
s = df.loc[df.Place.eq('Paris'), ['Name', 'Order']]
m = s.assign(Order=s.Order.sub(1))
m.merge(df, on=['Name', 'Order'])
Name Order Place
0 John 1 London
1 Peter 1 Rome
2 Jacob 1 Venice
I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!
If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma
You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).
The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.
Is there a way to do this kind of merging using pandas?
Did you look at fuzzywuzzy?
You might do something like:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
Caveat Emptor: I haven't tried to run this.
Let's say you have that function which returns the best match if any, None otherwise:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))
The simplest idea I can get now is to create special dataframe with distances between all names:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN