I'm trying to do a 'find and replace' in a dataframe data on a specific column type1 using terms stored in a dictionary. I first make all existing values in the dataframe to be uppercase. I create the dictionary mdata, I make sure they are all uppercase as well. Then through the for syntax I loop through the items in mdata replacing accordingly. This code used to work before I turned it into a function.
Any ideas where I've gone wrong?
def to_fish(data, fish):
data['type1'] = data['type1'].str.upper()
if fish == 'monument':
mdata = {
'natural': 'NATURAL FEATURe',
'DITCH TERMINUS': 'DITCH',
'DITCH RECUT': 'DITCH',
'NATURAL_lyr': 'NATURAL FEATURE'
}
mdata = {k.upper(): v.upper() for k, v in mdata.items()}
for copa, fish in mdata.items():
data = data.str.rstrip().str.lstrip().replace(copa, fish, regex=True)
Try the map method:
data['type1'] = data['type1'].map(mdata)
You can give this to get the results.
import pandas as pd
df = pd.DataFrame({'A':['Hello','Bye','OK','Hi','Bonjour'],
'B':['Jack','Jill','Bryan','Kevin','Susan'],
'C':['High','High','Middle','Middle','Low']})
print (df)
lookup_dict = {'High':'California','Middle':'Chicago','Low':'New York'}
df['C'] = df['C'].map(lookup_dict)
print (df)
Before:
A B C
0 Hello Jack High
1 Bye Jill High
2 OK Bryan Middle
3 Hi Kevin Middle
4 Bonjour Susan Low
After:
A B C
0 Hello Jack California
1 Bye Jill California
2 OK Bryan Chicago
3 Hi Kevin Chicago
4 Bonjour Susan New York
Related
I am working with pandas and have a dataframe that contains a list of sentences and people who said them, like this:
sentence person
'hello world' Matt
'cake, delicious cake!' Matt
'lovely day' Maria
'i like cake' Matt
'a new day' Maria
'a new world' Maria
I want to count non-overlapping matches of regex strings in sentence (e.g. cake, world, day) by the person. Note each row of sentence may contain more than one match (e.g cake):
person 'day' 'cake' 'world'
Matt 0 3 1
Maria 2 0 1
So far I am doing this:
rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()
However this str.contains gives me rows containing cake, but not individual instances of cake.
I know I can use str.counts(r"cake") on rows_cake. However, in practise my dataframe is extremely large (> 10 million rows) and the regexes I am using are quite complex so I am looking for a more efficient solution if possible.
Maybe you should first try to get the sentence itself and then use re to do your optimized regex stuff like that:
for row in df.itertuples(index=False):
do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person
As far as I know itertuples is quiet fast (Notes no.1 here). So the only optimization problem you have is with regex itself.
I came up with rather simple solution. But cant claim it to be the fastest or efficient.
import pandas as pd
import numpy as np
# to be used with read_clipboard()
'''
sentence person
'hello world' Matt
'cake, delicious cake!' Matt
'lovely day' Maria
'i like cake' Matt
'a new day' Maria
'a new world' Maria
'''
df = pd.read_clipboard()
# print(df)
Output:
sentence person
0 'hello world' Matt
1 'cake, delicious cake!' Matt
2 'lovely day' Maria
3 'i like cake' Matt
4 'a new day' Maria
5 'a new world' Maria
.
# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']
# for each keyword and each string, counting the occourance
for key in keywords:
df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]
# print(df)
Output:
sentence person day cake world
0 'hello world' Matt 0 0 1
1 'cake, delicious cake!' Matt 0 2 0
2 'lovely day' Maria 1 0 0
3 'i like cake' Matt 0 1 0
4 'a new day' Maria 1 0 0
5 'a new world' Maria 0 0 1
.
# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df,
values=['day', 'cake', 'world'],
columns=['person'],
aggfunc=np.sum).T
# print(df_pivot)
Final Output:
cake day world
person
Maria 0 2 1
Matt 3 0 1
Open to suggestions if this seems to be a good approach especially given the volume of data. Eager to learn.
since this primarily involves strings, I would suggest taking the computation out of Pandas - Python is faster than Pandas in most cases when it comes to string manipulation :
#read in data
df = pd.read_clipboard(sep='\s{2,}', engine='python')
#create a dictionary of persons and sentences :
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
d[k].append(v)
d = {k:",".join(v) for k,v in d.items()}
#search words
strings = ("cake", "world", "day")
#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
for st in strings:
m[k].append({st:v.count(st)})
res = {k:dict(ChainMap(*v)) for k,v in m.items()}
print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
'Maria': {'day': 2, 'world': 1, 'cake': 0}}
output = pd.DataFrame(res).T
day world cake
Matt 0 1 3
Maria 2 1 0
test the speeds and see which one is better. it would be useful for me and others as well.
I'm fairly new to python and working with a DataFrame in pandas & numpy from The Movie Database. One of the columns notes the main cast of each movie separated by the pipe symbol (|). I'm trying to find a way to split each individual cast member and list it in its own row with the movie title. I've attached a snippet below of the results I get.
tmdb_data = pd.read_csv('tmdb-movies.csv')
cast_split = tmdb_data[['original_title', 'cast']]
df = pd.DataFrame(cast_split)
df.head()
Movie Title & Cast
Expected Output:
original_title cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
Use pop + split + stack + rename + reset_index for new Series and then join to original:
tmdb_data = pd.DataFrame({'movie':['Jursassic World', 'Insurgent'],
'cast':['Chris Patt|Bryce Dallas Howard|Irrfan Khan',
'Shailene Woodley|Theo James']},
columns=['movie', 'cast'])
print (tmdb_data)
movie cast
0 Jursassic World Chris Patt|Bryce Dallas Howard|Irrfan Khan
1 Insurgent Shailene Woodley|Theo James
df1 = (tmdb_data.join(tmdb_data.pop('cast').str.split('|', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('cast'))
.reset_index(drop=True))
print (df1)
movie cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
3 Insurgent Shailene Woodley
4 Insurgent Theo James
First cast as a list (pardon the pun!), then rebuild dataframe via numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame([['Jursassic World', 'Chris Patt|Bryce Dallas Howard']], columns=['movie', 'cast'])
df.cast = df.cast.str.split('|')
df2 = pd.DataFrame({'movie': np.repeat(df.movie.values, df.cast.str.len()),
'cast': np.concatenate(df.cast.values)})
# cast movie
# 0 Chris Patt Jursassic World
# 1 Bryce Dallas Howard Jursassic World
I´m doing some resarch on a dataframe for people that are relative. But I can´t manage when I find brothers, I can´t find a way to write them down all on a specific column. Here follow an example:
cols = ['Name','Father','Brother']
df = pd.DataFrame({'Brother':'',
'Father':['Erick Moon','Ralph Docker','Erick Moon','Stewart Adborn'],
'Name':['John Smith','Rodolph Ruppert','Mathew Common',"Patrick French"]
},columns=cols)
df
Name Father Brother
0 John Smith Erick Moon
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon
3 Patrick French Stewart Adborn
What I want is this:
Name Father Brother
0 John Smith Erick Moon Mathew Common
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon John Smith
3 Patrick French Stewart Adborn
I apreciate any help!
Here is an idea you can try, firstly create a Brother column with all brothers as a list including itself and then remove itself separately. The code could probably be optimized but where you can start from:
import numpy as np
import pandas as pd
df['Brother'] = df.groupby('Father')['Name'].transform(lambda g: [g.values])
def deleteSelf(row):
row.Brother = np.delete(row.Brother, np.where(row.Brother == row.Name))
return(row)
df.apply(deleteSelf, axis = 1)
# Name Father Brother
# 0 John Smith Erick Moon [Mathew Common]
# 1 Rodolph Ruppert Ralph Docker []
# 2 Mathew Common Erick Moon [John Smith]
# 3 Patrick French Stewart Adborn []
def same_father(me, data):
hasdad = data.Father == data.at[me, 'Father']
notme = data.index != me
isbro = hasdad & notme
return data.loc[isbro].index.tolist()
df2 = df.set_index('Name')
getbro = lambda x: same_father(x.name, df2)
df2['Brother'] = df2.apply(getbro, axis=1)
I think this should work.(untested)
Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).
The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.
Is there a way to do this kind of merging using pandas?
Did you look at fuzzywuzzy?
You might do something like:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
Caveat Emptor: I haven't tried to run this.
Let's say you have that function which returns the best match if any, None otherwise:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))
The simplest idea I can get now is to create special dataframe with distances between all names:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN
First of all I am very new at pandas and am trying to lean so thorough answers will be appreciated.
I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster where tag subtoken means anything in the set {hashtagA} U {i | i in split('_', hashtagA)} from a table matching poster -> tweet
For example:
In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
In [2]: df
Out[2]:
0 1
0 jim i was like #yolo_omg to her
1 jack You are so #yes_omg #best_place_ever
2 neil Yo #rofl_so_funny
And from that I want to get something like
0 1
0 jim yolo_omg
1 jim yolo
2 jim omg
3 jack yes_omg
4 jack yes
5 jack omg
6 jack best_place_ever
7 jack best
8 jack place
9 jack ever
10 neil rofl_so_funny
11 neil rofl
12 neil so
13 neil funny
I managed to construct this mostrosity that actually does the job:
In [143]: df[1].str.findall('#([^\s]+)') \
.apply(pd.Series).stack() \
.apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
.apply(pd.Series).stack().to_frame().reset_index(level=0) \
.join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]
Out[143]:
0 0_l
0 0 jim yolo_omg
1 jim yolo
2 jim omg
0 jack yes_omg
1 jack yes
2 jack omg
1 0 jack best_place_ever
1 jack best
2 jack place
3 jack ever
0 0 neil rofl_so_funny
1 neil rofl
2 neil so
3 neil funny
But I have a very strong feeling that there are much better ways of doing this, especially given that the real dataset set is huge.
pandas indeed has a function for doing this natively.
Series.str.findall()
This basically applies a regex and captures the group(s) you specify in it.
So if I had your dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
What I would do is first to set the names of your columns, like this:
df.columns = ['user', 'tweet']
Or do it on creation of the dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]], columns=['user', 'tweet'])
Then I would simply apply the extract function with a regex:
df['tag'] = df["tweet"].str.findall("(#[^ ]*)")
And I would use the negative character group instead of a positive one, this is more likely to survive special cases.
How about using list comprehensions in python and then reverting back to pandas? Requires a few lines of code but is perhaps more readable.
import re
get the hash tags
tags = [re.findall('#([^\s]+)', t) for t in df[1]]
make lists of the tags with subtokens for each user
st = [[t] + [s.split('_') for s in t] for t in tags]
subtokens = [[i for s in poster for i in s] for poster in st]
put back into DataFrame with poster names
df2 = pd.DataFrame(subtokens, index=df[0]).stack()
In [250]: df2
Out[250]:
jim 0 yolo_omg
1 yolo
2 omg
jack 0 yes_omg
1 best_place_ever
2 yes
3 omg
4 best
5 place
6 ever
neil 0 rofl_so_funny
1 rofl
2 so
3 funny
dtype: object