Analyzing Token Data from a Pandas Dataframe - python

I'm a relative python noob and also new to natural language processing (NLP).
I have dataframe containing names and sales. I want to: 1) break out all the tokens and 2) aggregate sales by each token.
Here's an example of the dataframe:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do? I'm using Python.

Assumption: you have a function tokenize that takes in a string as input and returns a list of tokens
I'll use this function as a tokenizer for now:
def tokenize(word):
return word.casefold().split()
Solution
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation
The assign step creates a column called tokens that applies the tokenize functio
Note: For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)
this generates a df that looks like
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
use df.explode on this to get
name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
last step is just a groupy-agg step.

You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)

df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])

Related

Split pandas dataframe column of type string into multiple columns based on number of ',' characters

Let's say I have a pandas dataframe that looks like this:
import pandas as pd
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar']}
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
I know I can split the names into separate columns using the comma as separator, like so:
df[["name1", "name2", "name3"]] = df["name"].str.split(", ", expand=True)
df
name name1 name2 name3
0 Tom, Jeffrey, Henry Tom Jeffrey Henry
1 Nick, James Nick James None
2 Chris Chris None None
3 David, Oscar David Oscar None
However, if the name column would have a row that contains 4 names, like below, the code above will yield a ValueError: Columns must be same length as key
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar', 'Jim, Jones, William, Oliver']}
# Create DataFrame
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
4 Jim, Jones, William, Oliver
How can automatically split the name column into n-number of separate columns based on the ',' separator? The desired output would be this:
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
Use DataFrame.join for new DataFrame with rename for new columns names:
f = lambda x: f'name{x+1}'
df = df.join(df["name"].str.split(", ", expand=True).rename(columns=f))
print (df)
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver

Create the same ids for the same names in different dataframes in pandas

I have a dataset with unique names. Another dataset contains several rows with the same names as in the first dataset.
I want to create a column with unique ids in the first dataset and another column in the second dataset with the same ids corresponding to all the same names in the first dataset.
For example:
Dataframe 1:
player_id Name
1 John Dosh
2 Michael Deesh
3 Julia Roberts
Dataframe 2:
player_id Name
1 John Dosh
1 John Dosh
2 Michael Deesh
2 Michael Deesh
2 Michael Deesh
3 Julia Roberts
3 Julia Roberts
I want to do to use both data frames to run deep feature synthesis using featuretools.
To be able to do something like this:
entity_set = ft.EntitySet("basketball_players")
entity_set.add_dataframe(dataframe_name="players_set",
dataframe=players_set,
index='name'
)
entity_set.add_dataframe(dataframe_name="season_stats",
dataframe=season_stats,
index='season_stats_id'
)
entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")
This should do what your question asks:
import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])
df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])
print('inputs:', '\n')
print(df1)
print(df2)
df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]
print('\noutputs:', '\n')
print(df1)
print(df2)
Input/output:
inputs:
Name
0 John Dosh
1 Michael Deesh
2 Julia Roberts
Name
0 John Dosh
1 John Dosh
2 Michael Deesh
3 Michael Deesh
4 Michael Deesh
5 Julia Roberts
6 Julia Roberts
outputs:
id Name
0 1 John Dosh
1 2 Michael Deesh
2 3 Julia Roberts
id Name
0 1 John Dosh
1 1 John Dosh
2 2 Michael Deesh
3 2 Michael Deesh
4 2 Michael Deesh
5 3 Julia Roberts
6 3 Julia Roberts
UPDATE:
An alternative solution which should give the same result is:
df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]

Pandas: a Pythonic way to create a hyperlink from a value stored in another column of the dataframe

I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee

In, Python trying to remove duplicate word in dataframe, but get error

I'm trying to remove a duplicate word in a cell
Current Desired
0 John and Jane John and Jane
1 John and John John
2 John John
3 Jane and Jane Jane
I have tried the following, desired column gets filled with o d i c t _ k e y s ( [ ' n a n ' ] ):
from collections import OrderedDict
df['Current'] = (df['Desired'].astype(str).str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.astype(str).str.join(' '))
I have also tried this, but the desired column gets filled with nan
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
Let us do split with set then join back
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))
df
Out[876]:
Current out
0 John and Jane Jane and John
1 John and John John
2 John John
3 Jane and Jane Jane

In Pandas, how to create a unique ID based on the combination of many columns?

I have a very large dataset, that looks like
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})
df
Out[173]:
B C
0 john smith indiana jones
1 john doe duck mc duck
2 adam smith batman
3 john doe duck mc duck
4 NaN NaN
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
B C ID
0 john smith indiana jones 1
1 john doe duck mc duck 2
2 adam smith batman 3
3 john doe duck mc duck 2
4 NaN NaN 0
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly.
I use:
df['combined_id']=(df.B+df.C).rank(method='dense')
but the output is float64 and takes a lot of memory. Can we do better?
Thanks!
I think you can use factorize:
df['combined_id'] = pd.factorize(df.B+df.C)[0]
print df
B C combined_id
0 john smith indiana jones 0
1 john doe duck mc duck 1
2 adam smith batman 2
3 john doe duck mc duck 1
4 NaN NaN -1
Making jezrael's answer a little more general (what if the columns were not string?), you can use this compact function:
def make_identifier(df):
str_id = df.apply(lambda x: '_'.join(map(str, x)), axis=1)
return pd.factorize(str_id)[0]
df['combined_id'] = make_identifier(df[['B','C']])
jezrael's answer is great. But since this is for multiple columns, I prefer to use .ngroup() since this way NaN could remain NaN.
df['combined_id'] = df.groupby(['B', 'C'], sort = False).ngroup()
df

Categories

Resources