Pandas MultiIndex from regex on column - python

I have a pandas dataframe that looks
df = pd.DataFrame(
[
['JoeSmith', 5],
['CathySmith', 3],
['BrianSmith', 12],
['MarySmith', 67],
['JoeJones', 23],
['CathyJones', 98],
['BrianJones', 438],
['MaryJones', 75],
['JoeCollins', 56],
['CathyCollins', 125],
['BrianCollins', 900],
['MaryCollins', 321],
], columns = ['Name', 'Value']
)
print df
Name Value
0 JoeSmith 5
1 CathySmith 3
2 BrianSmith 12
3 MarySmith 67
4 JoeJones 23
5 CathyJones 98
6 BrianJones 438
7 MaryJones 75
8 JoeCollins 56
9 CathyCollins 125
10 BrianCollins 900
11 MaryCollins 321
The first column 'Name' needs to be split into First and Last names and put into a MultiIndex.
Value
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321

I think you can use extract for extracting Names and surname, then set_index and last drop column Name:
df[['name','surname']] = df.Name.str.extract(r'([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
df = df.set_index(['name','surname']).drop('Name', axis=1)
print df
Value
name surname
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321

Solution
import pandas as pd
pattern = r'.*\b([A-Z][a-z]*)([A-Z][a-z]*)\b.*'
names = df.Name.str.extract(pattern, expand=True)
midx = pd.MultiIndex.from_tuples(names.values.tolist())
df.index = midx
df[['Value']]
Explanation
pattern grabs a group of letters that starts with a capital A-Z followed by any number of lower-case a-z followed by another capital A-Z and any number of lower-case a-z. Then it splits it into two.
pd.MultiIndex.from_tuples creates the MultiIndex.
names.values.tolist() turns the converted DataFrame into a list of lists that will be interpreted as tuples.

Related

How Can I get this output using fuzzywuzzy?

If I have two dataframes (John,Alex,harry) and (ryan, kane, king). How can I use fuzzywuzzy in python to get the following output.
fuzz.Ratio
John ryan 25
John kane 54
John king 44
alex ryan 23
alex kane 14
alex king 55
harry ryan 47
harry kane 47
harry king 50
Your ratios are wrong. What you are looking for is cartesian product of the corresponding columns of both the dataframes.
Sample code:
import itertools
df1 = pd.DataFrame({'name': ['John','Alex','harry']})
df2 = pd.DataFrame({'name': ['ryan','kane','king']})
for w1, w2 in itertools.product(
df1['name'].apply(str.lower).values, df2['name'].apply(str.lower).values):
print (f"{w1}, {w2}, {fuzz.ratio(w1,w2)}")
Output:
john, ryan, 25
john, kane, 25
john, king, 25
alex, ryan, 25
alex, kane, 50
alex, king, 0
harry, ryan, 44
harry, kane, 22
harry, king, 0
IIUC, you could do:
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
a = ('John','Alex','harry')
b = ('ryan', 'kane', 'king')
# compute the ratios for each pair
res = ((ai, bi, fuzz.ratio(ai, bi)) for ai, bi in product(a, b))
# create DataFrame filter out the values that are 0
out = pd.DataFrame([e for e in res if e[2] > 0], columns=['name_a', 'name_b', 'fuzz_ratio'])
print(out)
Output
name_a name_b fuzz_ratio
0 John ryan 25
1 John kane 25
2 John king 25
3 Alex kane 25
4 harry ryan 44
5 harry kane 22

Analyzing Token Data from a Pandas Dataframe

I'm a relative python noob and also new to natural language processing (NLP).
I have dataframe containing names and sales. I want to: 1) break out all the tokens and 2) aggregate sales by each token.
Here's an example of the dataframe:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do? I'm using Python.
Assumption: you have a function tokenize that takes in a string as input and returns a list of tokens
I'll use this function as a tokenizer for now:
def tokenize(word):
return word.casefold().split()
Solution
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation
The assign step creates a column called tokens that applies the tokenize functio
Note: For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)
this generates a df that looks like
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
use df.explode on this to get
name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
last step is just a groupy-agg step.
You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])

Choose higher value based off column value between two dataframes

question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna

add a row at top in pandas dataframe [duplicate]

This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male

Query based on index value or value in a column in python

I have a pandas data frame from which I computed the mean scores of students. Student scores are stored in data as below:
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
By using meanscore = data.groupby("name").mean()
I obtain
score
name
John 95
Mary 87.5
Suzie 90
I would like to query, for instance, meanscore['score'][meanscore['name'] == 'John'] This line yields KeyError: 'name'
I know my way of doing it is not nice, as I can actually find out the meanscore of John by using mean['score'][0].
My question is: is there a way to find the corresponding index value of each name (e.g. [0] for John, [1] for Mary and [2] for Suzie) in my query? Thank you!!
You can use loc:
In [11]: meanscore
Out[11]:
score
name
John 95.0
Mary 87.5
Suzie 90.0
In [12]: meanscore.loc["John", "score"]
Out[12]: 95.0
You can do:
meanscore['score']['John']
Example:
>>> df
name score
0 John 90
1 Mary 87
2 John 100
3 Suzie 90
4 Mary 88
>>> meanscore = df.groupby('name').mean()
>>> meanscore
score
name
John 95.0
Mary 87.5
Suzie 90.0
>>> meanscore['score']['John']
95.0

Categories

Resources