Comparing string entries in two Pandas series - python

I have two panda series, and would simply like to compare their string values, and returning the strings (and maybe indices too) of the values they have in common e.g. Hannah, Frank and Ernie in the example below::
print(x)
print(y)
0 Anne
1 Beth
2 Caroline
3 David
4 Ernie
5 Frank
6 George
7 Hannah
Name: 0, dtype: object
1 Hannah
2 Frank
3 Ernie
4 NaN
5 NaN
6 NaN
7 NaN
Doing
x == y
throws a
ValueError: Can only compare identically-labeled Series objects
as does
x.sort_index(axis=0) == y.sort_index(axis=0)
and
x.reindex_like(y) > y
does something, but not the right thing!

If need common values only you can use convert first column to set and use intersection:
a = set(x).intersection(y)
print (a)
{'Hannah', 'Frank', 'Ernie'}
And for indices need merge by default inner join with reset_index for convert indices to columns:
df = pd.merge(x.rename('a').reset_index(), y.rename('a').reset_index(), on='a')
print (df)
index_x a index_y
0 4 Ernie 3
1 5 Frank 2
2 7 Hannah 1
Detail:
print (x.rename('a').reset_index())
index a
0 0 Anne
1 1 Beth
2 2 Caroline
3 3 David
4 4 Ernie
5 5 Frank
6 6 George
7 7 Hannah
print (y.rename('a').reset_index())
index a
0 1 Hannah
1 2 Frank
2 3 Ernie
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN

Related

Turn 4 columns into two

In Jupiter notebook, using pandas, I have a csv with 4 columns.
Names Number Names2 Number2
Jim 2 Greg 5
Meek 4 Drake 6
NaN 12 Tim 3
Neri 1 Nan 9
There are no duplicates between the two Name columns but there are NaN's.
I am looking to
Create 2 new columns that appends the 4 columns
Remove the NaN's in the process
Where there are NaN names remove the associated number aswell.
Desired Output
Names Number Names2 Number2 - NameList NumberList
Jim 2 Greg 5 Jim 2
Meek 4 Drake 6 Meek 4
NaN 12 Tim 3 Neri 1
Neri 1 Nan 9 Greg 5
Drake 6
Tim 3
I have tried using .append but whenever I append, my new NameList column ends up just being the same length as one of the original columns or the NaN's stay.
This looks like pd.wide_to_long with a little modification on the first set of Names and Number column:
d = dict(zip(['Names','Number'],['Names1','Number1']))
(pd.wide_to_long(df.rename(columns=d).reset_index()
,['Names','Number'],'index','v')
.dropna(subset=['Names']).reset_index(drop=True))
Names Number
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can try this:
df = df.replace('Nan', np.NaN)
df1 = pd.concat([pd.concat([df['Names'], df['Names2']]), pd.concat([df['Number'], df['Number2']])], axis=1).dropna().rename(columns={0: 'Nameslist', 1: 'Numberlist'}).reset_index().drop(columns=['index'])
print(df1)
Nameslist Numberlist
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
When you want to concatenate while ignoring the column names and index, numpy can be a handy tool:
tmp = pd.DataFrame(np.concatenate(
[df[['Names', 'Number']].dropna().values,
df[['Names2', 'Number2']].dropna().values]),
columns=['NameList', 'NumberList'])
It gives:
NameList NumberList
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can know concatenate on axis=1:
pd.concat([df, tmp], axis=1)
which gives as expected:
Names Number Names2 Number2 NameList NumberList
0 Jim 2.0 Greg 5.0 Jim 2
1 Meek 4.0 Drake 6.0 Meek 4
2 NaN 12.0 Tim 3.0 Neri 1
3 Neri 1.0 NaN 9.0 Greg 5
4 NaN NaN NaN NaN Drake 6
5 NaN NaN NaN NaN Tim 3
try this,
(pd.concat([df,
pd.DataFrame(
{x.replace("2", ""): df.pop(x)
for x in ['Names2', 'Number2']})])) \
.replace('Nan', np.NaN).dropna()
output,
Names Number
0 Jim 2
1 Meek 4
3 Neri 1
0 Greg 5
1 Drake 6
2 Tim 3

Using pandas extract regex with multiple groups

I am trying to extract a number from a pandas series of strings. For example consider this series:
s = pd.Series(['a-b-1', 'a-b-2', 'c1-d-5', 'c1-d-9', 'e-10-f-1-3.xl', 'e-10-f-2-7.s'])
0 a-b-1
1 a-b-2
2 c1-d-5
3 c1-d-9
4 e-10-f-1-3.xl
5 e-10-f-2-7.s
dtype: object
There are 6 rows, and three string formats/templates (known). The goal is to extract a number for each of the rows depending on the string. Here is what I came up with:
s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
and this correctly extracts the numbers that I want from each row:
0 1 2
0 1 NaN NaN
1 2 NaN NaN
2 NaN 5 NaN
3 NaN 9 NaN
4 NaN NaN 3
5 NaN NaN 7
However, since I have three groups in the regex, I have 3 columns, and here comes the question:
Can I write a regex that has one group or that can generate a single column, or do I need to coalesce the columns into one, and how can I do that without a loop if necessary?
Desired outcome would be a series like:
0 1
1 2
2 5
3 9
4 3
5 7
Simplest thing to do is bfill\ffill:
(s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
.bfill(axis=1)
[0]
)
Output:
0 1
1 2
2 5
3 9
4 3
5 7
Name: 0, dtype: object
Another way is to use optional non-capturing group:
s.str.extract('(?:a-b-)?(?:c1-d-)?(?:e-10-f-[0-9]-)?([0-9])')
Output:
0
0 1
1 2
2 5
3 9
4 3
5 7
You could use a single capturing group at the end, and add the 3 prefixes in a on capturing group (?:
As they all end with a hyphen, you could move that to after the non capturing group to shorted it a bit.
(?:a-b|c1-d|e-10-f-[0-9])-([0-9])
Regex demo
s.str.extract('(?:a-b|c1-d|e-10-f-[0-9])-([0-9])')
Ouput
0
0 1
1 2
2 5
3 9
4 3
5 7

Indexing by str.contains(), then inserting a value into another column

I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds.
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
print(df)
id store
0 1 McDonalds
1 2 Lidl
2 3 Lidl New York 123
3 4 KFC
4 5 Taco Restaurant
5 6 Lidl Berlin
6 7 Popeyes
7 8 Wallmart
8 9 Aldi
9 10 London Lidl
So let's say I want to standardize the Lidl stores. The standard name will just be "Lidl.
I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there. However I can't figure this out.
I'll first create the column where the standard name will be inserted:
d['standard_name'] = pd.np.nan
Then search for instances of Lidl, and insert the cleaned name into standard_name.
First of all the plan is to use str.contains and then set the standardized value to the new column:
df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'
print(df)
id store standard_name
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin NaN
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl NaN
Nothing has been inserted. I checked just the str.contains code alone, and found it all returned false:
df.store.str.contains(r'\blidl\b',re.I,regex=True)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: store, dtype: bool
I'm not sure what's happening here.
What I am trying to end up with is the standardized names filled in like this:
id store standard_name
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated
Also, is this the fastest way to do this? There are millions of rows to process.
If want set new column you can use DataFrame.loc with case=False or re.I :
Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Or is possible use another approach - Series.str.extract:
df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl

Creating dataframe from another dataframe and list

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done
You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

Select rows from a DataFrame based on string values in a column in pandas

How to select rows from a DataFrame based on string values in a column in pandas? I just want to display the just States only which are in all CAPS.
The states have the total number of cities.
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline
d = pd.read_csv("states.csv")
print(d)
print(df)
# States/cities B C D
# 0 FL 3 5 6
# 1 Orlando 1 2 3
# 2 Miami 1 1 3
# 3 Jacksonville 1 2 0
# 4 CA 8 3 2
# 5 San diego 3 1 0
# 6 San Francisco 5 2 2
# 7 WA 4 2 1
# 8 Seattle 3 1 0
# 9 Tacoma 1 1 1
How to display like so,
# States/Cites B C D
# 0 FL 3 5 6
# 4 CA 8 3 2
# 7 WA 4 2 1
You can write a function to be applied to each value in the States/cities column. Have the function return either True or False, and the result of applying the function can act as a Boolean filter on your DataFrame.
This is a common pattern when working with pandas. In your particular case, you could check for each value in States/cities whether it's made of only uppercase letters.
So for example:
def is_state_abbrev(string):
return string.isupper()
filter = d['States/cities'].apply(is_state_abbrev)
filtered_df = d[filter]
Here filter will be a pandas Series with True and False values.
You can also achieve the same result by using a lambda expression, as in:
filtered_df = d[d['States/cities'].apply(lambda x: x.isupper())]
This does essentially the same thing.
Consider pandas.Series.str.match passing a regex for only [A-Z]
states[states['States/cities'].str.match('^.*[A-Z]$')]
# States/cities B C D
# 0 FL 3 5 6
# 4 CA 8 3 2
# 7 WA 4 2 1
Data
from io import StringIO
import pandas as pd
txt = '''"States/cities" B C D
0 FL 3 5 6
1 Orlando 1 2 3
2 Miami 1 1 3
3 Jacksonville 1 2 0
4 CA 8 3 2
5 "San diego" 3 1 0
6 "San Francisco" 5 2 2
7 WA 4 2 1
8 Seattle 3 1 0
9 Tacoma 1 1 1'''
states = pd.read_table(StringIO(txt), sep="\s+")
You can get the rows with all uppercase values in the column States/cities like this:
df.loc[df['States/cities'].str.isupper()]
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
Just to be safe, you can add a condition so that it only returns the rows where 'States/cities' is uppercase and only 2 characters long (in case you had a value that was SEATTLE or something like that):
df.loc[(df['States/cities'].str.isupper()) & (df['States/cities'].apply(len) == 2)]
You can use str.contains to filter any row that contains small alphabets
df[~df['States/cities'].str.contains('[a-z]')]
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
If we assuming the order is always State followed by the city from the state , we can using where and dropna
df['States/cities']=df['States/cities'].where(df['States/cities'].isin(['FL','CA','WA']))
df.dropna()
df
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
Or we do str.len
df[df['States/cities'].str.len()==2]
Out[39]:
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1

Categories

Resources