I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds.
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
print(df)
id store
0 1 McDonalds
1 2 Lidl
2 3 Lidl New York 123
3 4 KFC
4 5 Taco Restaurant
5 6 Lidl Berlin
6 7 Popeyes
7 8 Wallmart
8 9 Aldi
9 10 London Lidl
So let's say I want to standardize the Lidl stores. The standard name will just be "Lidl.
I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there. However I can't figure this out.
I'll first create the column where the standard name will be inserted:
d['standard_name'] = pd.np.nan
Then search for instances of Lidl, and insert the cleaned name into standard_name.
First of all the plan is to use str.contains and then set the standardized value to the new column:
df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'
print(df)
id store standard_name
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin NaN
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl NaN
Nothing has been inserted. I checked just the str.contains code alone, and found it all returned false:
df.store.str.contains(r'\blidl\b',re.I,regex=True)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: store, dtype: bool
I'm not sure what's happening here.
What I am trying to end up with is the standardized names filled in like this:
id store standard_name
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated
Also, is this the fastest way to do this? There are millions of rows to process.
If want set new column you can use DataFrame.loc with case=False or re.I :
Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Or is possible use another approach - Series.str.extract:
df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Related
I have 2 tables
date
James
Jamie
John
Allysia
Jean
2022-01-01
NaN
6
5
4
3
2022-01-02
7
6
7
NaN
5
names
groupings
James
guy
John
guy
Jamie
girl
Allysia
girl
Jean
girl
into
date
James
Jamie
John
Allysia
Jean
girl
guy
2022-01-01
NaN
6
5
4
3
5
5
2022-01-02
7
6
7
NaN
5
5.5
7
threshold= >3
I want to create a new column grouped by guys /girls scores where the score taken is above the threshold and get their mean while ignoring NaN and scores that does not fit the threshold.
I do not know on how to replace scores that is below the threshold with nan.
I tried doing to do a group by to get them in to a list and create new row with mean.
groupingseries = groupings.groupby(['grouping'])['names'].apply(list)
for k,s in zip(groupingseries.keys(),groupingseries):
try:
its='"'+',"'.join(s)+'"'
df[k]=df[s].mean()
except:
print('not in item')
Not sure why the results return NaN for girl and guy.
Please do help.
Assuming df and groupings your two input DataFrames:
out = df.join(df.groupby(df.columns.map(groupings.set_index('names')['groupings']),
axis=1).sum()
)
Output:
date James Jamie John Allysia Jean girl guy
0 2022-01-01 NaN 6 5 4.0 3 13.0 5.0
1 2022-01-02 7.0 6 7 NaN 5 11.0 14.0
I want to split a dataframe into 3 new dataframes according to a priority column. My dataframe is as follows:
City Priority
0 New York 3
1 Paris 1
2 Boston 7
3 La Habana 6
4 Bilbao 10
5 Roma 2
6 Barcelona 1
7 Bruselas 8
8 Tokyo 7
9 Caracas 11
There are 3 types of priorities:
Priority 7 to 9
Priority 1 to 6
Priority from 10 to 11
The idea is to divide this dataframe in 3 with the following structure and that in turn would be ordered by the value of its priority:
Dataframe with 3 rows of priority from 7 to 9
Dataframe with 5 rows of priority from 1 to 6
Dataframe with 2 rows of the priority from 10 to 11.
The result would be as follows:
Dataframe 1:
City Priority
0 Boston 7
1 Tokyo 7
2 Bruselas 8
Dataframe 2:
City Priority
0 Paris 1
1 Barcelona 1
2 Roma 2
3 New York 3
4 La Habana 6
Dataframe 3:
City Priority
0 Bilbao 10
1 Caracas 11
I think it is important to note that if there were no rows of priority 7 to 9, the priority numbers that would be chosen for that dataframe of 3 would be 10, if not 11, if not 1, if not 2, etc. The same with the rest of the dataframes and priorities: 1, 2, 3, 4, etc for the second one and 10, 11, 1, 2, 3, etc for the third one.
Also, if there were 4 values such as 7, 7, 7, 8, only rows 7, 7, 7 would appear in the 3-row Dataframe and the row with value 8 would be in Dataframe 2.
Likewise, I think it is also important to say that in that iteration, when the first dataframe of 3 rows is generated, they should be "extracted" from the original dataframe so that they are not taken into account when generating the other dataframes. I hope I have explained myself well and that someone can help me. Best regards and thanks!
IIUC this should work as expected:
(1) you create a column bin_Priority which applies each row to the right bin, the labels of the bins are the priority in which order to look for them.
(2) sort_values on bin_Priority, then in each bin on Priority.
(3) split your df into 3 df's, the 1st with 3 rows, the 2nd with 2 rows and the 3rd with 5 rows.
If values of Priority groups are missing it chooses the right values because they are ordered right.
Please let me know if that is what you are searching for.
df = pd.DataFrame({
'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
'Priority': [3, 1, 7, 6, 10, 2, 1, 8, 7, 11]
})
#(1)
df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()
#(2)
ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])
#(3)
out = np.split(ordered_priority_df, [3,5])
print(df, ordered_priority_df, *out, sep='\n\n')
#df
City Priority bin_Priority
0 New York 3 3
1 Paris 1 3
2 Boston 7 1
3 La Habana 6 3
4 Bilbao 10 2
5 Roma 2 3
6 Barcelona 1 3
7 Bruselas 8 1
8 Tokyo 7 1
9 Caracas 11 2
#ordered_priority_df
City Priority bin_Priority
2 Boston 7 1
8 Tokyo 7 1
7 Bruselas 8 1
4 Bilbao 10 2
9 Caracas 11 2
1 Paris 1 3
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
# out[0]
City Priority bin_Priority
2 Boston 7 1
8 Tokyo 7 1
7 Bruselas 8 1
# out[1]
City Priority bin_Priority
4 Bilbao 10 2
9 Caracas 11 2
# out[2]
City Priority bin_Priority
1 Paris 1 3
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
Here is an example where I changed the value of Paris from 1 to 7. value 8 (which should be in the 1st df) ends in the 2nd df and same with value 11 (from 2nd to 3rd).
df = pd.DataFrame({
'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
'Priority': [3, 7, 7, 6, 10, 2, 1, 8, 7, 11]
})
df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()
ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])
out = np.split(ordered_priority_df, [3,5])
print(df, *out, sep='\n\n')
City Priority bin_Priority
0 New York 3 3
1 Paris 7 1
2 Boston 7 1
3 La Habana 6 3
4 Bilbao 10 2
5 Roma 2 3
6 Barcelona 1 3
7 Bruselas 8 1
8 Tokyo 7 1
9 Caracas 11 2
City Priority bin_Priority
1 Paris 7 1
2 Boston 7 1
8 Tokyo 7 1
City Priority bin_Priority
7 Bruselas 8 1
4 Bilbao 10 2
City Priority bin_Priority
9 Caracas 11 2
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
I want to replace the position words from strings column: if they are either present sole or in multiple but join with , and space.
id strings
0 1 south
1 2 north
2 3 east
3 4 west
4 5 west, east, south
5 6 west, west
6 7 north, north
7 8 north, south
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
My expected result will like this. Please note if they are components of phrase or words, then I don't need to replace them.
Is it possible to do that? Thank you.
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
The following code works, but I just wonder if there are some more concise methods?
df['strings'].astype(str).replace('south', np.nan).replace('north', np.nan)\
.replace('west', np.nan).replace('east', np.nan).replace('west, east', np.nan)\
.replace('west, west', np.nan).replace('north, north', np.nan).replace('west, east', np.nan)\
.replace('north, south', np.nan)
First use Series.str.split, forward filling for replace missing values, test if all matched values by DataFrame.isin and DataFrame.all for mask and last set missing values by Series.mask:
L = ['south','north','east','west']
m = df['strings'].str.split(', ', expand=True).ffill(axis=1).isin(L).all(axis=1)
df['strings'] = df['strings'].mask(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Another idea with sets, isdisjoint and Series.where:
m = [set(x.split(', ')).isdisjoint(L) for x in df['strings']]
df['strings'] = df['strings'].where(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Using Regex.
Ex:
df = pd.DataFrame({'strings': ['south', 'north', 'east', 'west', 'west, east, south', 'west, west', 'north, north', 'north, south', 'West Corporation global office', 'West-Riding', 'University of West Florida', 'Southwest']})
df['R'] = df['strings'].replace(r"\b(south|north|east|west)\b,?", np.NAN, regex=True)
print(df)
Output:
strings R
0 south NaN
1 north NaN
2 east NaN
3 west NaN
4 west, east, south NaN
5 west, west NaN
6 north, north NaN
7 north, south NaN
8 West Corporation global office West Corporation global office
9 West-Riding West-Riding
10 University of West Florida University of West Florida
11 Southwest Southwest
I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds. You can see below that Popeyes and Wallmart have already been standardized:
id store standard
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Slidling Shop NaN
5 6 Lidi Berlin NaN
6 7 Popeyes NY Popeyes
7 8 Wallmart LA 90210 Wallmart
8 9 Aldi NaN
9 10 London Lidl NaN
I use str.contains to find the store name, and place the standardized name into the standard column. Here I am standardizing Lidl stores:
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
print(df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Slidling Shop NaN
5 6 Lidi Berlin NaN
6 7 Popeyes NY Popeyes
7 8 Wallmart LA 90210 Wallmart
8 9 Aldi NaN
9 10 London Lidl Lidl
However the problem here is that it is searching str.contains on rows that have already been standardized (Popeyes and Wallmart).
How can I run str.contains only on rows where df['standard'] == NaN and ignore the standardized rows?
I have tried something very very messy, and it doesn't seem to work. I set a mask and then use that before running str.contains:
mask = df['standard'].isna()
df[mask].loc[df[mask].store.str.contains(aldi_regex,na=False), 'standard3'] = 'Aldi'
Does not work. I have also tried something even more messy and it didn't work:
df.loc[mask].loc[df.loc[mask].store.str.contains(aldi_regex,na=False), 'standard3'] = 'Aldi'
How can I ignore the standardized rows? Without resorting to a for loop.
My example dataframe:
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC', 'Slidling Shop', 'Lidi Berlin', 'Popeyes NY', 'Wallmart LA 90210', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1)), 'standard': pd.Series([pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, 'Popeyes', 'Wallmart', pd.np.nan, pd.np.nan],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
How can I ignore the standardized rows? Without resorting to a for
loop.
By filtering checking for null values:
df.loc[df['standard'].isnull() & df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
I have two panda series, and would simply like to compare their string values, and returning the strings (and maybe indices too) of the values they have in common e.g. Hannah, Frank and Ernie in the example below::
print(x)
print(y)
0 Anne
1 Beth
2 Caroline
3 David
4 Ernie
5 Frank
6 George
7 Hannah
Name: 0, dtype: object
1 Hannah
2 Frank
3 Ernie
4 NaN
5 NaN
6 NaN
7 NaN
Doing
x == y
throws a
ValueError: Can only compare identically-labeled Series objects
as does
x.sort_index(axis=0) == y.sort_index(axis=0)
and
x.reindex_like(y) > y
does something, but not the right thing!
If need common values only you can use convert first column to set and use intersection:
a = set(x).intersection(y)
print (a)
{'Hannah', 'Frank', 'Ernie'}
And for indices need merge by default inner join with reset_index for convert indices to columns:
df = pd.merge(x.rename('a').reset_index(), y.rename('a').reset_index(), on='a')
print (df)
index_x a index_y
0 4 Ernie 3
1 5 Frank 2
2 7 Hannah 1
Detail:
print (x.rename('a').reset_index())
index a
0 0 Anne
1 1 Beth
2 2 Caroline
3 3 David
4 4 Ernie
5 5 Frank
6 6 George
7 7 Hannah
print (y.rename('a').reset_index())
index a
0 1 Hannah
1 2 Frank
2 3 Ernie
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN