I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds. You can see below that Popeyes and Wallmart have already been standardized:
id store standard
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Slidling Shop NaN
5 6 Lidi Berlin NaN
6 7 Popeyes NY Popeyes
7 8 Wallmart LA 90210 Wallmart
8 9 Aldi NaN
9 10 London Lidl NaN
I use str.contains to find the store name, and place the standardized name into the standard column. Here I am standardizing Lidl stores:
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
print(df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Slidling Shop NaN
5 6 Lidi Berlin NaN
6 7 Popeyes NY Popeyes
7 8 Wallmart LA 90210 Wallmart
8 9 Aldi NaN
9 10 London Lidl Lidl
However the problem here is that it is searching str.contains on rows that have already been standardized (Popeyes and Wallmart).
How can I run str.contains only on rows where df['standard'] == NaN and ignore the standardized rows?
I have tried something very very messy, and it doesn't seem to work. I set a mask and then use that before running str.contains:
mask = df['standard'].isna()
df[mask].loc[df[mask].store.str.contains(aldi_regex,na=False), 'standard3'] = 'Aldi'
Does not work. I have also tried something even more messy and it didn't work:
df.loc[mask].loc[df.loc[mask].store.str.contains(aldi_regex,na=False), 'standard3'] = 'Aldi'
How can I ignore the standardized rows? Without resorting to a for loop.
My example dataframe:
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC', 'Slidling Shop', 'Lidi Berlin', 'Popeyes NY', 'Wallmart LA 90210', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1)), 'standard': pd.Series([pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, 'Popeyes', 'Wallmart', pd.np.nan, pd.np.nan],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
How can I ignore the standardized rows? Without resorting to a for
loop.
By filtering checking for null values:
df.loc[df['standard'].isnull() & df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
Related
I want to split a dataframe into 3 new dataframes according to a priority column. My dataframe is as follows:
City Priority
0 New York 3
1 Paris 1
2 Boston 7
3 La Habana 6
4 Bilbao 10
5 Roma 2
6 Barcelona 1
7 Bruselas 8
8 Tokyo 7
9 Caracas 11
There are 3 types of priorities:
Priority 7 to 9
Priority 1 to 6
Priority from 10 to 11
The idea is to divide this dataframe in 3 with the following structure and that in turn would be ordered by the value of its priority:
Dataframe with 3 rows of priority from 7 to 9
Dataframe with 5 rows of priority from 1 to 6
Dataframe with 2 rows of the priority from 10 to 11.
The result would be as follows:
Dataframe 1:
City Priority
0 Boston 7
1 Tokyo 7
2 Bruselas 8
Dataframe 2:
City Priority
0 Paris 1
1 Barcelona 1
2 Roma 2
3 New York 3
4 La Habana 6
Dataframe 3:
City Priority
0 Bilbao 10
1 Caracas 11
I think it is important to note that if there were no rows of priority 7 to 9, the priority numbers that would be chosen for that dataframe of 3 would be 10, if not 11, if not 1, if not 2, etc. The same with the rest of the dataframes and priorities: 1, 2, 3, 4, etc for the second one and 10, 11, 1, 2, 3, etc for the third one.
Also, if there were 4 values such as 7, 7, 7, 8, only rows 7, 7, 7 would appear in the 3-row Dataframe and the row with value 8 would be in Dataframe 2.
Likewise, I think it is also important to say that in that iteration, when the first dataframe of 3 rows is generated, they should be "extracted" from the original dataframe so that they are not taken into account when generating the other dataframes. I hope I have explained myself well and that someone can help me. Best regards and thanks!
IIUC this should work as expected:
(1) you create a column bin_Priority which applies each row to the right bin, the labels of the bins are the priority in which order to look for them.
(2) sort_values on bin_Priority, then in each bin on Priority.
(3) split your df into 3 df's, the 1st with 3 rows, the 2nd with 2 rows and the 3rd with 5 rows.
If values of Priority groups are missing it chooses the right values because they are ordered right.
Please let me know if that is what you are searching for.
df = pd.DataFrame({
'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
'Priority': [3, 1, 7, 6, 10, 2, 1, 8, 7, 11]
})
#(1)
df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()
#(2)
ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])
#(3)
out = np.split(ordered_priority_df, [3,5])
print(df, ordered_priority_df, *out, sep='\n\n')
#df
City Priority bin_Priority
0 New York 3 3
1 Paris 1 3
2 Boston 7 1
3 La Habana 6 3
4 Bilbao 10 2
5 Roma 2 3
6 Barcelona 1 3
7 Bruselas 8 1
8 Tokyo 7 1
9 Caracas 11 2
#ordered_priority_df
City Priority bin_Priority
2 Boston 7 1
8 Tokyo 7 1
7 Bruselas 8 1
4 Bilbao 10 2
9 Caracas 11 2
1 Paris 1 3
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
# out[0]
City Priority bin_Priority
2 Boston 7 1
8 Tokyo 7 1
7 Bruselas 8 1
# out[1]
City Priority bin_Priority
4 Bilbao 10 2
9 Caracas 11 2
# out[2]
City Priority bin_Priority
1 Paris 1 3
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
Here is an example where I changed the value of Paris from 1 to 7. value 8 (which should be in the 1st df) ends in the 2nd df and same with value 11 (from 2nd to 3rd).
df = pd.DataFrame({
'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
'Priority': [3, 7, 7, 6, 10, 2, 1, 8, 7, 11]
})
df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()
ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])
out = np.split(ordered_priority_df, [3,5])
print(df, *out, sep='\n\n')
City Priority bin_Priority
0 New York 3 3
1 Paris 7 1
2 Boston 7 1
3 La Habana 6 3
4 Bilbao 10 2
5 Roma 2 3
6 Barcelona 1 3
7 Bruselas 8 1
8 Tokyo 7 1
9 Caracas 11 2
City Priority bin_Priority
1 Paris 7 1
2 Boston 7 1
8 Tokyo 7 1
City Priority bin_Priority
7 Bruselas 8 1
4 Bilbao 10 2
City Priority bin_Priority
9 Caracas 11 2
6 Barcelona 1 3
5 Roma 2 3
0 New York 3 3
3 La Habana 6 3
I want to replace the position words from strings column: if they are either present sole or in multiple but join with , and space.
id strings
0 1 south
1 2 north
2 3 east
3 4 west
4 5 west, east, south
5 6 west, west
6 7 north, north
7 8 north, south
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
My expected result will like this. Please note if they are components of phrase or words, then I don't need to replace them.
Is it possible to do that? Thank you.
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
The following code works, but I just wonder if there are some more concise methods?
df['strings'].astype(str).replace('south', np.nan).replace('north', np.nan)\
.replace('west', np.nan).replace('east', np.nan).replace('west, east', np.nan)\
.replace('west, west', np.nan).replace('north, north', np.nan).replace('west, east', np.nan)\
.replace('north, south', np.nan)
First use Series.str.split, forward filling for replace missing values, test if all matched values by DataFrame.isin and DataFrame.all for mask and last set missing values by Series.mask:
L = ['south','north','east','west']
m = df['strings'].str.split(', ', expand=True).ffill(axis=1).isin(L).all(axis=1)
df['strings'] = df['strings'].mask(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Another idea with sets, isdisjoint and Series.where:
m = [set(x.split(', ')).isdisjoint(L) for x in df['strings']]
df['strings'] = df['strings'].where(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Using Regex.
Ex:
df = pd.DataFrame({'strings': ['south', 'north', 'east', 'west', 'west, east, south', 'west, west', 'north, north', 'north, south', 'West Corporation global office', 'West-Riding', 'University of West Florida', 'Southwest']})
df['R'] = df['strings'].replace(r"\b(south|north|east|west)\b,?", np.NAN, regex=True)
print(df)
Output:
strings R
0 south NaN
1 north NaN
2 east NaN
3 west NaN
4 west, east, south NaN
5 west, west NaN
6 north, north NaN
7 north, south NaN
8 West Corporation global office West Corporation global office
9 West-Riding West-Riding
10 University of West Florida University of West Florida
11 Southwest Southwest
I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds.
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
print(df)
id store
0 1 McDonalds
1 2 Lidl
2 3 Lidl New York 123
3 4 KFC
4 5 Taco Restaurant
5 6 Lidl Berlin
6 7 Popeyes
7 8 Wallmart
8 9 Aldi
9 10 London Lidl
So let's say I want to standardize the Lidl stores. The standard name will just be "Lidl.
I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there. However I can't figure this out.
I'll first create the column where the standard name will be inserted:
d['standard_name'] = pd.np.nan
Then search for instances of Lidl, and insert the cleaned name into standard_name.
First of all the plan is to use str.contains and then set the standardized value to the new column:
df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'
print(df)
id store standard_name
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin NaN
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl NaN
Nothing has been inserted. I checked just the str.contains code alone, and found it all returned false:
df.store.str.contains(r'\blidl\b',re.I,regex=True)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: store, dtype: bool
I'm not sure what's happening here.
What I am trying to end up with is the standardized names filled in like this:
id store standard_name
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated
Also, is this the fastest way to do this? There are millions of rows to process.
If want set new column you can use DataFrame.loc with case=False or re.I :
Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Or is possible use another approach - Series.str.extract:
df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?
I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS
Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)
Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.
I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.
Having saved it my script is
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
# print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()
However this does not seem to process the headers properly as it gives
GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Year: 2010/11 (Final) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Coverage: England NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1. Includes International GCSE, Cambridge Inte... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2. Includes attempts and achievements by these... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
All of this should be treated as comments.
If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.
How can you get pandas to automatically detect where the column headers are just as libreoffice does?
pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.
Simply tell parse you want to skip some rows:
>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
Local Authority Number Local Authority Name \
0 201 City of london
1 201 City of london
2 202 Camden
3 202 Camden
4 202 Camden
Local Authority Establishment Number Unique Reference Number \
0 2016005 100001
1 2016007 100003
2 2024104 100049
3 2024166 100050
4 2024196 100051
School Name Town \
0 City of London School for Girls London
1 City of London School London
2 Haverstock School London
3 Parliament Hill School London
4 Regent High School London
Number of pupils at the end of key stage 4 \
0 105
1 140
2 200
3 172
4 174
Number of pupils attempting a GCSE or an IGCSE \
0 104
1 140
2 194
3 169
4 171
Number of students achieving 8 or more GCSE or IGCSE passes at A*-G \
0 100
1 108
2 SUPP
3 22
4 0
Number of students achieving 8 or more GCSE or IGCSE passes at A*-A \
0 87
1 75
2 0
3 7
4 0
Number of students achieving 5 A*-A grades or more at GCSE or IGCSE
0 100
1 123
2 0
3 34
4 SUPP
[5 rows x 11 columns]