I want to replace the position words from strings column: if they are either present sole or in multiple but join with , and space.
id strings
0 1 south
1 2 north
2 3 east
3 4 west
4 5 west, east, south
5 6 west, west
6 7 north, north
7 8 north, south
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
My expected result will like this. Please note if they are components of phrase or words, then I don't need to replace them.
Is it possible to do that? Thank you.
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
The following code works, but I just wonder if there are some more concise methods?
df['strings'].astype(str).replace('south', np.nan).replace('north', np.nan)\
.replace('west', np.nan).replace('east', np.nan).replace('west, east', np.nan)\
.replace('west, west', np.nan).replace('north, north', np.nan).replace('west, east', np.nan)\
.replace('north, south', np.nan)
First use Series.str.split, forward filling for replace missing values, test if all matched values by DataFrame.isin and DataFrame.all for mask and last set missing values by Series.mask:
L = ['south','north','east','west']
m = df['strings'].str.split(', ', expand=True).ffill(axis=1).isin(L).all(axis=1)
df['strings'] = df['strings'].mask(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Another idea with sets, isdisjoint and Series.where:
m = [set(x.split(', ')).isdisjoint(L) for x in df['strings']]
df['strings'] = df['strings'].where(m)
print (df)
id strings
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 West Corporation global office
9 10 West-Riding
10 11 University of West Florida
11 12 Southwest
Using Regex.
Ex:
df = pd.DataFrame({'strings': ['south', 'north', 'east', 'west', 'west, east, south', 'west, west', 'north, north', 'north, south', 'West Corporation global office', 'West-Riding', 'University of West Florida', 'Southwest']})
df['R'] = df['strings'].replace(r"\b(south|north|east|west)\b,?", np.NAN, regex=True)
print(df)
Output:
strings R
0 south NaN
1 north NaN
2 east NaN
3 west NaN
4 west, east, south NaN
5 west, west NaN
6 north, north NaN
7 north, south NaN
8 West Corporation global office West Corporation global office
9 West-Riding West-Riding
10 University of West Florida University of West Florida
11 Southwest Southwest
Related
I am trying to create a tally that alternates the Region by one row when the next ID is run. I am expecting one result per each ID.
Tried a few methods but nothing seems to be working and running short on ideas.
Data Set
ID
Region
1
North
1
South
1
East
1
West
2
North
2
South
2
East
2
West
3
North
3
South
3
East
3
West
4
North
4
South
4
East
4
West
5
Northwest
5
South West
6
Northwest
6
South West
7
North
7
South
7
East
7
West
8
North
8
South
8
East
8
West
9
North
9
South
9
East
9
West
9
Northwest
9
South West
Expected Output
ID
Region
1
North
2
South
3
East
4
West
5
Northwest
6
South West
7
North
8
South
9
North
You can factorize both columns and keep the rows for which the ranks are equal:
out = df.loc[pd.factorize(df['Region'])[0] == pd.factorize(df['ID'])[0]]
Output:
ID Region
0 1 North
5 2 South
10 3 East
15 4 West
16 5 Northwest
19 6 South West
Other idea, what about using an intermediate rectangular matrix and take its diagonal?
import numpy as np
df2 = (df.pivot(index='ID', columns='Region', values='Region')
.reindex(index=df['ID'].unique(), columns=df['Region'].unique())
)
out = pd.DataFrame({'ID': df2.index, 'Region': np.diag(df2)})
Output:
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
Intermediate rectangular matrix:
Region North South East West Northwest South West
ID
1 North South East West NaN NaN
2 North South East West NaN NaN
3 North South East West NaN NaN
4 North South East West NaN NaN
5 NaN NaN NaN NaN Northwest South West
6 NaN NaN NaN NaN Northwest South West
updated example
There is no clear logic, so I can just assume hereā¦
assuming you want the restart the diagonals
import numpy as np
df2 = (df.pivot(index='ID', columns='Region', values='Region')
.reindex(index=df['ID'].unique(), columns=df['Region'].unique())
)
idx = np.arange(df2.shape[0])
out = pd.DataFrame({'ID': df2.index, 'Region': df2.to_numpy()[idx, idx%df2.shape[1]]})
Output:
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
6 7 North
7 8 South
8 9 East
Intermediate df2, with selected values **HIGHLIGHTED**:
Region North South East West Northwest South West
ID
1 **NORTH** South East West NaN NaN
2 North **SOUTH** East West NaN NaN
3 North South **EAST** West NaN NaN
4 North South East **WEST** NaN NaN
5 NaN NaN NaN NaN **NORTHWEST** South West
6 NaN NaN NaN NaN Northwest **SOUTH WEST**
7 **NORTH** South East West NaN NaN
8 North **SOUTH** East West NaN NaN
9 North South **EAST** West Northwest South West
assuming you want to restart when the successive groups are different:
import numpy as np
df2 = (df.pivot(index='ID', columns='Region', values='Region')
.reindex(index=df['ID'].unique(), columns=df['Region'].unique())
)
group = df2.ne(df2.shift()).where(df2.notna()).any(axis=1).cumsum()
out = (df2.groupby(group, group_keys=False)
.apply(lambda g: pd.Series(np.diag(g.dropna(axis=1)), index=g.index))
.reset_index(name='Region')
)
Output:
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
6 7 North
7 8 South
8 9 North
Intermediate df2 with groups of successive identical rows:
Region North South East West Northwest South West
ID group
1 1 North South East West NaN NaN
2 1 North South East West NaN NaN
3 1 North South East West NaN NaN
4 1 North South East West NaN NaN
5 2 NaN NaN NaN NaN Northwest South West
6 2 NaN NaN NaN NaN Northwest South West
7 3 North South East West NaN NaN
8 3 North South East West NaN NaN
9 4 North South East West Northwest South West
I think the method ".drop_duplicates" might solve your problem. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
I think you should use the "subset" parameter, as indicated in documentation of the above link.
The unique region names can be collected into a set, then for each group (grouped by ID) the next available region is extracted:
def get_next_region(r_group, regions):
for reg in r_group:
if reg in regions:
regions.remove(reg)
break
return reg
regions = set(df.Region.unique()) # unique regions
reg_df = df.groupby('ID')['Region'].apply(get_next_region, regions=regions.copy())\
.reset_index(name="Region")
print(reg_df)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
Here is a potential solution. It does result in the desired output for the example you gave, but would not necessarily generalize to any combination of ID values or combinations of Regions. It may be useful if you are looking for pandas methods that don't require setting up a loop.
import pandas as pd
import math
# Create pandas DataFrame
df = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6],
'Region':['North','South','East','West',
'North','South','East','West',
'North','South','East','West',
'North','South','East','West',
'Northwest','South West','Northwest','South West'
]
})
# A method of indexing the regions per ID from 1 to n regions
df = df.reset_index().rename(columns={'index':'Region Idx'})
offset = df['Region Idx']*df['ID'].diff()
offset.loc[offset==0] = math.nan
offset = offset.ffill().fillna(0)
df['Region Idx'] = df['Region Idx'] - offset + 1
# Assuming your IDs are integers in ascending order
# and you don't have any special cases with the number of regions per ID,
# this can be used for the rolling region selection per ID
df['Max Region'] = df[['ID','Region Idx']].groupby('ID').transform('max')
df['Selected Region Idx'] = (df['ID']-1) % df['Max Region'] + 1
# Final result
result = df.loc[df['Region Idx']==df['Selected Region Idx'],['ID','Region']]
A possible solution
unique_ids = df['ID'].unique()
res_df = pd.DataFrame(columns=['ID', 'Region'])
for i, id in enumerate(unique_ids):
region = df[df['ID'] == id]['Region'].iloc[i % 4]
res_df = pd.concat([res_df, pd.DataFrame({'ID': [id], 'Region': [region]})],
ignore_index=True)
print(res_df)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
Another thought using groupby.ngroup and drop_duplicates:
id_idx = df.groupby('ID').ngroup().drop_duplicates().index
region_idx = df.groupby('Region').ngroup().drop_duplicates().index
out = pd.DataFrame()
out['ID'], out['Region'] = df.loc[id_idx, 'ID'].values, df.loc[region_idx, 'Region'].values
print(out)
ID Region
0 1 North
1 2 South
2 3 East
3 4 West
4 5 Northwest
5 6 South West
Suppose we have the following pandas dataframe:
df
Age Country
0 10 1
1 15 2
2 20 3
3 25 1
4 30 2
5 15 3
6 20 3
7 15 4
8 20 4
The desired result is:
Age Country Continent
0 10 1 Africa
1 15 2 Asia
2 20 3 Africa
3 25 1 Africa
4 30 2 Asia
5 15 3 Africa
6 20 3 Africa
7 15 4 Asia
8 20 4 Asia
How can we group df if the number of countries is huge? One possible solution is given here:
continent_lookup = {1: 'Africa', 2: 'Asia', 3: 'Africa', 4: 'Asia'}
df['Continent'] = df.Country.map(continent_lookup)
However, if the number of countries is very large, it would be convenient to be able to pass a dictionary of the following form:
convenient_continent_lookup = {'Africa': [1,3], 'Asia': [2,4]}
One way to go from here would be to convert convenient_continent_lookup to a dataframe as mentioned in this answer to get df2:
df2
Country Continent
0 1 Africa
1 2 Asia
2 3 Africa
3 1 Africa
4 2 Asia
5 3 Africa
6 3 Africa
7 4 Asia
8 4 Asia
If we would now join df with df2 on column 'Continent' we get the desired result. Is there an easier solution?
You don't need to create a new dataframe at all. Using list comphrension, you can convert the convenient_continent_lookup dict into a dict that map can understand:
df['Continent'] = df['Country'].map({v: k for k, vals in convenient_continent_lookup.items() for v in vals})
Output:
>>> df
Age Country Continent
0 10 1 Africa
1 15 2 Asia
2 20 3 Africa
3 25 1 Africa
4 30 2 Asia
5 15 3 Africa
6 20 3 Africa
7 15 4 Asia
8 20 4 Asia
Let us check explode + merge
s = pd.Series(convenient_continent_lookup).explode().reset_index()
s.columns = ['Continent','Country']
df = df.merge(s,how='left')
df
Out[359]:
Age Country Continent
0 10 1 Africa
1 25 1 Africa
2 15 2 Asia
3 30 2 Asia
4 20 3 Africa
5 15 3 Africa
6 20 3 Africa
7 15 4 Asia
8 20 4 Asia
I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds.
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
print(df)
id store
0 1 McDonalds
1 2 Lidl
2 3 Lidl New York 123
3 4 KFC
4 5 Taco Restaurant
5 6 Lidl Berlin
6 7 Popeyes
7 8 Wallmart
8 9 Aldi
9 10 London Lidl
So let's say I want to standardize the Lidl stores. The standard name will just be "Lidl.
I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there. However I can't figure this out.
I'll first create the column where the standard name will be inserted:
d['standard_name'] = pd.np.nan
Then search for instances of Lidl, and insert the cleaned name into standard_name.
First of all the plan is to use str.contains and then set the standardized value to the new column:
df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'
print(df)
id store standard_name
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin NaN
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl NaN
Nothing has been inserted. I checked just the str.contains code alone, and found it all returned false:
df.store.str.contains(r'\blidl\b',re.I,regex=True)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: store, dtype: bool
I'm not sure what's happening here.
What I am trying to end up with is the standardized names filled in like this:
id store standard_name
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated
Also, is this the fastest way to do this? There are millions of rows to process.
If want set new column you can use DataFrame.loc with case=False or re.I :
Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Or is possible use another approach - Series.str.extract:
df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I have one data frame which is loads a central file. New files are monthly updated here. Since there are few missing columns in the file which is copied into the data frame, I created a mapping dataframe which adds values to the dataframe when condition is met for the missing columns.
Below is the central file example:
ID Region Country Code User Order Price
1 Germany ABC 2342545
2 Italy DEF 5464545
3 USA GHI 3245325
4 India JKL 676565
5 Mexico MNO 3443252
6 China PQR 565445
7 Germany STU 765765
8 Mexico VWX 564566
9 China YZA 346534
10 India BCD 5675765
This is my mapping file for missing Region and Code
Country Region Code
Germany EU 1
Italy EU 2
USA America 3
India Asia 4
Mexico America 5
China Asia 6
Here is the expected output:
ID Region Country Code User Order Price
1 EU Germany 1 ABC 2342545
2 EU Italy 2 DEF 5464545
3 America USA 3 GHI 3245325
4 Asia India 4 JKL 676565
5 America Mexico 5 MNO 3443252
6 Asia China 6 PQR 565445
7 EU Germany 2 STU 765765
8 America Mexico 5 VWX 564566
9 Asia China 6 YZA 346534
10 Asia India 4 BCD 5675765
What I have done is to use for loops with iterrows() to update the values in the data frame.
The problem is it is super slow and it takes about an hour or more to update about 60,000 records.
here is my code:
for central_update_index, central_update_row in central_bridge_file.iterrows():
print('index: ', central_update_index)
for bridge_match_index, bridge_match_row in central_bridge_matching_file.iterrows():
if bridge_match_row['Country'] == central_update_row['Country']:
if central_update_row['Country / Company (P2)'] == bridge_match_row['Country']:
central_bridge_file.loc[central_update_index, 'Code'] = \
bridge_match_row['Code']
central_bridge_file.loc[central_update_index, 'Region'] = bridge_match_row[
'Region']
Can someone help me in how can I write a lambda statement or something that could do it in mins?
Give df,
ID Region Country Code User Order Price
0 1 NaN Germany NaN ABC 2342545
1 2 NaN Italy NaN DEF 5464545
2 3 NaN USA NaN GHI 3245325
3 4 NaN India NaN JKL 676565
4 5 NaN Mexico NaN MNO 3443252
5 6 NaN China NaN PQR 565445
6 7 NaN Germany NaN STU 765765
7 8 NaN Mexico NaN VWX 564566
8 9 NaN China NaN YZA 346534
9 10 NaN India NaN BCD 5675765
and df_map,
Country Region Code
0 Germany EU 1
1 Italy EU 2
2 USA America 3
3 India Asia 4
4 Mexico America 5
5 China Asia 6
You can merge these two dataframes on 'Country':
df[['ID','Country','User','Order Price']].merge(df_map)
Output:
ID Country User Order Price Region Code
0 1 Germany ABC 2342545 EU 1
1 7 Germany STU 765765 EU 1
2 2 Italy DEF 5464545 EU 2
3 3 USA GHI 3245325 America 3
4 4 India JKL 676565 Asia 4
5 10 India BCD 5675765 Asia 4
6 5 Mexico MNO 3443252 America 5
7 8 Mexico VWX 564566 America 5
8 6 China PQR 565445 Asia 6
9 9 China YZA 346534 Asia 6
If you want to totally replace the Region and Code columns in your data, you can do merge:
df = (df.drop(['Region','Code'], axis=1)
.merge(mapping, on='Country', how='left')
)
If you only want to update those columns, e.g. keeping the old values, then
mapping = mapping.set_index('Country')
for c in ['Region', 'Code']:
df[c] = df[c].fillna(df['Country'].map(mapping[c]))
Output:
ID Region Country Code User Order Price
0 1 EU Germany 1.0 ABC 2342545
1 2 EU Italy 2.0 DEF 5464545
2 3 America USA 3.0 GHI 3245325
3 4 Asia India 4.0 JKL 676565
4 5 America Mexico 5.0 MNO 3443252
5 6 Asia China 6.0 PQR 565445
6 7 EU Germany 1.0 STU 765765
7 8 America Mexico 5.0 VWX 564566
8 9 Asia China 6.0 YZA 346534
9 10 Asia India 4.0 BCD 5675765
I have a large dataframe that I need to split on empty rows.
here's a simplified example of the DataFrame:
A B C
0 1 0 International
1 1 1 International
2 NaN 2 International
3 1 3 International
4 1 4 International
5 8 0 North American
6 8 1 North American
7 8 2 North American
8 8 3 North American
9 NaN NaN NaN
10 1 0 Internal
11 1 1 Internal
12 6 0 East
13 6 1 East
14 6 2 East
...
As you can see, row 9 is blank. What I need to do is take rows 0 through 8 and put them in a different dataframe, as well as rows 10 to the next blank so that I have several dataframes in the end. Notice, when looking for blank rows I need the whole row to be blank.
Here is the code I'm using to find blanks:
def find_breaks(df):
df_breaks = df[(df.loc[:,['A','B','C']].isnull()).any(axis=1)]
print(df_breaks.index)
This code works when I test it on the simplified DF but, of course, my real DataFrame has many more columns than ['A','B','C']
How can I find the next blank row (or as I am doing above, all the blank rows at once) without having to specify my column names?
Thanks
IIUC, use pd.isnull + np.split:
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
A B C
0 1.0 0.0 International
1 1.0 1.0 International
2 NaN 2.0 International
3 1.0 3.0 International
4 1.0 4.0 International
5 8.0 0.0 North American
6 8.0 1.0 North American
7 8.0 2.0 North American
8 8.0 3.0 North American
A B C
9 NaN NaN NaN
10 1.0 0.0 Internal
11 1.0 1.0 Internal
12 6.0 0.0 East
13 6.0 1.0 East
14 6.0 2.0 East
First, obtain the indices where the entire row is null, and then use that to split your dataframe into chunks. np.split handles dataframes quite well.