Replace string with np.nan if condition is met - python

I am trying to replace a string occurrence in a column if a condition is met.
My sample input dataset:
Series Name Type
Food ACG
Drinks FEG
Food at Home BON
I want to replace the strings on the Series Name column if the strings on the Type column are either ACG or BON with nan or blank. For that I tried the following code where I used conditions with not much success.
Code:
df.loc[((df['Type'] == 'ACG') | (df['Type'] == 'BON')),
df['Series Name'].replace(np.nan)]
Desired output:
Series Name Type
ACG
FEG
Food at Home BON

Since you want to set the whole cell to nan, just do this:
df.loc[((df['Type'] == 'ACG') | (df['Type'] == 'BON')), 'Series Name'] = np.nan
Output:
Series Name Type
0 NaN ACG
1 Drinks FEG
2 NaN BON
Update:
Regarding to your question in the comments, if you only wanted to change parts of the string, you could use replace like this:
#new input
df = pd.DataFrame({
'Series Name': ['Food to go', 'Fast Food', 'Food at Home'],
'Type': ['ACG', 'FEG', 'BON']
})
Series Name Type
0 Food to go ACG
1 Fast Food FEG
2 Food at Home BON
mask = df['Type'].isin(['ACG', 'BON'])
df.loc[mask, 'Series Name'] = (df.loc[mask, 'Series Name']
.replace(to_replace="Food", value='NEWVAL', regex=True))
print(df)
Series Name Type
0 NEWVAL to go ACG
1 Fast Food FEG
2 NEWVAL at Home BON

Another option is to use Series.mask:
mask = df['Type'].isin(['ACG', 'BON'])
df['Series Name'] = df['Series Name'].mask(mask)
Output:
Series Name Type
0 NaN ACG
1 Drinks FEG
2 NaN BON

Related

Multiple similar columns with similar values

The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])

Pandas: creating a new column conditional on substring searches of one column and inverse of another column

I'd like to create a new column in a Pandas data frame based on a substring search of one column and an inverse of another column. Here is some data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Manufacturer':['ABC-001', 'ABC-002', 'ABC-003', 'ABC-004', 'DEF-123', 'DEF-124', 'DEF-125', 'ABC-987', 'ABC-986', 'ABC-985'],
'Color':['04-Red', 'vs Red - 07', 'Red', 'Red--321', np.nan, np.nan, np.nan, 'Blue', 'Black', 'Orange'],
})
Manufacturer Color
0 ABC-001 04-Red
1 ABC-002 vs Red - 07
2 ABC-003 Red
3 ABC-004 Red--321
4 DEF-123 NaN
5 DEF-124 NaN
6 DEF-125 NaN
7 ABC-987 Blue
8 ABC-986 Black
9 ABC-985 Orange
I would like to be able to create a new column named Country based on the following logic:
a) if the Manufacturer column contains the substring 'ABC' and the Color column contains the substring 'Red', then write 'United States' to the Country column
b) if the Manufacturer column contains the substring 'DEF', then write 'Canada to the Country column
c) if the Manufacturer column contains the substring 'ABC' and the Color column does NOT contain the substring 'Red', then write 'England' to the Country column.
My attempt is as follows:
df['Country'] = np.where((df['Manufacturer'].str.contains('ABC')) & (df['Color'].str.contains('Red', na=False)), 'United States', # the 'a' case
np.where(df['Manufacturer'].str.contains('DEF', na=False), 'Canada', # the 'b' case
np.where((df['Manufacturer'].str.contains('ABC')) & (df[~df['Color'].str.contains('Red', na=False)]), 'England', # the 'c' case
'ERROR')))
But, this gets the following error:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
The error message suggests that it might be a matter of operator precedence, as mentioned in:
pandas comparison raises TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
Python error: TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
I believe I'm using parentheses properly here (although maybe I'm not).
Does anyone see the cause of this error? (or know of a more elegant want to accomplish this?)
Thanks in advance!
You don't want to index into df here, so just do this:
Just change: (df[~df['Color'].str.contains('Red', na=False)])
to: ~df['Color'].str.contains('Red', na=False)
and it should work.
Also, if you want to break this up for readability and to eliminate some repetition, I would suggest something like this:
# define the parameters that define the Country variable in another table
df_countries = pd.DataFrame(
{'letters': ['ABC', 'DEF', 'ABC'],
'is_red': [True, False, False],
'Country': ['United States', 'Canada', 'England']})
# add those identifying parameters to your current table as temporary columns
df['letters'] = df.Manufacturer.str.replace('-.*', '')
df['is_red'] = df.Color.str.contains('Red', na=False)
# merge the tables together and drop the temporary key columns
df = df.merge(df_countries, how='left', on=['letters', 'is_red'])
df = df.drop(columns=['letters', 'is_red'])
Or more concise:
in_col = lambda col, string: df[col].str.contains(string, na=False)
conds = {'United States': in_col('Manufacturer', 'ABC') & in_col('Color', 'Red'),
'Canada': in_col('Manufacturer', 'DEF'),
'England': in_col('Manufacturer', 'ABC') & ~in_col('Color', 'Red')}
df['Country'] = np.select(condlist=conds.values(), choicelist=conds.keys())
Another way out is use of np.select(list of conditions, list of choices)
conditions=[(df['Manufacturer'].str.contains('ABC')) & (df['Color'].str.contains('Red')),df['Manufacturer'].str.contains('DEF', na=False),(df['Manufacturer'].str.contains('ABC')) & (~df['Color'].str.contains('Red', na=False))]
choices=['United States','Canada','England']
df['Country']=np.select(conditions,choices)
Manufacturer Color Country
0 ABC-001 04-Red United States
1 ABC-002 vs Red - 07 United States
2 ABC-003 Red United States
3 ABC-004 Red--321 United States
4 DEF-123 NaN Canada
5 DEF-124 NaN Canada
6 DEF-125 NaN Canada
7 ABC-987 Blue England
8 ABC-986 Black England
9 ABC-985 Orange England
This is an easy and straightforward way to do it:
country = []
for index, row in df.iterrows():
if 'DEF' in row['Manufacturer']:
country.append('Canada')
elif 'ABC' in row['Manufacturer']:
if 'Red' in row['Color']:
country.append('United States')
else:
country.append('England')
else:
country.append('')
df['Country'] = country
Of course there will be more efficient ways to go about this w/o looping through the entire dataframe, but in almost all cases this should be sufficient.

Pandas Dataframe : Assigning integer values based on the column value

I have the following pandas dataframe.
df = pd.DataFrame({'Neighborhood': ['Marble Hill', 'Chelsea', 'Sutton Place'],
'Venue Category': ['Hospital', 'Bridge', 'School']})
When I execute it, I get the following table.
Neighborhood Venue Category
0 Marble Hill Hospital
1 Chelsea Bridge
2 Sutton Place School
Now, I want to assign numerical values for each Venue Category.
Hospital - 5 marks
School - 4 marks
Bridge - 2 marks
So I tried to assign marks using this code. I want to display the marks in a separate column.
def df2(df):
if (df['Venue Category'] == 'Hospital'):
return 5
elif (df['Venue Category'] == 'School'):
return 4
elif (df['Venue Category'] != 'Hospital' or df['Venue Category'] != 'School'):
return np.nan
df['Value'] = df.apply(df2, axis = 1)
Once executed, it gives me the following warning. May I know how to fix this please?
/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if __name__ == '__main__':
Create dictionary for all possible Venue Category and then use Series.map, if some value from column not exist in keys of dictionary is returned NaN:
df = pd.DataFrame({'Neighborhood': ['Marble Hill', 'Chelsea', 'Sutton Place', 'aaa'],
'Venue Category': ['Hospital', 'Bridge', 'School', 'a']})
print (df)
Neighborhood Venue Category
0 Marble Hill Hospital
1 Chelsea Bridge
2 Sutton Place School
3 aaa a
d = {'Hospital':5, 'School':4, 'Bridge':2}
df['Value'] = df['Venue Category'].map(d)
print (df)
Neighborhood Venue Category Value
0 Marble Hill Hospital 5.0
1 Chelsea Bridge 2.0
2 Sutton Place School 4.0
3 aaa a NaN
Solution with np.select is possible, but in my opinion overcomplicated:
conditions = [df['Venue Category'] == 'Hospital',
df['Venue Category'] == 'School',
df['Venue Category'] == 'Bridge']
choices = [5,4,3]
df['Value'] = np.select(conditions, choices, default=np.nan)
print (df)
Neighborhood Venue Category Value
0 Marble Hill Hospital 5.0
1 Chelsea Bridge 3.0
2 Sutton Place School 4.0
3 aaa a NaN

Python - Group dataframe based on certain string

I am trying to combine these strings and rows within certain logic:
s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']
s2 = [1,1,2,2,2]
s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']
df = pd.DataFrame(list(zip(s1,s2,s3)),
columns=['file','id','book'])
df
Data Preview:
file id book
abc.txt 1 Harry Potter
abc.txt 1 Vol 1
ert.txt 2 Lord of the Rings
ert.txt 2 NaN
ert.txt 2 Harry Potter
I have bunch of files name columns with id's associated with it. I have 'book' column where vol 1 has been in separate row.
I know that this vol1 is only associated with 'Harry Potter' in the given dataset.
Based on the group by of 'file' & 'id', how do I combine 'Vol 1' in the same row where 'Harry Potter' string appears in the row?
Notice some data row doesn't have vo1 for Harry Potter I only want 'Vol 1' when looking at the file & id groupby.
2 Tries:
1st: doesn't work
if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])):
df.groupby(['file','id'],as_index=False).first()
2nd: this applies to every string (but don't want it apply every 'Harry Potter' string.
df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'
Here is the output I am looking for
file id book
abc.txt 1 Harry Potter - Vol 1
ert.txt 2 Lord of the Rings - Vol 1
ert.txt 2 NaN
ert.txt 2 Harry Potter
Start from import re (you will use it).
Then create your DataFrame:
df = pd.DataFrame({
'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],
'id': [1, 1, 2, 2, 2],
'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',
np.nan, 'Harry Potter']})
The first processing step is to add a column, let's call it book2,
containing book2 from the next row:
df["book2"] = df.book.shift(-1).fillna('')
I added fillna('') to replace NaN values with an empty string.
Then define a function to be applied to each row:
def fn(row):
return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\
and re.match(r'^Vol \d+$', row.book2) else row.book
This function checks whether book == "Harry Potter" and book2 matches
"Vol " + a sequence of digits.
If it does, it returns book + book2, otherwise it returns just book.
Then we apply this function and save the result back under book:
df["book"] = df.apply(fn, axis=1)
And the only remaining thing is to drop:
rows where book matches Vol \d+,
book2 column.
The code is:
df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\
.drop(columns=['book2'])
fillna(False) is needed because str.match returns NaN for
source content == NaN.
Assuming that "Vol x" occurs on the row following the title, I would use an auxilliary Series obtained by shifting the book column by -1. It is then enough to combine that Series with the book column when it starts with "Vol " and drop lines where the books column starts with "Vol ". Code could be:
b2 = df.book.shift(-1).fillna('')
df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')
print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))
If the order in the dataframe is not guaranteed but if a Vol x row matches the other row in dataframe with same file and id, you can split the dataframe in 2 parts one containing the Vol x rows and one containing the other ones and update the latter from the former:
g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))
for k, v in g:
if k:
df_vol = v
else:
df = v
for row in df_vol.iterrows():
r = row[1]
df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']
Utilizing merge, apply, update, drop_duplicates.
set_index and merge on index file, id between df of 'Harry Potter' and df of 'Vol 1'; join to create appropriate string and convert it to dataframe
df.set_index(['file', 'id'], inplace=True)
df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book')
Out[2059]:
book
file id
abc.txt 1 Harry Potter Vol 1
Update original df, drop_duplicate, and reset_index
df.update(df1)
df.drop_duplicates().reset_index()
Out[2065]:
file id book
0 abc.txt 1 Harry Potter Vol 1
1 ert.txt 2 Lord of the Rings - Vol 1
2 ert.txt 2 NaN
3 ert.txt 2 Harry Potter

How to normalize text in a list that is itself in a column of Pandas dataframe?

Basically, I have a dataframe of Ike's sandwiches that has three columns: Ingredients / Name / Price and the ingredients column is a list of ingredients ['x',' y',' z']
Unfortunately, when I scraped the list, it retained weird spaces and other formatting and now I'd like to amend the ingredient lists in the columns to remove spaces and force lower case.
example:
0 [Avocado, French Dressing, Gouda, Ham, Sal... Al Bundy $9.99
1 [Caesar, Halal Chicken, Marinated Artichoke ... Backstabber $9.99
2 [Bacon, Swiss, Turkey] Barry B. $8.98
3 [Avocado, Havarti, Turkey] Barry Z. $8.98
4 [Avocado, Halal Chicken, Honey Mustard, Pep... Bella $9.99
And the problem is:
> [x for x in mdf.ingredients[3:4]]
[[u'Avocado', u' Havarti', u' Turkey']]
Notice the spaces
I tried to doing:
for sandwich in mdf.ingredients:
for ingredient in sandwich:
ingredient = ingredient.strip()
ingredient = ingredient.lower()
Which, if I print ingredient in the loop, accomplishes my goal, but does not actually change the value within the dataframe.
Is there anyway to change the values within those lists or do I need to make a whole new column with the corrected values?
To modify df['ingredients'], you could assign it to a list of lists. For example, if df looks like this:
import pandas as pd
df = pd.DataFrame([([u'Avocado', u' Havarti', u' Turkey'], 'Barry Z', 8.98),
([u'Bacon', u' Swiss', u'Turkey'], 'Barry B', 8.98)],
columns=['ingredients', 'name', 'price'])
print(df)
# ingredients name price
# 0 [Avocado, Havarti, Turkey] Barry Z 8.98
# 1 [Bacon, Swiss, Turkey] Barry B 8.98
then
df['ingredients'] = [[item.strip().lower() for item in lst] for lst in df['ingredients']]
makes df look like
ingredients name price
0 [avocado, havarti, turkey] Barry Z 8.98
1 [bacon, swiss, turkey] Barry B 8.98
However, having a column of lists is often not very convenient. If you want to find all the items with swiss as an ingredient, you have to loop through each row, check if that row has swiss, then return that row.
If instead you normalized the DataFrame so that each item has its own column, then that kind of search can be expressed more easily.
For example:
import pandas as pd
df = pd.DataFrame([([u'Avocado', u' Havarti', u' Turkey'], 'Barry Z', 8.98),
([u'Bacon', u' Swiss', u'Turkey'], 'Barry B', 8.98)],
columns=['ingredients', 'name', 'price'])
ingredients = df['ingredients'].apply(
lambda lst: pd.Series(True, index=[item.strip().lower() for item in lst]))
ingredients.fillna(False, inplace=True)
del df['ingredients']
df = df.join(ingredients)
print(df)
produces a DataFrame that looks like
name price avocado bacon havarti swiss turkey
0 Barry Z 8.98 True False True False True
1 Barry B 8.98 False True False True True
Now to find all items which contain swiss you could use:
In [43]: df[df['swiss']]
Out[43]:
name price avocado bacon havarti swiss turkey
1 Barry B 8.98 False True False True True
By the way, this code:
for ingredient in sandwich:
ingredient = ingredient.strip()
does not affect sandwich because inside the loop the variable ingredient is getting reassigned to a new value. It does not change the values in sandwich. Understanding this is a fundamental ingredient to understanding Python's name/reference model.

Categories

Resources