I am trying to modify the values in columns of a pandas DataFrame based on conditionals. This answer: https://stackoverflow.com/a/50779719/1112097 is close, but the conditionals used are too simple for my use case, which uses a dictionary of lists in the conditional
Consider a Dataframe of individuals and their location:
owners = pd.DataFrame([['John', 'North'],
['Sara', 'South'],
['Seth', 'East'],
['June', 'West']],
columns=['Who','Location'])
owners
output:
Who
Location
0
John
North
1
Sara
South
2
Seth
East
3
June
West
The dictionary contains lists of locations where a type of pet can go:
pets = {
'Cats': ['North', 'South'],
'Dogs': ['East', 'North'],
'Birds': ['South', 'East']}
pets
output: {'Cats': ['North', 'South'],
'Dogs': ['East', 'North'],
'Birds': ['South', 'East']}
I need to add a column in the owners DateFrame for each pet type that says yes or no based on the presence of the location in the dictionary lists
In this example, the final table should look like this:
Who
Location
Cats
Dogs
Birds
0
John
North
Yes
Yes
No
1
Sara
South
Yes
No
Yes
2
Seth
East
No
Yes
Yes
3
June
West
No
No
No
This fails
for pet in pets:
owners[pet] = 'Yes' if owners['Location'] in pets[pet] else 'No'
With the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I understand that the error comes from the fact that owners['Location'] is a series not an individual value in a row, but I don't know the proper way to apply this kind of conditional across the rows of a DataFrame.
Lets do isin
for k, v in pets.items():
owners[k] = owners['Location'].isin(v)
Who Location Cats Dogs Birds
0 John North True True False
1 Sara South True False True
2 Seth East False True True
3 June West False False False
You can also use .isin() and .map()
for pet in pets:
owners[pet] = owners["Location"].isin(pets[pet]).map({True: "Yes", False: "No"})
print(owners)
Who Location Cats Dogs Birds
0 John North Yes Yes No
1 Sara South Yes No Yes
2 Seth East No Yes Yes
3 June West No No No
you need to iterate through both the key and value of pets, see below.
for k,v in pets.values():
owners[k] = owners['Location'].apply(lambda x: 'Yes' if x in v else 'No')
will output:
Who Location North East South
0 John North No Yes No
1 Sara South Yes No No
2 Seth East No No Yes
3 June West No No No
You can use apply:
for pet, locs in pets.items():
owners[pet] = owners['Location'].apply(lambda l: 'Yes' if l in locs else 'No')
Some of these other answers are probably faster, but here is a way by switching the dictionary keys and values.
d = {i:[] for i in set([j for i in list(pets.values()) for j in i])}
for k,v in pets.items():
for i in v:
d.get(i).append(k)
owners.join(owners['Location'].map(d).str.join('|').str.get_dummies().replace({1:'Yes',0:'No'}))
Output:
Who Location Birds Cats Dogs
0 John North No Yes Yes
1 Sara South Yes Yes No
2 Seth East Yes No Yes
3 June West No No No
Related
I have a dataframe with a string column that contains a sequence of author names and their affiliations.
Address
'Smith, Jane (University of X); Doe, Betty (Institute of Y)'
'Walter, Bill (Z University); Albertson, John (Z University); Chen, Hilary (University of X)'
'Note, Joe (University of X); Cal, Stephanie (University of X)'
I want to create a new column with a Boolean TRUE/FALSE that tests if all authors are from University X. Note there can be any number of authors in the string.
Desired output:
T/F
FALSE
FALSE
TRUE
I think I can split the Address column using
df['Address_split'] = df['Address'].str.split(';', expand=False)
which then creates the list of names in the cell.
Address_split
['Smith, Jane (University of X)', 'Doe, Betty (University of Y)']
I even think I can use the all() function to test for a Boolean for one cell at a time.
all([("University X" in i) for i in df['Address_split'][2]]) returns TRUE
But I am struggling to think through how I can do this on each cell's list individually. I think I need some combination of map and/or apply.
You can split but expand so you can stack into one big Series. Then you can use extract to get the name and location.
Then your check is that all the values are 'University of X' which can be done with an equality comparison + all within a groupby. Since the grouping is based on the original index you can simply assign the result back to the original DataFrame
s = (df['Address'].str.split(';', expand=True).stack()
.str.extract('(.*)\s\((.*)\)')
.rename(columns={0: 'name', 1: 'location'}))
# name location
#0 0 Smith, Jane University of X
# 1 Doe, Betty Institute of Y
#1 0 Walter, Bill Z University
# 1 Albertson, John Z University
# 2 Chen, Hilary University of X
#2 0 Note, Joe University of X
# 1 Cal, Stephanie University of X
df['T/F'] = s['location'].eq('University of X').groupby(level=0).all()
print(df)
Address T/F
0 Smith, Jane (University of X); Doe, Betty (Ins... False
1 Walter, Bill (Z University); Albertson, John (... False
2 Note, Joe (University of X); Cal, Stephanie (U... True
You can use str.extractall to extract all the universities in parentheses and check if matches with University of X.
df['T/F'] = df['Address'].str.extractall(r"\(([^)]*)\)").eq('University of X').groupby(level=0).all()
Address T/F
0 'Smith, Jane (University of X); Doe, Betty (In... False
1 'Walter, Bill (Z University); Albertson, John ... False
2 'Note, Joe (University of X); Cal, Stephanie (... True
Here are some other options:
u = 'University of X'
df['Address'].str.count(u).eq(df['Address'].str.count(';')+1)
or
df['Address'].str.findall('([\w ]+)(?=\))').map(lambda x: set(x) == {u})
Output:
0 False
1 False
2 True
I have a dataframe which looks something like this:
dfA
name field country action
Sam elec USA POS
Sam elec USA POS
Sam elec USA NEG
Tommy mech Canada NEG
Tommy mech Canada NEG
Brian IT Spain NEG
Brian IT Spain NEG
Brian IT Spain POS
I want to group the dataframe based on the first 3 columns adding a new column "No of data". This is something which I do using this:
dfB = dfA.groupby(["name", "field", "country"], dropna=False).size().reset_index(name = "No_of_data")
This gives me a new dataframe which looks something like this:
dfB
name field country No_of_data
Sam elec USA 3
Tommy mech Canada 2
Brian IT Spain 3
But now I also want to add a new column to this particular dataframe which tells me what is the count of number of "POS" for every combination of "name", "field" and "country". Which should look something like this:
dfB
name field country No_of_data No_of_POS
Sam elec USA 3 2
Tommy mech Canada 2 0
Brian IT Spain 3 1
How do I add the new column (No_of_POS) to the table dfB when I dont have the information about "POS NEG" in it and needs to be taken from dfA.
You can use a dictionary with functions in the aggregate method:
dfA.groupby(["name", "field", "country"], as_index=False)['action']\
.agg({'No_of_data': 'size', 'No_of_POS': lambda x: x.eq('POS').sum()})
You can precompute the boolean before aggregating; performance should be better as the data size increases :
(df.assign(action = df.action.eq('POS'))
.groupby(['name', 'field', 'country'],
sort = False,
as_index = False)
.agg(no_of_data = ('action', 'size'),
no_of_pos = ('action', 'sum'))
name field country no_of_data no_of_pos
0 Sam elec USA 3 2
1 Tommy mech Canada 2 0
2 Brian IT Spain 3 1
You can add an aggregation function when you're grouping your data. Check agg() function, maybe this will help.
I'm trying to do a 'find and replace' in a dataframe data on a specific column type1 using terms stored in a dictionary. I first make all existing values in the dataframe to be uppercase. I create the dictionary mdata, I make sure they are all uppercase as well. Then through the for syntax I loop through the items in mdata replacing accordingly. This code used to work before I turned it into a function.
Any ideas where I've gone wrong?
def to_fish(data, fish):
data['type1'] = data['type1'].str.upper()
if fish == 'monument':
mdata = {
'natural': 'NATURAL FEATURe',
'DITCH TERMINUS': 'DITCH',
'DITCH RECUT': 'DITCH',
'NATURAL_lyr': 'NATURAL FEATURE'
}
mdata = {k.upper(): v.upper() for k, v in mdata.items()}
for copa, fish in mdata.items():
data = data.str.rstrip().str.lstrip().replace(copa, fish, regex=True)
Try the map method:
data['type1'] = data['type1'].map(mdata)
You can give this to get the results.
import pandas as pd
df = pd.DataFrame({'A':['Hello','Bye','OK','Hi','Bonjour'],
'B':['Jack','Jill','Bryan','Kevin','Susan'],
'C':['High','High','Middle','Middle','Low']})
print (df)
lookup_dict = {'High':'California','Middle':'Chicago','Low':'New York'}
df['C'] = df['C'].map(lookup_dict)
print (df)
Before:
A B C
0 Hello Jack High
1 Bye Jill High
2 OK Bryan Middle
3 Hi Kevin Middle
4 Bonjour Susan Low
After:
A B C
0 Hello Jack California
1 Bye Jill California
2 OK Bryan Chicago
3 Hi Kevin Chicago
4 Bonjour Susan New York
I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.
If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)
I have a very large dataset, that looks like
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})
df
Out[173]:
B C
0 john smith indiana jones
1 john doe duck mc duck
2 adam smith batman
3 john doe duck mc duck
4 NaN NaN
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
B C ID
0 john smith indiana jones 1
1 john doe duck mc duck 2
2 adam smith batman 3
3 john doe duck mc duck 2
4 NaN NaN 0
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly.
I use:
df['combined_id']=(df.B+df.C).rank(method='dense')
but the output is float64 and takes a lot of memory. Can we do better?
Thanks!
I think you can use factorize:
df['combined_id'] = pd.factorize(df.B+df.C)[0]
print df
B C combined_id
0 john smith indiana jones 0
1 john doe duck mc duck 1
2 adam smith batman 2
3 john doe duck mc duck 1
4 NaN NaN -1
Making jezrael's answer a little more general (what if the columns were not string?), you can use this compact function:
def make_identifier(df):
str_id = df.apply(lambda x: '_'.join(map(str, x)), axis=1)
return pd.factorize(str_id)[0]
df['combined_id'] = make_identifier(df[['B','C']])
jezrael's answer is great. But since this is for multiple columns, I prefer to use .ngroup() since this way NaN could remain NaN.
df['combined_id'] = df.groupby(['B', 'C'], sort = False).ngroup()
df