pandas: cannot set column with substring extracted from other column - python

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!

You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN

Related

How to join multiple dataframe columns based on row index to specified column?

PROBLEM STATEMENT:
I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.
EXAMPLE:
Original Data frame
time
msg
d0
d1
d2
0
msg0
a
b
c
1
msg1
x
x
x
2
msg0
a
b
c
3
msg2
1
2
3
What I want, if I were to filter for msg0 and msg2
time
msg
d0
d1
d2
0
msg0
abc
NaN
NaN
1
msg1
x
x
x
2
msg0
abc
NaN
Nan
3
msg2
123
NaN
NaN
MY ATTEMPT:
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
mask = df.index[((df['msg'] == "msg0") |
(df['msg'] == "msg1") |
(df['msg'] == "msg3"))].tolist()
# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0'] + df['d1'] + df['d2']
df.iloc[mask, 3:] = "NaN"
The approach might be somewhat similar to #mozway's answer I will make it more detailed to be easier to follow.
1- Define your target columns and messages (just to make it easier to deal with)
# the messages to filter
msgs = ["msg0", "msg2"]
# the columns to filter
columns = df.columns.drop(['time', 'msg'])
# the column to contain the result
total_col = ["d0"]
2- Mask the rows based on the (msgs) column value
mask = df['msg'].isin(msgs)
3- Find the value of the combined values
# a- mask the dataframe to the target columns and rows.
# b- apply ''.join() to join all the column values
# c- to join columns not rows apply on axis = 1
new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)
4- Set all target columns and rows to np.nan and redefine the values of the "total" column
df.loc[mask, columns] = np.nan
df.loc[mask, total_col] = new_total_col
Result
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN
You can use:
cols = ['d0', 'd1', 'd2']
# get the rows matching the msg condition
m = df['msg'].isin(['msg0', 'msg2'])
# get relevant columns
# concatenate the non-NaN value
# update as DataFrame to assign NaN is the non-first columns
df.loc[m, cols] = (df
.loc[m, cols]
.agg(lambda r: ''.join(r.dropna()), axis=1)
.rename(cols[0]).to_frame()
)
print(df)
Output:
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN

create dataframe with outliers and then replace with nan

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

How to merge many DataFrames by index combining values where columns overlap?

I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df

Reassigning Entries in a Column of Pandas DataFrame

My goal is to conditionally index a data frame and change the values in a column for these indexes.
I intend on looking through the column 'A' to find entries = 'a' and update their column 'B' with the word 'okay.
group = ['a']
df = pd.DataFrame({"A": [a,b,a,a,c], "B": [NaN,NaN,NaN,NaN,NaN]})
>>>df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
df[df['A'].apply(lambda x: x in group)]['B'].fillna('okay', inplace=True)
This gives me the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
Following the documentation (what I understood of it) I tried the following instead:
df[df['A'].apply(lambda x: x in group)].loc[:,'B'].fillna('okay', inplace=True)
I can't figure out why the reassignment of 'NaN' to 'okay' is not occurring inplace and how this can be rectified?
Thank you.
Try this with lambda:
Solution First:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
Using lambda + map or apply..
>>> df["B"] = df["A"].map(lambda x: "okay" if "a" in x else "NaN")
OR# df["B"] = df["A"].map(lambda x: "okay" if "a" in x else np.nan)
OR# df['B'] = df['A'].apply(lambda x: 'okay' if x == 'a' else np.nan)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution second:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
another fancy way to Create Dictionary frame and apply it using map function across the column:
>>> frame = {'a': "okay"}
>>> df['B'] = df['A'].map(frame)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution Third:
This is already been posted by #d_kennetz but Just want to club together, wher you can also do the assignment to both columns (A & B)in one shot:..
>>> df.loc[df.A == 'a', 'B'] = "okay"
If I understand this correctly, you simply want to replace the value for a column on those rows matching a given condition (i.e. where A column belongs to a certain group, here with a single value 'a'). The following should do the trick:
import pandas as pd
group = ['a']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
print(df)
df.loc[df['A'].isin(group),'B'] = 'okay'
print(df)
What we're doing here is we're using the .loc filter, which just returns a view on the existing dataframe.
First argument (df['A'].isin(group)) filters on those rows matching a given criterion. Notice you can use the equality operator (==) but not the in operator and therefore have to use .isin() instead).
Second argument selects only the 'B' column.
Then you just assign the desired value (which is a constant).
Here's the output:
A B
0 a None
1 b None
2 a None
3 a None
4 c None
A B
0 a okay
1 b None
2 a okay
3 a okay
4 c None
If you wanted to fancier stuff, you might want do the following:
import pandas as pd
group = ['a', 'b']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
df.loc[df['A'].isin(group),'B'] = "okay, it was " + df['A']+df['A']
print(df)
Which gives you:
A B
0 a okay, it was aa
1 b okay, it was bb
2 a okay, it was aa
3 a okay, it was aa
4 c None

Excluding 'None' when checking for 'NaN' values in pandas

I'm cleaning a dataset of NaN to run linear regression on it, in the process, I replaced someNaN with None.
After doing this I check for remaining columns with NaN values using the following code, where houseprice is the name of the dataframe
def cols_NaN():
return houseprice.columns[houseprice.isnull().any()].tolist()
print houseprice[cols_NaN()].isnull().sum()
the problem is that the result of the above includes None values also. I want to select those columns which have NaN values. How can I do that?
Only thing I could think of is to check if elements are float because np.nan is of type float and is null.
Consider the dataframe df
df = pd.DataFrame(dict(A=[1., None, np.nan]), dtype=np.object)
print(df)
A
0 1
1 None
2 NaN
Then we test if both float and isnull
df.A.apply(lambda x: isinstance(x, float)) & df.A.isnull()
0 False
1 False
2 True
Name: A, dtype: bool
For working with column names it is a bit different, because need map and pandas.isnull:
For houseprice.columns.apply() and if houseprice.columns.isnull() get errors:
AttributeError: 'Index' object has no attribute 'apply'
AttributeError: 'Index' object has no attribute 'isnull'
houseprice = pd.DataFrame(columns = [np.nan, None, 'a'])
print (houseprice)
Empty DataFrame
Columns: [nan, None, a]
print (houseprice.columns[(houseprice.columns.map(type) == float) &
(pd.isnull(houseprice.columns))].tolist())
[nan]
And for check all values in DataFrame is necessary applymap:
houseprice = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,5],
'E':['a','s',None],
'F':[np.nan,4,3]})
print (houseprice)
A B C D E F
0 1 4 NaN 1 a NaN
1 2 5 8.0 3 s 4.0
2 3 6 9.0 5 None 3.0
print (houseprice.columns[(houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any()])
Index(['C', 'F'], dtype='object')
And for sum this code is simplier - sum True values in boolean mask:
print ((houseprice.applymap(lambda x: isinstance(x, float)) &
houseprice.isnull()).any().sum())
2

Categories

Resources