Replace values in dataframe from another dataframe with Pandas - python

I have 3 dataframes: df1, df2, df3. I am trying to fill NaN values of df1 with some values contained in df2. The values selected from df2 are also selected according to the output of a simple function (mul_val) who processes some data stored in df3.
I was able to get such result but I would like to find in a simpler, easier way and more readable code.
Here is what I have so far:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val at every iteration I get:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO, I identify 6800 as the minimum number between 6800 and 8910. Therefore I select the number 5 who is placed in df1. Repeating such operation for the remaining rows of df3 (in this case I have only 2 rows but they could be a lot more), the final result should be like this:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); the important thing is that the final result is still a dataframe.

I can offer two similar options that achieve the same result than your loop in a couple of lines:
1.Using apply and fillna() (fillna is faster than combine_first by a factor of two):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of df3

Related

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

How to iterate over rows and multiple columns in panda?

I have a dataframe (df1) and I want to replace the values for the columns V2 and V3 if they have the same value than V1.
import pandas as pd
import numpy as np
df_start= pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[10,5,20,17,15], "V3":[10, 25, 15, 10, 20]})
df_end = pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[np.nan,np.nan,20,17,15], "V3":[np.nan, 25, np.nan, 10, np.nan]})
I know iterrows is not recommended but I don't know what I should do.
You can use mask:
For a seperate dataframe use assign:
df_end = df_start.assign(**df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
For modifying the input dataframe just assign inplace:
df_start[['V2','V3']] = (df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
ID V1 V2 V3
0 1 10 NaN NaN
1 2 5 NaN 25.0
2 3 15 20.0 NaN
3 4 20 17.0 10.0
4 5 20 15.0 NaN
You'll still use a regular loop to go through the columns, but the apply function is your best friend for this kind of row-wise operation. If you're going to use info from more than one column (here you're comparing some column and "V1"), you use apply on the DataFrame and specify the axis. If you were only looking at info from one column (like making a column that doubles values from V1 if they're even, you can use apply with just a Series.
For both versions of the function, the argument you're going to pass is a lambda expression. If you apply it do a DataFrame like you are here, the x represents the values in a row that can be indexed by a column. Finally, you assign the result back to a new or existing column in your DataFrame.
Assuming that df_start and df_end represent your planned input and output:
cols = ["V2","V3"]
for col in cols:
df_start[col] = df.apply(lambda x[col] if x[col] != x["V1"] else np.nan, axis=1]

pandas transform with NaN values in grouped columns [duplicate]

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Python adding two dataframes based on index (edited)

(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0

How to select all elements greater than a given values in a dataframe

I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

Categories

Resources