Update row in a dataframe based on a second one - python

I have the following dataframe, df1 :
AS AT CH TR
James Robert/01/08/2019 0 0 0 1
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
John Smith/02/08/2019 0 1 0 0
And df2 :
TIME
Andrew Johnson/08/08/2019 1
James Robert/01/08/2019 0.5
John Smith/02/08/2019 1
If an index value is present in both dataframes (example : James Robert/01/08/2019 and John Smith/02/08/2019), I would like to delete the row in df1 if the value of df1["Column with a value"] - df2['TIME'] = 0 otherwise I would like to update the value.
The desired output would be :
AS AT CH TR
James Robert/01/08/2019 0 0 0 0.5
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
If a row is in both dataframes, I'm able to delete it from df1, but I can't find a way to add this particular condition : "df1["Column with a value"]"
Thanks

Instead of using index use them as columns. Place the df2['index'] column in a list. Use that list as parameter in isin method done in df1.
df2['index'] = df2.index
df1['index'] = df1.index
filtered_df1 = df1[df1['index'].isin(df2['index'].values.tolist())]
Create a dictionary with your 'index' column and the value for your 'Time' column from df2 then map it to filtered_df1.
your_dict = dict(zip(df2['index'],df2['Time']))
filtered_df1['Subtract Value'] = filtered_df1['index'].map(your_dict).fillna(value = 0)
Then do the subtraction there.
final_df = filtered_df1.sub(filtered_df1['Subtract Value'], axis=0)
Hope this helps.

Related

Group by column and Spread values of another Column into other Columns

I have the current dataframe and I'm trying to group by the Name and spread the values of weight into the columns and count each time they occur. Thanks!
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
Desired output:
Try pandas crosstab :
pd.crosstab(df.Name, df.Weight)
Weight Above Average Average Below Average
Name
John 0 2 0
Paul 0 0 1
Darren 2 0 0
use groupby and unstack:
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
df = df.groupby(['Name', 'Weight'])['Weight'].count().unstack(1).fillna(0).astype(int).reset_index()
df = df.rename_axis('', axis=1).set_index('Name')
df
Out[1]:
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1
Use get dummies to achieve what you need here
pd.get_dummies(df.set_index('Name'), dummy_na=False,prefix=[None]).groupby('Name').sum()
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1

How to match column names with dictionary keys and add value to counter

I created a dataframe that has binary values for each cell, where each row is a user and each column is a company the user can select (or not), like this:
company1 company2 company3
1 0 0
0 0 1
0 1 1
And I created a dictionary that categorizes each company into either a high, mid, or low value company:
{'company1': 'high',
'company2': 'low',
'company3': 'low'}
Currently there are companies that are in the dataframe but not in the dictionary, but this should be fixed relatively soon. I would like to create variables for how many times each user selected a high, mid, or low value company. Ultimately should look something like this:
company1 company2 company3 total_low total_mid total_high
1 0 0 0 0 1
0 0 1 1 0 0
0 1 1 2 0 0
I started creating a loop to accomplish this, but I'm not sure how to match the column name with the dictionary key/value, or if this is even the most efficient method (there are ~18,000 rows/users and ~100 columns/companies in total):
total_high = []
total_mid = []
total_low = []
for row in range(df.shape[0]):
for col in range(df.shape[1]):
if df.iloc[row,col] == 1:
# match column name with dict key and add value to
# counter
One possible approach:
d = {'company1': 'high',
'company2': 'low',
'company3': 'low'}
df.join(df.rename(columns=d)
.groupby(level=0, axis=1).sum()
.reindex(['low','mid','high'], axis=1, fill_value=0)
.add_prefix('total_')
)
Output:
company1 company2 company3 total_low total_mid total_high
0 1 0 0 0 0 1
1 0 0 1 1 0 0
2 0 1 1 2 0 0
Not as short as #Quang Hoang 's but Another way;
Melt dataframe
df2=pd.melt(df, value_vars=['company1', 'company2', 'company3'])
Map dictionary creating another column total
df2['total']=df2.variable.map(d)
Pivot high, low and add middle and join to df
compa=['low','medium','high']
df.join(df2.groupby(['variable','total'])['value'].sum().unstack('total', fill_value=0).reindex(compa,axis=1, fill_value=0).add_prefix('total_').reset_index().drop(columns=['variable']))

pandas drop rows after value appears

I have a dataframe:
df = pd.DataFrame({'Position': [1,2,3,4,5,'Title','Name','copy','Thanks'], 'Winner': [0,0,0,0,0,'Johnson',0,0,0]})
I want to drop all the rows after and including the row Johnson appears in. This would give me a dataframe looking like:
df = pd.DataFrame({'Position': [1,2,3,4,5], 'Winner': [0,0,0,0,0]})
I have tried referencing the index that 'Johnson' appears in the slicing the dataframe using the index. But this didn't work for.
thanks
You just need boolean index and cumsum:
df[df.Winner.eq('Johnson').cumsum().lt(1)]
Output:
Position Winner
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
You could use boolean indexing:
df[~df['Winner'].eq('Johnson').cumsum().astype(bool)]
I think the winner could be another person so also you can check 0:
df.loc[:df['Winner'].eq(0).idxmin() - 1]
Output
Position Winner
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

Replace column values based on another dataframe python pandas - better way?

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Categories

Resources