Overwrite columns in DataFrames of different sizes pandas - python

I have following two Data Frames:
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
And I want to update the values of df1 with the ones on df2 whenever there is a match in the ids. The desired dataframe is this one:
df_result = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[1,0,1,1,4]})
How can I get that from the above two dataframes?
I have tried using merge, but fewer records and it keeps both columns:
results = pd.merge(df1,df2,on='ids')
results.to_dict()
{'cost_x': {0: 0, 1: 0}, 'cost_y': {0: 1, 1: 4}, 'ids': {0: 1, 1: 5}}

You could do this with a left merge:
merged = pd.merge(df1, df2, on='ids', how='left')
merged['cost'] = merged.cost_x.where(merged.cost_y.isnull(), merged['cost_y'])
result = merged[['ids','cost']]
However you can avoid the need for the merge (and get better performance) if you set the ids as an index column; then pandas can use this to align the results for you:
df1 = df1.set_index('ids')
df2 = df2.set_index('ids')
df1.cost.where(~df1.index.isin(df2.index), df2.cost)
ids
1 1.0
2 0.0
3 1.0
4 1.0
5 4.0
Name: cost, dtype: float64

You can use set_index and combine first to give precedence to values in df2
df_result = df2.set_index('ids').combine_first(df1.set_index('ids'))
df_result.reset_index()
You get
ids cost
0 1 1
1 2 0
2 3 1
3 4 1
4 5 4

Another way to do it, using a temporary merged dataframe which you can discard after use.
import pandas as pd
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
dftemp = df1.merge(df2,on='ids',how='left', suffixes=('','_r'))
print(dftemp)
df1.loc[~pd.isnull(dftemp.cost_r), 'cost'] = dftemp.loc[~pd.isnull(dftemp.cost_r), 'cost_r']
del dftemp
df1 = df1[['ids','cost']]
print(df1)
OUTPUT-----:
dftemp:
cost ids cost_r
0 0 1 1.0
1 0 2 NaN
2 1 3 NaN
3 1 4 NaN
4 0 5 4.0
df1:
ids cost
0 1 1.0
1 2 0.0
2 3 1.0
3 4 1.0
4 5 4.0

A little late, but this did it for me and was faster that the accepted answer in tests:
df1.update(df2.set_index('ids').reindex(df1.set_index('ids').index).reset_index())

Related

Left join and sum results

I work with Python and I try to implement the function merge with two tables df_agg and df_total. With this function, I used the argument left with the expectation that from the first table with the title all rows will be covered. For the first table, it is important to consider that the first table contains duplicates in the join column id while the second table does not have duplicates in id.
df_new = pd.merge(df_agg,df_total, on='id', how='left')
The merge command executes successfully.But the results are extraordinary, instead to have the same sum of df_agg['total'] with df_new['total'], results in the df_new['total'] being greater than df_agg.
So can anybody help me with what causes this problem and suggest to me some arguments in the function in order to have the same sum before and after merging?
It means id has duplicates in both DataFrames, so new DataFrame has more rows like df_agg (is created 'product' of duplicated rows by all combinations).
df_agg = pd.DataFrame( {"id": [1,1,2,3,3], 'a':range(5) })
df_total = pd.DataFrame( {"id": [1,1,1,3,4], 'b':range(10,15) })
df_new = pd.merge(df_agg,df_total, on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 0 11.0
2 1 0 12.0
3 1 1 10.0
4 1 1 11.0
5 1 1 12.0
6 2 2 NaN
7 3 3 13.0
8 3 4 13.0
print (len(df_new), len(df_agg))
9 5
Possible solution is remove duplicates:
df_new = pd.merge(df_agg,df_total.drop_duplicates('id'), on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 1 10.0
2 2 2 NaN
3 3 3 13.0
4 3 4 13.0
print (len(df_new), len(df_agg))
5 5

insert rows to pandas Dataframe based on condition?

I`m using pandas dataframe to read .csv format file. I would like to insert rows when specific column values changed from value to other. My data is shown as follow:
Id type
1 car
1 track
2 train
2 plane
3 car
I need to add row that contains Id is empty and type value is number 4 after any change in Id column value. My desired output should like this:
Id type
1 car
1 track
4
2 train
2 plane
4
3 car
How I do this??
You could use groupby to split by groups and append the rows in a list comprehension before merging again with contact:
df2 = pd.concat([d.append(pd.Series([None, 4], index=['Id', 'type']), ignore_index=True)
for _,d in df.groupby('Id')], ignore_index=True).iloc[:-1]
If the index is sorted, another option is to find the index of the last item per group and use it to generate the new rows:
# get index of last item per group (except last)
idx = df.index.to_series().groupby(df['Id']).last().values[:-1]
# craft a DataFrame with the new rows
d = pd.DataFrame([[None, 4]]*len(idx), columns=df.columns, index=idx)
# concatenate and reorder
pd.concat([df, d]).sort_index().reset_index(drop=True)
output:
Id type
0 1.0 car
1 1.0 track
2 NaN 4.0
3 2.0 train
4 2.0 plane
5 NaN 4.0
6 3.0 car
You can do this:
df = pd.read_csv('input.csv', sep=";")
Id type
0 1 car
1 1 track
2 2 train
3 2 plane
4 3 car
mask = df['Id'].ne(df['Id'].shift(-1))
df1 = pd.DataFrame('4',index=mask.index[mask] + .5, columns=df.columns)
df1['Id'] = df['Id'].replace({'4':' '})
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
which gives:
Id type
0 1.0 car
1 1.0 track
2 NaN 4
3 2.0 train
4 2.0 plane
5 NaN 4
6 3.0 car
​
You can do:
In [244]: grp = df.groupby('Id')
In [256]: res = pd.DataFrame()
In [257]: for x,y in grp:
...: if y['type'].count() > 1:
...: tmp = y.append(pd.DataFrame({'Id': [''], 'type':[4]}))
...: res = res.append(tmp)
...: else:
...: res = res.append(y)
...:
In [258]: res
Out[258]:
Id type
0 1 car
1 1 track
0 4
2 2 train
3 2 plane
0 4
4 3 car
Please find the solution below using index :
###### Create a shift variable to compare index
df['idshift'] = df['Id'].shift(1)
# When shift id does not match id, mean change index
change_index = df.index[df['idshift']!=df['Id']].tolist()
change_index
# Loop through all the change index and insert at index
for i in change_index[1:]:
line = pd.DataFrame({"Id": ' ' , "rate": 4}, index=[(i-1)+.5])
df = df.append(line, ignore_index=False)
# finallt sort the index
df = df.sort_index().reset_index(drop=True)
Input Dataframe :
df = pd.DataFrame({'Id': [1,1,2,2,3,3,3,4],'rate':[1,2,3,10,12,16,10,12]})
Ouput Results from the code :

Find the latest occurrence of an class item and store how many values are between these two in a pandas DataFrame

I have a pandas DataFrame with some labels for n classes. Now I want to add a column and store how many items are between two elements of the same class.
Class
0 0
1 1
2 1
3 1
4 0
and I want to get this:
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
This is the code I used:
df = pd.DataFrame({'Class':[0,1,1,1,0]})
df['Shift'] = np.nan
for item in df.Class.unique():
_df = df[df['Class'] == item]
_df = _df.reset_index().rename({'index':'idx'}, axis=1)
df.loc[_df.idx, 'Shift'] = _df['idx'].diff().values
df
This seems circuitous to me. Is there a more elegant way of producing this output?
You could do:
df['shift'] = np.arange(len(df))
df['shift'] = df.groupby('Class')['shift'].diff()
print(df)
Output
Class shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
As an alternative:
df['shift'] = df.assign(shift=np.arange(len(df))).groupby('Class')['shift'].diff()
The idea is to create a column with consecutive values, group by the Class column and compute the diff on the new column.
If there is default RangeIndex use Index.to_series with grouping by column df['Class'] and DataFrameGroupBy.diff:
df['Shift'] = df.index.to_series().groupby(df['Class']).diff()
Similar alternative is create helper column:
df['Shift'] = df.assign(tmp = df.index).groupby('Class')['tmp'].diff()
print (df)
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
Your solution with reseting index should be simplify by:
df['Shift'] = df.reset_index().groupby('Class')['index'].diff().to_numpy()

Adding new column from list in dataframe But List have more values than total no of rows in dataframe

I have a dataframe and a list
df=pd.read_csv('aa.csv')
temp=['1','2','3','4','5','6','7']`
Now my data-frame have only 3 rows. I am adding temp as a new column
df['temp']=pd.Series(temp)
But in the final df i am only getting first 3 values of temp and all others are rejected. Is there any way to add a list of larger/smaller in size as a new column to the dataframe
Thanks
Use DataFrame.reindex for create rows filled by missing values before created new column:
df = pd.read_csv('aa.csv')
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp'] = pd.Series(temp)
Sample:
df = pd.DataFrame({'A': [1,2,3]})
print(df)
A
0 1
1 2
2 3
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp']=pd.Series(temp)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
Or use concat with Series with specify name for new column name:
s = pd.Series(temp, name='temp')
df = pd.concat([df, s], axis=1)
Similar:
s = pd.Series(temp)
df = pd.concat([df, s.rename('temp')], axis=1)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7

How can I aggregate a dataframe on specific values?

I have a pandas dataframe df like this, say
ID activity date
1 A 4
1 B 8
1 A 12
1 C 12
2 B 9
2 A 10
3 A 3
3 D 4
and I would like to return a table that counts the number of occurence of some activity in a precise list, say l = [A, B] in this case, then
ID activity(count)_A activity(count)_B
1 2 1
2 1 2
3 1 0
is what I need.
What is the quickest way to perform that ? ideally without for loop
Thanks !
Edit: I know there is pivot function to do this kind of job. But in my case I have much more activity types than what I really need to count in the list l. Is it still optimal to use pivot ?
You can use isin with boolean indexing as first step and then pivoting - fastest should be groupby, size and unstack, then pivot_table and last crosstab, the best test each solution with real data:
df2 = (df[df['activity'].isin(['A','B'])]
.groupby(['ID','activity'])
.size()
.unstack(fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
print (df2)
ID activity(count)_A activity(count)_B
0 1 2 1
1 2 1 1
2 3 1 0
Or:
df1 = df[df['activity'].isin(['A','B'])]
df2 = (pd.crosstab(df1['ID'], df1['activity'])
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
Or:
df2 = (df[df['activity'].isin(['A','B'])]
.pivot_table(index='ID', columns='activity', aggfunc='size', fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
I believe df.groupby('activity').size().reset_index(name='count')
should do as you expect.
Just aggregate by Counter and use pd.DataFrame default constructor
from collections import Counter
agg_= df.groupby(df.index).ID.agg(Counter).tolist()
ndf = pd.DataFrame(agg_)
A B C D
0 2 1.0 1.0 NaN
1 1 1.0 NaN NaN
2 1 NaN NaN 1.0
If you have l = ['A', 'B'], just filter
ndf[l]
A B
0 2 1.0
1 1 1.0
2 1 NaN

Categories

Resources