Pandas dataframe merge with update data - python

I hawe two DataFrame:
df1 = pd.DataFrame({'date':['2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05'], 'value':[1,1,1,1,1]})
df2 = pd.DataFrame({'date':['2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08'], 'value':[2,2,2,2,2]})
date value date value
2017-01-01 1 2017-01-04 2
2017-01-02 1 2017-01-05 2
2017-01-03 1 2017-01-06 2
2017-01-04 1 2017-01-07 2
2017-01-05 1 2017-01-08 2
Need to merge df1 and df2 to obtain the following results:
date value
2017-01-01 1
2017-01-02 1
2017-01-03 1
2017-01-04 2
2017-01-05 2
2017-01-06 2
2017-01-07 2
2017-01-08 2

You can use concat with drop_duplicates by column date and keep last values:
print (pd.concat([df1, df2]).drop_duplicates('date', keep='last'))
date value
0 2017-01-01 1
1 2017-01-02 1
2 2017-01-03 1
0 2017-01-04 2
1 2017-01-05 2
2 2017-01-06 2
3 2017-01-07 2
4 2017-01-08 2

I believe you can use the combine_first command built into pandas.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.combine_first.html
in this case you would do
df3 = df1.combine_first(df2)
Im not certain if it works in the case you are replacing an integer with an integer or if you need to have NaN values in place.

Related

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9

How to convert string of different format(YYYY-MM-DD, DD-MM-YYYY) to date object of one format in pandas?

I have a column of string object where it contains different format(YYYY-MM-DD, DD-MM-YYYY). How to convert to DD-MM-YYYY of date object.
I tried with,
df['accepted_date'] = pd.to_datetime(df['accepted_date'], format='%d-%m-%Y')
I got error as
time data '1899-12-31' does not match format '%d-%m-%Y' (match)
Thanks,
Let pandas to parse dates, but then some days with months should be swapped:
df['accepted_date'] = pd.to_datetime(df['accepted_date'])
So better is use to_datetime with format and parameter errors='coerce', what return only matched datetimes with NaT for non matched. Last use combine_first for join all Series - NaT are replaced by values from another Series:
df = pd.DataFrame({'accepted_date':['2017-01-02','07-08-2017','20-03-2017','2017-01-04']})
d1 = pd.to_datetime(df['accepted_date'], format='%d-%m-%Y', errors='coerce')
d2 = pd.to_datetime(df['accepted_date'], format='%Y-%m-%d', errors='coerce')
df['accepted_date1'] = d1.combine_first(d2)
df['accepted_date2'] = pd.to_datetime(df['accepted_date'])
print (df)
accepted_date accepted_date1 accepted_date2
0 2017-01-02 2017-01-02 2017-01-02
1 07-08-2017 2017-08-07 2017-07-08 <-swapped dd-mm
2 20-03-2017 2017-03-20 2017-03-20
3 2017-01-04 2017-01-04 2017-01-04
Detail:
print (d1)
0 NaT
1 2017-08-07
2 2017-03-20
3 NaT
Name: accepted_date, dtype: datetime64[ns]
print (d2)
0 2017-01-02
1 NaT
2 NaT
3 2017-01-04
Name: accepted_date, dtype: datetime64[ns]
EDIT:
Another solution is use parameter dayfirst=True:
df['accepted_date3'] = pd.to_datetime(df['accepted_date'], dayfirst=True)
print (df)
accepted_date accepted_date3
0 2017-01-02 2017-01-02
1 07-08-2017 2017-08-07
2 20-03-2017 2017-03-20
3 2017-01-04 2017-01-04

Fill in missing dates pandas based off max and min

How can I create a python statement for a conditional
I have a dataframe like the one below. I was wondering how can i fill in missing dates based of the max min dates in a dataframe.
Day Movie Rating
2017-01-01 GreatGatsby 5
2017-01-02 TopGun 5
2017-01-03 Deadpool 1
2017-01-10 PlanetOfApes 2
How can I make something that filles in the missing dates to something like
Day Movie Rating
2017-01-01 GreatGatsby 5
2017-01-02 TopGun 5
2017-01-03 Deadpool 1
2017-01-04 0 0
2017-01-05 0 0
2017-01-06 0 0
2017-01-07 0 0
2017-01-08 0 0
2017-01-09 0 0
2017-01-10 PlanetOfApes 2
Use resample + first/last/min/max:
df.set_index('Day').resample('1D').first().fillna(0).reset_index()
Day Movie Rating
0 2017-01-01 GreatGatsby 5.0
1 2017-01-02 TopGun 5.0
2 2017-01-03 Deadpool 1.0
3 2017-01-04 0 0.0
4 2017-01-05 0 0.0
5 2017-01-06 0 0.0
6 2017-01-07 0 0.0
7 2017-01-08 0 0.0
8 2017-01-09 0 0.0
9 2017-01-10 PlanetOfApes 2.0
If Day isn't a datetime column, use pd.to_datetime to convert it first:
df['Day'] = pd.to_datetime(df['Day'])
Alternative by Wen asfreq:
df.set_index('Day').asfreq('D').fillna(0).reset_index()
Day Movie Rating
0 2017-01-01 GreatGatsby 5.0
1 2017-01-02 TopGun 5.0
2 2017-01-03 Deadpool 1.0
3 2017-01-04 0 0.0
4 2017-01-05 0 0.0
5 2017-01-06 0 0.0
6 2017-01-07 0 0.0
7 2017-01-08 0 0.0
8 2017-01-09 0 0.0
9 2017-01-10 PlanetOfApes 2.0
I believe you need reindex:
df = (df.set_index('Day')
.reindex(pd.date_range(df['Day'].min(), df['Day'].max()), fill_value=0)
.reset_index())
print (df)
index Movie Rating
0 2017-01-01 GreatGatsby 5
1 2017-01-02 TopGun 5
2 2017-01-03 Deadpool 1
3 2017-01-04 0 0
4 2017-01-05 0 0
5 2017-01-06 0 0
6 2017-01-07 0 0
7 2017-01-08 0 0
8 2017-01-09 0 0
9 2017-01-10 PlanetOfApes 2

how to groupby and calculate the percentage of non missing values in each column in pandas?

I have the following datadrame
var loyal_date
1 2017-01-17
1 2017-01-03
1 2017-01-11
1 NaT
1 NaT
2 2017-01-15
2 2017-01-07
2 Nat
2 Nat
2 Nat
i need to group by var column and find the percentage of non missing value in loyal_date column for each group. Is there any way to do it using lambda function?
try this:
In [59]: df
Out[59]:
var loyal_date
0 1 2017-01-17
1 1 2017-01-03
2 1 2017-01-11
3 1 NaT
4 1 NaT
5 2 2017-01-15
6 2 2017-01-07
7 2 NaT
8 2 NaT
9 2 NaT
In [60]: df.groupby('var')['loyal_date'].apply(lambda x: x.notnull().sum()/len(x)*100)
Out[60]:
var
1 60.0
2 40.0
Name: loyal_date, dtype: float64

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Categories

Resources