I have two DataFrames (with DatetimeIndex) and want to update the first frame (the older one) with data from the second frame (the newer one).
The new frame may contain more recent data for rows already contained in the the old frame. In this case, data in the old frame should be overwritten with data from the new frame.
Also the newer frame may have more columns / rows, than the first one.
In this case the old frame should be enlarged by the data in the new frame.
Pandas docs state, that
"The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis"
and
"a DataFrame can be enlarged on either axis via .loc"
However this doesn't seem to work and throws a KeyError. Example:
In [195]: df1
Out[195]:
A B C
2015-07-09 12:00:00 1 1 1
2015-07-09 13:00:00 1 1 1
2015-07-09 14:00:00 1 1 1
2015-07-09 15:00:00 1 1 1
In [196]: df2
Out[196]:
A B C D
2015-07-09 14:00:00 2 2 2 2
2015-07-09 15:00:00 2 2 2 2
2015-07-09 16:00:00 2 2 2 2
2015-07-09 17:00:00 2 2 2 2
In [197]: df1.loc[df2.index] = df2
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-197-74e630e87cf8> in <module>()
----> 1 df1.loc[df2.index] = df2
/.../pandas/core/indexing.pyc in __setitem__(self, key, value)
112
113 def __setitem__(self, key, value):
--> 114 indexer = self._get_setitem_indexer(key)
115 self._setitem_with_indexer(indexer, value)
116
/.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key)
107
108 try:
--> 109 return self._convert_to_indexer(key, is_setter=True)
110 except TypeError:
111 raise IndexingError(key)
/.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
1110 mask = check == -1
1111 if mask.any():
-> 1112 raise KeyError('%s not in index' % objarr[mask])
1113
1114 return _values_from_object(indexer)
KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"
What is the best way (with respect to performance, as my real data is much larger) two achieve the desired updated and enlarged DataFrame. This is the result I would like to see:
A B C D
2015-07-09 12:00:00 1 1 1 NaN
2015-07-09 13:00:00 1 1 1 NaN
2015-07-09 14:00:00 2 2 2 2
2015-07-09 15:00:00 2 2 2 2
2015-07-09 16:00:00 2 2 2 2
2015-07-09 17:00:00 2 2 2 2
df2.combine_first(df1) (documentation)
seems to serve your requirement; PFB code snippet & output
import pandas as pd
print 'pandas-version: ', pd.__version__
df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1),
('2015-07-09 13:00:00',1,1,1),
('2015-07-09 14:00:00',1,1,1),
('2015-07-09 15:00:00',1,1,1)],
columns=['Dt', 'A', 'B', 'C']).set_index('Dt')
# print df1
df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2),
('2015-07-09 15:00:00',2,2,2,2),
('2015-07-09 16:00:00',2,2,2,2),
('2015-07-09 17:00:00',2,2,2,2),],
columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt')
res_combine1st = df2.combine_first(df1)
print res_combine1st
output
pandas-version: 0.15.2
A B C D
Dt
2015-07-09 12:00:00 1 1 1 NaN
2015-07-09 13:00:00 1 1 1 NaN
2015-07-09 14:00:00 2 2 2 2
2015-07-09 15:00:00 2 2 2 2
2015-07-09 16:00:00 2 2 2 2
2015-07-09 17:00:00 2 2 2 2
You can use the combine function.
import pandas as pd
# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))
df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='A B C D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))
# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))
Out[248]:
A B C D
2015-07-09 12:00:00 NaN NaN NaN NaN
2015-07-09 13:00:00 NaN NaN NaN NaN
2015-07-09 14:00:00 2 2 2 2
2015-07-09 15:00:00 2 2 2 2
2015-07-09 16:00:00 2 2 2 2
2015-07-09 17:00:00 2 2 2 2
combiner = lambda x, y: np.where(x.isnull(), y, x)
# use df1 to update result
result.combine(df1, combiner)
Out[249]:
A B C D
2015-07-09 12:00:00 1 1 1 NaN
2015-07-09 13:00:00 1 1 1 NaN
2015-07-09 14:00:00 2 2 2 2
2015-07-09 15:00:00 2 2 2 2
2015-07-09 16:00:00 2 2 2 2
2015-07-09 17:00:00 2 2 2 2
# maybe fillna(method='ffill') if you like
In addition to previous answer, after reindexing you can use
result.fillna(df1, inplace=True)
so based on Jianxun Li's code (extended with one more column) you can try this
# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))
df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='A B C D E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))
# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))
# fill NaN from df1
result.fillna(df1, inplace=True)
Out[3]:
A B C D E
2015-07-09 12:00:00 1 1 1 NaN NaN
2015-07-09 13:00:00 1 1 1 NaN NaN
2015-07-09 14:00:00 2 2 2 2 2
2015-07-09 15:00:00 2 2 2 2 2
2015-07-09 16:00:00 2 2 2 2 2
2015-07-09 17:00:00 2 2 2 2 2
Related
I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122
I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed
My dataframe contains both NaT and NaN values
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00 20.565
2 2015-11-11 11:44:00 21.0000 NaT NaN
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00 13.700
I want to fill NaT with dates and NaN with numbers. Fillna(4) method replaces both NaT and NaN with 4. Is it possible to differentiate between NaT and NaN somehow?
My current workaround is to df[column].fillna()
Since NaTs pertain to datetime columns, you can exclude them when applying your filling operation.
u = df.select_dtypes(exclude=['datetime'])
df[u.columns] = u.fillna(4)
df
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00 20.565
2 2015-11-11 11:44:00 21.0000 NaT 4.000
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00 13.700
Similarly, to fill NaT values only, change "exclude" to "include" in the code above.
u = df.select_dtypes(include=['datetime'])
df[u.columns] = u.fillna(pd.to_datetime('today'))
df
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00.000000 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00.000000 20.565
2 2015-11-11 11:44:00 21.0000 2019-02-17 16:11:09.407466 4.000
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00.000000 13.700
Try something like this, using pandas.DataFrame.select_dtypes:
>>> import pandas as pd, datetime, numpy as np
>>> df = pd.DataFrame({'a': [datetime.datetime.now(), np.nan], 'b': [5, np.nan], 'c': [1, 2]})
>>> df
a b c
0 2019-02-17 18:06:15.231557 5.0 1
1 NaT NaN 2
>>> fill_dt = datetime.datetime.now()
>>> fill_value = 4
>>> dt_filled_df = df.select_dtypes('datetime').fillna(fill_dt)
>>> dt_filled_df
a
0 2019-02-17 18:06:15.231557
1 2019-02-17 18:06:36.040404
>>> value_filled_df = df.select_dtypes('int').fillna(fill_value)
>>> value_filled_df
c
0 1
1 2
>>> dt_filled_df.columns = [col + '_notnull' for col in dt_filled_df]
>>> value_filled_df.columns = [col + '_notnull' for col in value_filled_df]
>>> df = df.join(value_filled_df)
>>> df = df.join(dt_filled_df)
>>> df
a b c c_notnull a_notnull
0 2019-02-17 18:06:15.231557 5.0 1 1 2019-02-17 18:06:15.231557
1 NaT NaN 2 2 2019-02-17 18:06:36.040404
I have the following datadrame
var loyal_date
1 2017-01-17
1 2017-01-03
1 2017-01-11
1 NaT
1 NaT
2 2017-01-15
2 2017-01-07
2 Nat
2 Nat
2 Nat
i need to group by var column and find the percentage of non missing value in loyal_date column for each group. Is there any way to do it using lambda function?
try this:
In [59]: df
Out[59]:
var loyal_date
0 1 2017-01-17
1 1 2017-01-03
2 1 2017-01-11
3 1 NaT
4 1 NaT
5 2 2017-01-15
6 2 2017-01-07
7 2 NaT
8 2 NaT
9 2 NaT
In [60]: df.groupby('var')['loyal_date'].apply(lambda x: x.notnull().sum()/len(x)*100)
Out[60]:
var
1 60.0
2 40.0
Name: loyal_date, dtype: float64
I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2