Merge two pandas dataframes by index and replace column values in Python - python

I have two pandas dataframes:
DF1
index = np.arange('2020-01-01 00:00', '2020-01-01 00:04', dtype='datetime64[m]')
df = np.random.randint(100,500, size=(4,4))
columns =['Open','High','Low','Close']
df = pd.DataFrame(df, index=index, columns = columns)
df.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 315 298 296 493
2020-01-01 00:03:00 324 411 198 101
DF2
index = np.arange('2020-01-01 00:02', '2020-01-01 00:05', dtype='datetime64[m]')
df2 = np.random.randint(100,500, size=(3,4))
columns =['Open','High','Low','Close']
df2 = pd.DataFrame(df2, index=index, columns = columns)
df2.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I need to merge both dataframes by the index (Time) and replace the column values of DF1 by the column values of DF2.
This is my expected output:
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475 ->>>> Correspond to DF1
2020-01-01 00:01:00 362 135 456 235 ->>>> Correspond to DF1
2020-01-01 00:02:00 430 394 131 490 ->>>> Correspond to DF2
2020-01-01 00:03:00 190 211 394 359 ->>>> Correspond to DF2
2020-01-01 00:04:00 192 291 143 350 ->>>> Correspond to DF2
I have try several functions including merge or concat (concat([df1, df2], join="inner")) but with no success. Any help would be very appreciated. Thanks!

Try this:
df2.combine_first(df)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
Because you mentioned pd.concat, here is how you could do it with that.
out = pd.concat([df, df2])
out = out[~out.index.duplicated(keep='last')]
print(out)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350

Related

Groupby DatetimeIndex month

I'm trying to groupby respective month over a span of years but to no avail. For e.g. Group all the Januarys from 2011 - 2013 together. Group all the Febs together.
Partial Dataset:
Date
2011-01-01 161
2011-02-01 117
2011-03-01 239
2012-01-01 289
2012-02-01 294
2012-03-01 378
2013-01-01 383
2013-02-01 361
Expected Output:
Date
2011-01-01 161
2012-01-01 117
2013-01-01 239
2011-02-01 289
2012-02-01 294
2013-02-01 378
2011-03-01 383
2012-03-01 361
Attempted:
Date is DatetimeIndex
df = df.groupby([df.index.year],[df.index.month])
Output: TypeError: unhashable type: 'list'
You are passing two lists, pass one list with two elements, for example:
df = df.groupby([df.index.year, df.index.month])
We can try argsort
df=df.iloc[df.index.month.argsort()]
df
2011-01-01 161
2012-01-01 289
2013-01-01 383
2011-02-01 117
2012-02-01 294
2013-02-01 361
2011-03-01 239
2012-03-01 378
Name: Date, dtype: int64

Subset selected days data in Python

I have some time series data as:
import pandas as pd
index = pd.date_range('06/01/2014',periods=24*30,freq='H')
df1 = pd.DataFrame(range(len(index)),index=index)
Now I want to subset data of below dates
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
I tried following statement but it is not working
sub_data = df1.loc[df1.index.isin(pd.to_datetime(selec_dates))]
Where am I doing wrong? Is there any other approach to subset selected days data?
You need compare dates and for test membership use numpy.in1d:
sub_data = df1.loc[np.in1d(df1.index.date, pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
...
If want use isin, is necessary create Series with same index:
sub_data = df1.loc[pd.Series(df1.index.date, index=df1.index)
.isin(pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
...
I'm sorry and misunderstood your question
df1[pd.Series(df1.index.date, index=df1.index).isin(pd.to_datetime(selec_dates).date)]
Should perform what was needed
original answer
Please check the pandas documentation on selection
You can easily do
sub_data = df1.loc[pd.to_datetime(selec_dates)]
You can use .query() method:
In [202]: df1.query('#index.normalize() in #selec_dates')
Out[202]:
0
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
... ...
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
Edit: I have been made aware this only works if you are working with a daterange in the same month and year as in your query. For a more general (and better answer) see #jezrael solution.
You can use np.in1d and .day on your index if you wanted to do it as you tried:
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
df1.loc[np.in1d(df1.index.day, (pd.to_datetime(selec_dates).day))]
This gives you as you require:
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
2014-06-10 12:00:00 228
2014-06-10 13:00:00 229
2014-06-10 14:00:00 230
2014-06-10 15:00:00 231
2014-06-10 16:00:00 232
2014-06-10 17:00:00 233
2014-06-10 18:00:00 234
2014-06-10 19:00:00 235
2014-06-10 20:00:00 236
2014-06-10 21:00:00 237
2014-06-10 22:00:00 238
2014-06-10 23:00:00 239
2014-06-15 00:00:00 336
2014-06-15 01:00:00 337
2014-06-15 02:00:00 338
2014-06-15 03:00:00 339
2014-06-15 04:00:00 340
2014-06-15 05:00:00 341
...
2014-06-15 18:00:00 354
2014-06-15 19:00:00 355
2014-06-15 20:00:00 356
2014-06-15 21:00:00 357
2014-06-15 22:00:00 358
2014-06-15 23:00:00 359
2014-06-20 00:00:00 456
2014-06-20 01:00:00 457
2014-06-20 02:00:00 458
2014-06-20 03:00:00 459
2014-06-20 04:00:00 460
2014-06-20 05:00:00 461
2014-06-20 06:00:00 462
2014-06-20 07:00:00 463
2014-06-20 08:00:00 464
2014-06-20 09:00:00 465
2014-06-20 10:00:00 466
2014-06-20 11:00:00 467
2014-06-20 12:00:00 468
2014-06-20 13:00:00 469
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
I used these Sources for this answer:
- Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
- In Python-Pandas, How can I subset a dataframe by specific datetime index values?
- return pandas DF column with the number of days elapsed between index and today's date
- Get weekday/day-of-week for Datetime column of DataFrame
- https://stackoverflow.com/a/36893416/2254228
Use the string repr of the date, leaving out the time periods in the day.
pd.concat([df1['2014-06-10'] , df1['2014-06-15'], df1['2014-06-20']])

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help
The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help
IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Python pandas. Group By and removing a timestamp

I have the below pandas data frame. I need to do a Group By by column B and sum col A and remove the time stamp. So..In the below...should have one record with the A's summed up. Som How I do thus in pandas?
A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 134
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 134
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 134
2013-03-16 12:00:00 985 134
2013-03-17 08:00:00 258 134
This can be done with a straight-forward groupby operation:
import io
import pandas as pd
content='''\
date time A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 135
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 135
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 136
2013-03-16 12:00:00 985 136
2013-03-17 08:00:00 258 137'''
df = pd.read_table(io.BytesIO(content), sep='\s+',
parse_dates=[[0, 1]], header=0,
index_col=0)
print(df.groupby(['B']).sum())
yields
A
B
134 4406
135 1328
136 2783
137 258
Some of the values in B were changed to show a more interesting groupby operation.

Categories

Resources