Merge two pandas dataframes by index and replace column values in Python

Merge two pandas dataframes by index and replace column values in Python - python

I have two pandas dataframes:
DF1
index = np.arange('2020-01-01 00:00', '2020-01-01 00:04', dtype='datetime64[m]')
df = np.random.randint(100,500, size=(4,4))
columns =['Open','High','Low','Close']
df = pd.DataFrame(df, index=index, columns = columns)
df.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 315 298 296 493
2020-01-01 00:03:00 324 411 198 101
DF2
index = np.arange('2020-01-01 00:02', '2020-01-01 00:05', dtype='datetime64[m]')
df2 = np.random.randint(100,500, size=(3,4))
columns =['Open','High','Low','Close']
df2 = pd.DataFrame(df2, index=index, columns = columns)
df2.index.name = 'Time'
Open High Low Close
Time
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
I need to merge both dataframes by the index (Time) and replace the column values of DF1 by the column values of DF2.
This is my expected output:
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475 ->>>> Correspond to DF1
2020-01-01 00:01:00 362 135 456 235 ->>>> Correspond to DF1
2020-01-01 00:02:00 430 394 131 490 ->>>> Correspond to DF2
2020-01-01 00:03:00 190 211 394 359 ->>>> Correspond to DF2
2020-01-01 00:04:00 192 291 143 350 ->>>> Correspond to DF2
I have try several functions including merge or concat (concat([df1, df2], join="inner")) but with no success. Any help would be very appreciated. Thanks!

Try this:
df2.combine_first(df)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350
Because you mentioned pd.concat, here is how you could do it with that.
out = pd.concat([df, df2])
out = out[~out.index.duplicated(keep='last')]
print(out)
Open High Low Close
Time
2020-01-01 00:00:00 266 397 177 475
2020-01-01 00:01:00 362 135 456 235
2020-01-01 00:02:00 430 394 131 490
2020-01-01 00:03:00 190 211 394 359
2020-01-01 00:04:00 192 291 143 350

Related

Groupby DatetimeIndex month

I'm trying to groupby respective month over a span of years but to no avail. For e.g. Group all the Januarys from 2011 - 2013 together. Group all the Febs together.
Partial Dataset:
Date
2011-01-01 161
2011-02-01 117
2011-03-01 239
2012-01-01 289
2012-02-01 294
2012-03-01 378
2013-01-01 383
2013-02-01 361
Expected Output:
Date
2011-01-01 161
2012-01-01 117
2013-01-01 239
2011-02-01 289
2012-02-01 294
2013-02-01 378
2011-03-01 383
2012-03-01 361
Attempted:
Date is DatetimeIndex
df = df.groupby([df.index.year],[df.index.month])
Output: TypeError: unhashable type: 'list'

You are passing two lists, pass one list with two elements, for example:
df = df.groupby([df.index.year, df.index.month])

We can try argsort
df=df.iloc[df.index.month.argsort()]
df
2011-01-01 161
2012-01-01 289
2013-01-01 383
2011-02-01 117
2012-02-01 294
2013-02-01 361
2011-03-01 239
2012-03-01 378
Name: Date, dtype: int64

Subset selected days data in Python

I have some time series data as:
import pandas as pd
index = pd.date_range('06/01/2014',periods=24*30,freq='H')
df1 = pd.DataFrame(range(len(index)),index=index)
Now I want to subset data of below dates
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
I tried following statement but it is not working
sub_data = df1.loc[df1.index.isin(pd.to_datetime(selec_dates))]
Where am I doing wrong? Is there any other approach to subset selected days data?

You need compare dates and for test membership use numpy.in1d:
sub_data = df1.loc[np.in1d(df1.index.date, pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
...
If want use isin, is necessary create Series with same index:
sub_data = df1.loc[pd.Series(df1.index.date, index=df1.index)
.isin(pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
...

I'm sorry and misunderstood your question
df1[pd.Series(df1.index.date, index=df1.index).isin(pd.to_datetime(selec_dates).date)]
Should perform what was needed
original answer
Please check the pandas documentation on selection
You can easily do
sub_data = df1.loc[pd.to_datetime(selec_dates)]

You can use .query() method:
In [202]: df1.query('#index.normalize() in #selec_dates')
Out[202]:
0
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
... ...
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]

Edit: I have been made aware this only works if you are working with a daterange in the same month and year as in your query. For a more general (and better answer) see #jezrael solution.
You can use np.in1d and .day on your index if you wanted to do it as you tried:
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
df1.loc[np.in1d(df1.index.day, (pd.to_datetime(selec_dates).day))]
This gives you as you require:
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
2014-06-10 12:00:00 228
2014-06-10 13:00:00 229
2014-06-10 14:00:00 230
2014-06-10 15:00:00 231
2014-06-10 16:00:00 232
2014-06-10 17:00:00 233
2014-06-10 18:00:00 234
2014-06-10 19:00:00 235
2014-06-10 20:00:00 236
2014-06-10 21:00:00 237
2014-06-10 22:00:00 238
2014-06-10 23:00:00 239
2014-06-15 00:00:00 336
2014-06-15 01:00:00 337
2014-06-15 02:00:00 338
2014-06-15 03:00:00 339
2014-06-15 04:00:00 340
2014-06-15 05:00:00 341
...
2014-06-15 18:00:00 354
2014-06-15 19:00:00 355
2014-06-15 20:00:00 356
2014-06-15 21:00:00 357
2014-06-15 22:00:00 358
2014-06-15 23:00:00 359
2014-06-20 00:00:00 456
2014-06-20 01:00:00 457
2014-06-20 02:00:00 458
2014-06-20 03:00:00 459
2014-06-20 04:00:00 460
2014-06-20 05:00:00 461
2014-06-20 06:00:00 462
2014-06-20 07:00:00 463
2014-06-20 08:00:00 464
2014-06-20 09:00:00 465
2014-06-20 10:00:00 466
2014-06-20 11:00:00 467
2014-06-20 12:00:00 468
2014-06-20 13:00:00 469
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
I used these Sources for this answer:
- Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
- In Python-Pandas, How can I subset a dataframe by specific datetime index values?
- return pandas DF column with the number of days elapsed between index and today's date
- Get weekday/day-of-week for Datetime column of DataFrame
- https://stackoverflow.com/a/36893416/2254228

Use the string repr of the date, leaving out the time periods in the day.
pd.concat([df1['2014-06-10'] , df1['2014-06-15'], df1['2014-06-20']])

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help

The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00

IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help

IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64

You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Python pandas. Group By and removing a timestamp

I have the below pandas data frame. I need to do a Group By by column B and sum col A and remove the time stamp. So..In the below...should have one record with the A's summed up. Som How I do thus in pandas?
A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 134
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 134
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 134
2013-03-16 12:00:00 985 134
2013-03-17 08:00:00 258 134

This can be done with a straight-forward groupby operation:
import io
import pandas as pd
content='''\
date time A B
2013-03-15 17:00:00 1 134
2013-03-15 18:00:00 810 134
2013-03-15 19:00:00 1797 134
2013-03-15 20:00:00 813 135
2013-03-15 21:00:00 1323 134
2013-03-16 05:00:00 98 134
2013-03-16 06:00:00 515 135
2013-03-16 10:00:00 377 134
2013-03-16 11:00:00 1798 136
2013-03-16 12:00:00 985 136
2013-03-17 08:00:00 258 137'''
df = pd.read_table(io.BytesIO(content), sep='\s+',
parse_dates=[[0, 1]], header=0,
index_col=0)
print(df.groupby(['B']).sum())
yields
A
B
134 4406
135 1328
136 2783
137 258
Some of the values in B were changed to show a more interesting groupby operation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two pandas dataframes by index and replace column values in Python - python

Related

Groupby DatetimeIndex month

Subset selected days data in Python

subsetting pandas dataframe on specific date value

how to subset pandas dataframe on date

Python pandas. Group By and removing a timestamp

Categories

Resources