Pandas Merge on Specific Attributes of DateTimeIndex

Pandas Merge on Specific Attributes of DateTimeIndex - python

I currently have two pandas data frames which are both indexed using the pandas DateTimeIndex format.
df1
datetimeindex value
2014-01-01 00:00:00 204.501667
2014-01-01 01:00:00 125.345000
2014-01-01 02:00:00 119.660000
df2 (where the year 1900 is a filler year I added during import. Actual year does not matter)
datetimeindex temperature
1900-01-01 00:00:00 48.2
1900-01-01 01:00:00 30.2
1900-01-01 02:00:00 42.8
I would like to use pd.merge to combine the data frames based on the left index, however, I would like to ignore the year altogether to yield this:
merged_df
datetimeindex value temperature
2014-01-01 00:00:00 204.501667 48.2
2014-01-01 01:00:00 125.345000 30.2
2014-01-01 02:00:00 119.660000 42.8
so far I have tried:
merged_df = pd.merge(df1,df2,left_on =
['df1.index.month','df1.index.day','df1,index.hour'],right_on =
['df2.index.month','df2.index.day','df2.index.hour'],how = 'left')
which gave me the error KeyError: 'df2.index.month'
Is there a way to perform this merge as I have outlined it?
Thanks

You have to lose the quotesL
In [11]: pd.merge(df1, df2, left_on=[df1.index.month, df1.index.day, df1.index.hour],
right_on=[df2.index.month, df2.index.day, df2.index.hour])
Out[11]:
key_0 key_1 key_2 value temperature
0 1 1 0 204.501667 48.2
1 1 1 1 125.345000 30.2
2 1 1 2 119.660000 42.8
Here "df2.index.month" is a string whereas df2.index.month is the array of months.

Probably not as efficient because pd.to_datetime can be slow:
df2['NewIndex'] = pd.to_datetime(df2.index)
df2['NewIndex'] = df2['NewIndex'].apply(lambda x: x.replace(year=2014))
df2.set_index('NewIndex',inplace=True)
Then just do a merge on the whole index.

Related

How to update some of the rows from another series in pandas using df.update

I have a df like,
stamp value
0 00:00:00 2
1 00:00:00 3
2 01:00:00 5
converting to time delta
df['stamp']=pd.to_timedelta(df['stamp'])
slicing only odd index and adding 30 mins,
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
#print(odd_df)
1 00:30:00
Name: stamp, dtype: timedelta64[ns]
now, updating df with odd_df,
as per the documentation it should give my expected output.
expected output:
df.update(odd_df)
#print(df)
stamp value
0 00:00:00 2
1 00:30:00 3
2 01:00:00 5
What I am getting,
df.update(odd_df)
#print(df)
stamp value
0 00:30:00 00:30:00
1 00:30:00 00:30:00
2 00:30:00 00:30:00
please help, what is wrong in this.

Try this instead:
df.loc[1::2, 'stamp'] += pd.to_timedelta('30 min')
This ensures you update just the values in DataFrame specified by the .loc() function while keeping the rest of your original DataFrame. To test, run df.shape. You will get (3,2) with the method above.
In your code here:
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
The odd_df DataFrame only has parts of your original DataFrame. The parts you sliced. The shape of odd_df is (1,).

Python - Select min values in dataframe

I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:

Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2

Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00

Pandas - Adding values from DataFrame for different rows

I have a pandas df and I would like to add values for each row from the "total_load" column with the "Battery capacity" column. For example 4755 +(-380) = 4375 and so on.
Obviously, what I am doing right now is for every row in the "Battery capacity" column do: 5200 - the value from "total_load" column. Any ideas how I can write that? Should I use an for loop?
df["Battery capacity"] = 5200 + df["total_load"]
Output should be something like:
time total_load battery capacity
2016-06-01 00:00:00 -445 4755
2016-06-01 01:00:00 -380 4375
2016-06-01 02:00:00 -350 4025
Thanks!

IIUC, use cumsum to get a "running total" of total_load:
df['Battery capacity'] = df['total_load'].cumsum() + 5200
Output:
Battery capacity total_load
time
2016-01-01 00:00:00 4755.0 -445.0
2016-01-01 01:00:00 4375.0 -380.0
2016-01-01 02:00:00 4025.0 -350.0
2016-01-01 03:00:00 3685.0 -340.0

Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two different DataFrames that I want to merge with date and hours columns. I saw some threads that are there, but I could not find the solution for my issue. I also read this document and tried different combinations, however, did not work well.
Example of my two different DataFrames,
DF1
date hours var1 var2
0 2013-07-10 00:00:00 150.322617 52.225920
1 2013-07-10 01:00:00 155.250917 53.365296
2 2013-07-10 02:00:00 124.918667 51.158249
3 2013-07-10 03:00:00 143.839217 53.138251
.....
9 2013-09-10 09:00:00 148.135818 86.676341
10 2013-09-10 10:00:00 147.833517 53.658016
11 2013-09-10 12:00:00 149.580233 69.745368
12 2013-09-10 13:00:00 163.715317 14.524894
13 2013-09-10 14:00:00 168.856650 10.762779
DF2
date hours myvar1 myvar2
0 2013-07-10 09:00:00 1.617 98.56
1 2013-07-10 10:00:00 2.917 23.60
2 2013-07-10 12:00:00 19.667 36.15
3 2013-07-10 13:00:00 14.217 45.16
.....
20 2013-09-10 20:00:00 1.517 53.56
21 2013-09-10 21:00:00 5.233 69.47
22 2013-09-10 22:00:00 13.717 14.25
23 2013-09-10 23:00:00 18.850 10.69
As you can see in both DataFrames, DF2 starts with 09:00:00 and I want to join with DF1 09:00:00, which is basically the matchind dates and times. So far, I tried many different combination using previous threads and the documentation mentioned above. An example,
merged_df = DF2.merge(DF1, how = 'left', on = ['date', 'hours'])
This was introduces NAN values for right right DataFrame. I know, I do not have to use both date and hours columns, however, still getting the same result. I tried R quick like this, which works perfectly fine.
merged_df <- left_join(DF1, DF2, by = 'date')
Is there anyway in pandas to merge DatFrames just with matching values without getting NaN values?

Use how='inner' in pd.merge:
merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours'])
This will perform and "inner-join" thereby omitting rows in each dataframe that do not match. Hence, no NaN in either the right or left part of merged dataframe.

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance

You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Merge on Specific Attributes of DateTimeIndex - python

Probably not as efficient because pd.to_datetime can be slow: df2['NewIndex'] = pd.to_datetime(df2.index) df2['NewIndex'] = df2['NewIndex'].apply(lambda x: x.replace(year=2014)) df2.set_index('NewIndex',inplace=True) Then just do a merge on the whole index.

Related

How to update some of the rows from another series in pandas using df.update

Python - Select min values in dataframe

Pandas - Adding values from DataFrame for different rows

Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate]

How to resample data in a single dataframe within 3 distinct groups

Categories

Resources