Identifying overlapping rows in multiple dataframes - python

I have two dataframes like
df1
Time accler
19.13.33 24
19.13.34 24
19.13.35 25
19.13.36 27
19.13.37 25
19.13.38 27
19.13.39 25
19.13.40 24
df2
Time accler
19.13.29 24
19.13.30 24
19.13.31 25
19.13.32 27
19.13.33 25
19.13.34 27
19.13.35 25
19.13.36 24
These two data frames overlap over column time from 19.13.33 to 19.13.36. So when ever there is a overlap I wanted only the dataframe which consists of the overlapped rows
expected output
df1
Time accler
19.13.33 24
19.13.34 24
19.13.35 25
19.13.36 27
df2
Time accler
19.13.33 25
19.13.34 27
19.13.35 25
19.13.36 24
or I can also have a concat of the dataframes which will be helpful for further processing.
I tried merge but did not work as the dataframes are created dynamically depending on the number of csv files. I tried concatenating first all the dataframes and tried to iterate over the rows but did not find a way.

You can use merge, default parameter how='inner' can be omited:
df = pd.merge(df1, df2, on='Time')
print (df)
Time accler_x accler_y
0 19.13.33 24 25
1 19.13.34 24 27
2 19.13.35 25 25
3 19.13.36 27 24
df1 = df[['Time','accler_x']].rename(columns={'accler_x':'accler'})
print (df1)
Time accler
0 19.13.33 24
1 19.13.34 24
2 19.13.35 25
3 19.13.36 27
df2 = df[['Time','accler_y']].rename(columns={'accler_y':'accler'})
print (df2)
Time accler
0 19.13.33 25
1 19.13.34 27
2 19.13.35 25
3 19.13.36 24
If you need merge multiple DataFrames use reduce:
#Python 3
import functools
df = functools.reduce(lambda x,y: x.merge(y,on=['Time']), [df1, df2])
#python 2
df = reduce(lambda x,y: x.merge(y,on=['Time']), [df1, df2])

Related

pandas dataframe groupby apply multi columns and get count

I have a excel like this:
year
a
b
2021
12
23
2021
31
0
2021
15
21
2021
14
0
2022
32
0
2022
24
15
2022
28
29
2022
33
0
I wanna get count of condition: a>=30 and b==0 group by year
the final output like this:
2021 1
2022 2
I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python
For count matched rows chain both conditions by & for bitwise AND and aggregate sum, Trues are processing like 1 and False like 0:
df1 = ((df.a>=30) & (df.b==0)).astype(int)
.groupby(df['year']).sum().reset_index(name='count')
print (df1)
year count
0 2021 1
1 2022 2
Similar idea with helper column:
df1 = (df.assign(count = ((df.a>=30) & (df.b==0)).astype(int))
.groupby('year', as_index=False)['count']
.sum())

How do you return dataframes with lowest date for each ID and highest date for each ID

I have a dataframe made up of synthetic student test results data, this looks like the below:
print(df)
student_ID test_date result
76 2021-02-14 60
33 2021-01-12 54
76 2021-11-23 71
76 2021-05-10 78
33 2021-06-09 81
...
The output I'm looking for would look like the below, the oldest and most recent test date and result for each student ID, along with the difference between the two.:
student_ID test_date result test_date2 result2 difference
76 2021-02-14 60 2021-11-23 71 11
33 2021-01-12 54 2021-06-09 81 27
...
I was thinking to create two seperate dataframes, one with records that have the oldest date for each student ID and the other dataframe with the most recent record for each student ID, then concat the two and create an additional column to calculate the difference but I'm unsure if this would the the correct way of doing it. Would there also be a way to order the records by highest difference to lowest, regardless of whether it is a positive or negative difference (10 and -10 would be the same).
You can use DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin for get values of index from column result for values by maximal and minimal datetimes with aggregate max and min in GroupBy.agg, last create difference column:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)
.agg(test_date=('test_date','min'),
result=('test_date','idxmin'),
test_date2=('test_date','max'),
result2=('test_date','idxmax'))
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27
For oldier pandas version here is alternative solution:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)['test_date']
.agg([('test_date','min'),
('result','idxmin'),
('test_date2','max'),
('result2','idxmax')])
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27

Find missing hours from one column and save as txt file in Python

Given a dataset as follows:
date
NO2
SO2
O3
0
2018/11/14 10:00
9
25
80
1
2018/11/14 12:00
9
26
88
2
2018/11/14 13:00
8
26
88
3
2018/11/14 14:00
8
34
88
4
2018/11/14 15:00
8
37
89
5
2018/11/14 17:00
8
72
40
6
2018/11/14 18:00
8
56
50
7
2018/11/14 19:00
7
81
22
I would like to find missing hours from date column, and save these missing date as missing_date.txt.
My code:
df = df.set_index(pd.to_datetime(df['date']))
df = df.sort_index()
df = df.drop(columns=['date'])
df = df.resample('H').first().fillna(np.nan)
missing = df[df['NO2'].isnull()]
np.savetxt('./missing_date.txt', missing.index.to_series(), fmt='%s')
Out:
2018-11-14T11:00:00.000000000
2018-11-14T16:00:00.000000000
The problem:
not concise, maybe need to improve;
date format is not expected as follows: 2018/11/14 11:00, 2018/11/14 16:00.
How could I improve the code above? Thanks.
Use DataFrame.asfreq working with unique datetimes:
#create sorted DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
#if possible duplicates
#df = df.resample('H').first()
#if not duplicates
df = df.asfreq('H')
missing = df[df['NO2'].isna()]
For write to file is possible convert values of DatetimeIndex for custom format first by DatetimeIndex.strftime and then write by numpy or pandas:
s = missing.index.strftime('%Y/%m/%d %H:%M').to_series()
np.savetxt('./missing_date.txt', s, fmt='%s')
s.to_csv('./missing_date.txt', index=False)

Python combine monthly and minutes dataframes with TZ-aware datetime index

I have two time-series below. Datetime indices are TZ-aware.
df1: Five minutes interval
value_1
Timestamp
2009-04-01 10:50:00+09:30 50
2009-04-05 11:55:00+09:30 55
2009-04-23 16:00:00+09:30 0
2009-05-03 10:50:00+09:30 50
2009-05-07 11:55:00+09:30 55
2009-05-11 16:00:00+09:30 0
2009-07-04 02:05:00+09:30 5
2009-07-21 09:10:00+09:30 10
2009-07-30 12:15:00+09:30 15
2010-09-02 11:25:00+09:30 25
2010-09-22 15:30:00+09:30 30
2010-09-30 06:15:00+09:30 15
2010-12-06 11:25:00+09:30 25
2010-12-22 15:30:00+09:30 30
2010-12-28 06:15:00+09:30 15
df2: Monthly interval obtained by groupby('Month') from a different dataset.
value_2
Timestamp
2009-04-30 00:00:00+09:30 23
2009-07-31 00:00:00+09:30 28
2010-12-31 00:00:00+09:30 23
I want to combine the two datasets by index. Any record in df1 should be included in the final results if it has the same month as df2. The expected result is below.
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23
This is my attempt.
result = pd.concat([df1, df2], axis=1)
# this combines the datasets, but not like expected, also by including join="outer". With join="inner", no data shown.
result = pd.merge(df1, df2, left_on='value_1', right_index=True)
# this return ValueError: You are trying to merge on Int64 and datetime64[ns, Australia/North] columns. If you wish to proceed you should use pd.concat
# Using #Ben.T
mt_hMF = df1.merge( df2.reset_index().set_index(df2.index.floor('M')),
how='left', left_index=True, right_index=True).set_index('Timestamp')
# This gives ValueError: <MonthEnd> is a non-fixed frequency
Try this, using strftime to create a temporary merge key for both dataframes:
df1.reset_index()\
.assign(yearmonth=df1.index.strftime('%Y%m'))\
.merge(df2.assign(yearmonth=df2.index.strftime('%Y%m')))\
.set_index('Timestamp')\
.drop('yearmonth', axis=1)
Output:
value_1 value_2
Timestamp
2009-04-01 10:50:00+09:30 50 23
2009-04-05 11:55:00+09:30 55 23
2009-04-23 16:00:00+09:30 0 23
2009-07-04 02:05:00+09:30 5 28
2009-07-21 09:10:00+09:30 10 28
2009-07-30 12:15:00+09:30 15 28
2010-12-06 11:25:00+09:30 25 23
2010-12-22 15:30:00+09:30 30 23
2010-12-28 06:15:00+09:30 15 23

Pandas Resampling Code Runs extremely Slowly

I need to resample some data in Pandas and I am using the code below:
On my data it takes, 5 hours.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
This is prohibitively slow.
How can I speed up the above code, on data like:
id date value
1 16-12-1 9
1 16-12-1 8
1 17-1-1 18
2 17-3-4 19
2 17-3-4 20
1 17-4-3 21
2 17-7-13 12
3 17-8-9 12
2 17-9-12 11
1 17-11-12 19
3 17-11-12 21
giving output:
id date
1 2016-12-04 17
2017-01-01 18
2017-04-09 21
2017-11-12 19
2 2017-03-05 39
2017-07-16 12
2017-09-17 11
3 2017-08-13 12
2017-11-12 21
Name: value, dtype: int64
I set up date as an index but the code is so slow. Any help would be great.
Give this a try.
I am going to use pd.Grouper() and specify the frequency to daily, hoping that it is faster. Also, i am getting rid of the agg and using .sum() straight away.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df2 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
Results:
id date
1 2016-12-01 17
2017-01-01 18
2017-04-03 21
2017-11-12 19
2 2017-03-04 39
2017-07-13 12
2017-09-12 11
3 2017-08-09 12
2017-11-12 21
Hope this works.
[EDIT]
So I just did a small test between both methods over a randomly generated df with 100000 rows
df = pd.DataFrame(np.random.randint(0, 30,size=100000),
columns=["id"],
index=pd.date_range("19300101", periods=100000))
df['value'] = np.random.randint(0, 10,size=100000)
and tried it on both codes and the results are:
for using resmple:
startTime = time.time()
df2 = df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
print(time.time()-startTime)
1.0451831817626953 seconds
for using pd.Grouper():
startTime = time.time()
df3 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
print(time.time()-startTime)
0.08430838584899902 seconds
so approximately 12 times faster! (if my math is correct)

Categories

Resources