Separate by threshold

Separate by threshold - python

I am trying to take the value of c_med one value as threshold from input:1 and separate the above and below values in two different outputs from input:2. Write above.csv & below.csv with reference to column c_total.
Read the above.csv as input and categorize them with percentage as mentioned in point 2 written in pure python.
Input: 1
date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751
Input: 2 ['date','startTime','endTime','day','c_total','u_total']
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
I am trying to read a threshold value from another input csv c_med
I am getting following error:
Traceback (most recent call last):
File "class_med.py", line 10, in <module>
above_median = df_data['c_total'] > df_med['c_med']
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
filter the separated data column c_total with percentage. Pure python solution given below but I am looking for a pandas solution. like in Reference one
for row in csv.reader(inp):
if int(row[1])<(.20 * max_value):
val = 'viewers'
elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value):
val= 'event based'
elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value):
val= 'situational'
elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value):
val = 'active'
else:
val= 'highly active'
writer.writerow([row[0],row[1],val])
Code:
import pandas as pd
import numpy as np
df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
above = df_data['c_total'] > df_med['c_med']
#print above_median
above.to_csv('above.csv', index=None, header=None)
df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
#Percentage block should come here
Edit: In case of single column value the qcut is the simplest solution. But when it comes to using two values from two different columns how to achieve that in pandas ?
for row in csv.reader(inp):
if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
val='highly active'
elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
val='active'
elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):
val='event based'
elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
val ='situational'
else:
val= 'viewers'

assuming you have the following DFs:
In [7]: df1
Out[7]:
date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med
0 2 12 2309 19072 12515 13131 254 785 686 751
In [8]: df2
Out[8]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254
separate by threshold (you can compare two series with the same length or with a scalar value - i assume you will to separate your second data set, comparing it to the scalar value (c_med column) from the first of your first data set:
In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]
In [23]: above
Out[23]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
you can use qcut() method in order to categorize your data:
In [29]: df2['cat'] = pd.qcut(df2.c_total,
....: q=[0, .2, .4, .6, .8, 1.],
....: labels=['viewers','event based','situational','active','highly active'])
In [30]: df2
Out[30]:
date startTime endTime day c_total u_total cat
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790 highly active
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750 active
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747 active
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777 highly active
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785 highly active
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757 situational
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735 situational
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636 situational
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513 event based
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382 event based
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303 viewers
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262 viewers
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254 viewers
check:
In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
c_total pct cat
0 18944 0.993289 highly active
1 17534 0.919358 active
2 17262 0.905096 active
3 19072 1.000000 highly active
4 18275 0.958211 highly active
5 13589 0.712510 situational
6 16053 0.841705 situational
7 11440 0.599832 situational
8 5972 0.313129 event based
9 3424 0.179530 event based
10 2696 0.141359 viewers
11 2350 0.123217 viewers
12 2309 0.121068 viewers

Related

Counting each day in a dataframe (Not resetting on new year)

I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below:
Code which handles dayNo:
df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year
I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below:
Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!

You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point:
df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")})
start = pd.Timestamp("2021-12-31")
df["dayNo"] = df["dates"].sub(start).dt.days
dates dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
5 2022-12-30 16:00:00 364
6 2022-12-31 00:00:00 365
7 2022-12-31 08:00:00 365
8 2022-12-31 16:00:00 365
9 2023-01-01 00:00:00 366
10 2023-01-01 08:00:00 366
11 2023-01-01 16:00:00 366
12 2023-01-02 00:00:00 367
13 2023-01-02 08:00:00 367
14 2023-01-02 16:00:00 367
15 2023-01-03 00:00:00 368

You are nearly there with your solution just do Apply for final result as
df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo)
df
Out[108]:
dates TradeDate dayNo
0 2022-12-31 00:00:00 2022-12-31 365
1 2022-12-31 01:00:00 2022-12-31 365
2 2022-12-31 02:00:00 2022-12-31 365
3 2022-12-31 03:00:00 2022-12-31 365
4 2022-12-31 04:00:00 2022-12-31 365
.. ... ... ...
68 2023-01-02 20:00:00 2023-01-02 367
69 2023-01-02 21:00:00 2023-01-02 367
70 2023-01-02 22:00:00 2023-01-02 367
71 2023-01-02 23:00:00 2023-01-02 367
72 2023-01-03 00:00:00 2023-01-03 368

Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
The dataframe has then dates from 2022 to 2030 :
TradeDate
0 2022-12-29 00:00:00
1 2022-12-29 08:00:00
2 2022-12-29 16:00:00
3 2022-12-30 00:00:00
4 2022-12-30 08:00:00
... ...
7682 2030-01-01 16:00:00
7683 2030-01-02 00:00:00
7684 2030-01-02 08:00:00
7685 2030-01-02 16:00:00
7686 2030-01-03 00:00:00
[7687 rows x 1 columns]
I propose you the following commented-inside code to aim our target :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
# Initialize Days counter
dyc = df['TradeDate'].iloc[0].dayofyear
# Initialize Previous day of Year
prv_dof = dyc
def func(row):
global dyc, prv_dof
# Get the day of the year
dof = row.iloc[0].dayofyear
# If New day then increment days counter
if dof != prv_dof:
dyc+=1
prv_dof = dof
return dyc
df['dayNo'] = df.apply(func, axis=1)
Resulting dataframe :
TradeDate dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
... ... ...
7682 2030-01-01 16:00:00 2923
7683 2030-01-02 00:00:00 2924
7684 2030-01-02 08:00:00 2924
7685 2030-01-02 16:00:00 2924
7686 2030-01-03 00:00:00 2925

Create regular time series from irregular interval with python

I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance

Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]

How to group columns in pandas?

I have DataFrame like this:
Jan Feb Jan.01 Feb.01
0 0 4 6 4
1 2 5 7 8
2 3 6 7 7
How can group this for getting this result? What functions i must to use?
2000 2001
Jan Feb Jan.01 Feb.01
0 0 4 6 4
1 2 5 7 8
2 3 6 7 7

I think this will do
df
Jan Feb Jan.01 Feb.01
0 2016-01-01 00:00:00 2016-01-02 00:00:00 2 413
1 2016-01-02 01:00:00 2016-01-03 01:00:00 1 414
2 2016-01-03 02:00:00 2016-01-04 02:00:00 2 763
3 2016-01-04 03:00:00 2016-01-05 03:00:00 1 837
4 2016-01-05 04:00:00 2016-01-06 04:00:00 2 375
level1_col = pd.Series(df.columns).str.split('.').apply(lambda x: 2000+int(x[1]) if len(x) == 2 else 2000)
level2_col = df.columns.tolist()
df.columns = [level1_col, level2_col]
df
2000 2001
Jan Feb Jan.01 Feb.01
0 2016-01-01 00:00:00 2016-01-02 00:00:00 2 413
1 2016-01-02 01:00:00 2016-01-03 01:00:00 1 414
2 2016-01-03 02:00:00 2016-01-04 02:00:00 2 763
3 2016-01-04 03:00:00 2016-01-05 03:00:00 1 837
4 2016-01-05 04:00:00 2016-01-06 04:00:00 2 375
df[2000]
Jan Feb
0 2016-01-01 00:00:00 2016-01-02 00:00:00
1 2016-01-02 01:00:00 2016-01-03 01:00:00
2 2016-01-03 02:00:00 2016-01-04 02:00:00
3 2016-01-04 03:00:00 2016-01-05 03:00:00
4 2016-01-05 04:00:00 2016-01-06 04:00:00

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.

UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

pandas datetime: groupy hourly and every monday

I'm new to pandas / python:
I have a dataframe (events.number) indexed by a datetime object.
I'm trying to extract an event count hourly, on every Monday (or other particular weekday). I wrote:
hour_tally_monday = events.number.groupby(lambda x: (x.hour & x.weekday==0) ).count()
but this does not work correctly.
I can drop the "& x.weekday==1" and it works but presumably uses all the days in the frame. What's the right (simplest) syntax to just average over Mondays?

I think you need first filter dataframe with boolean indexing and then use groupby with size:
import pandas as pd
start = pd.to_datetime('2016-02-01')
end = pd.to_datetime('2016-02-25')
rng = pd.date_range(start, end, freq='12H')
events = pd.DataFrame({'number': [1] * 20 + [2] * 15 + [3] * 14}, index=rng)
print events
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-02 00:00:00 1
2016-02-02 12:00:00 1
2016-02-03 00:00:00 1
2016-02-03 12:00:00 1
2016-02-04 00:00:00 1
2016-02-04 12:00:00 1
2016-02-05 00:00:00 1
2016-02-05 12:00:00 1
2016-02-06 00:00:00 1
2016-02-06 12:00:00 1
2016-02-07 00:00:00 1
...
...
filtered = events[events.index.weekday == 0]
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
In version 0.18.1 you can use new method DatetimeIndex.weekday_name:
filtered = events[events.index.weekday_name == 'Monday']
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
print filtered.groupby(filtered.index.hour).size()
0 4
12 4
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separate by threshold - python

Related

Counting each day in a dataframe (Not resetting on new year)

Create regular time series from irregular interval with python

How to group columns in pandas?

add timedelta data within a group in pandas dataframe

pandas datetime: groupy hourly and every monday

Categories

Resources