Separate by threshold - python
I am trying to take the value of c_med one value as threshold from input:1 and separate the above and below values in two different outputs from input:2. Write above.csv & below.csv with reference to column c_total.
Read the above.csv as input and categorize them with percentage as mentioned in point 2 written in pure python.
Input: 1
date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751
Input: 2 ['date','startTime','endTime','day','c_total','u_total']
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
I am trying to read a threshold value from another input csv c_med
I am getting following error:
Traceback (most recent call last):
File "class_med.py", line 10, in <module>
above_median = df_data['c_total'] > df_med['c_med']
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
filter the separated data column c_total with percentage. Pure python solution given below but I am looking for a pandas solution. like in Reference one
for row in csv.reader(inp):
if int(row[1])<(.20 * max_value):
val = 'viewers'
elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value):
val= 'event based'
elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value):
val= 'situational'
elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value):
val = 'active'
else:
val= 'highly active'
writer.writerow([row[0],row[1],val])
Code:
import pandas as pd
import numpy as np
df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
above = df_data['c_total'] > df_med['c_med']
#print above_median
above.to_csv('above.csv', index=None, header=None)
df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
#Percentage block should come here
Edit: In case of single column value the qcut is the simplest solution. But when it comes to using two values from two different columns how to achieve that in pandas ?
for row in csv.reader(inp):
if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
val='highly active'
elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
val='active'
elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):
val='event based'
elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
val ='situational'
else:
val= 'viewers'
assuming you have the following DFs:
In [7]: df1
Out[7]:
date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med
0 2 12 2309 19072 12515 13131 254 785 686 751
In [8]: df2
Out[8]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254
separate by threshold (you can compare two series with the same length or with a scalar value - i assume you will to separate your second data set, comparing it to the scalar value (c_med column) from the first of your first data set:
In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]
In [23]: above
Out[23]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
you can use qcut() method in order to categorize your data:
In [29]: df2['cat'] = pd.qcut(df2.c_total,
....: q=[0, .2, .4, .6, .8, 1.],
....: labels=['viewers','event based','situational','active','highly active'])
In [30]: df2
Out[30]:
date startTime endTime day c_total u_total cat
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790 highly active
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750 active
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747 active
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777 highly active
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785 highly active
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757 situational
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735 situational
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636 situational
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513 event based
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382 event based
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303 viewers
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262 viewers
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254 viewers
check:
In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
c_total pct cat
0 18944 0.993289 highly active
1 17534 0.919358 active
2 17262 0.905096 active
3 19072 1.000000 highly active
4 18275 0.958211 highly active
5 13589 0.712510 situational
6 16053 0.841705 situational
7 11440 0.599832 situational
8 5972 0.313129 event based
9 3424 0.179530 event based
10 2696 0.141359 viewers
11 2350 0.123217 viewers
12 2309 0.121068 viewers
Related
Counting each day in a dataframe (Not resetting on new year)
I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below: Code which handles dayNo: df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below: Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!
You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point: df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")}) start = pd.Timestamp("2021-12-31") df["dayNo"] = df["dates"].sub(start).dt.days dates dayNo 0 2022-12-29 00:00:00 363 1 2022-12-29 08:00:00 363 2 2022-12-29 16:00:00 363 3 2022-12-30 00:00:00 364 4 2022-12-30 08:00:00 364 5 2022-12-30 16:00:00 364 6 2022-12-31 00:00:00 365 7 2022-12-31 08:00:00 365 8 2022-12-31 16:00:00 365 9 2023-01-01 00:00:00 366 10 2023-01-01 08:00:00 366 11 2023-01-01 16:00:00 366 12 2023-01-02 00:00:00 367 13 2023-01-02 08:00:00 367 14 2023-01-02 16:00:00 367 15 2023-01-03 00:00:00 368
You are nearly there with your solution just do Apply for final result as df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo) df Out[108]: dates TradeDate dayNo 0 2022-12-31 00:00:00 2022-12-31 365 1 2022-12-31 01:00:00 2022-12-31 365 2 2022-12-31 02:00:00 2022-12-31 365 3 2022-12-31 03:00:00 2022-12-31 365 4 2022-12-31 04:00:00 2022-12-31 365 .. ... ... ... 68 2023-01-02 20:00:00 2023-01-02 367 69 2023-01-02 21:00:00 2023-01-02 367 70 2023-01-02 22:00:00 2023-01-02 367 71 2023-01-02 23:00:00 2023-01-02 367 72 2023-01-03 00:00:00 2023-01-03 368
Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) : import pandas as pd df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")}) The dataframe has then dates from 2022 to 2030 : TradeDate 0 2022-12-29 00:00:00 1 2022-12-29 08:00:00 2 2022-12-29 16:00:00 3 2022-12-30 00:00:00 4 2022-12-30 08:00:00 ... ... 7682 2030-01-01 16:00:00 7683 2030-01-02 00:00:00 7684 2030-01-02 08:00:00 7685 2030-01-02 16:00:00 7686 2030-01-03 00:00:00 [7687 rows x 1 columns] I propose you the following commented-inside code to aim our target : import pandas as pd df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")}) # Initialize Days counter dyc = df['TradeDate'].iloc[0].dayofyear # Initialize Previous day of Year prv_dof = dyc def func(row): global dyc, prv_dof # Get the day of the year dof = row.iloc[0].dayofyear # If New day then increment days counter if dof != prv_dof: dyc+=1 prv_dof = dof return dyc df['dayNo'] = df.apply(func, axis=1) Resulting dataframe : TradeDate dayNo 0 2022-12-29 00:00:00 363 1 2022-12-29 08:00:00 363 2 2022-12-29 16:00:00 363 3 2022-12-30 00:00:00 364 4 2022-12-30 08:00:00 364 ... ... ... 7682 2030-01-01 16:00:00 2923 7683 2030-01-02 00:00:00 2924 7684 2030-01-02 08:00:00 2924 7685 2030-01-02 16:00:00 2924 7686 2030-01-03 00:00:00 2925
Create regular time series from irregular interval with python
I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this : Index count 2018-01-05 00:00:00 1 2018-01-07 00:00:00 4 2018-01-08 00:00:00 15 2018-01-11 00:00:00 2 2018-01-14 00:00:00 5 2018-01-19 00:00:00 5 .... 2018-12-26 00:00:00 6 2018-12-29 00:00:00 7 2018-12-30 00:00:00 8 And I expect the result to be something like this: Index count 2018-01-01 00:00:00 0 2018-01-02 00:00:00 0 2018-01-03 00:00:00 0 2018-01-04 00:00:00 0 2018-01-05 00:00:00 1 2018-01-06 00:00:00 0 2018-01-07 00:00:00 4 2018-01-08 00:00:00 15 2018-01-09 00:00:00 0 2018-01-10 00:00:00 0 2018-01-11 00:00:00 2 2018-01-12 00:00:00 0 2018-01-13 00:00:00 0 2018-01-14 00:00:00 5 2018-01-15 00:00:00 0 2018-01-16 00:00:00 0 2018-01-17 00:00:00 0 2018-01-18 00:00:00 0 2018-01-19 00:00:00 5 .... 2018-12-26 00:00:00 6 2018-12-27 00:00:00 0 2018-12-28 00:00:00 0 2018-12-29 00:00:00 7 2018-12-30 00:00:00 8 2018-12-31 00:00:00 0 So, far I just try resample from pandas but it only partially solved my problem. Thanks in advance
Use DataFrame.reindex with date_range: #if necessary df.index = pd.to_datetime(df.index) df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0) print (df) count 2018-01-01 0 2018-01-02 0 2018-01-03 0 2018-01-04 0 2018-01-05 1 ... 2018-12-27 0 2018-12-28 0 2018-12-29 7 2018-12-30 8 2018-12-31 0 [365 rows x 1 columns]
How to group columns in pandas?
I have DataFrame like this: Jan Feb Jan.01 Feb.01 0 0 4 6 4 1 2 5 7 8 2 3 6 7 7 How can group this for getting this result? What functions i must to use? 2000 2001 Jan Feb Jan.01 Feb.01 0 0 4 6 4 1 2 5 7 8 2 3 6 7 7
I think this will do df Jan Feb Jan.01 Feb.01 0 2016-01-01 00:00:00 2016-01-02 00:00:00 2 413 1 2016-01-02 01:00:00 2016-01-03 01:00:00 1 414 2 2016-01-03 02:00:00 2016-01-04 02:00:00 2 763 3 2016-01-04 03:00:00 2016-01-05 03:00:00 1 837 4 2016-01-05 04:00:00 2016-01-06 04:00:00 2 375 level1_col = pd.Series(df.columns).str.split('.').apply(lambda x: 2000+int(x[1]) if len(x) == 2 else 2000) level2_col = df.columns.tolist() df.columns = [level1_col, level2_col] df 2000 2001 Jan Feb Jan.01 Feb.01 0 2016-01-01 00:00:00 2016-01-02 00:00:00 2 413 1 2016-01-02 01:00:00 2016-01-03 01:00:00 1 414 2 2016-01-03 02:00:00 2016-01-04 02:00:00 2 763 3 2016-01-04 03:00:00 2016-01-05 03:00:00 1 837 4 2016-01-05 04:00:00 2016-01-06 04:00:00 2 375 df[2000] Jan Feb 0 2016-01-01 00:00:00 2016-01-02 00:00:00 1 2016-01-02 01:00:00 2016-01-03 01:00:00 2 2016-01-03 02:00:00 2016-01-04 02:00:00 3 2016-01-04 03:00:00 2016-01-05 03:00:00 4 2016-01-05 04:00:00 2016-01-06 04:00:00
add timedelta data within a group in pandas dataframe
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns]. I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as: 1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum()) 2)df ["duration"]= df.groupby('user_id').aggregate (np.sum) 3)df ["duration"]= df.groupby('user_id').agg (np.sum) but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method: In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum') In [292]: df Out[292]: a user_id b interval duration 0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00 1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00 2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00 3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00 4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00 5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00 6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00 7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00 8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00 9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00 OLD answer: Demo: In [260]: df Out[260]: a b interval user_id 0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1 1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1 2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1 3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1 4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1 5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1 6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2 7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2 8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2 9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2 In [261]: df.dtypes Out[261]: a datetime64[ns] b datetime64[ns] interval timedelta64[ns] user_id int64 dtype: object In [262]: df.groupby('user_id')['interval'].sum() Out[262]: user_id 1 838 days 08:00:00 2 1454 days 00:00:00 Name: interval, dtype: timedelta64[ns] In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum()) Out[263]: user_id 1 838 days 08:00:00 2 1454 days 00:00:00 Name: interval, dtype: timedelta64[ns] In [264]: df.groupby('user_id').agg(np.sum) Out[264]: interval user_id 1 838 days 08:00:00 2 1454 days 00:00:00 So check your data...
pandas datetime: groupy hourly and every monday
I'm new to pandas / python: I have a dataframe (events.number) indexed by a datetime object. I'm trying to extract an event count hourly, on every Monday (or other particular weekday). I wrote: hour_tally_monday = events.number.groupby(lambda x: (x.hour & x.weekday==0) ).count() but this does not work correctly. I can drop the "& x.weekday==1" and it works but presumably uses all the days in the frame. What's the right (simplest) syntax to just average over Mondays?
I think you need first filter dataframe with boolean indexing and then use groupby with size: import pandas as pd start = pd.to_datetime('2016-02-01') end = pd.to_datetime('2016-02-25') rng = pd.date_range(start, end, freq='12H') events = pd.DataFrame({'number': [1] * 20 + [2] * 15 + [3] * 14}, index=rng) print events number 2016-02-01 00:00:00 1 2016-02-01 12:00:00 1 2016-02-02 00:00:00 1 2016-02-02 12:00:00 1 2016-02-03 00:00:00 1 2016-02-03 12:00:00 1 2016-02-04 00:00:00 1 2016-02-04 12:00:00 1 2016-02-05 00:00:00 1 2016-02-05 12:00:00 1 2016-02-06 00:00:00 1 2016-02-06 12:00:00 1 2016-02-07 00:00:00 1 ... ... filtered = events[events.index.weekday == 0] print filtered number 2016-02-01 00:00:00 1 2016-02-01 12:00:00 1 2016-02-08 00:00:00 1 2016-02-08 12:00:00 1 2016-02-15 00:00:00 2 2016-02-15 12:00:00 2 2016-02-22 00:00:00 3 2016-02-22 12:00:00 3 In version 0.18.1 you can use new method DatetimeIndex.weekday_name: filtered = events[events.index.weekday_name == 'Monday'] print filtered number 2016-02-01 00:00:00 1 2016-02-01 12:00:00 1 2016-02-08 00:00:00 1 2016-02-08 12:00:00 1 2016-02-15 00:00:00 2 2016-02-15 12:00:00 2 2016-02-22 00:00:00 3 2016-02-22 12:00:00 3 print filtered.groupby(filtered.index.hour).size() 0 4 12 4 dtype: int64