Datetime conversion - slow in pandas - python

I have a Dataframe which looks like this :
date,time,metric_x
2016-02-27,00:00:28.0000000,31
2016-02-27,00:01:19.0000000,40
2016-02-27,00:02:55.0000000,39
2016-02-27,00:03:51.0000000,48
2016-02-27,00:05:22.0000000,42
2016-02-27,00:05:59.0000000,35
I wish to generate a new column
df['time_slot'] = df.apply(lambda row: time_slot_convert(pd.to_datetime(row['time'])), axis =1)
Where,
def time_slot_convert(time):
return time.hour + 1
This functions finds the hour for this record, plus 1.
This is extremely slow. I understand that the data is read as a string. is there a more efficient way that will speed this up?

Faster is remove apply:
df['time_slot'] = pd.to_datetime(df['time']).dt.hour + 1
print (df)
date time metric_x time_slot
0 2016-02-27 00:00:28.0000000 31 1
1 2016-02-27 00:01:19.0000000 40 1
2 2016-02-27 00:02:55.0000000 39 1
3 2016-02-27 00:03:51.0000000 48 1
4 2016-02-27 00:05:22.0000000 42 1
5 2016-02-27 00:05:59.0000000 35 1

Related

counting number of dates in between two date range from different dataframe [duplicate]

This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10

How to calculate a Process Duration from a TimeSeries Dataset with Pandas

I have a huge dataset of various sensor data sorted chronologically (by timestamp) and by sensor type. I want to calculate the duration of a process in seconds by subtracting the first entry of a sensor from the last entry. This is to be done with python and pandas. Attached is an example for better understanding:
enter image description here
I want to subtract the first row from the last row for each sensor type to get the process duration in seconds (i.e. row 8 minus row 1 : 2022-04-04T09:44:56.962Z - 2022-04-04T09:44:56.507Z = 0.455 seconds).
The duration should then be written to a newly created column in the last row of the sensor type.
Thanks in advance!
Assuming your 'timestamp' column is already 'to_datetime' converted, would this work ?
df['diffPerSensor_type']=df.groupby('sensor_type')['timestamp'].transform('last')-df.groupby('sensor_type')['timestamp'].transform('first')
You could then extract your seconds with this
df['diffPerSensor_type'].dt.seconds
If someone wants to reproduce an example, here is a df:
import pandas as pd
df = pd.DataFrame({
'sensor_type' : [0]*7 + [1]*11 + [13]*5 + [8]*5,
'timestamp' : pd.date_range('2022-04-04', periods=28, freq='ms'),
'value' : [128] * 28
})
df['time_diff in milliseconds'] = (df.groupby('sensor_type')['timestamp']
.transform(lambda x: x.iloc[-1]-x.iloc[0])
.dt.components.milliseconds)
print(df.head(10))
sensor_type timestamp value time_diff in milliseconds
0 0 2022-04-04 00:00:00.000 128 6
1 0 2022-04-04 00:00:00.001 128 6
2 0 2022-04-04 00:00:00.002 128 6
3 0 2022-04-04 00:00:00.003 128 6
4 0 2022-04-04 00:00:00.004 128 6
5 0 2022-04-04 00:00:00.005 128 6
6 0 2022-04-04 00:00:00.006 128 6
7 1 2022-04-04 00:00:00.007 128 10
8 1 2022-04-04 00:00:00.008 128 10
9 1 2022-04-04 00:00:00.009 128 10
My solution is nearly the same as #Daniel Weigel , only that I used lambda to calc the difference.

Output raw value difference from one period to the next using Python

I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)

Using Wide_to_Long on 3 Columns

How to split a dataframe using pandas wide_to_long keeping first column as index and balance columns (in group of 3) into single dataframe.
I have sample dataframe like below:
columns = [timestamp, BQ_0, BP_0, BO_0, BQ_1, BP_2, BO_2, BQ_3, BP_3,BO_3, BQ_4, BP_4, BO_4, BQ_4, BP_4, BO_4]
09:15:00 900 29450.00 2 20 29,436 1 100 29425.15 1 60 29352.05 1 20 29352.00 1
09:15:01 900 29450.00 2 20 29,436 1 100 29425.15 1 60 29352.05 1 20 29352.00 1
09:15:02 20 29412.40 1 20 29,410 1 80 29410.10 1 20 29407.60 1 20 29388.90 1
09:15:03 80 29430.20 1 80 29,430 1 80 29430.05 2 20 29430.00 1 20 29424.75 1
09:15:04 120 29445.80 1 40 29,440 2 40 29440.10 1 40 29440.05 1 20 29439.10 1
I want to melt this Dataframe in group of [timestamp , BQ_ , BP_ , BO_ ] using pandas wide_to_long where
_Q = Quantity, _P = Price, _O = Orders,
I want my result dataframe a like below:
timestamp, BQ_, BP_, BO_
09:15:00 900 29450.00 2 <= 1st Row
09:15:00 20 29,436 1
09:15:00 100 29425.15 1
09:15:00 60 29352.05 1
09:15:00 20 29352.00 1
09:15:01 900 29450.00 2 <= 2nd Row
09:15:01 20 29,436 1
09:15:01 100 29425.15 1
09:15:01 60 29352.05 1
09:15:01 20 29352.00 1
09:15:02 20 29412.40 1 <= 3rd Row
09:15:02 20 29,410 1
...
Source : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html
pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\d+')
df : DataFrame
The wide-format DataFrame
stubnames : str or list-like
The stub name(s). The wide format variables are assumed to start with the stub names.
i : str or list-like
Column(s) to use as id variable(s)
j : str
The name of the sub-observation variable. What you wish to name your suffix in the long format.
sep : str, default “”
A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format. For example, if your column names are A-suffix1, A-suffix2, you can strip the hyphen by specifying sep=’-‘
New in version 0.20.0.
suffix : str, default ‘\d+’
A regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes. Suffixes with no numbers could be specified with the negated character class ‘\D+’. You can also further disambiguate suffixes, for example, if your wide variables are of the form A-one, B-two,.., and you have an unrelated column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’
New in version 0.20.0.
Changed in version 0.23.0: When all suffixes are numeric, they are cast to int64/float64.
You can try it like this
result = pd.wide_to_long(df, stubnames=['BQ_','BP_','BO_'], i=['timestamp'],j="Number")

Pandas: Assign multi-index DataFrame with with DataFrame by index-level-0

Please, suggest a more suitable title for this question
I have: Two-level indexed DF (crated via groupby):
clicks yield
country report_date
AD 2016-08-06 1 31
2016-12-01 1 0
AE 2016-10-11 1 0
2016-10-13 2 0
I need:
Consequently take country by country data, process it and put it back:
for country in set(DF.get_level_values(0)):
DF_country = process(DF.loc[country])
DF[country] = DF_country
Where process add new rows to DF_country.
Problem is in last string:
ValueError: Wrong number of items passed 2, placement implies 1
I just modify your code, I change the process to add, Base on my understanding process is a self-define function right ?
for country in set(DF.index.get_level_values(0)): # change here
DF_country = DF.loc[country].add(1)
DF.loc[country] = DF_country.values #and here
DF
Out[886]:
clicks yield
country report_date
AD 2016-08-06 2 32
2016-12-01 2 1
AE 2016-10-11 2 1
2016-10-13 3 1
EDIT :
l=[]
for country in set(DF.index.get_level_values(0)):
DF1=DF.loc[country]
DF1.loc['2016-01-01']=[1,2] #adding row here
l.append(DF1)
pd.concat(l,axis=0,keys=set(DF.index.get_level_values(0)))
Out[923]:
clicks yield
report_date
AE 2016-10-11 1 0
2016-10-13 2 0
2016-01-01 1 2
AD 2016-08-06 1 31
2016-12-01 1 0
2016-01-01 1 2

Categories

Resources