DateTime Junction Vehicles ID
0 2015-11-01 00:00:00 1 15 20151101001
1 2015-11-01 01:00:00 1 13 20151101011
2 2015-11-01 02:00:00 1 10 20151101021
3 2015-11-01 03:00:00 1 7 20151101031
4 2015-11-01 04:00:00 1 9 20151101041
5 2015-11-01 05:00:00 1 6 20151101051
6 2015-11-01 06:00:00 1 9 20151101061
7 2015-11-01 07:00:00 1 8 20151101071
8 2015-11-01 08:00:00 1 11 20151101081
9 2015-11-01 09:00:00 1 12 20151101091
I want to split the ID column into two separate columns such that the first 4 digits are in one, and the remaining digits are in the second.
Code I've tried:
new_ID = data.apply(lambda x: x.rsplit(4))
But it doesn't work. How can I do this with pandas?
Option 1
The simplest and most direct - use the str accessor.
v = df.ID.astype(str)
df['Year'], df['ID'] = v.str[:4], v.str[4:]
df
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015
Option 2
str.extract
v = df.ID.astype(str).str.extract('(?P<Year>\d{4})(?P<ID>.*)', expand=True)
df = pd.concat([df.drop('ID', 1), v], 1)
df
DateTime Junction Vehicles Year ID
0 2015-11-01 00:00:00 1 15 2015 1101001
1 2015-11-01 01:00:00 1 13 2015 1101011
2 2015-11-01 02:00:00 1 10 2015 1101021
3 2015-11-01 03:00:00 1 7 2015 1101031
4 2015-11-01 04:00:00 1 9 2015 1101041
5 2015-11-01 05:00:00 1 6 2015 1101051
6 2015-11-01 06:00:00 1 9 2015 1101061
7 2015-11-01 07:00:00 1 8 2015 1101071
8 2015-11-01 08:00:00 1 11 2015 1101081
9 2015-11-01 09:00:00 1 12 2015 1101091
Here is a numeric solution (assuming that the length of ID column is constant):
In [10]: df['Year'], df['ID'] = df['ID'] // 10**7, df['ID'] % 10**7
In [11]: df
Out[11]:
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015
df[id_col].map(lambda x: int(str(x)[:5])) # as an integer
df[id_col].map(lambda x: str(x)[:5]) # as a string
Related
Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!
Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60
I have a relatively big dataset that I want to split into multiple dataframes in Python based on a column containing a datetime object. The values in the column (that I want to split the dataframe by) are given in the following format:
2015-11-01 00:00:05
You may assume the dataframe looks like this.
How can I split the dataframe into 5-second intervals in the following way:
1st dataframe 2015-11-01 00:00:00 - 2015-11-01 00:00:05,
2nd dataframe 2015-11-01 00:00:05 - 2015-11-01 00:00:10, and so on.
I also need to count the number of observations in each of resulting dataframes. In other, words, it would be nice if I could get another dataframe with 2 columns (the desired output format can be found below):
1st column represents the splitted group (values of this column don't matter: they could be simply 1, 2, 3,.. indicating the order of the 5-second intervals, for example, 1 could refer to the period 2015-11-01 00:00:00 - 2015-11-01 00:00:05, 2 could refer to the period 2015-11-01 00:00:05 - 2015-11-01 00:00:10
and so on),
2nd column shows the number of observations falling in each respective interval.
Create dictionary of DataFrames and add new column with assign:
rng = pd.date_range('2015-11-01 00:00:00', periods=100, freq='S')
df = pd.DataFrame({'Date': rng, 'a': range(100)})
print (df.head(10))
Date a
0 2015-11-01 00:00:00 0
1 2015-11-01 00:00:01 1
2 2015-11-01 00:00:02 2
3 2015-11-01 00:00:03 3
4 2015-11-01 00:00:04 4
5 2015-11-01 00:00:05 5
6 2015-11-01 00:00:06 6
7 2015-11-01 00:00:07 7
8 2015-11-01 00:00:08 8
9 2015-11-01 00:00:09 9
g = df.groupby(pd.Grouper(key='Date', freq='5S'))
dfs = {k.strftime('%Y-%m-%d %H:%M:%S'):v.assign(A=range(1,len(v)+1), B=len(v)) for k,v in g}
print (dfs['2015-11-01 00:00:05'])
Date a A B
5 2015-11-01 00:00:05 5 1 5
6 2015-11-01 00:00:06 6 2 5
7 2015-11-01 00:00:07 7 3 5
8 2015-11-01 00:00:08 8 4 5
9 2015-11-01 00:00:09 9 5 5
If need count rows first aggreagte size and for Interval is add 1 to index:
df1 = df.groupby(pd.Grouper(key='Date', freq='5S')).size().reset_index(name='Count')
df1['Interval'] = df1.index + 1
print (df1.head())
Date Count Interval
0 2015-11-01 00:00:00 5 1
1 2015-11-01 00:00:05 5 2
2 2015-11-01 00:00:10 5 3
3 2015-11-01 00:00:15 5 4
4 2015-11-01 00:00:20 5 5
Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.
Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.
I would like to fuzzy merge the dataframes with a join on the timestamp. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.
pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?
numpy.searchsorted() finds the appropriate index positions to merge on (see docs) - hope the below get you closer to what you're looking for:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN
Building on #Stephan's answer and #JohnE's comment, something similar can be done with pandas.merge_asof for pandas>=0.19.0:
>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})
A B
0 2015-12-01 00:00:00 0
1 2015-12-01 00:30:00 1
2 2015-12-01 01:00:00 2
3 2015-12-01 01:30:00 3
4 2015-12-01 02:00:00 4
5 2015-12-01 02:30:00 5
6 2015-12-01 03:00:00 6
7 2015-12-01 03:30:00 7
8 2015-12-01 04:00:00 8
9 2015-12-01 04:30:00 9
>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})
C D
0 2015-12-01 00:00:00 10
1 2015-12-01 01:00:00 11
2 2015-12-01 02:00:00 12
3 2015-12-01 03:00:00 13
4 2015-12-01 04:00:00 14
5 2015-12-01 05:00:00 15
6 2015-12-01 06:00:00 16
7 2015-12-01 07:00:00 17
8 2015-12-01 08:00:00 18
9 2015-12-01 09:00:00 19
>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')
A B C D
0 2015-12-01 00:00:00 0 2015-12-01 00:00:00 10
1 2015-12-01 00:30:00 1 2015-12-01 00:00:00 10
2 2015-12-01 01:00:00 2 2015-12-01 01:00:00 11
3 2015-12-01 01:30:00 3 2015-12-01 01:00:00 11
4 2015-12-01 02:00:00 4 2015-12-01 02:00:00 12
5 2015-12-01 02:30:00 5 2015-12-01 02:00:00 12
6 2015-12-01 03:00:00 6 2015-12-01 03:00:00 13
7 2015-12-01 03:30:00 7 2015-12-01 03:00:00 13
8 2015-12-01 04:00:00 8 2015-12-01 04:00:00 14
9 2015-12-01 04:30:00 9 2015-12-01 04:00:00 14
I have a row of data (in pandas), that has a time of day:
0 8:00 AM
1 11:00 AM
2 8:00 AM
3 4:00 PM
4 9:00 AM
5
6 9:00 AM
7
8 9:00 AM
9
10 9:00 AM
11
12 9:00 AM
13
14 8:00 AM
15 11:00 AM
16 8:00 AM
17 11:00 AM
18 9:00 AM
19
20 9:00 AM
21
22 9:00 AM
23
24 9:00 AM
25
26 9:00 AM
27
28 9:00 AM
I would like to convert this to something similar to this:
0 2015-11-11 08:00:00
1 2015-11-11 11:00:00
2 2015-11-11 08:00:00
3 2015-11-11 16:00:00
4 2015-11-11 09:00:00
5 NaT
6 2015-11-11 09:00:00
7 NaT
8 2015-11-11 09:00:00
9 NaT
10 2015-11-11 09:00:00
11 NaT
12 2015-11-11 09:00:00
13 NaT
14 2015-11-11 08:00:00
15 2015-11-11 11:00:00
16 2015-11-11 08:00:00
17 2015-11-11 11:00:00
18 2015-11-11 09:00:00
19 NaT
20 2015-11-11 09:00:00
21 NaT
22 2015-11-11 09:00:00
23 NaT
24 2015-11-11 09:00:00
25 NaT
26 2015-11-11 09:00:00
27 NaT
28 2015-11-11 09:00:00
29 NaT
But without the date added to it. I am then trying to merge my pandas columns into a single column to be able to iterate through. I have tried adding them astype(str) with no success in a pd.merge.
Any ideas on how to use the to_datetime function in pandas while just keeping it as UTC time?
Considering the following input Data:
data = ['8:00 AM',
'11:00 AM',
'8:00 AM',
'4:00 PM',
'9:00 AM',
'',
'9:00 AM',
'',
'9:00 AM']
Code:
import pandas as pd
x = pd.to_datetime(data).time
pd.Series(x)
Output:
0 08:00:00
1 11:00:00
2 08:00:00
3 16:00:00
4 09:00:00
5 NaN
6 09:00:00
7 NaN
8 09:00:00
dtype: object
If you have other data in another series you would like to join into the same dataframe:
x = pd.Series(x)
y = pd.Series(range(9))
pd.concat([x, y], axis=1)
0 1
0 08:00:00 0
1 11:00:00 1
2 08:00:00 2
Finally, if you prefer the columns merged as strings, try this:
z = pd.concat([x, y], axis=1)
z[0].astype(str) + ' foo ' + z[1].astype(str)
0 08:00:00 foo 0
1 11:00:00 foo 1
2 08:00:00 foo 2
3 16:00:00 foo 3
4 09:00:00 foo 4
5 nan foo 5
6 09:00:00 foo 6
7 nan foo 7
8 09:00:00 foo 8
dtype: object
I'm having trouble simply sorting a pandas dataframe first by a column with a string then by datetime column. when doing so, the dates returned are out of order. What am I doing wrong?
df looks like
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
5 2013-07-05 00:00:00 1
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
When the dataframe was created, Date was an object, and converted to datetime using:
df['Date'] = df['Date'].apply(dateutil.parser.parse)
now the dtypes are:
Date datetime64[ns]
Field 1 int64
dtype: object
when running a
df.sort_index(by=['Field 1', 'Date'])
or
df.sort(['Field 1','Date'])
I get back:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
10 2013-07-15 00:00:00 1
5 2013-07-05 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
what I really want back is:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
5 2013-07-05 00:00:00 1
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2