Split a numeric ID column into two using pandas

Split a numeric ID column into two using pandas - python

DateTime Junction Vehicles ID
0 2015-11-01 00:00:00 1 15 20151101001
1 2015-11-01 01:00:00 1 13 20151101011
2 2015-11-01 02:00:00 1 10 20151101021
3 2015-11-01 03:00:00 1 7 20151101031
4 2015-11-01 04:00:00 1 9 20151101041
5 2015-11-01 05:00:00 1 6 20151101051
6 2015-11-01 06:00:00 1 9 20151101061
7 2015-11-01 07:00:00 1 8 20151101071
8 2015-11-01 08:00:00 1 11 20151101081
9 2015-11-01 09:00:00 1 12 20151101091
I want to split the ID column into two separate columns such that the first 4 digits are in one, and the remaining digits are in the second.
Code I've tried:
new_ID = data.apply(lambda x: x.rsplit(4))
But it doesn't work. How can I do this with pandas?

Option 1
The simplest and most direct - use the str accessor.
v = df.ID.astype(str)
df['Year'], df['ID'] = v.str[:4], v.str[4:]
df
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015
Option 2
str.extract
v = df.ID.astype(str).str.extract('(?P<Year>\d{4})(?P<ID>.*)', expand=True)
df = pd.concat([df.drop('ID', 1), v], 1)
df
DateTime Junction Vehicles Year ID
0 2015-11-01 00:00:00 1 15 2015 1101001
1 2015-11-01 01:00:00 1 13 2015 1101011
2 2015-11-01 02:00:00 1 10 2015 1101021
3 2015-11-01 03:00:00 1 7 2015 1101031
4 2015-11-01 04:00:00 1 9 2015 1101041
5 2015-11-01 05:00:00 1 6 2015 1101051
6 2015-11-01 06:00:00 1 9 2015 1101061
7 2015-11-01 07:00:00 1 8 2015 1101071
8 2015-11-01 08:00:00 1 11 2015 1101081
9 2015-11-01 09:00:00 1 12 2015 1101091

Here is a numeric solution (assuming that the length of ID column is constant):
In [10]: df['Year'], df['ID'] = df['ID'] // 10**7, df['ID'] % 10**7
In [11]: df
Out[11]:
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015

df[id_col].map(lambda x: int(str(x)[:5])) # as an integer
df[id_col].map(lambda x: str(x)[:5]) # as a string

Related

pandas consecutive Boolean event rollup time series

Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!

Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60

Spliting a dataframe into multiple 5-second dataframes and obtaining count in Python

I have a relatively big dataset that I want to split into multiple dataframes in Python based on a column containing a datetime object. The values in the column (that I want to split the dataframe by) are given in the following format:
2015-11-01 00:00:05
You may assume the dataframe looks like this.
How can I split the dataframe into 5-second intervals in the following way:
1st dataframe 2015-11-01 00:00:00 - 2015-11-01 00:00:05,
2nd dataframe 2015-11-01 00:00:05 - 2015-11-01 00:00:10, and so on.
I also need to count the number of observations in each of resulting dataframes. In other, words, it would be nice if I could get another dataframe with 2 columns (the desired output format can be found below):
1st column represents the splitted group (values of this column don't matter: they could be simply 1, 2, 3,.. indicating the order of the 5-second intervals, for example, 1 could refer to the period 2015-11-01 00:00:00 - 2015-11-01 00:00:05, 2 could refer to the period 2015-11-01 00:00:05 - 2015-11-01 00:00:10
and so on),
2nd column shows the number of observations falling in each respective interval.

Create dictionary of DataFrames and add new column with assign:
rng = pd.date_range('2015-11-01 00:00:00', periods=100, freq='S')
df = pd.DataFrame({'Date': rng, 'a': range(100)})
print (df.head(10))
Date a
0 2015-11-01 00:00:00 0
1 2015-11-01 00:00:01 1
2 2015-11-01 00:00:02 2
3 2015-11-01 00:00:03 3
4 2015-11-01 00:00:04 4
5 2015-11-01 00:00:05 5
6 2015-11-01 00:00:06 6
7 2015-11-01 00:00:07 7
8 2015-11-01 00:00:08 8
9 2015-11-01 00:00:09 9
g = df.groupby(pd.Grouper(key='Date', freq='5S'))
dfs = {k.strftime('%Y-%m-%d %H:%M:%S'):v.assign(A=range(1,len(v)+1), B=len(v)) for k,v in g}
print (dfs['2015-11-01 00:00:05'])
Date a A B
5 2015-11-01 00:00:05 5 1 5
6 2015-11-01 00:00:06 6 2 5
7 2015-11-01 00:00:07 7 3 5
8 2015-11-01 00:00:08 8 4 5
9 2015-11-01 00:00:09 9 5 5
If need count rows first aggreagte size and for Interval is add 1 to index:
df1 = df.groupby(pd.Grouper(key='Date', freq='5S')).size().reset_index(name='Count')
df1['Interval'] = df1.index + 1
print (df1.head())
Date Count Interval
0 2015-11-01 00:00:00 5 1
1 2015-11-01 00:00:05 5 2
2 2015-11-01 00:00:10 5 3
3 2015-11-01 00:00:15 5 4
4 2015-11-01 00:00:20 5 5

How to merge two dataframes based on the closest (or most recent) timestamp

Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.
Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.
I would like to fuzzy merge the dataframes with a join on the timestamp. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.
pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?

numpy.searchsorted() finds the appropriate index positions to merge on (see docs) - hope the below get you closer to what you're looking for:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN

Building on #Stephan's answer and #JohnE's comment, something similar can be done with pandas.merge_asof for pandas>=0.19.0:
>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})
A B
0 2015-12-01 00:00:00 0
1 2015-12-01 00:30:00 1
2 2015-12-01 01:00:00 2
3 2015-12-01 01:30:00 3
4 2015-12-01 02:00:00 4
5 2015-12-01 02:30:00 5
6 2015-12-01 03:00:00 6
7 2015-12-01 03:30:00 7
8 2015-12-01 04:00:00 8
9 2015-12-01 04:30:00 9
>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})
C D
0 2015-12-01 00:00:00 10
1 2015-12-01 01:00:00 11
2 2015-12-01 02:00:00 12
3 2015-12-01 03:00:00 13
4 2015-12-01 04:00:00 14
5 2015-12-01 05:00:00 15
6 2015-12-01 06:00:00 16
7 2015-12-01 07:00:00 17
8 2015-12-01 08:00:00 18
9 2015-12-01 09:00:00 19
>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')
A B C D
0 2015-12-01 00:00:00 0 2015-12-01 00:00:00 10
1 2015-12-01 00:30:00 1 2015-12-01 00:00:00 10
2 2015-12-01 01:00:00 2 2015-12-01 01:00:00 11
3 2015-12-01 01:30:00 3 2015-12-01 01:00:00 11
4 2015-12-01 02:00:00 4 2015-12-01 02:00:00 12
5 2015-12-01 02:30:00 5 2015-12-01 02:00:00 12
6 2015-12-01 03:00:00 6 2015-12-01 03:00:00 13
7 2015-12-01 03:30:00 7 2015-12-01 03:00:00 13
8 2015-12-01 04:00:00 8 2015-12-01 04:00:00 14
9 2015-12-01 04:30:00 9 2015-12-01 04:00:00 14

Convert Time Column to UTC Time, no date

I have a row of data (in pandas), that has a time of day:
0 8:00 AM
1 11:00 AM
2 8:00 AM
3 4:00 PM
4 9:00 AM
5
6 9:00 AM
7
8 9:00 AM
9
10 9:00 AM
11
12 9:00 AM
13
14 8:00 AM
15 11:00 AM
16 8:00 AM
17 11:00 AM
18 9:00 AM
19
20 9:00 AM
21
22 9:00 AM
23
24 9:00 AM
25
26 9:00 AM
27
28 9:00 AM
I would like to convert this to something similar to this:
0 2015-11-11 08:00:00
1 2015-11-11 11:00:00
2 2015-11-11 08:00:00
3 2015-11-11 16:00:00
4 2015-11-11 09:00:00
5 NaT
6 2015-11-11 09:00:00
7 NaT
8 2015-11-11 09:00:00
9 NaT
10 2015-11-11 09:00:00
11 NaT
12 2015-11-11 09:00:00
13 NaT
14 2015-11-11 08:00:00
15 2015-11-11 11:00:00
16 2015-11-11 08:00:00
17 2015-11-11 11:00:00
18 2015-11-11 09:00:00
19 NaT
20 2015-11-11 09:00:00
21 NaT
22 2015-11-11 09:00:00
23 NaT
24 2015-11-11 09:00:00
25 NaT
26 2015-11-11 09:00:00
27 NaT
28 2015-11-11 09:00:00
29 NaT
But without the date added to it. I am then trying to merge my pandas columns into a single column to be able to iterate through. I have tried adding them astype(str) with no success in a pd.merge.
Any ideas on how to use the to_datetime function in pandas while just keeping it as UTC time?

Considering the following input Data:
data = ['8:00 AM',
'11:00 AM',
'8:00 AM',
'4:00 PM',
'9:00 AM',
'',
'9:00 AM',
'',
'9:00 AM']
Code:
import pandas as pd
x = pd.to_datetime(data).time
pd.Series(x)
Output:
0 08:00:00
1 11:00:00
2 08:00:00
3 16:00:00
4 09:00:00
5 NaN
6 09:00:00
7 NaN
8 09:00:00
dtype: object
If you have other data in another series you would like to join into the same dataframe:
x = pd.Series(x)
y = pd.Series(range(9))
pd.concat([x, y], axis=1)
0 1
0 08:00:00 0
1 11:00:00 1
2 08:00:00 2
Finally, if you prefer the columns merged as strings, try this:
z = pd.concat([x, y], axis=1)
z[0].astype(str) + ' foo ' + z[1].astype(str)
0 08:00:00 foo 0
1 11:00:00 foo 1
2 08:00:00 foo 2
3 16:00:00 foo 3
4 09:00:00 foo 4
5 nan foo 5
6 09:00:00 foo 6
7 nan foo 7
8 09:00:00 foo 8
dtype: object

pandas sort DataFrame based on object, and datetime columns?

I'm having trouble simply sorting a pandas dataframe first by a column with a string then by datetime column. when doing so, the dates returned are out of order. What am I doing wrong?
df looks like
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
5 2013-07-05 00:00:00 1
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
When the dataframe was created, Date was an object, and converted to datetime using:
df['Date'] = df['Date'].apply(dateutil.parser.parse)
now the dtypes are:
Date datetime64[ns]
Field 1 int64
dtype: object
when running a
df.sort_index(by=['Field 1', 'Date'])
or
df.sort(['Field 1','Date'])
I get back:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
10 2013-07-15 00:00:00 1
5 2013-07-05 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
what I really want back is:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
5 2013-07-05 00:00:00 1
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a numeric ID column into two using pandas - python

df[id_col].map(lambda x: int(str(x)[:5])) # as an integer df[id_col].map(lambda x: str(x)[:5]) # as a string

Related

pandas consecutive Boolean event rollup time series

Spliting a dataframe into multiple 5-second dataframes and obtaining count in Python

How to merge two dataframes based on the closest (or most recent) timestamp

Convert Time Column to UTC Time, no date

pandas sort DataFrame based on object, and datetime columns?

Categories

Resources