Loop through rows of dataframe using re.compile().split() - python

I have a dataframe taht consists of 1 column and several rows. Each of these rows is constructed in the same way: -timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...
The timestamps have this format: YYYY-MM-DD HH:MM:SS and the values are number with 2 decimals.
I would like to make a new dataframe that has the individual timestamps in one row and the related values in the next row.
I managed to get the expected result linewise with regex but not for the entire dataframe.
My code so far:
#input dataframe
data.head()
values
0 2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...
1 2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...
2 2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...
test = data['values'].iloc[0] #first row of data
row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)
df_row1 = pd.DataFrame(row1)
df_row1.head()
values
0 2020-05-12 10:00:00
1 12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....
2 2020-05-12 10:00:01
3 12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...
#trying the same for the entire dataframe
for row in data:
df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)
print(df_new)
['values']
My question now is how can I loop through the rows of my dataframe and get the expected result?

In case you want to first split the lines and extract the values to columns, be aware you can use str.extract. Using named grouping in your regular expression it will automatically assign the columns for your dataframe
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)
print(df)
# date time value_one value_two value_three
# 0 2020-05-12 10:00:00 12.07 13 11.56
# 0 2020-06-12 11:00:00 13.07 16 11.16
# 0 2020-05-12 10:00:01 11.49 17 5.67
# 1 2020-05-13 10:00:00 14.07 13 15.56
# 1 2020-05-16 10:00:02 11.51 18 5.69
In case you do not know the number of groups after the date and time use split rather than a regular expression. I would suggest something like this:
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().reset_index()
df = df['value'].str.split(" ").apply(pd.Series)
df.columns = [f"col_{col}" for col in df.columns]
print(df)
# col_0 col_1 col_2 col_3 col_4 col_5 col_6
# 0 2020-05-12 10:00:00 12.07 13 11.56 NaN NaN
# 1 2020-06-12 11:00:00 13.07 16 11.16 NaN NaN
# 2 2020-05-12 10:00:01 11.49 17 5.67 NaN NaN
# 3 2020-05-13 10:00:00 14.07 13 14 15 15.56
# 4 2020-05-16 10:00:02 11.51 18 5.69 NaN NaN

You don't need to loop through the rows to get the result instead, you can use Series.str.split to split the given series around the delimiter, the delimiter in this case would be a regular expression. Then you can use DataFrame.explode to transform each element in a list-likes to seperate rows.
Use:
data["values"] = data["values"].str.split(r'\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})')
data = data.explode("values")
data["values"] = data["values"].str.split(r'(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s+')
data = data.explode("values").reset_index(drop=True)
print(data)
This resulting dataframe data should look like:
values
0 2020-05-12 10:00:00
1 12.07 13 11.56
2 2020-05-12 10:00:01
3 11.49 17 5.67
4 2020-05-12 10:01:00
5 11.49 17 5.67
6 2020-05-12 10:01:01
7 12.07 13 11.56
8 2020-05-12 10:02:00
9 14.29 18 11.28
10 2020-05-12 10:02:01
11 13.77 18 7.43

Related

Find Yesterday's High Price by Merging Two DF's on Datetime and Date Columns

I'm trying to merge two df's, one df has a datetime column, and the other has just a date column. My application for this is to find yesterday's high price using an OHLC dataset. I've attached some starter code below, but I'll describe what I'm looking for.
Given this intraday dataset:
time current_intraday_high
0 2022-02-11 09:00:00 1
1 2022-02-11 10:00:00 2
2 2022-02-11 11:00:00 3
3 2022-02-11 12:00:00 4
4 2022-02-11 13:00:00 5
5 2022-02-14 09:00:00 6
6 2022-02-14 10:00:00 7
7 2022-02-14 11:00:00 8
8 2022-02-14 12:00:00 9
9 2022-02-14 13:00:00 10
10 2022-02-15 09:00:00 11
11 2022-02-15 10:00:00 12
12 2022-02-15 11:00:00 13
13 2022-02-15 12:00:00 14
14 2022-02-15 13:00:00 15
15 2022-02-16 09:00:00 16
16 2022-02-16 10:00:00 17
17 2022-02-16 11:00:00 18
18 2022-02-16 12:00:00 19
19 2022-02-16 13:00:00 20
...and this daily dataframe:
time daily_high
0 2022-02-11 5
1 2022-02-14 10
2 2022-02-15 15
3 2022-02-16 20
...how can I merge them together, and have each row of the intraday dataframe contain the previous (business) day's high price, like so:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
(Note the NaN's at the top because we don't have any data for Feb 10, 2022 from the intraday dataset, and see how each row contains the intraday data, plus the PREVIOUS day's max "high" price.)
Minimal reproducible example code below:
import pandas as pd
###################################################
# CREATE MOCK INTRADAY DATAFRAME
###################################################
intraday_date_time = [
"2022-02-11 09:00:00",
"2022-02-11 10:00:00",
"2022-02-11 11:00:00",
"2022-02-11 12:00:00",
"2022-02-11 13:00:00",
"2022-02-14 09:00:00",
"2022-02-14 10:00:00",
"2022-02-14 11:00:00",
"2022-02-14 12:00:00",
"2022-02-14 13:00:00",
"2022-02-15 09:00:00",
"2022-02-15 10:00:00",
"2022-02-15 11:00:00",
"2022-02-15 12:00:00",
"2022-02-15 13:00:00",
"2022-02-16 09:00:00",
"2022-02-16 10:00:00",
"2022-02-16 11:00:00",
"2022-02-16 12:00:00",
"2022-02-16 13:00:00",
]
intraday_date_time = pd.to_datetime(intraday_date_time)
intraday_df = pd.DataFrame(
{
"time": intraday_date_time,
"current_intraday_high": [x for x in range(1, 21)],
},
)
print(intraday_df)
# intraday_df.to_csv('intradayTEST.csv', index=True)
###################################################
# AGGREGATE/UPSAMPLE TO DAILY DATAFRAME
###################################################
# Aggregate to business days using intraday_df
agg_dict = {'current_intraday_high': 'max'}
daily_df = intraday_df.set_index('time').resample('B').agg(agg_dict).reset_index()
daily_df.rename(columns={"current_intraday_high": "daily_high"}, inplace=True)
print(daily_df)
# daily_df.to_csv('dailyTEST.csv', index=True)
###################################################
# MERGE THE TWO DATAFRAMES
###################################################
# Need to merge the daily dataset to the intraday dataset, such that,
# any row on the newly merged/joined/concat'd dataset will have:
# 1. The current intraday datetime in the 'time' column
# 2. The current 'intraday_high' value
# 3. The PREVIOUS DAY's 'daily_high' value
# This doesn't work as the daily_df just gets appended to the bottom
# of the intraday_df due to the datetimes/dates merging
merged_df = pd.merge(intraday_df, daily_df, how='outer', on='time')
print(merged_df)
pd.merge_asof allows you to easily do a merge like this.
yesterdays_high = (intraday_df.resample('B', on='time')['current_intraday_high'].max()
.shift()
.rename('yesterdays_high')
.reset_index())
merged_df = pd.merge_asof(intraday_df, yesterdays_high)
print(merged_df)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
Given your already existing code, you can map the shifted values:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].shift(-1)))
)
If you don't have all days and really want to map the real previous business day:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].add(pd.offsets.BusinessDay())))
)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
We can use .dt.date as an index to join two frames together on the same days. As of previous day hight_price, we can apply shift on daily_df:
intra_date = intraday_df['time'].dt.date
daily_date = daily_df['time'].dt.date
answer = intraday_df.set_index(intra_date).join(
daily_df.set_index(daily_date)['daily_high'].shift()
).reset_index(drop=True)

Extract rows based on max date for each id if another row value is duplicate in pandas df

I want to remove rows if completion_dt is duplicate for each id, keeping the row with max date from the maturity_dt column.
id Maturity_dt score completion_dt
1 2020-12-31 12 2022-06-24
2 2020-12-31 8 2022-03-30
2 2019-12-31 0 2022-03-30
2 2018-12-31 0 2021-08-24
3 2022-12-31 15 2022-05-31
2 2021-12-31 12 2022-05-31
Desire Output:
id Maturity_dt score completion_dt
1 2020-12-31 12 2022-06-24
2 2020-12-31 8 2022-03-30
2 2018-12-31 0 2021-08-24
3 2022-12-31 15 2022-05-31
2 2021-12-31 12 2022-05-31
I tried the following code but it's not working as expected.
df = df.sort_values('Maturity_dt').drop_duplicates(subset=['completion_dt', 'id'], keep='last')
Alternative:
df = df.loc[df.groupby(['id', 'completion_dt'])['Maturity_dt'].idxmax()]
print(df)
Output:
id Maturity_dt score completion_dt
0 1 2020-12-31 12 2022-06-24
3 2 2018-12-31 0 2021-08-24
1 2 2020-12-31 8 2022-03-30
5 2 2021-12-31 12 2022-05-31
4 3 2022-12-31 15 2022-05-31

Adding a year to a period?

I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.
This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31

How to extract data from df2 based on datetime data in df1 in python

The question here is that I am using 2 datasets: df1 has all time log data for a few years; df2 has data of incidents that happen at a specific point of time. I need to extract from df1 a segment of one-week data right before the datetime of the IncidentTime in df2(thus IncidentTime is the end time of a 7-day period for a particular Group_Id).
Note:
A Group_Id may have more than 1 incidents(ie. there may be multiple rows in df2 with the same Group_Id).
Here are the datasets:
df1
Timestamp Group_Id Data
2013-10-20 00:00:05.143 11 14
2013-10-21 00:05:10.377 11 15
2013-10-22 14:22:15.501 11 19
...
2016-03-05 00:00:05.743 101 21
2017-12-24 00:00:10.407 101 33
...
df2
IncidentTime Group_Id
27/10/13 16:08 11
03/12/16 16:11 2
24/10/14 12:08 11
04/07/17 08:00 100
03/04/13 14:10 26
15/11/18 17:00 46
11/02/19 00:20 101
Then create a new column in final data for IncidentTime(it'll be the same for a Group_Id each time). For example, Group_Id = 11 has IncidentTime of 27/10/13 16:08, thus we take all data from df1 for Group_Id = 11 where its Timestamp lie between 20/10/13 16:08 and 27/10/13 16:08. So that our final data df3 has the 7-day data corresponding to its own IncidentTime for all Group_Id:
df3
IncidentTime Group_Id Timestamp Data
27/10/13 16:08 11 2013-10-20 19:10:05.143 14
27/10/13 16:08 11 2013-10-21 00:05:10.377 15
27/10/13 16:08 11 2013-10-22 14:22:15.501 19
...
27/10/13 16:08 11 2013-10-27 05:22:15.501 20
...
You can canvert columns to datetimes first and then create IncidentTime_start column by subtract 7 days and DataFrame.merge with outer join:
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df2['IncidentTime'] = pd.to_datetime(df2['IncidentTime'], dayfirst=True)
df2['IncidentTime_start'] = df2['IncidentTime'] - pd.offsets.DateOffset(days=7)
df = df2.merge(df1, on='Group_Id', how='outer')
print (df)
IncidentTime Group_Id IncidentTime_start Timestamp \
0 2013-10-27 16:08:00 11 2013-10-20 16:08:00 2013-10-20 00:00:05.143
1 2013-10-27 16:08:00 11 2013-10-20 16:08:00 2013-10-21 00:05:10.377
2 2013-10-27 16:08:00 11 2013-10-20 16:08:00 2013-10-22 14:22:15.501
3 2016-12-03 16:11:00 2 2016-11-26 16:11:00 NaT
4 2014-10-24 12:08:00 40 2014-10-17 12:08:00 NaT
5 2017-07-04 08:00:00 100 2017-06-27 08:00:00 NaT
6 2013-04-03 14:10:00 26 2013-03-27 14:10:00 NaT
7 2018-11-15 17:00:00 46 2018-11-08 17:00:00 NaT
8 2019-02-11 00:20:00 101 2019-02-04 00:20:00 2016-03-05 00:00:05.743
9 2019-02-11 00:20:00 101 2019-02-04 00:20:00 2017-12-24 00:00:10.407
Data
0 14.0
1 15.0
2 19.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 21.0
9 331.0
And then filter by Series.between with boolean indexing:
df = (df[df['Timestamp'].between(df['IncidentTime_start'], df['IncidentTime'])]
.drop('IncidentTime_start', axis=1))
print (df)
IncidentTime Group_Id Timestamp Data
1 2013-10-27 16:08:00 11 2013-10-21 00:05:10.377 15.0
2 2013-10-27 16:08:00 11 2013-10-22 14:22:15.501 19.0

Pandas select columns and data dependant on header

I have a large .csv file. I want to select only the column with he time/date and 20 other columns which I know by header.
As a test I try to take only the column with the header 'TIMESTAMP' I know this is
4207823 rows long in the .csv and it only contains dates and times. The code below selects the TIMESTAMP column but also carries on to take values from other columns as shown below:
import csv
import numpy as np
import pandas
low_memory=False
f = pandas.read_csv('C:\Users\mmso2\Google Drive\MABL Wind\_Semester 2 2016\Wind Farm Info\DataB\DataB - NaN2.csv', dtype = object)#convert file to variable so it can be edited
time = f[['TIMESTAMP']]
time = time[0:4207823]#test to see if this stops time taking other data
print time
output
TIMESTAMP
0 2007-08-15 21:10:00
1 2007-08-15 21:20:00
2 2007-08-15 21:30:00
3 2007-08-15 21:40:00
4 2007-08-15 21:50:00
5 2007-08-15 22:00:00
6 2007-08-15 22:10:00
7 2007-08-15 22:20:00
8 2007-08-15 22:30:00
9 2007-08-15 22:40:00
10 2007-08-15 22:50:00
11 2007-08-15 23:00:00
12 2007-08-15 23:10:00
13 2007-08-15 23:20:00
14 2007-08-15 23:30:00
15 2007-08-15 23:40:00
16 2007-08-15 23:50:00
17 2007-08-16 00:00:00
18 2007-08-16 00:10:00
19 2007-08-16 00:20:00
20 2007-08-16 00:30:00
21 2007-08-16 00:40:00
22 2007-08-16 00:50:00
23 2007-08-16 01:00:00
24 2007-08-16 01:10:00
25 2007-08-16 01:20:00
26 2007-08-16 01:30:00
27 2007-08-16 01:40:00
28 2007-08-16 01:50:00
29 2007-08-16 02:00:00 #these are from the TIMESTAMP column
... ...
679302 221.484 #This is from another column
679303 NaN
679304 2015-09-23 06:40:00
679305 NaN
679306 NaN
679307 2015-09-23 06:50:00
679308 NaN
679309 NaN
679310 2015-09-23 07:00:00
The problem was due to an error in the input file so simple use of usecols in pandas.read_csv worked.
code below demonstrates the selection of a few columns of data
import csv
import pandas
low_memory=False
#read only the selected columns
df = pandas.read_csv('DataB - Copy - Copy.csv',delimiter=',', dtype = object,
usecols=['TIMESTAMP', 'igmmx_U_77m', 'igmmx_U_58m', ])
print df # see what the data looks like
outfile = open('DataB_GreaterGabbardOnly.csv','wb')#somewhere to write the data to
df.to_csv(outfile)#save selection to the blank .csv created above

Categories

Resources