How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?
I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.
Date
Data
2019-01-01
A
2019-05-01
B
2020-04-01
C
2021-07-01
D
2022-02-01
E
2024-05-01
F
The output I'm looking for is the following,
Min Date
Max Date
Data
2019-01-01
2020-04-01
ABC
2021-07-01
2022-02-01
DE
2024-05-01
2024-05-01
F
I was looking at pandas .diff() and .cumsum() for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.
I would also suggest .diff() and .cumsum():
import pandas as pd
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])
blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})
out:
Date Data
min max sum
Date
0 2019-01-01 2019-05-01 AB
1 2020-06-01 2020-06-01 C
2 2021-07-01 2022-02-01 DE
3 2024-05-01 2024-05-01 F
after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.
The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?
I have a column in pandas dataframe that contains two types of information = 1. date and time, 2=company name. I have to split the column into two (date_time, full_company_name). Firstly I tried to split the columns based on character count (first 19 one column, the rest to other column) but then I realized that sometimes the date is missing so the split might not work. Then I tried using regex but I cant seem to extract it correctly.
The column:
the desired output:
If the dates are all properly formatted, maybe you don't have to use regex
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
OR
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")
Note:
If you have an option to avoid concatenating those strings to begin with, please do so. This is not a healthy habit.
Solution (not that pretty gets the job done):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet
use a non capturing group ?.* instead of (.*)
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")
I have a two-columns data frame, with departure and arrival times (see example below).
In order to make operations on those times, i want to convert the string into datetime format, keeping only hour/minutes/seconds information.
Example of input data - file name = table
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
I ran this code to import the table file and then to convert the type in to datetime format:
import pandas as pd
from datetime import datetime
df= pd.read_excel("table.xlsx")
df['arrival_time']= pd.to_datetime(df['arrival_time'], format= '%H:%M:%S')
but get this error:
ValueError: time data ' 07:30:00' does not match format '%H:%M:%S' (match)
What mistake i am making?
Seems like an import issue ' 07:30:00', there's a space in front. If it's a CSV you're importing you can use skipinitialspace=True.
If I import your CSV file, and use your code, it works fine:
CSV:
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
df = pd.read_csv('test.csv', skipinitialspace=True)
df['arrival_time']= pd.to_datetime(df['arrival_time'], format='%H:%M:%S').dt.time
print(df)
departure_time arrival_time
0 07:00:00 07:30:00
1 07:00:00 07:15:00
2 07:05:00 07:22:00
3 07:10:00 07:45:00
4 07:15:00 07:50:00
5 07:10:00 07:26:00
6 07:40:00 08:10:00
I have a number of dataframes consisting of date time and precipitation data that I would like to later merge and plot by station ID. The station ID is present in the "header" of the df but I am not able to reason out how to assign the entire df a unique ID based on it. The .csv file name that the df is constructed from also has the station ID in the name.
For clarification, I am reading in each .csv file from path using an if statement within a loop to differentiate between the different file names and adjust the column headers as needed.
Reading in each file, where 8085 is part of the filename, looks like
def function:
if '8085' in file:
df = pd.read_csv(path + file, usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'], names=['Variable', 'Date', 'Time', 'FPR-D Oil'], header = None, parse_dates = [[1,2]], skiprows=[0,1])
df_fprd_oil = df[df['Variable'].str.contains('Precip')]
Example of the .csv file before reading in:
Station ID,Sensor Serial Num,
12345678,sn123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,
Example of the df after reading in:
Date_Time Variable FPR-D Oil
0 2020-02-01 09:45:00 Precip -2.19
3 2020-02-01 10:00:00 Precip -2.19
6 2020-02-01 10:15:00 Precip -2.19
What (I think) is desired
Date_Time Station ID Variable FPR-D Oil
0 2020-02-01 09:45:00 12345678 Precip -2.19
3 2020-02-01 10:00:00 12345678 Precip -2.19
6 2020-02-01 10:15:00 12345678 Precip -2.19
Or maybe even
Date_Time Variable FPR-D Oil Unique ID
0 2020-02-01 09:45:00 Precip -2.19 1
3 2020-02-01 10:00:00 Precip -2.19 1
6 2020-02-01 10:15:00 Precip -2.19 1
If you want to add UniqueID to your Dataframe and want it to be constant for one dataframe then you can simply do this.
df["UniqueID"] = 1
It will add a column named UniqueID in your existing df with the assigned value.
Hope it helps.
I am assuming, based on your comments above, that the station id is always located at the second row of the first column in all the csv files.
import pandas as pd
from io import StringIO
# sample data
s = """Station ID,Sensor, Serial Num,
12345678,123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,"""
# read your file
df = pd.read_csv(StringIO(s), usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'],
skiprows=[0,1], names=['Variable', 'Date', 'Time', 'FPR-D Oil'])
# read it again but only get the first value of the second row
sid = pd.read_csv(StringIO(s), skiprows=1, nrows=1, header=None)[0].iloc[0]
# filter and copy so you are not assign to a slice of a frame
new_df = df[df['Variable'] == 'Precip'].copy()
# assign sid to a new column
new_df.loc[:, 'id'] = sid
print(new_df)
Variable Date Time FPR-D Oil id
0 Precip 02/01/2020 09:45:00 -2.19 12345678
3 Precip 02/01/2020 10:00:00 -2.19 12345678
Try this:
df['Station ID'] = 12345678
Where df is your dataframe
I have a dataframe (df) with two columns where the head looks like
name start end
0 John 2018-11-09 00:00:00 2012-03-01 00:00:00
1 Steve 1990-09-03 00:00:00
2 Debs 1977-09-07 00:00:00 2012-07-02 00:00:00
3 Mandy 2009-01-09 00:00:00
4 Colin 1993-08-22 00:00:00 2002-06-03 00:00:00
The start and end columns have the type object. I want to change the type to datetime so I can use the following:
referenceError = DeptTemplate['start'] > DeptTemplate['end']
am trying to change the type using:
df['start'].dt.strftime('%d/%m/%Y')
df['end'].dt.strftime('%d/%m/%Y')
but I think where there are some rows where there are no date in the columns its causing a problem. How can I set any blank values so I can change the type to date time and run my analysis?
As shown in the .to_datetime docs you can set the behavior using the errors kwarg. You can also set the strftime format with the format kwarg.
# Bad values will be NaT
df["start"] = pd.to_datetime(df.start, errors='coerce', format='%d/%m/%Y')
As mentioned in the comments, you can prepare the column with replace if you absolutely must use strftime.