I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.
datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11
What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.
Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.
Would there be another way to handle this data that I'm just not presenting?
or what I'm doing makes sense and is efficient enough?
I eventually want to plot some of this data by selecting specific data names.
I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:
print(df)
datetime battery datetime.1 temperature datetime.2 pressure
0 2020-01-01 01:01:01 13.8 2020-01-01 01:01:02 97 2020-01-01 01:01:03 10
1 2020-01-01 01:01:04 13.8 2020-01-01 01:01:05 97 2020-01-01 01:01:06 11
print(df.memory_usage().sum())
==> 224
After converting the dataframe:
dfs = []
for i in range(0, 6, 2):
d = df.iloc[:, i:i+2]
d["type"] = d.columns[1]
d.columns = ["datetime", "value", "type"]
dfs.append(d)
new_df = pd.concat(dfs)
print(new_df)
==>
datetime value type
0 2020-01-01 01:01:01 13.8 battery
1 2020-01-01 01:01:04 13.8 battery
0 2020-01-01 01:01:02 97.0 temperature
1 2020-01-01 01:01:05 97.0 temperature
0 2020-01-01 01:01:03 10.0 pressure
1 2020-01-01 01:01:06 11.0 pressure
print(new_df.memory_usage().sum())
==> 192
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
I'm having trouble finding an efficient way to update some column values in a large pandas DataFrame.
The code below creates a DataFrame in a similar format to what I'm working with. A summary of the data: the DataFrame contains three days of consumption data with each day being split into 10 periods of measurement. Each measurement period is also recorded during four separate processes being a preliminary reading, end of day reading and two later revisions with all updates being recorded by the Last_Update column with the date.
dates = ['2022-01-01']*40 + ['2022-01-02']*40 + ['2022-01-03']*40
periods = list(range(1,11))*12
versions = (['PRELIM'] * 10 + ['DAILY'] * 10 + ['REVISE'] * 20) * 3
data = {'Date': dates,
'Period' : periods,
'Version': versions,
'Consumption': np.random.randint(1, 30, 120)}
df = pd.DataFrame(data)
df.Date = pd.to_datetime(df.Date)
## Add random times to the REVISE Last_Update values
df['Last_Update'] = df['Date'].apply(lambda x: x + pd.Timedelta(hours=np.random.randint(1,23), minutes=np.random.randint(1,59)))
df['Last_Update'] = df['Last_Update'].where(df.Version == 'REVISE', df['Date'])
The problem is that the two revision categories are both specified by the same value: "REVISE". One of these "REVISE" values must be changed to something like "REVISE_2". If you group the data in the following way df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].sum() you can see there are two Last_Update dates for each period in each day for REVISE. So we need to set the REVISE with the largest date to REVISE_2.
The only way I've managed to find a solution is using a very convoluted function with the apply method to test which date is larger and store its index and then change the value using loc. This ended up taking huge amount of time for small segments of the data (the full dataset is millions of rows).
I feel like there is an easy solution using groupby functions by I'm having difficulties navigating the multi index output.
Any help would be appreciated cheers.
We figure our the index of the max REVISE date using idxmax after some grouping, and then change the labels:
last_revised_date_idx = df[df['Version'] == 'REVISE'].groupby(['Date', 'Period'], group_keys = False)['Last_Update'].idxmax()
df.loc[last_revised_date_idx, 'Version'] = 'REVISE_2'
check the output:
df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].count().head(20)
produces
Date Period Version Last_Update
2022-01-01 1 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 03:50:00 1
REVISE_2 2022-01-01 12:10:00 1
2 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 10:45:00 1
REVISE_2 2022-01-01 22:05:00 1
3 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 17:03:00 1
REVISE_2 2022-01-01 19:10:00 1
4 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 15:23:00 1
REVISE_2 2022-01-01 18:08:00 1
5 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 12:19:00 1
REVISE_2 2022-01-01 18:04:00 1
I have some data in a pandas dataframe that has entries at the per-second level over the course of a few hours. Entries are indexed by datetime format as TIMESTAMP. I would like to group all data within each minute and do some calculations and manipulations. That is, I would like to take all data within 09:00:00 to 09:00:59 and report some things about what happened in this minute. I would then like to do the same calculations and manipulations from 09:01:00 to 09:01:59 and so on through to the end of my dataset.
I've been fiddling around with groupby() and .resample() but I have had no success so far. I can think of a very inelegant way to do it with a series of for loops and if statements but I was wondering if there was an easier way here.
You didn't provide any data or code, so I'll just make some up. You also don't specify what calculations you want to do, so I'm just taking the mean:
>>> import numpy as np
>>> import pandas as pd
>>> dates = pd.date_range("1/1/2020 00:00:00", "1/1/2020 03:00:00", freq="S")
>>> values = np.random.random(len(dates))
>>> df = pd.DataFrame({"dates": dates, "values": values})
>>> df.resample("1Min", on="dates").mean().reset_index()
dates values
0 2020-01-01 00:00:00 0.486985
1 2020-01-01 00:01:00 0.454880
2 2020-01-01 00:02:00 0.467397
3 2020-01-01 00:03:00 0.543838
4 2020-01-01 00:04:00 0.502764
.. ... ...
236 2020-01-01 03:56:00 0.478224
237 2020-01-01 03:57:00 0.460435
238 2020-01-01 03:58:00 0.508211
239 2020-01-01 03:59:00 0.415030
240 2020-01-01 04:00:00 0.050993
[241 rows x 2 columns]
Having a terrible time finding information on this. I am tracking several completion times every single day to measure them against goal completion time.
I am reading the completion date and time into a pandas dataframe and using df.map to map a dictionary of completion times to create a "goal time" column in a dataframe.
Sample Data:
Date Process
1/2/2020 10:20:00 AM Test 1
1/2/2020 10:25:00 AM Test 2
1/3/2020 10:15:00 AM Test 1
1/3/2020 10:00:00 AM Test 2
Using df.map() to create a column with the goal time:
goalmap={
'Test 1':dt.datetime.strptime('10:15', '%H:%M'),
'Test 2':dt.datetime.strptime('10:30', '%H:%M')}
df['Goal Time']=df['Process'].map(goalmap)
I am then trying to create a new column of "Delta" that calculates the time difference between the two in minutes. Most of the issues I am running into relate to the data types. I got it to calculate an time difference by converting column one (Date) using pd.to_datetime but because my 'Goal Time' column does not store a date, it calculates a delta that is massive (back to 1900). I've also tried parsing the time out of the Date Time column to no avail.
Any best way to calculate the difference between time stamps only?
I recommend timedelta over datetime:
goalmap={
'Test 1': pd.to_timedelta('10:15:00'),
'Test 2': pd.to_timedelta('10:30:00') }
df['Goal Time']=df['Process'].map(goalmap)
df['Goal_Timestamp'] = df['Date'].dt.normalize() + df['Goal Time']
df['Meet_Goal'] = df['Date'] <= df['Goal_Timestamp']
Output:
Date Process Goal Time Goal_Timestamp Meet_Goal
0 2020-01-02 10:20:00 Test 1 10:15:00 2020-01-02 10:15:00 False
1 2020-01-02 10:25:00 Test 2 10:30:00 2020-01-02 10:30:00 True
2 2020-01-03 10:15:00 Test 1 10:15:00 2020-01-03 10:15:00 True
3 2020-01-03 10:00:00 Test 2 10:30:00 2020-01-03 10:30:00 True
I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:
Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2
Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00