Group by time intervals and additional attribute - python

I have this data:
import pandas as pd
data = {
'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}
df = pd.DataFrame(data)
I want to create two sets of CSVs:
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.
The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.
The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.
1st output can look like this for each vehicle type:
2nd output:
I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.
I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.
I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.
And I don't know if this is applicable.

If I understand you correctly, one approach could be as follows:
Data
import pandas as pd
# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset
# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33',
'2022-11-03 00:00:35', '2022-11-03 00:00:46',
'2022-11-03 00:20:21', '2022-11-03 00:21:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
df = pd.DataFrame(data)
print(df)
timestamp from to type
0 2022-11-03 00:00:06 A B Car
1 2022-11-03 00:00:33 A B Car
2 2022-11-03 00:00:35 A B Van
3 2022-11-03 00:00:46 A C Car
4 2022-11-03 00:20:21 B C HGV
5 2022-11-03 00:21:30 C B Van
# e.g. for FROM we want: `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want: `A-B`, 3 (COUNT), `00:15` (TIME-END)
# `A-C`, 1 (COUNT), `00:15` (TIME-END)
Code
# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')
# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]
for item in nested_list:
# groupby `item` (i.e. `['from']` and `['from','to']`)
# use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x on: `from`
# and get the `count`. Finally, reset the index
out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
# rename the columns using our `cols` dict
out = out.rename(columns=cols)
# convert timestamps like `'2022-11-03 00:15:00' to `00:15`
out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
# rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
if 'TO' in out.columns:
out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
else:
out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
# write output to `csv file`; e.g. use an `f-string` to customize file name
out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
# `index=False` avoids writing away the index
Output (loaded in excel)
Relevant documentation:
pd.to_datetime, df.set_index, .to_offset
df.groupby, .resample
df.rename
.dt.strftime
df.to_csv

Related

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

How to access nested data in a pandas dataframe?

Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]

Finding total "wait" time for concurrently running transactions

I need to evaluate several million lines of performance logging for a manufacturing execution system. I need to group the data by date, class and name with finding the total "wait time" of numerous concurrently running transactions. The data comes in looking similar to what is in this dataframe:
import pandas as pd
d = {'START_DATE': ['2021-08-07 19:11:40', '2021-08-07 19:11:40', '2021-08-07 19:11:40',
'2021-08-07 19:20:40', '2021-08-07 19:20:40', '2021-08-07 19:20:40',
'2021-08-07 19:21:40', '2021-08-07 19:21:40', '2021-08-07 19:21:40',
'2021-08-10 19:20:40', '2021-08-10 19:20:40', '2021-08-10 19:20:40',
'2021-08-10 19:21:40', '2021-08-10 19:21:40', '2021-08-10 19:21:40'
],
'ELAPSED_TIME': ['00:00:00.465', '00:00:01.000', '00:00:00.165',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000'
],
'TRANSACTION': ['a', 'b', 'c',
'a', 'd', 'c',
'e', 'a', 'b',
'a', 'd', 'c',
'e', 'a', 'b'
],
'USER': ['Bob', 'Bob', 'Bob',
'Biff', 'Biff', 'Biff',
'Biff', 'Biff', 'Biff',
'Bob', 'Bob', 'Bob',
'Bob', 'Bob', 'Bob'
],
'CLASS': ['AA', 'AA', 'AA',
'BB', 'BB', 'BB',
'BB', 'BB', 'BB',
'AA', 'AA', 'AA',
'AA', 'AA', 'AA'
]}
df = pd.DataFrame(data=d)
See how the transaction times will start at the same time and run concurrent with each other, but will be "done" at different times. E.g. Bob's first set of transactions (rows 0-2) all take a different amount of time, but when I group by DATE, CLASS, and USER--I want to show the total wait time to be 1000ms (based on the second line's wait time).
On 08/07/2021, Biff has two sets of transactions starting at different times, but they will still overlap into one wait time--6000ms.
Expected output would look something like:
DATE CLASS USER Wait
2021-08-07 AA Bob 1000
2021-08-07 BB Biff 360000
2021-08-10 AA Bob 360000
Like I mentioned the actual data has several millions lines of transactions--I am looking for help in finding something better (and hopefully faster than what I have/found):
def getSecs1(grp):
return pd.DatetimeIndex([]).union_many([ pd.date_range(
row.START_DATE, row.END_DATE, freq='25ms', closed='left')
for _, row in grp.iterrows() ]).size
I add an END_DATE column by adding the milliseconds to the START_DATE. I have to do it with chunks of 25ms otherwise it would take wwaayy too long to do.
Any help/advice would be greatly appreciated.
###Edit
Change the overlap to minutes
This solution uses a package called staircase which is built on pandas and numpy for working with (mathematical) step functions. You can think of an interval as being a step function which goes from value 0 to 1 at the start of an interval and 1 to 0 at the end of an interval.
additional setup
convert START_DATE and ELAPSED_TIME to appropriate pandas time objects
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df["ELAPSED_TIME"] = pd.to_timedelta(df["ELAPSED_TIME"])
define daily bins
dates = pd.period_range("2021-08-07", "2021-08-10")
solution
Define a function which takes a dataframe, makes a step function from start and end times (calculated as start + duration), sets non-zero values to 1, slices the step function with the bins, and integrates.
import staircase as sc
def calc_dates_for_user(df_):
return (
sc.Stairs( # creating step function
start=df_["START_DATE"],
end=df_["START_DATE"] + df_["ELAPSED_TIME"],
)
.make_boolean() # where two intervals overlap the value of the step function will be 2. This sets all non-zero values to 1 (effectively creating a union of intervals).
.slice(dates) # analogous to groupby
.integral()/pd.Timedelta("1s") # for each slice integrate (which will equal the length of the interval) and divide by seconds
)
When we groupby USER and CLASS and apply this function we get a dataframe, indexed by these variables, with a column index corresponding to intervals in the period range
USER CLASS [2021-08-07, 2021-08-08) [2021-08-08, 2021-08-09) [2021-08-09, 2021-08-10) [2021-08-10, 2021-08-11)
Biff BB 360000.0 0.0 0.0 0.0
Bob AA 1000.0 0.0 0.0 360000.0
We'll clean it up like so
result = (
df.groupby(["USER", "CLASS"])
.apply(calc_dates_for_user)
.melt(ignore_index=False, var_name="DATE", value_name="WAIT") # melt column index into a single column of daily intervals
.query("WAIT != 0") # filter out days where no time recorded
.reset_index() # move USER and CLASS from index to columns
)
result then looks like this
USER CLASS DATE WAIT
0 Biff BB [2021-08-07, 2021-08-08) 360000.0
1 Bob AA [2021-08-07, 2021-08-08) 1000.0
2 Bob AA [2021-08-10, 2021-08-11) 360000.0
To get your expected result you can replace the DATE column with the timestamps relating to day-start with
result["DATE"] = pd.IntervalIndex(result["DATE"]).left
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Comparing date columns between two dataframes

The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Categories

Resources