Convert 2D dataframe to 3D numpy array based on unique ID - python

I have a dataframe in this format:
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
... though my dataframe is much larger, with more than 500 hundred IDs.
I want to convert this 2D - dataframe into a 3D array in this format (num_time_samples, value, ID). Essentially I would like to have one 2D-array for every unique ID.
I plan on using the value column to build lag based feature vectors, but I'm stuck on how to convert the dataframe. I've searched and tried df.value, reshaping, etc and nothing has worked.

Say you have
df = pd.DataFrame(
{
'time column': [
'00:00:00', '00:15:00', '00:00:00', '00:15:00',
],
'ID column': [
1, 1, 2, 2,
],
'Value': [
10, 0, 6, 2,
],
}
)
where df actually is a subset of your dataframe keeping eveything data-type-naive.
I want to convert this 2D - dataframe into a 3D array in this format (num_time_samples, value, ID).
Why not do
a = (
df
.set_index(['time column', 'ID column'])
.unstack(level=-1) # which leaves 'time column' as first dimension index
.to_numpy()
.reshape(
(
df['ID column'].unique().size,
df['time column'].unique().size,
1,
)
)
)
a looks like
>>> a
array([[[10],
[ 6]],
[[ 0],
[ 2]]], dtype=int64)
>>> a.shape
(2, 2, 1)
>>> a.ndim
3
a is structured as time column × ID column × Value (and indexable accordingly). E.g. let's get individuals' 00:15:00-data
>>> a[1] # <=> a[1, ...] <=> a[1, :, :]
array([[0],
[2]], dtype=int64)
Let's get the first and second individual's time series, respectively,
>>> a[:, 0] # <=> a[:, 0, :] <=> a[..., 0, :]
array([[10],
[ 0]], dtype=int64)
>>> a[:, 1]
array([[6],
[2]], dtype=int64)

Related

pandas RollingGroupBy agg 'size' of rolling group (not 'count')

It is possible to perform a
df.groupby.rolling.agg({'any_df_col': 'count'})
But how about a size agg?
'count' will produce a series with the 'running count' of rows that match the groupby condition (1, 1, 1, 2, 3...), but I would like to know, for all of those rows, the total number of rows that match the groupby (so, 1, 1, 3, 3, 3) in that case.
Usually in pandas I think this is achieved by using size instead of count.
This code may illustrate.
import datetime as dt
import pandas as pd
df = pd.DataFrame({'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4),
],
'value': [1, 2, 1, 10, 10, 10, 10],
'type': [0, 0, 0, 0, 0, 0, 0]
})
df = df.set_index(pd.DatetimeIndex(df['time_ref']), drop=True)
by = ['value']
window = '1H'
gb_rolling = df.groupby(by=by).rolling(window=window)
agg_d = {'type': 'count'}
test = gb_rolling.agg(agg_d)
print (test)
# this works
type
value time_ref
1 2023-01-01 00:30:00 1.0
2023-01-01 01:00:00 2.0
2 2023-01-01 00:30:00 1.0
10 2023-01-01 02:00:00 1.0
2023-01-01 02:15:00 2.0
2023-01-01 02:16:00 3.0
2023-01-01 04:00:00 1.0
# but this doesn't
agg_d = {'type': 'size'}
test = gb_rolling.agg(agg_d)
# AttributeError: 'size' is not a valid function for 'RollingGroupby' object
my desired output is to get the SIZE of the group ... this:
type
value time_ref
1 2023-01-01 00:30:00 2
2023-01-01 01:00:00 2
2 2023-01-01 00:30:00 1
10 2023-01-01 02:00:00 3
2023-01-01 02:15:00 3
2023-01-01 02:16:00 3
2023-01-01 04:00:00 1
I cannot think of a way to do what I need without using the rolling functionality, because the relevant windows of my data are not deteremined by calendar time but by the time of the events themselves... if that assumption is wrong, and I can do that and get a 'size' without using rolling, that is OK, but as far as I know I have to use rolling since the time_ref of the event is the important thing for grouping with subsequent rows, not pure calendar time.
Thanks.
I'm not completely following your question. It seems like you want the type column to be the number of rows of a given value for each 1-hour increment... But if that's the case your desired output is incorrect, and should be:
value time_ref type
1 2023-01-01 00:30:00 1 # <- not 2 here (1 in 0-hr, 1 in 1-hr window)
2023-01-01 01:00:00 1 # <- same here
2 2023-01-01 00:30:00 1 # rest is ok....
...
If that's correct, then, starting with:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
...just add an hour column:
df['hour'] = df.time_ref.dt.hour
and aggregate on that and value:
tmp = (
df.groupby(['value', 'hour'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
which gives you:
value hour type
0 1 0 1
1 1 1 1
2 2 0 1
3 10 2 3
4 10 4 1
...which you can join back onto your original df:
res = df.merge(tmp, how='left', on=['value', 'hour'])
time_ref value hour type
0 2023-01-01 00:30:00 1 0 1
1 2023-01-01 00:30:00 2 0 1
2 2023-01-01 01:00:00 1 1 1
3 2023-01-01 02:00:00 10 2 3
4 2023-01-01 02:15:00 10 2 3
5 2023-01-01 02:16:00 10 2 3
6 2023-01-01 04:00:00 10 4 1
If that's not what you're looking for, you may clarify your question.
Ah.. thanks for clarifying. I understand the problem now.
I played around with rolling, but couldn't find a way to get it to work either... but here is an alternate method:
df = pd.DataFrame({
'time_ref': [
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 0, 30),
dt.datetime(2023, 1, 1, 1),
dt.datetime(2023, 1, 1, 2),
dt.datetime(2023, 1, 1, 2, 15),
dt.datetime(2023, 1, 1, 2, 16),
dt.datetime(2023, 1, 1, 4)],
'value': [1, 2, 1, 10, 10, 10, 10]})
df.index = df.time_ref
value_start = df.groupby('value').agg(min)
df['hrs_since_group_start'] = df.apply(
lambda row: row.time_ref - value_start.loc[row.value, 'time_ref'],
axis=1
).view(int) / 1_000_000_000 / 60 // 60
(.view(int) changes the timedelta to nanoseconds. so the / 1_000_000_000 / 60 changes it to minutes since the first group, and // 60 changes it to number of whole hours since first group.)
group_hourly_counts = (
df.groupby(['value', 'hrs_since_group_start'])
.agg('count')
.reset_index()
.rename(columns={'time_ref': 'type'}))
res = (
df.merge(
group_hourly_counts,
how='left',
on=['value', 'hrs_since_group_start'])
.drop(columns='hrs_since_group_start'))
res:
time_ref value type
0 2023-01-01 00:30:00 1 2
1 2023-01-01 00:30:00 2 1
2 2023-01-01 01:00:00 1 2
3 2023-01-01 02:00:00 10 3
4 2023-01-01 02:15:00 10 3
5 2023-01-01 02:16:00 10 3
6 2023-01-01 04:00:00 10 1
...somebody more familiar with the rolling functionality can probably find you a simpler solution though :)
If .rolling in combination with count doesn't work then I don't think this is really a "rolling-problem". You could try the following (I think it's similar to Damians 2. answer):
df = df.assign(
hours=df["time_ref"].sub(df.groupby("value")["time_ref"].transform("first"))
.dt.seconds.floordiv(3_600),
type=lambda df: df.groupby(["value", "hours"]).transform("size")
).drop(columns="hours").set_index(["value", "time_ref"]).sort_index()
Result for the sample:
type
value time_ref
1 2023-01-01 00:30:00 2
2023-01-01 01:00:00 2
2 2023-01-01 00:30:00 1
10 2023-01-01 02:00:00 3
2023-01-01 02:15:00 3
2023-01-01 02:16:00 3
2023-01-01 04:00:00 1

Find row with nan value and delete it

I have a dataframe. This dataframe contains three cells id, horstid, date. The cell date has one NaN value. I want the below code what works with pandas, I want it with numpy.
First I want to transform my dataframe to a numpy array. After that I want is to find all rows where the date is NaN and print it. After that I want to remove all this rows. But how could I do this in numpy?
This is my dataframe
id horstid date
0 1 11 2008-09-24
1 2 22 NaN
2 3 33 2008-09-18
3 4 33 2008-10-24
This is my code. That works with fine, but with pandas.
d = {'id': [1, 2, 3, 4], 'horstid': [11, 22, 33, 33], 'date': ['2008-09-24', np.nan, '2008-09-18', '2008-10-24']}
df = pd.DataFrame(data=d)
df['date'].isna()
[OUT]
0 False
1 True
2 False
3 False
df.drop(df.index[df['date'].isna() == True])
[OUT]
id horstid date
0 1 11 2008-09-24
2 3 33 2008-09-18
3 4 33 2008-10-24
What I want is the above code without pandas but with numpy.
npArray = df.to_numpy()
date = npArray [:,2].astype(np.datetime64)
[OUT]
ValueError: Cannot create a NumPy datetime other than NaT with generic units
Here's a solution based on Numpy and pure python:
df = pd.DataFrame.from_dict(dict(horstid = [11, 22, 33, 33], id=[1,2,3,4], data=['2008-09-24', np.nan, '2008-09-18', '2008-10-24']))
a = df.values
index = list(map(lambda x: type(x) != type(1.),a[:, 2]))
print(a[index,:])
[[11 1 '2008-09-24']
[33 3 '2008-09-18']
[33 4 '2008-10-24']]

Create a new column in pandas depending on values from two other columns

I have an example data as:
datetime column1. column2
2020-01-01. 5. [0,0,0,1]
2020-01-02. 4. [0,0,0,0]
2020-01-03. 10. [1,1,1,0]
2020-01-04. 2. [1,1,1,1]
I want a new column called action which assumes: 1 if column1 values are below 3 and above 5 otherwise the df.column2.any(axis=1) values.
The example output should look like this:
datetime column1. column2 action
2020-01-01. 5. [0,0,0,1]. 1
2020-01-02. 2. [0,0,0,0]. 1
2020-01-03. 10. [1,1,1,0]. 1
2020-01-04. 4. [0,0,0,0] 0
Use numpy.where Series.between with any:
df['action'] = np.where(df.column1.between(3,5), df.column2.apply(any), 1)
print (df)
datetime column1 column2 action
0 2020-01-01 5 [0, 0, 0, 1] 1
1 2020-01-02 2 [0, 0, 0, 0] 1
2 2020-01-03 10 [1, 1, 1, 0] 1
3 2020-01-04 4 [0, 0, 0, 0] 0

Pandas groupby with identification of an element with max value in another column

I have a dataframe with sales results of items with different pricing rules:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['item'] = [1, 1, 2, 2, 2]
df_1['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_1['price_rule'] = ['a', 'b', 'a', 'b', 'b']
df_1['sales']= [2, 4, 1, 5, 7]
df_1['clicks']= [7, 8, 9, 10, 11]
df_2['item'] = [1, 1, 2, 2, 2]
df_2['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_2['price_rule'] = ['b', 'b', 'a', 'a', 'a']
df_2['sales']= [2, 3, 4, 5, 6]
df_2['clicks']= [7, 8, 9, 10, 11]
df_3['item'] = [1, 1, 2, 2, 2]
df_3['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_3['price_rule'] = ['b', 'a', 'b', 'a', 'b']
df_3['sales']= [6, 5, 4, 5, 6]
df_3['clicks']= [7, 8, 9, 10, 11]
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item', 'date'])
df.reset_index(drop=True)
df
It results with:
item date price_rule sales clicks
0 1 2018-01-01 a 2 7
0 1 2018-01-01 b 2 7
0 1 2018-01-01 b 6 7
1 1 2018-01-02 b 4 8
1 1 2018-01-02 b 3 8
1 1 2018-01-02 a 5 8
2 2 2018-01-03 a 1 9
2 2 2018-01-03 a 4 9
2 2 2018-01-03 b 4 9
3 2 2018-01-04 b 5 10
3 2 2018-01-04 a 5 10
3 2 2018-01-04 a 5 10
4 2 2018-01-05 b 7 11
4 2 2018-01-05 a 6 11
4 2 2018-01-05 b 6 11
My goal is to:
1. group all items by day (to get a single row for each item and given day)
2. aggregate 'clicks' with "sum"
3. generate a "winning_pricing_rule" columns as following:
- for a given item and given date, take a pricing rule with the highest 'sales' value
- in case of 'draw' (see eg: item 2 on 2018-01-03 in a sample above): choose just one of them (that's rare in my dataset, so it can be random...)
I imagine the result to look like this:
item date winning_price_rule clicks
0 1 2018-01-01 b 21
1 1 2018-01-02 a 24
2 2 2018-01-03 b 27 <<remark: could also be a (due to draw)
3 2 2018-01-04 a 30 <<remark: could also be b (due to draw)
4 2 2018-01-05 b 33
I tried:
a.groupby(['item', 'date'], as_index = False).agg({'sales':'sum','revenue':'max'})
but failed to identify a winning pricing rule.
Any ideas? Many Thanks for help :)
Andy
First convert column price_rule to index by DataFrame.set_index, so for winning_price_rule is possible use DataFrameGroupBy.idxmax - get index value by maximum sales in GroupBy.agg, because also is necessary aggregate sum:
df1 = (df.set_index('price_rule')
.groupby(['item', 'date'])
.agg({'sales':'idxmax', 'clicks':'sum'})
.reset_index())
For pandas 0.25.+ is possible use:
df1 = (df.set_index('price_rule')
.groupby(['item', 'date'])
.agg(winning_pricing_rule=pd.NamedAgg(column='sales', aggfunc='idxmax'),clicks=pd.NamedAgg(column='clicks', aggfunc="sum'))
.reset_index())

Pad rows with no data as Na in Pandas Dataframe

I have an np array of timestamps:
ts = np.range(5)
In [34]: ts
Out[34]: array([0, 1, 2, 3, 4])
and I have a pandas DataFrame:
data = pd.DataFrame([10, 10, 10], index = [0,3,4])
In [33]: data
Out[33]:
0
0 10
3 10
4 10
The index of data is guaranteed to be a subset of ts. I want to generate the following data frame:
res:
0 10
1 nan
2 nan
3 10
4 10
So I want the index to be ts and the values to be from data. But for rows where timestamp doesn't exist in data, I want an NaN. How can I do this?
You are looking for the reindex function.
For example:
data.reindex(index=ts)
Output:
0
0 10
1 NaN
2 NaN
3 10
4 10

Categories

Resources