I need to extract a block (contiguous) of lines around a particular date specified by the presence (not NaN value) in q1 column. By block I mean k days before the date and p days after the date.
For example, using the following dataframe, and setting k=5, p=2, I need to get the following blocks:
participant_id response_date q1 summary
0 11.0 2016-04-27 NaN NaN
1 11.0 2016-04-30 NaN 2.0
2 11.0 2016-05-01 1089.0 3.0
3 11.0 2016-05-02 NaN 3.0
4 11.0 2016-05-03 NaN 3.0
5 11.0 2016-05-04 NaN 3.0
6 11.0 2016-05-05 NaN 3.0
7 11.0 2016-05-06 NaN 3.0
8 11.0 2016-05-07 NaN 4.0
9 11.0 2016-05-08 NaN 4.0
10 11.0 2016-05-09 NaN 3.0
11 11.0 2016-05-10 NaN 3.0
12 11.0 2016-05-11 NaN 3.0
13 11.0 2016-05-12 NaN 3.0
14 11.0 2016-05-13 NaN 3.0
15 11.0 2016-05-14 NaN 3.0
16 11.0 2016-05-15 NaN 3.0
17 11.0 2016-05-16 NaN 3.0
18 11.0 2016-05-17 NaN 4.0
19 11.0 2016-05-18 NaN 3.0
20 11.0 2016-05-19 NaN 3.0
21 11.0 2016-05-20 NaN 3.0
22 11.0 2016-05-21 NaN 4.0
23 11.0 2016-05-22 NaN 4.0
24 11.0 2016-05-23 NaN 4.0
25 11.0 2016-05-24 NaN 3.0
26 11.0 2016-05-25 NaN 3.0
27 11.0 2016-05-26 NaN 3.0
28 11.0 2016-05-27 NaN 3.0
29 11.0 2016-05-28 NaN 3.0
30 11.0 2016-05-29 NaN 3.0
31 11.0 2016-05-30 NaN 3.0
32 11.0 2016-05-31 NaN 4.0
33 11.0 2016-06-01 NaN 4.0
34 11.0 2016-06-02 802.0 3.0
35 11.0 2016-06-03 NaN 3.0
36 11.0 2016-06-04 NaN 3.0
37 11.0 2016-06-05 NaN 3.0
38 11.0 2016-06-06 NaN 3.0
39 11.0 2016-06-07 NaN 3.0
40 11.0 2016-06-08 NaN 3.0
41 11.0 2016-06-09 NaN 3.0
42 11.0 2016-06-10 NaN 3.0
43 11.0 2016-06-11 NaN 5.0
44 11.0 2016-06-12 NaN 3.0
45 11.0 2016-06-13 NaN 4.0
46 11.0 2016-06-14 NaN 4.0
47 11.0 2016-06-15 NaN 3.0
48 11.0 2016-06-16 NaN 3.0
49 11.0 2016-06-17 NaN 3.0
Block 1: (up to 5 days before the date where q1 is not NaN' and 2 days
0 11.0 2016-04-27 NaN NaN
1 11.0 2016-04-30 NaN 2.0
2 11.0 2016-05-01 1089.0 3.0
3 11.0 2016-05-02 NaN 3.0
4 11.0 2016-05-03 NaN 3.0
Block 2:
30 11.0 2016-05-29 NaN 3.0
31 11.0 2016-05-30 NaN 3.0
32 11.0 2016-05-31 NaN 4.0
33 11.0 2016-06-01 NaN 4.0
34 11.0 2016-06-02 802.0 3.0
35 11.0 2016-06-03 NaN 3.0
36 11.0 2016-06-04 NaN 3.0
I have implemented this algorithm in a quite straightforward way, with loops and conditional flows, however, that's pretty slow (for a large data set) and I would like to learn more paythonian/pandasian solution. I anticipate it may involve groupBy function.
Since I do not have the starting code or data, I'll try my best. Given your response_date column is a datetime object
import datetime as dt
dates_not_null = your_df.loc[~your_df.q1.isnull(), 'response_date']
for i in dates_not_null:
req_df = your_df.loc[(i - dt.timedelta(k)) : i + dt.timedelta(p)), :]
you can append this dataframe to a list and then concatenate or do whatever you want.
Using a helper function to get a dict of DataFrames and concatenate them:
from dateutil.relativedelta import relativedelta
def get_block(obj, d, k, p):
# obj -> dataframe; d -> date
start = d - relativedelta(days=k)
end = d + relativedelta(days=p)
obj = obj.set_index('response_date')
return obj.loc[start:end]
dates = df[df.q1.notnull()]['response_date'].tolist()
result = {}
k = 5
p = 2
for d in dates:
result[d] = get_block(df, d, k, p)
print(result[dates[0]])
participant_id q1 summary
response_date
2016-04-27 11 NaN NaN
2016-04-30 11 NaN 2.0
2016-05-01 11 1089.0 3.0
2016-05-02 11 NaN 3.0
2016-05-03 11 NaN 3.0
Then you can just concatenate this result:
result = pd.concat((result))
result.index = result.index.rename(['mid_date', 'response_date']
print(result)
participant_id q1 summary
mid_date response_date
2016-05-01 2016-04-27 11 NaN NaN
2016-04-30 11 NaN 2.0
2016-05-01 11 1089.0 3.0
2016-05-02 11 NaN 3.0
2016-05-03 11 NaN 3.0
2016-06-02 2016-05-28 11 NaN 3.0
2016-05-29 11 NaN 3.0
2016-05-30 11 NaN 3.0
2016-05-31 11 NaN 4.0
2016-06-01 11 NaN 4.0
2016-06-02 11 802.0 3.0
2016-06-03 11 NaN 3.0
2016-06-04 11 NaN 3.0
I think a loop is pretty unavoidable here given that you may have overlapping sub-sections of your input.
Related
I have a column vector with say 30 values (1-30) I would like to try to manipulate this vector so that it becomes a matrix with 5 values in the first column, 10 values in the second and 15 values in the third column. How would I implement this using Pandas or NumPy?
import pandas as pd
#Create data
df = pd.DataFrame(np.linspace(1,20,20))
print(df)
1
2
:
28
29
30
In order to get something like this:
# Manipulate the column vector to make columns where the first column has 5
# the second column has 10 and the last column has 15 values
'T1' 'T2' 'T3'
1 6 16
2 7 17
3 8 18
4 9 19
5 10 20
NA 11 21
NA 12 22
NA 13 23
NA 14 24
NA 15 25
NA NA 26
NA NA 27
NA NA 28
NA NA 29
NA NA 30
It took a little time to find out what series is this, and I found that its a triangular series , just a modified one.
tri = lambda x:int((0.25+2*x)**0.5-0.5)
This would give results like:
0 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 ...
And after the modification:
modtri = lambda x:int((0.25+2*(x//5))**0.5-0.5)
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
So each occurrence in normal triangular series repeats 5 times.
The above modtri function would directly map the index starting from 0, to appropriate group ids.
and so after that, this would do the job:
df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
Full execution:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.linspace(1,30,30))
N = 5 #the increment value
modtri = lambda x:int((0.25+2*(x//N))**0.5-0.5)
df2 = df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
df2.rename(columns={0: "T1", 1: "T2",2:"T3"},inplace=True)
print(df2)
Output:
T1 T2 T3
0 1.0 6.0 16.0
1 2.0 7.0 17.0
2 3.0 8.0 18.0
3 4.0 9.0 19.0
4 5.0 10.0 20.0
5 NaN 11.0 21.0
6 NaN 12.0 22.0
7 NaN 13.0 23.0
8 NaN 14.0 24.0
9 NaN 15.0 25.0
10 NaN NaN 26.0
11 NaN NaN 27.0
12 NaN NaN 28.0
13 NaN NaN 29.0
14 NaN NaN 30.0
Try this by slicing with reindexing:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
Original data before operation:
df = pd.DataFrame(np.linspace(1,30,30))
print(df)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 20.0
20 21.0
21 22.0
22 23.0
23 24.0
24 25.0
25 26.0
26 27.0
27 28.0
28 29.0
29 30.0
Running new codes:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
print(df)
0 T1 T2 T3
0 1.0 1.0 6.0 16.0
1 2.0 2.0 7.0 17.0
2 3.0 3.0 8.0 18.0
3 4.0 4.0 9.0 19.0
4 5.0 5.0 10.0 20.0
5 6.0 NaN 11.0 21.0
6 7.0 NaN 12.0 22.0
7 8.0 NaN 13.0 23.0
8 9.0 NaN 14.0 24.0
9 10.0 NaN 15.0 25.0
10 11.0 NaN NaN 26.0
11 12.0 NaN NaN 27.0
12 13.0 NaN NaN 28.0
13 14.0 NaN NaN 29.0
14 15.0 NaN NaN 30.0
15 16.0 NaN NaN NaN
16 17.0 NaN NaN NaN
17 18.0 NaN NaN NaN
18 19.0 NaN NaN NaN
19 20.0 NaN NaN NaN
20 21.0 NaN NaN NaN
21 22.0 NaN NaN NaN
22 23.0 NaN NaN NaN
23 24.0 NaN NaN NaN
24 25.0 NaN NaN NaN
25 26.0 NaN NaN NaN
26 27.0 NaN NaN NaN
27 28.0 NaN NaN NaN
28 29.0 NaN NaN NaN
29 30.0 NaN NaN NaN
I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32
I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0
20 21.0 205.0
I am trying to classify according to machine number. Like Machine_number 1 to 5 will be one group. Then 6 to 10 in one group and so on.
I think you need substract 1 by sub and then floordiv:
df['g'] = df.Machine_number.sub(1).floordiv(5)
#same as //
#df['g'] = df.Machine_number.sub(1) // 5
print (df)
Machine_number Machine_Running_Hours g
0 1.0 424.0 -0.0
1 2.0 458.0 0.0
2 3.0 465.0 0.0
3 4.0 446.0 0.0
4 5.0 466.0 0.0
5 6.0 466.0 1.0
6 7.0 445.0 1.0
7 8.0 466.0 1.0
8 9.0 447.0 1.0
9 10.0 469.0 1.0
10 11.0 467.0 2.0
11 12.0 449.0 2.0
12 13.0 436.0 2.0
13 14.0 465.0 2.0
14 15.0 463.0 2.0
15 16.0 372.0 3.0
16 17.0 460.0 3.0
17 18.0 450.0 3.0
18 19.0 467.0 3.0
19 20.0 463.0 3.0
20 21.0 205.0 4.0
If need store in dictionary use groupby with dict comprehension:
dfs = {i:g for i, g in df.groupby(df.Machine_number.astype(int).sub(1).floordiv(5))}
print (dfs)
{0: Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0, 1: Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0, 2: Machine_number Machine_Running_Hours
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0, 3: Machine_number Machine_Running_Hours
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0, 4: Machine_number Machine_Running_Hours
20 21.0 205.0}
print (dfs[0])
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
print (dfs[1])
Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0