Add missing days in a dataframe - python

I need to fill missing days in the column 'day':
id month day trans
0 0 8 1 9
1 0 8 2 5
2 0 8 3 10
3 0 8 4 6
4 0 8 6 4
5 0 8 8 4
I am looking for output:
id month day trans
0 0 8 1 9
1 0 8 2 5
2 0 8 3 10
3 0 8 4 6
4 0 8 5 NAN
5 0 8 6 4
6 0 8 7 NAN
7 0 8 8 4

Use reindex()
df1=df.set_index('day').reindex([1,2,3,4,5,6,7]).reset_index()
df1[['month','id']]=df1[['month','id']].ffill()
Following your comment;
mux = pd.MultiIndex.from_product([df['id'].unique(),[1,2,3,4,5,6,7]], names=['id','day'])
df1=df.set_index(['id','day']).reindex(mux).reset_index()
df1[['month','id']]=df1[['month','id']].ffill()
id day month #trans
0 0 1 8.0 9.0
1 0 2 8.0 5.0
2 0 3 8.0 10.0
3 0 4 8.0 6.0
4 0 5 8.0 NaN
5 0 6 8.0 4.0
6 0 7 8.0 NaN

I think the best way to deal with it is building a pandas df that has all the [month, day] values of your output, and left merging your first df on [id, month, day] key.

Using pandas upsampling.
df['date'] = df.apply(lambda x: datetime(2020, x['month'], x['day']), axis=1)
df = df.set_index('date')
# Upsampling
df_daily = df.resample('D').asfreq().reset_index()
# reassign month and day
df_daily['month'] = df_daily.date.dt.month
df_daily['day'] = df_daily.date.dt.day
df_daily['id'] = df_daily['id'].fillna(method='ffill').astype(int)
del df_daily['date']

Related

Pandas Dataframe aggregating function to count also nan values

I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?
First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN

Casting a value based on a trigger in pandas

I would like to create a new column every time I get 1 in the 'Signal' column that will cast the corresponding value from the 'Value' column (please see the expected output below).
Initial data:
Index
Value
Signal
0
3
0
1
8
0
2
8
0
3
7
1
4
9
0
5
10
0
6
14
1
7
10
0
8
10
0
9
4
1
10
10
0
11
10
0
Expected Output:
Index
Value
Signal
New_Col_1
New_Col_2
New_Col_3
0
3
0
0
0
0
1
8
0
0
0
0
2
8
0
0
0
0
3
7
1
7
0
0
4
9
0
7
0
0
5
10
0
7
0
0
6
14
1
7
14
0
7
10
0
7
14
0
8
10
0
7
14
0
9
4
1
7
14
4
10
10
0
7
14
4
11
10
0
7
14
4
What would be a way to do it?
You can use a pivot:
out = df.join(df
# keep only the values where signal is 1
# and get Signal's cumsum
.assign(val=df['Value'].where(df['Signal'].eq(1)),
col=df['Signal'].cumsum()
)
# pivot cumsumed Signal to columns
.pivot(index='Index', columns='col', values='val')
# ensure column 0 is absent (using loc to avoid KeyError)
.loc[:, 1:]
# forward fill the values
.ffill()
# rename columns
.add_prefix('New_Col_')
)
output:
Index Value Signal New_Col_1 New_Col_2 New_Col_3
0 0 3 0 NaN NaN NaN
1 1 8 0 NaN NaN NaN
2 2 8 0 NaN NaN NaN
3 3 7 1 7.0 NaN NaN
4 4 9 0 7.0 NaN NaN
5 5 10 0 7.0 NaN NaN
6 6 14 1 7.0 14.0 NaN
7 7 10 0 7.0 14.0 NaN
8 8 10 0 7.0 14.0 NaN
9 9 4 1 7.0 14.0 4.0
10 10 10 0 7.0 14.0 4.0
11 11 10 0 7.0 14.0 4.0
#create new column by incrementing the rows that has signal
df['new_col']='new_col_'+df['Signal'].cumsum().astype(str)
#rows having no signal, make them null
df['new_col'] = df['new_col'].mask(df['Signal']==0, '0')
#pivot table
df2=(df.pivot(index=['Index','Signal', 'Value'], columns='new_col', values='Value')
.reset_index()
.ffill().fillna(0) #forward fill and fillna with 0
.drop(columns=['0','Index'] ) #drop the extra columns
.rename_axis(columns={'new_col':'Index'}) # rename the axis
.astype(int)) # changes values to int, removing decimals
df2
Index Signal Value new_col_1 new_col_2 new_col_3
0 0 3 0 0 0
1 0 8 0 0 0
2 0 8 0 0 0
3 1 7 7 0 0
4 0 9 7 0 0
5 0 10 7 0 0
6 1 14 7 14 0
7 0 10 7 14 0
8 0 10 7 14 0
9 1 4 7 14 4
10 0 10 7 14 4
11 0 10 7 14 4

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Grouping data in pandas by rows

I have data with this structure:
id month val
1 0 4
2 0 4
3 0 5
1 1 3
2 1 7
3 1 9
1 2 12
2 2 1
3 2 5
1 3 10
2 3 4
3 3 7
...
I want to get mean val for each id, grouped by two months. Expected result:
id two_months val
1 0 3.5
2 0 5.5
3 0 7
1 1 11
2 1 2.5
3 1 6
What's the simplest way to do it using Pandas?
If months are consecutive integers starting by 0 use integer division by 2:
df = df.groupby(['id',df['month'] // 2])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 0 3.5
1 2 0 5.5
2 3 0 7.0
3 1 1 11.0
4 2 1 2.5
5 3 1 6.0
Possible solution with convert to datetimes:
df.index = pd.to_datetime(df['month'].add(1), format='%m')
df = df.groupby(['id', pd.Grouper(freq='2MS')])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 1900-01-01 3.5
1 2 1900-01-01 5.5
2 3 1900-01-01 7.0
3 1 1900-03-01 11.0
4 2 1900-03-01 2.5
5 3 1900-03-01 6.0

pandas dataframe sum of shift(x) for x in range(1, n)

I have a dataframe with like this, and want to add a new column that is the equivalent of applying shift n times. For example, let n = 2:
df = pd.DataFrame(numpy.random.randint(0, 10, (10, 2)), columns=['a','b'])
a b
0 0 3
1 7 0
2 6 6
3 6 0
4 5 0
5 0 7
6 8 0
7 8 7
8 4 4
9 2 2
df['c'] = df['b'].shift(1) + df['b'].shift(2)
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0
In this manner, column c gets the sum of the previous n values from column b.
Other than a loop, is there a better way to accomplish this for a large n?
You can use the rolling() method with a window of 2:
df['c'] = df.b.rolling(window = 2).sum().shift()
df
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0

Categories

Resources