pandas filling nan with previous row value multiplied with another column - python

I have dataframe for which I want to fill nan with values from previous rows mulitplied with pct_change column
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 nan 0.5
4 nan 1.3
5 nan 2
6 5 3
so for 3rd row 10*0.5 = 5 and use that filled value to fill next rows if its nan.
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 5 0.5
4 6.5 1.3
5 13 2
6 5 3
I have used this
while df['col_to_fill'].isna().sum() > 0:
df.loc[df['col_to_fill'].isna(), 'col_to_fill'] = df['col_to_fill'].shift(1) * df['pct_change']
but Its taking too much time as its only filling those row whos previous row are nonnan in one loop.

Try with cumprod after ffill
s = df.col_to_fill.ffill()*df.loc[df.col_to_fill.isna(),'pct_change'].cumprod()
df.col_to_fill.fillna(s, inplace=True)
df
Out[90]:
col_to_fill pct_change
0 1.0 NaN
1 2.0 1.0
2 10.0 0.5
3 5.0 0.5
4 6.5 1.3
5 13.0 2.0
6 5.0 3.0

Related

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

Pandas: Iterate by two column for each iteration

Does anyone know how to iterate a pandas Dataframe with two columns for each iteration?
Say I have
a b c d
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
So something like
for x, y in ...:
correlation of x and y
So output will be
corr_ab corr_bc corr_cd
0.1 0.3 -0.4
You can use zip with indexing for tuples, create dictionary of one element lists with Series.corr and f-strings for columns names and pass to DataFrame constructor:
L = {f'corr_{col1}{col2}': [df[col1].corr(df[col2])]
for col1, col2 in zip(df.columns, df.columns[1:])}
df = pd.DataFrame(L)
print (df)
corr_ab corr_bc corr_cd
0 0.860108 0.61333 0.888523
You can use df.corr to get the correlation of the dataframe. You then use mask to avoid repeated correlations. After that you can stack your new dataframe to make it more readable. Assuming you have data like this
0 1 2 3 4
0 11 6 17 2 3
1 3 12 16 17 5
2 13 2 11 10 0
3 8 12 13 18 3
4 4 3 1 0 18
Finding the correlation,
corrData = data.corr(method='pearson')
We get,
0 1 2 3 4
0 1.000000 -0.446023 0.304108 -0.136610 -0.674082
1 -0.446023 1.000000 0.563112 0.773013 -0.258801
2 0.304108 0.563112 1.000000 0.494512 -0.823883
3 -0.136610 0.773013 0.494512 1.000000 -0.545530
4 -0.674082 -0.258801 -0.823883 -0.545530 1.000000
Masking out repeated correlations,
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
We get
0 1 2 3 4
0 NaN -0.446023 0.304108 -0.136610 -0.674082
1 NaN NaN 0.563112 0.773013 -0.258801
2 NaN NaN NaN 0.494512 -0.823883
3 NaN NaN NaN NaN -0.545530
4 NaN NaN NaN NaN NaN
Stacking the correlated data
dataCorr = dataCorr.stack().reset_index()
The stacked data will look as shown
level_0 level_1 0
0 0 1 -0.446023
1 0 2 0.304108
2 0 3 -0.136610
3 0 4 -0.674082
4 1 2 0.563112
5 1 3 0.773013
6 1 4 -0.258801
7 2 3 0.494512
8 2 4 -0.823883
9 3 4 -0.545530

pandas take average on odd rows

I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0

Pandas data frame combine rows

My problem is a large data frame which I would like to clear out. The two main problems for me are:
The whole data frame is time-based. That means I can not shift rows around, otherwise, the timestamp wouldn't fit anymore.
The data is not always in the same order.
Here is an example to clarify
index a b c d x1 x2 y1 y2 t
0 1 2 0.2
1 1 2 0.4
2 2 4 0.6
3 1 2 1.8
4 2 3 2.0
5 1 2 3.8
6 2 3 4.0
7 2 5 4.2
The result should be looking like this
index a b c d x1 x2 y1 y2 t
0 1 2 2 4 0.2
1 1 2 0.4
3 1 2 2 3 1.8
5 1 2 2 3 3.8
7 2 5 4.2
This means I would like, to sum up, the right half of the df and keep the timestamp of the first entry. The second problem is, there might be different data from the left half of the df in between.
This may not be the most general solution, but it solves your problem:
First, isolate the right half:
r = df[['x1', 'x2', 'y1', 'y2']].dropna(how='all')
Second, use dropna applied column by column to compress the data:
r_compressed = r.apply(
lambda g: g.dropna().reset_index(drop=True),
axis=0
).set_index(r.index[::2])
You need to drop the index otherwise pandas will attempt to realign the data. The original index is reapplied at the end (but only with every second index label) to facilitate reinsertion of the left half and the t column.
Output (note the index values):
x1 x2 y1 y2
0 1.0 2.0 2.0 4.0
3 1.0 2.0 2.0 3.0
5 1.0 2.0 2.0 3.0
Third, isolate left half:
l = df[['a', 'b', 'c', 'd']].dropna(how='all')
Fourth, incorporate the left half and t column to compressed right half:
out = r_compressed.combine_first(l)
out['t'] = df['t']
Output:
a b c d x1 x2 y1 y2 t
0 NaN NaN NaN NaN 1.0 2.0 2.0 4.0 0.2
1 1.0 2.0 NaN NaN NaN NaN NaN NaN 0.4
3 NaN NaN NaN NaN 1.0 2.0 2.0 3.0 1.8
5 NaN NaN NaN NaN 1.0 2.0 2.0 3.0 3.8
7 NaN NaN 2.0 5.0 NaN NaN NaN NaN 4.2

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources