Pandas - Fillna based on last non-blank value and next column - python

I have the following pandas dataframe:
A B C
0 100.0 110.0 100
1 90.0 120.0 110
2 NaN 105.0 105
3 NaN 100.0 103
4 NaN NaN 107
5 NaN NaN 110
I need to fill NaNs in all columns in a particular way. Let's take column "A" as an example: the last non-NaN value is row #1 (90.0). So for column "A" I need to fill NaNs with the following formula:
Column_A-Row_1 * Column_B-CurrentRow / Column_B-Row_1
For example, the first NaN of column A (row #2) should be filled with: 90 * 105 / 120. The following NaN of column A should be filled with: 90 * 100 / 120.
Please note that column names can change, so I can't reference columns by name.
This is the expected output:
A B C
0 100.00 110.00 100.0
1 90.00 120.00 110.0
2 78.75 105.00 105.0
3 75.00 100.00 103.0
4 NaN 103.88 107.0
5 NaN 106.80 110.0
Any ideas? Thanks

You can fill the first NaN that follows a number using shift on both axis:
df2 = df.combine_first(df.shift().mul(df.div(df.shift()).shift(-1,axis=1)))
output:
A B C
0 100.00 110.000000 100
1 90.00 120.000000 110
2 78.75 105.000000 105
3 NaN 100.000000 103
4 NaN 103.883495 107
5 NaN NaN 110
It is unclear how you get the 75 though, do you want to iterate the process?

Related

Merge rows duplicate values in a column using Pandas

I have DataFrame like this
A B C D
010 100 NaN 300
020 NaN 200 400
020 100 NaN NaN
030 NaN NaN 19
030 1 NaN NaN
040 NaN 2 1
How can I merge all rows that have duplicate (same value) in Column A so that other values fill the empty places?
End result
A B C D
010 100 NaN 300
020 100 200 400
030 1 NaN 19
040 NaN 2 1
Check with
df=df.groupby('A',as_index=False).first()
Out[65]:
A B C D
0 10 100.0 NaN 300.0
1 20 100.0 200.0 400.0
2 30 1.0 NaN 19.0
3 40 NaN 2.0 1.0

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

How to find last index in Pandas Data Frame row and count backwards using column information?

For example:
If I have a data frame like this:
20 40 60 80 100 120 140
1 1 1 1 NaN NaN NaN NaN
2 1 1 1 1 1 NaN NaN
3 1 1 1 1 NaN NaN NaN
4 1 1 NaN NaN 1 1 1
How do I find the last index in each row and then count the difference in columns elapsed so I get something like this?
20 40 60 80 100 120 140
1 40 20 0 NaN NaN NaN NaN
2 80 60 40 20 0 NaN NaN
3 60 40 20 0 NaN NaN NaN
4 20 0 NaN NaN 40 20 0
You can try of Transposing the dataframe, then after count only not null values and last set 1
#bit of complex procedure, solution involving with.
def fill_values(df):
df = df[::-1]
a = df == 1
b = a.cumsum()
#Function in counting the cummulative not null values
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
return (b-b.mask(a).ffill().fillna(0).astype(int))[::-1]*20
df.apply(fill_values,1).replace(0,np.nan)-20
Out:
20 40 60 80 100 120 140
1 40.0 20.0 0.0 NaN NaN NaN NaN
2 80.0 60.0 40.0 20.0 0.0 NaN NaN
3 60.0 40.0 20.0 0.0 NaN NaN NaN
4 20.0 0.0 NaN NaN 40.0 20.0 0.0

How can i filter consecutive data rows btw NaN rows in a pandas dataframe?

I have a dataframe that looks like the following. There are >=1 consecutive rows where y_l is populated and y_h is NaN and vice versa.
When we have more than 1 consecutive populated lines between the NaNs we only want to keep the one with the lowest y_l or the highest y_h.
e.g. on the df below from the last 3 rows we would only keep the 2nd and discard the other two.
What would be a smart way to implement that?
df = pd.DataFrame({'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['y_l','y_h'])
>>> df
y_l y_h
0 NaN 90.0
1 97.0 NaN
2 95.0 NaN
3 98.0 NaN
4 NaN 95
Desired result:
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95
You need create new column or Series for distinguish each consecutives and then use groupby with aggreagte by agg, last for change order of columns use reindex:
a = df['y_l'].isnull()
b = a.ne(a.shift()).cumsum()
df = (df.groupby(b, as_index=False)
.agg({'y_l':'min', 'y_h':'max'})
.reindex(columns=['y_l','y_h']))
print (df)
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95.0
Detail:
print (b)
0 1
1 2
2 2
3 2
4 3
Name: y_h, dtype: int32
What if you had more columns?
for example
df = pd.DataFrame({'A': [NaN, 15,20,25,NaN],'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['A','y_l','y_h'])
>>>df
A y_l y_h
0 NaN NaN 90.0
1 15.0 97.0 NaN
2 20.0 95.0 NaN
3 25.0 98.0 NaN
4 NaN NaN 95.0
How could you keep the values in column A after filtering out the irrelevant rows as below?
A y_l y_h
0 NaN NaN 90.0
1 20.0 95.0 NaN
2 NaN NaN 95.0

Interpolate a missing values using rows and columns values

In Python Pandas, how should I interactively interpolate a dataframe with some NaN rows and columns?
For example, the following dataframe -
90 92.5 95 100 110 120
Index
1 NaN NaN NaN NaN NaN NaN
2 0.469690 NaN NaN NaN NaN NaN
3 0.478220 NaN 0.492232 0.505685 NaN NaN
4 0.486377 NaN 0.503853 0.518890 0.550517 NaN
5 0.485862 NaN 0.502130 0.515076 0.537675 0.564383
My goal is to interpolate & fill all the NaN efficiently, I.E, to interpolate whatever NaN that is possible. However If I use
df.interpolate(inplace=True, axis=0, method='spline', order=1, limit=20, limit_direction='both')
it will return "TypeError: Cannot interpolate with all NaNs."
You can try this (thank you #Boud for df.dropna(axis=1, how='all')):
In [138]: new = df.dropna(axis=1, how='all').interpolate(limit=20, limit_direction='both')
In [139]: new
Out[139]:
90 95 100 110 120
Index
1 0.469690 0.492232 0.505685 0.550517 0.564383
2 0.469690 0.492232 0.505685 0.550517 0.564383
3 0.478220 0.492232 0.505685 0.550517 0.564383
4 0.486377 0.503853 0.518890 0.550517 0.564383
5 0.485862 0.502130 0.515076 0.537675 0.564383

Categories

Resources