Change axis for pandas replace ffill - python

Suppose I have a dataframe that looks like:
df =
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Then it is possible to use df.fillna(method='ffill', axis=1) to obtain:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
i.e. I forward fill the rows.
However, now I have a dataframe with -1 instead of np.nan. Pandas has the replace function that also has the possibility to use method='ffill', but replace() does not take an axis argument, so to obtain the same result as above, I would need to call df.T.replace(-1, method='ffill').T. Since transposing is quite expensive (especially considering I'm working on a dataframe of multiple gigabytes), this is not an option. How could I achieve the desired result?

Use mask and ffill
df.mask(df.eq(-1)).ffill(axis=1)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0

You can convert your -1 values to NaN before using pd.DataFrame.ffill:
print(df)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 -1.0
2 6.0 -1.0 -1.0
res = df.replace(-1, np.nan)\
.ffill(axis=1)
print(res)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0

IIUC, use mask and ffill with axis=1:
Where df1 = df.fillna(-1.0)
df1.mask(df1 == -1).ffill(1)
Output:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0

Related

Dataframe compare, combine and merge for rectangular meshgrid

I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.

How do I interpolate filling in nans with the minimum value on either side?

I have a dataframe of the form:
df = {'col_1': [5,4,np.nan,np.nan,1,0,1,2,np.nan,np.nan,5],
'col_2': [5,4,3,2,np.nan,np.nan,np.nan,np.nan,3,4,5]}
df = pd.DataFrame(df)
I want to "interpolate" but taking the min value on either side for desired result of:
df_desired = {'col_1': [5,4,1,1,1,0,1,2,2,2,5],
'col_2': [5,4,3,2,2,2,2,2,3,4,5]}
df_desired = pd.DataFrame(df_desired)
Does anyone know a good way of doing this? Thanks!
Here is a way where you can get np.miminum between ffill and bfill
out = np.minimum(df.ffill(),df.bfill())
print(out)
col_1 col_2
0 5.0 5.0
1 4.0 4.0
2 1.0 3.0
3 1.0 2.0
4 1.0 2.0
5 0.0 2.0
6 1.0 2.0
7 2.0 2.0
8 2.0 3.0
9 2.0 4.0
10 5.0 5.0

Negating column values and adding particular values in only some columns in a Pandas Dataframe

Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.
Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.
Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

Pandas MultiIndex: Selecting a column knowing only the second index?

I'm working with the following DataFrame:
age height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
I added another header to the df in this way:
zipped = list(zip(df.columns, ["RHS", "height", "weight", "shoe_size"]))
df.columns = pd.MultiIndex.from_tuples(zipped)
So this is the new DataFrame:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
Now I know how to select the first column, by using the corresponding tuple ("age", "RHS"):
df[("age", "RHS")]
but I was wondering about how to do this by using only the second index "RHS".
Ideally something like:
df[(any, "RHS")]
You could use get_level_values
In [700]: df.loc[:, df.columns.get_level_values(1) == 'RHS']
Out[700]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
You pass slice(None) as the first argument to .loc, provided you sort your columns first using df.sort_index:
In [325]: df.sort_index(1).loc[:, (slice(None), 'RHS')]
Out[325]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
You can also use pd.IndexSlice with df.loc:
In [332]: idx = pd.IndexSlice
In [333]: df.sort_index(1).loc[:, idx[:, 'RHS']]
Out[333]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
With the slicer, you don't need to explicitly pass slice(None) because IndexSlice does that for you.
If you don't sort your columns, you get:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
If you have multiple RHS columns in the second level, all those columns are returned.

Categories

Resources