given the dataframe df
df = pd.DataFrame(data=[[np.nan,1],
[np.nan,np.nan],
[1,2],
[2,3],
[np.nan,np.nan],
[np.nan,np.nan],
[3,4],
[4,5],
[np.nan,np.nan],
[np.nan,np.nan]],columns=['A','B'])
df
Out[16]:
A B
0 NaN 1.0
1 NaN NaN
2 1.0 2.0
3 2.0 3.0
4 NaN NaN
5 NaN NaN
6 3.0 4.0
7 4.0 5.0
8 NaN NaN
9 NaN NaN
I would need to replace the nan using the following rules:
1) if nan is at the beginning replace with the first values after the nan
2) if nan is in the middle of 2 or more values replace the nan with the average of these values
3) if nan is at the end replace with the last value
df
Out[16]:
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Use add between forward filling and backfilling values, then divide by 2 and last replace last and first NaNs:
df = df.bfill().add(df.ffill()).div(2).ffill().bfill()
print (df)
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Detail:
print (df.bfill().add(df.ffill()))
A B
0 NaN 2.0
1 NaN 3.0
2 2.0 4.0
3 4.0 6.0
4 5.0 7.0
5 5.0 7.0
6 6.0 8.0
7 8.0 10.0
8 NaN NaN
9 NaN NaN
Related
I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.
Imagine I have the following Series:
inp = pd.Series(np.arange(10))
What I want to do, is to transform it to a np.array in the following way:
input
0
1
2
3
4
0
NaN
NaN
NaN
Nan
0
1
NaN
NaN
NaN
0
1
2
NaN
NaN
0
1
2
3
NaN
0
1
2
3
4
0
1
2
3
4
5
1
2
3
4
5
6
2
3
4
5
6
...and so forth.
The column called input is not expected in the output, but I placed it here to make my inquiry more clear.
What I tried is the following:
matrix = [x.to_numpy() for x in list(inp.rolling(window=5, min_periods=5))]
Problem is, i can't use np.stack() on matrix, as (even though I passed min_periods=5) the shape of every item in the list is different.
Also I feel like I am overlooking a very simple pandas command :D.
Thank you very much!
EDIT:
My current workaround is a custom function. I guess there are way better solutions than this one:
def rolling_transform_series(x):
length = len(x)
array = []
for idx in range(length):
s = x[idx-5:idx]
if idx < 5:
s = np.r_[np.zeros(5-idx), x[:idx]]
s[s==0] = np.nan
array.append(s)
return np.array(array)
df = inp.apply(rolling_transform_series)
You can try it with shift:
import pandas as pd
inp = pd.Series(np.arange(10))
pd.DataFrame([inp.shift(s) for s in range(4,-6,-1)])
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN NaN 0.0 1.0 2.0 3.0 4.0 5.0
1 NaN NaN NaN 0.0 1.0 2.0 3.0 4.0 5.0 6.0
2 NaN NaN 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
3 NaN 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
4 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 NaN
6 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 NaN NaN
7 3.0 4.0 5.0 6.0 7.0 8.0 9.0 NaN NaN NaN
8 4.0 5.0 6.0 7.0 8.0 9.0 NaN NaN NaN NaN
9 5.0 6.0 7.0 8.0 9.0 NaN NaN NaN NaN NaN
I have a DataFrame where I want to replace only the rows with NaN values in each column by the row below it. I tried solutions from multiple feeds and used ffill but that resulted in filling few cells and not the entire row.
ss s h b sb
0 NaN NaN NaN NaN NaN
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 1.0 6.0 7.0 11.0 3.0
Expected output:
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
You can create groups by testing rows with only missing values with cumulative sum by swapped order of column and pass to GroupBy.bfill:
df = df.groupby((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1]).bfill()
print (df)
ss s h b sb
0 3.0 NaN 14.0 NaN 8.0
1 3.0 NaN 14.0 NaN 8.0
2 9.0 8.0 23.0 NaN 2.0
3 1.0 6.0 7.0 11.0 3.0
4 1.0 6.0 7.0 11.0 3.0
5 1.0 6.0 7.0 11.0 3.0
Detail:
print ((df.notna().any(axis=1)).iloc[::-1].cumsum().iloc[::-1])
0 3
1 3
2 2
3 1
4 1
5 1
dtype: int32
I have two .csv files, one contains what could be described as a header and a body. The header contains data like the total number of rows, datetime, what application generated the data, and what line the body starts on. The second file contains a single row.
>>> import pandas as pd
>>> df = pd.read_csv("data.csv", names=list('abcdef'))
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>> df2 = pd.read_csv("extra_data.csv")
>>> df2
a b c
0 6.0 5.0 4.0
>>> row = df2.loc[0]
>>>
I am having trouble modifying the 'a', 'b' and 'c' columns and then saving the DataFrame to a new .csv file.
I have tried adding the row by way of slicing and the addition operator but this did not work:
>>> df[5:,'a':'c'] += row
TypeError: '(slice(5, None, None), slice('a', 'c', None))' is an invalid key
>>>
I also tried the answer I found here, but this gave a similar error:
>>> df[5:,row.index] += row
TypeError: '(slice(5, None, None), Index(['a', 'b', 'c'], dtype='object'))' is an invalid key
>>>
I suspect the problem is coming from object dtypes so I tried converting a subframe to the float type:
>>> sub_section = df.loc[5:,['a','b','c']].astype(float)
>>> sub_section
a b c
5 0.0 1.0 2.0
6 0.0 1.0 2.0
7 0.0 1.0 2.0
8 0.0 1.0 2.0
9 0.0 1.0 2.0
10 0.0 1.0 2.0
11 0.0 1.0 2.0
>>> sub_section += row
>>> sub_section
a b c
5 6.0 6.0 6.0
6 6.0 6.0 6.0
7 6.0 6.0 6.0
8 6.0 6.0 6.0
9 6.0 6.0 6.0
10 6.0 6.0 6.0
11 6.0 6.0 6.0
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>>
Obviously, in this case df.loc[] is returning a copy, and then modifying the copy does nothing to the df.
How do I modify parts of a DataFrame (dtype=object) and then save the changes?
I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.
Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.
Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0