How to use pandas.DataFrame.mask for NaN - python

I want to ignore NaN values in my selected dataframe columns when I want to normalize with sklearn.preprocessing.normalize. Column example:
0 12.0
1 12.0
2 3.0
3 NaN
4 3.0
5 3.0
6 NaN
7 NaN
8 3.0
9 3.0
10 3.0
11 4.0
12 10.0

You can make use of function dropna(). It will return the same dataframe with rows containing NaN deleted.
>>> a.dropna()
0 12.0
0 1 12
1 2 3
3 4 3
4 5 3
7 8 3
8 9 3
9 10 3
10 11 4
11 12 10

Related

Pandas ffill for certain values in a column

I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4
One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0

replacing values in a pandas dataframe with values from another dataframe based common columns

How can I replace values in a pandas dataframe with values from another dataframe based common columns.
I need to replace NaN values in dataframe1 based on the common columns of "types" and "o_period". any suggestion?
df1
types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
8 2 1 1 44882.0 39.0
9 2 1 2 17176.0 29.0
10 2 2 1 28609.0 58.0
11 2 2 2 20370.0 53.0
12 2 3 1 7064.0 12.0
13 2 3 2 13099.0 44.0
14 2 4 1 NaN NaN
15 2 4 2 7117.0 18.0
16 3 1 1 1179.0 1.0
17 3 1 2 552.0 1.0
18 3 2 1 781.0 0.0
19 3 2 2 676.0 1.0
20 3 3 1 783.0 6.0
21 3 3 2 1948.0 2.0
22 3 4 1 NaN NaN
23 3 4 2 274.0 1.0
24 4 1 1 251.0 0.0
25 4 1 2 105.0 0.0
26 4 2 1 288.0 0.0
27 4 2 2 192.0 0.0
28 4 3 1 349.0 2.0
29 4 3 2 1208.0 11.0
30 4 4 1 NaN NaN
31 4 4 2 2051.0 4.0
32 5 1 1 45.0 0.0
33 5 1 2 NaN NaN
34 5 2 1 789.0 7.0
35 5 2 2 437.0 7.0
36 5 3 1 1157.0 5.0
37 5 3 2 2161.0 12.0
38 5 4 1 NaN NaN
39 5 4 2 542.0 1.0
df2
types o_periods s_months incidents
0 1 1 911.0 3.0
1 1 2 1689.0 8.0
2 2 1 26852.0 36.0
3 2 2 14440.0 36.0
4 3 1 914.0 2.0
5 3 2 862.0 1.0
6 4 1 296.0 1.0
7 4 2 889.0 4.0
8 5 1 664.0 4.0
9 5 2 1047.0 7.0
df3:rows with NaN
types c_years o_periods s_months incidents
6 1 4 1 NaN NaN
14 2 4 1 NaN NaN
22 3 4 1 NaN NaN
30 4 4 1 NaN NaN
33 5 1 2 NaN NaN
38 5 4 1 NaN NaN
I have tried to merge df2 with df3 but the indexing seems to reset.
First separate the rows where you have NaN values out into a new dataframe called df3 and drop the rows where there are NaN values from df1.
Then do a left join based on the new dataframe.
df4 = pd.merge(df3,df2,how='left',on=['types','o_period'])
After that is done, append the rows from df4 back into df1.
Another way is to combine the 2 columns you want to lookup into a single column
df1["types_o"] = df1["types_o"].astype(str) + df1["o_period"].astype(str)
df2["types_o"] = df2["types_o"].astype(str) + df2["o_period"].astype(str)
Then you can do a look up on the missing values.
df1.types_o.replace('Nan', np.NaN, inplace=True)
df1.loc[df1['s_months'].isnull(),'s_months'] = df2['types_o'].map(df1.types_o)
df1.loc[df1['incidents'].isnull(),'incidents'] = df2['types_o'].map(df1.types_o)
You didn't paste any code or examples of your data which is easily reproducible so this is the best I can do.

Jupyter Notebook for csv file to select 3 window rolling [duplicate]

This question already has answers here:
Rolling or sliding window iterator?
(29 answers)
Closed 1 year ago.
I have this big data in csv file:
I manage to open this on Jupyter Notebook.
The data in csv example: 1 2 3 4 5 6 7 8 9 10
And I wanted to open in the notebook as '3 windows rolling' without doing any (sum,mean for example)
The output I want in the notebook are>>
First open csv to get first column.
import pandas as pd
df = pd.read_csv("filename.csv")
I will use io only to simulate data from file
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
Result
first
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Next you can use shift to create other columns
df['second'] = df['first'].shift(-1)
df['third'] = df['first'].shift(-2)
Result
first second third
0 1 2.0 3.0
1 2 3.0 4.0
2 3 4.0 5.0
3 4 5.0 6.0
4 5 6.0 7.0
5 6 7.0 8.0
6 7 8.0 9.0
7 8 9.0 10.0
8 9 10.0 NaN
9 10 NaN NaN
At the end you can remove two last rows with NaN and convert all to integer
df = df[:-2].astype(int)
or if you don't have NaN in other places
df = df.dropna().astype(int)
Result:
first second third
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
Minimal working code
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['first'])
print(df)
df['second'] = df['first'].shift(-1) #, fill_value=0)
df['third'] = df['first'].shift(-2)
print(df)
#df = df.dropna().astype(int)
df = df[:-2].astype(int)
print(df)
EDIT:
The same using for-loop to create any number of columns
text = """col 1
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['col 1'])
print(df)
number = 5
for x in range(1, number+1):
df[f'col {x+1}'] = df['col 1'].shift(-x)
print(df)
#df = df.dropna().astype(int)
df = df[:-number].astype(int)
print(df)
Result
col 1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2.0 3.0 4.0 5.0 6.0
1 2 3.0 4.0 5.0 6.0 7.0
2 3 4.0 5.0 6.0 7.0 8.0
3 4 5.0 6.0 7.0 8.0 9.0
4 5 6.0 7.0 8.0 9.0 10.0
5 6 7.0 8.0 9.0 10.0 NaN
6 7 8.0 9.0 10.0 NaN NaN
7 8 9.0 10.0 NaN NaN NaN
8 9 10.0 NaN NaN NaN NaN
9 10 NaN NaN NaN NaN NaN
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10

How to fill NaN in one column depending from values two different columns

I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help
Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0

How to loop through each row in pandas dataframe and set values equal to nan after a threshold is surpassed?

If I have a pandas dataframe like this:
0 1 2 3 4 5
A 5 5 10 9 4 5
B 10 10 10 8 1 1
C 8 8 0 9 6 3
D 10 10 11 4 2 9
E 0 9 1 5 8 3
If I set a threshold of 7, how do I loop through each row and set the values after the threshold is no longer met equal to np.nan such that I get a data frame like this:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2 9
E 0 9 1 5 8 NaN
Where everything after the last number greater than 7 is set equal to np.nan.
Let's try this:
df.where(df.where(df > 7).bfill(axis=1).notna())
Output:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
create a mask m by using df.where on df.gt(7) and bfill and isna. Finally, indexing df using m
m = df.where(df.gt(7)).bfill(1).notna()
df[m]
Out[24]:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
A very nice question , reverse the order then cumsum the one equal to 0 should be NaN
df.where(df.iloc[:,::-1].gt(7).cumsum(1).ne(0))
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN

Categories

Resources