Replace Column with single value but LEAVE NaNs - python

I'm trying to replace the entire column with a single value, however, I want to leave the NaNs in place. How do I go about doing that? Lets say for column 'Q1' I would like to replace every value with '1' but leave every row that has NaN in place. In the end, for column 'Q1' every row that has a integer value would now have the integer value '1' and every row that has NaN would still remain as NaN.
Q1 Q2 Q3 Q4
0 NaN NaN 1.33 NaN
1 NaN NaN NaN 1.35
2 0.93 NaN NaN NaN
3 NaN 1.08 NaN NaN
4 NaN NaN 1.28 NaN
...

In [13]: df
Out[13]:
Q1 Q2
0 NaN 1.0
1 NaN 2.0
2 0.93 NaN
3 NaN 3.0
4 NaN 4.0
In [14]: df.loc[~df.Q1.isna(), 'Q1'] = 1
In [15]: df
Out[15]:
Q1 Q2
0 NaN 1.0
1 NaN 2.0
2 1.0 NaN
3 NaN 3.0
4 NaN 4.0

Related

Pandas set all values after first NaN to NaN

For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!
Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN

Add single field on df end

Is it possible after dataframe with 20+ rows and xx+ columns to add a single field with total count of certain value. User will add different values to df and before 'pandas.DataFrame.to_excel' it's neccesary to to add a single field with some specific data. Like in the attached picture. Is it possible to add a single field after an already structured df?
This can work for you:
Df:
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
for i in range(df.iloc[-1].name + 1, 25): # Add 20 new nan row (you can change it)
df.loc[i, :] = np.nan
df.loc[df.iloc[-1].name + 1, 'A'] = 'Result: ' + str(df['B'].sum()) # For this example i just put sum of column B so you can change it.
print(df)
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
21 NaN NaN NaN
22 NaN NaN NaN
23 NaN NaN NaN
24 NaN NaN NaN
25 Result: 15.0 NaN NaN

Cuting dataframe loop

I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!
This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.

Combine multi columns to one column Pandas

Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help
We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)
In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)
You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.
if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)

How to drop column according to NAN percentage for dataframe?

For certain columns of df, if 80% of the column is NAN.
What's the simplest code to drop such columns?
You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:
df = df.loc[:, df.isnull().mean() < .8]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan
print (df.isnull().mean())
A 0.81
B 0.00
C 0.06
D 0.80
E 0.00
dtype: float64
df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
B C E
0 0.278369 NaN 0.004719
1 0.670749 NaN 0.575093
2 0.209202 NaN 0.219697
3 0.811683 NaN 0.274074
4 0.940030 NaN 0.175410
If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:
np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
1 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN
3 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0
5 NaN NaN NaN 1.0 1.0 NaN NaN 1.0 NaN 1.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
9 1.0 NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN
df1 = df.dropna(thresh=2, axis=1)
print (df1)
0 3 4 5 7 9
0 NaN 1.0 1.0 NaN NaN NaN
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN NaN NaN
4 NaN NaN NaN 1.0 NaN 1.0
5 NaN 1.0 1.0 NaN 1.0 1.0
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 1.0 NaN
9 1.0 NaN 1.0 NaN 1.0 NaN
EDIT: For non-Boolean data
Total number of NaN entries in a column must be less than 80% of total entries:
df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]
df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)
Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.
You can use the pandas dropna. For example:
df.dropna(axis=1, thresh = int(0.2*df.shape[0]), inplace=True)
Notice that we used 0.2 which is 1-0.8 since the thresh refers to the number of non-NA values
As suggested in comments, if you use sum() on a boolean test, you can get the number of occurences.
Code:
def get_nan_cols(df, nan_percent=0.8):
threshold = len(df.index) * nan_percent
return [c for c in df.columns if sum(df[c].isnull()) >= threshold]
Used as:
del df[get_nan_cols(df, 0.8)]

Categories

Resources