Remove NaN for the dataset

Remove NaN for the dataset - python

Given the sample df:
p = [[1.234,1], [2.2134,1.2365], [1.1234,2.5432]]
q = [[2,2], [0,1], [2,4]]
p[p == 22] = np.nan
I am able to remove NaN from p values by doing:
p = np.array([i for i in p if np.any(np.isfinite(i))], np.float64)
q = np.array(q, np.float64)
Can I do anything for a loop to check if there is a NaN and remove it?
But it is for one couple. What if I have the dataset like (real data is much more bigger(106,1900))
df =
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 NaN NaN NaN NaN NaN NaN NaN NaN
1 46.0595 1.6829 25.0695 3.7463 NaN NaN NaN NaN NaN NaN
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 NaN NaN NaN NaN
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 NaN NaN

Try for instance (in order to fill all NaN-s with 0's):
df.fillna(0)
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

You can use your average or mean of each column to fill your NaN values
df.fillna(df.mean())

Related

assigning multiple values to different cells in a dataframe

This is probably an easy question, but I couldn't find any simple way to do that. Imagine the following dataframe:
df = pd.DataFrame(index=range(10), columns=range(5))
and three lists that contain indices, columns, and values of the defined dataframe that I intend to change:
idx_list = [1,5,3,7] # the indices of the cells that I want to change
col_list = [1,4,3,1] # the columns of the cells that I want to change
value_list = [9,8,7,6] # the final value of whose cells`
I was wondering if there exist a function in pandas that does the following efficiently:
for i in range(len(idx_list)):
df.loc[idx_list[i], col_list[i]] = value_list[i]
Thanks.

Using .values
df.values[idx_list,col_list]=value_list
df
Out[205]:
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 9 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 7 NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 8
6 NaN NaN NaN NaN NaN
7 NaN 6 NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Or another way less efficient
updatedf=pd.Series(value_list,index=pd.MultiIndex.from_arrays([idx_list,col_list])).unstack()
df.update(updatedf)

try df.applymap() function, you can use lambda to do your required operations.

How to drop column according to NAN percentage for dataframe?

For certain columns of df, if 80% of the column is NAN.
What's the simplest code to drop such columns?

You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:
df = df.loc[:, df.isnull().mean() < .8]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan
print (df.isnull().mean())
A 0.81
B 0.00
C 0.06
D 0.80
E 0.00
dtype: float64
df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
B C E
0 0.278369 NaN 0.004719
1 0.670749 NaN 0.575093
2 0.209202 NaN 0.219697
3 0.811683 NaN 0.274074
4 0.940030 NaN 0.175410
If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:
np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
1 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN
3 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0
5 NaN NaN NaN 1.0 1.0 NaN NaN 1.0 NaN 1.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
9 1.0 NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN
df1 = df.dropna(thresh=2, axis=1)
print (df1)
0 3 4 5 7 9
0 NaN 1.0 1.0 NaN NaN NaN
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN NaN NaN
4 NaN NaN NaN 1.0 NaN 1.0
5 NaN 1.0 1.0 NaN 1.0 1.0
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 1.0 NaN
9 1.0 NaN 1.0 NaN 1.0 NaN
EDIT: For non-Boolean data
Total number of NaN entries in a column must be less than 80% of total entries:
df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]

df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)
Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.

You can use the pandas dropna. For example:
df.dropna(axis=1, thresh = int(0.2*df.shape[0]), inplace=True)
Notice that we used 0.2 which is 1-0.8 since the thresh refers to the number of non-NA values

As suggested in comments, if you use sum() on a boolean test, you can get the number of occurences.
Code:
def get_nan_cols(df, nan_percent=0.8):
threshold = len(df.index) * nan_percent
return [c for c in df.columns if sum(df[c].isnull()) >= threshold]
Used as:
del df[get_nan_cols(df, 0.8)]

python, pandas - dataframe with time, create shifted data

I have a DataFrame:
df = pd.DataFrame(
np.random.rand(10, 3),
columns='sensor_id|unix_timestamp|value'.split('|'))
I want to create 5 more columns in which each new column is a shifted version of the value column.
sensor_id unix_timestamp value value_shift_0 value_shift_1 value_shift_2 value_shift_3 value_shift_4
0 0.901001 0.036683 0.945908 NaN NaN NaN NaN NaN
1 0.751759 0.038600 0.117308 NaN NaN NaN NaN NaN
2 0.737604 0.484417 0.602733 NaN NaN NaN NaN NaN
3 0.259865 0.522115 0.074188 NaN NaN NaN NaN NaN
4 0.932359 0.662560 0.648445 NaN NaN NaN NaN NaN
5 0.114668 0.066766 0.285553 NaN NaN NaN NaN NaN
6 0.795851 0.565259 0.888404 NaN NaN NaN NaN NaN
7 0.082534 0.355506 0.671816 NaN NaN NaN NaN NaN
8 0.336648 0.651789 0.859373 NaN NaN NaN NaN NaN
9 0.917073 0.842281 0.458542 NaN NaN NaN NaN NaN
But I don't know how to fill in with the appropriated shifted value columns.

pd.concat with a dictionary comprehension along with join
df.join(
pd.concat(
{'value_shift_{}'.format(i): df.value.shift(i) for i in range(5)},
axis=1))
alternative with numpy
def multi_shift(s, n):
a = np.arange(len(s))
i = (a[:, None] - a[:n]).ravel()
e = np.empty(i.shape)
e.fill(np.nan)
w = np.where(i >= 0)
e[w] = df.value.values[i[w]]
return pd.DataFrame(e.reshape(10, -1),
s.index, ['shift_%i' % s for s in range(n)])
df.join(multi_shift(df.value, n))
timing

Scatter plot different groups by colors, using groups that have NaN

I have a data frame in pandas:
d1_a d2_a d3_a group
BI59 NaN 0.023333 NaN 2
BI71 NaN 0.173333 NaN 2
BI52 NaN NaN NaN 1
BI44 0.450000 NaN NaN 1
BI36 NaN 0.286667 NaN 2
BI29 NaN 0.030000 NaN 2
BI50 NaN 0.633333 NaN 2
BI63 NaN 0.110000 NaN 2
BI64 NaN 0.320000 NaN 2
BI65 0.206667 NaN NaN 1
BI67 NaN 0.216667 NaN 2
BI68 NaN 0.473333 NaN 2
BI71 NaN 0.053333 NaN 2
BI72 NaN 0.006667 NaN 2
BI75 NaN 0.430000 NaN 2
BI76 NaN 0.260000 NaN 2
BI78 NaN 0.250000 NaN 2
BI81 NaN 0.006667 NaN 2
BI83 NaN 0.603333 NaN 2
BI84 NaN NaN 0.196667 3
BI86 NaN NaN 0.046667 3
BI89 NaN 0.110000 NaN 2
BI91 NaN NaN 0.213333 3
BI93 NaN 0.443333 NaN 2
BI97 0.586667 NaN NaN 1
BI98 0.380000 NaN NaN 1
BI99 0.016667 NaN NaN 1
BI11 NaN 0.206667 NaN 2
BI12 NaN 0.500000 NaN 2
BI17 0.626667 NaN NaN 1
The BI## is the index column, the groups that the rows belong to are denoted by the group column. So d1_a is group 1, d2_a is group 2 and d3_a is group 3. Also, the numbers on the index column would be the x axis. How do I create a scatter plot, with each group being represented by a different color? When I try plotting I get empty plots.
If I try something like subset_d1_a = df['d1_a'].dropna() and do something similar for each group then I can remove the NaNs but now the arrays are of different lengths and I cannot plot them all on the same graph.
Preferably I'd like to do this in seaborn but any method in python will do.
So far, this is what I'm doing, now sure if I'm going down the right path:
subset = pd.concat([df.d1_a, df.d2_a, df.d3_a], axis=1)
subset = subset.sum(axis=1)
subset = pd.concat([subset,df.group], axis=1)
subset = subset.dropna()
g = subset.groupby('groups')

It is not clear what a scatter chart would look like given your data, but you could do something like this:
colors = {1: 'red', 2: 'green', 3: 'blue'}
df.iloc[:, :3].sum(axis=1).plot(kind='bar', colors=df.group.map(colors).tolist()

Multiplying multiple columns in a DataFrame

I'm trying to multiply N columns in a DataFrame by N columns in the same DataFrame, and then divide the results by a single column. I'm having trouble with the first part, see example below.
import pandas as pd
from numpy import random
foo = pd.DataFrame({'A':random.rand(10),
'B':random.rand(10),
'C':random.rand(10),
'N':random.randint(1,100,10),
'X':random.rand(10),
'Y':random.rand(10),
'Z':random.rand(10), })
foo[['A','B','C']].multiply(foo[['X','Y','Z']], axis=0).divide(foo['N'], axis=0)
What I'm trying to get at is column-wise multiplication (i.e. A*X, B*Y, C*Z)
The result is not an N column matrix but a 2N one, where the columns I'm trying to multiply by are added to the DataFrame, and all the entries have NaN values, like so:
A B C X Y Z
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
What's going on here, and how do I do column-wise multiplication?

This will work using the values from columns X, Y, Z and N, but perhaps it will help you see what the issue is:
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0))
A B C
0 0.000452 0.004049 0.010364
1 0.004716 0.001566 0.012881
2 0.001488 0.000296 0.004415
3 0.000269 0.001168 0.000327
4 0.001386 0.008267 0.012048
5 0.000084 0.009588 0.003189
6 0.000099 0.001063 0.006493
7 0.009958 0.035766 0.012618
8 0.001252 0.000860 0.000420
9 0.006422 0.005013 0.004108
The result is indexed on columns A, B, C. It is unclear what the resulting columns should be, which is why you are getting the NaNs.
Appending the function above with .values will give you the result you desire, but it is then up to you to replace the index and columns.
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0)).values
array([[ 4.51754797e-04, 4.04911292e-03, 1.03638836e-02],
[ 4.71588457e-03, 1.56556402e-03, 1.28805803e-02],
[ 1.48820116e-03, 2.95700572e-04, 4.41516179e-03],
[ 2.68791866e-04, 1.16836123e-03, 3.27217820e-04],
[ 1.38648301e-03, 8.26692582e-03, 1.20482313e-02],
[ 8.38762247e-05, 9.58768066e-03, 3.18903965e-03],
[ 9.94132918e-05, 1.06267623e-03, 6.49315435e-03],
[ 9.95764539e-03, 3.57657737e-02, 1.26179014e-02],
[ 1.25210929e-03, 8.59735215e-04, 4.20124326e-04],
[ 6.42175897e-03, 5.01250179e-03, 4.10783492e-03]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove NaN for the dataset - python

Try for instance (in order to fill all NaN-s with 0's): df.fillna(0) Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

You can use your average or mean of each column to fill your NaN values df.fillna(df.mean())

Related

assigning multiple values to different cells in a dataframe

How to drop column according to NAN percentage for dataframe?

python, pandas - dataframe with time, create shifted data

Scatter plot different groups by colors, using groups that have NaN

Multiplying multiple columns in a DataFrame

Categories

Resources