Working with NaN values in multiple columns in Pandas - python

I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.

You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)

Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16

Related

How to filter a pandas dataframe till it finds a value in NaN column?

I have a data frame like this:
df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
17 NaN
18 NaN
I want to filter this data frame from the start to the row where it finds a number in the score column.
So, after filtering the data frame should look like this:
new_df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
I want to filter this data frame from the row where it finds a number in the score column to the end of the data frame.
So, after filtering the data frame should look like this:
new_df:
number score
16 10
17 NaN
18 NaN
How do I filter this data frame?
Kindly help
You can use pd.Series.last_valid_index and pd.Series.first_valid_index like this:
df.loc[df['score'].first_valid_index():]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
And,
df.loc[:df['score'].last_valid_index()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
And, if you wanted to clip leading NaN and trailing Nan you can combined the two.
df.loc[df['score'].first_valid_index():df['score'].last_valid_index()]
Output:
number score
4 16 10.0
You can use a reverse cummax and boolean slicing:
new_df = df[df['score'].notna()[::-1].cummax()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
For the second one, a simple cummax:
new_df = df[df['score'].notna().cummax()]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN

How to remove observations with missing values for specific columns from pandas DataFrame?

I have pandas DataFrame containing columns with missing values. I want remove observations, rows with them but only for specific columns. For example:
A B C D E
2 1 NaN 7 9
1 3 6 NaN 10
NaN 3 11 0 8
And let's say I want to remove observations with missing value for column D. So I want result like this:
A B C D E
2 1 NaN 7 9
NaN 3 11 0 8
Thank you for all suggestions.
Lets try mask pd.Series.notna()
df[df.D.notna()]
A B C D E
0 2.0 1 NaN 7.0 9
2 NaN 3 11.0 0.0 8

Transposing a Pandas DataFrame Without Aggregating

I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN

Remove lesser than K consecutive NaNs from pandas DataFrame

I am working Time Series data. I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. I tried looking at some of the links like:
Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count.
Pandas: run length of NaN holes : Outputs run Length encoding for NaNs
There are many more others along this lane, but none of them actually tells how can we remove them after identifying.
I found one similar solution but that is in R :
How to remove more than 2 consecutive NA's in a column?
I want solution in Python.
So here is the example:
Here is my dataframe column:
a
0 36.45
1 35.45
2 NaN
3 NaN
4 NaN
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
10 NaN
11 NaN
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
If k = 3, my output should be:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
How can I go about removing the consecutive NaNs less than or equal to some threshold (k).
There are a few ways, but this is how I've done it:
Determine groups of consecutive numbers using a neat cumsum trick
Use groupby + transform to determine the size of each group
Identify groups of NaNs that are within the threshold
Filter them out with boolean indexing.
k = 3
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)
df[m]
a
0 36.45
1 35.45
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
You can perform df = df[m]; df.reset_index(drop=True) step at the end if you want a monotonically increasing integer index.
You can create a indicator column to count the consecutive nans.
k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)
Out[375]:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71

Add column in dataframe from list

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Categories

Resources