Fast way to fill NaN in DataFrame

Fast way to fill NaN in DataFrame - python

I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods

Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)

Related

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.

There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Merging data into an existing pandas dataframe column conditionally

I have the following data:
one_dict = {0: "zero", 1: "one", 2: "two", 3: "three", 4: "four"}
two_dict = {0: "light", 1: "calc", 2: "line", 3: "blur", 4: "color"}
np.random.seed(2)
n = 15
a_df = pd.DataFrame(dict(a=np.random.randint(0, 4, n), b=np.random.randint(0, 3, n)))
a_df["c"] = np.nan
a_df = a_df.sort_values("b").reset_index(drop=True)
where the dataframe looks as:
In [45]: a_df
Out[45]:
a b c
0 3 0 NaN
1 1 0 NaN
2 0 0 NaN
3 2 0 NaN
4 3 0 NaN
5 1 0 NaN
6 2 1 NaN
7 2 1 NaN
8 3 1 NaN
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
I would like to replace values in c with those from dictionaries one_dict
and two_dict, with the result as follows:
In [45]: a_df
Out[45]:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 .
4 3 0 .
5 1 0 .
6 2 1 calc
7 2 1 calc
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
 Attempt
I'm not sure what a good approach to this would be though.
I thought that I might do something along the following lines:
merge_df = pd.DataFrame(dict(one = one_dict, two=two_dict)).reset_index()
merge_df['zeros'] = 0
merge_df['ones'] = 1
giving
In [62]: merge_df
Out[62]:
index one two zeros ones
0 0 zero light 0 1
1 1 one calc 0 1
2 2 two line 0 1
3 3 three blur 0 1
4 4 four color 0 1
Then merge this into the a_df, but I'm not sure how to merge in and update
at the same time, or if this is a good approach.
Edit
keys correspond to the values of column a
. is just shorthand, this should be filled in with the value as others are

This is just matter of creating new dataframe with the correct structure and merge:
(a_df.drop('c', axis=1)
.merge(pd.DataFrame([one_dict,two_dict])
.rename_axis(index='b',columns='a')
.stack().reset_index(name='c'),
on=['a','b'],
how='left')
)
Output:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 two
4 3 0 three
5 1 0 one
6 2 1 line
7 2 1 line
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN

Pandas Fill Column with Dictionary

I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?

You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN

Limit cumsum to only the previous 4 rows

beginner's question:
I want to create a cumulative sum column on my dataframe, but I only want the column to add the values from the previous 4 rows (inclusive of the current row). I also need to start the count again with each new 'Type' in the frame.
This is what I'm going for:
Type Value Desired column
A 1 -
A 2 -
A 1 -
A 1 5
A 2 6
A 2 6
B 2 -
B 2 -
B 2 -
B 2 8
B 1 7
B 1 6

You can do this by applying a rolling_sum after we groupby the Type. For example:
>>> df["sum4"] = df.groupby("Type")["Value"].apply(lambda x: pd.rolling_sum(x,4))
>>> df
Type Value sum4
0 A 1 NaN
1 A 2 NaN
2 A 1 NaN
3 A 1 5
4 A 2 6
5 A 2 6
6 B 2 NaN
7 B 2 NaN
8 B 2 NaN
9 B 2 8
10 B 1 7
11 B 1 6
pandas uses NaN to represent missing data; if you really want - instead, you could do that too, using
df["sum4"] = df["sum4"].fillna('-')

Pandas: Convert DataFrame Column Values Into New Dataframe Indices and Columns

I have a dataframe that looks like this:
a b c
0 1 10
1 2 10
2 2 20
3 3 30
4 1 40
4 3 10
The dataframe above as default (0,1,2,3,4...) indices. I would like to convert it into a dataframe that looks like this:
1 2 3
0 10 0 0
1 0 10 0
2 0 20 0
3 0 0 30
4 40 0 10
Where column 'a' in the first dataframe becomes the index in the second dataframe, the values of 'b' become the column names and the values of c are copied over, with 0 or NaN filling missing values. The original dataset is large and will result in a very sparse second dataframe. I then intend to add this dataframe to a much larger one, which is straightforward.
Can anyone advise the best way to achieve this please?

You can use the pivot method for this.
See the docs: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects
An example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':[0,1,2,3,4,4], 'b':[1,2,2,3,1,3], 'c':[10,10,20,3
0,40,10]})
In [3]: df
Out[3]:
a b c
0 0 1 10
1 1 2 10
2 2 2 20
3 3 3 30
4 4 1 40
5 4 3 10
In [4]: df.pivot(index='a', columns='b', values='c')
Out[4]:
b 1 2 3
a
0 10 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 10
If you want zeros instead of NaN's as in your example, you can use fillna:
In [5]: df.pivot(index='a', columns='b', values='c').fillna(0)
Out[5]:
b 1 2 3
a
0 10 0 0
1 0 10 0
2 0 20 0
3 0 0 30
4 40 0 10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast way to fill NaN in DataFrame - python

Try ffill as suggested with groupby df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)

Related

Count preceding non NaN values in pandas

Merging data into an existing pandas dataframe column conditionally

Pandas Fill Column with Dictionary

Limit cumsum to only the previous 4 rows

Pandas: Convert DataFrame Column Values Into New Dataframe Indices and Columns

Categories

Resources