I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I'm aiming to subset a pandas df using a condition and append those rows to the right of a df. For example, where Num2 is equal to 1, I want to take the following row and append it to the right of the df. The following appends every row, where as I just want to append the following row after a 1 in Num2. I'd also like to be able to append specific cols. Using below, this could be only Num1 and Num2.
df = pd.DataFrame({
'Num1' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num2' : [0,0,0,0,0,1,3,0,1,2,0,0,0,0,1,4],
'Value' : [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
})
df1 = df.add_suffix('1').join(df.shift(-1).add_suffix('2'))
intended output:
# grab all rows after a 1 in Num2
ones = df.loc[df["Num2"].shift().isin([1])]
# append these to the right
Num1 Num2 Value Num12 Num22
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 4 1 0 0 3
6 0 3 0
7 1 0 0
8 2 1 0 3 2
9 3 2 0
10 1 0 0
11 1 0 0
12 2 0 0
13 3 0 0
14 4 1 0 0 4
15 0 4 0
You can try:
df=df.join(df.shift(-1).mask(df['Num2'].ne(1)).drop('Value',1).add_suffix('2'))
OR
ones.index=ones.index-1
df=df.join(ones.drop('Value',1).add_suffix('2'))
#OR(use any 1 since both method doing the same thing)
df=pd.concat([df,ones.drop('Value',1).add_suffix('2')],axis=1)
If needed use fillna():
df[["Num12", "Num22"]]=df[["Num12", "Num22"]].fillna('')
We can do this by making new columns that are the -1 shifts of the previous three, then setting them equal to "" if Num2 isn't 1.
mask = df.Num2 != 1
df[["Num12", "Num22"]] = df[["Num1", "Num2"]].shift(-1)
df.loc[mask, ["Num12", "Num22"]] = ""
Got a warning on this, but nevertheless
>>> df[["Num12", "Num22"]] = np.where(df[['Num1', "Num2"]]['Num2'][:,np.newaxis] == 1, df[['Num1', 'Num2']].shift(-1), [np.nan, np.nan])
<stdin>:1: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
>>> df
Num1 Num2 Value Num12 Num22
0 0 0 0 NaN NaN
1 1 0 0 NaN NaN
2 2 0 0 NaN NaN
3 3 0 0 NaN NaN
4 4 0 0 NaN NaN
5 4 1 0 0.0 3.0
6 0 3 0 NaN NaN
7 1 0 0 NaN NaN
8 2 1 0 3.0 2.0
9 3 2 0 NaN NaN
10 1 0 0 NaN NaN
11 1 0 0 NaN NaN
12 2 0 0 NaN NaN
13 3 0 0 NaN NaN
14 4 1 0 0.0 4.0
15 0 4 0 NaN NaN
I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods
Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)
Probably this question has already an answer, but I could not succeed to find any.
I want to get items from a second data-frame to be appended to a new column in the first dataframe if there a match between both dataframe
Here I am showing some sample data quite similar to the case I am confronting.
import pandas as pd
import numpy as np
a = np.arange(3).repeat(3)
b = np.tile(np.arange(3),3)
df1 = pd.DataFrame({'a':a, 'b':b})
a b
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
a2 = np.arange(1, 4).repeat(3)
b2 = np.tile(np.arange(3),3)
c = np.random.randint(0, 10, size=a2.size)
df2 = pd.DataFrame({'a2':a2, 'b2':b2, 'c':c})
a2 b2 c
0 1 0 3
1 1 1 1
2 1 2 9
3 2 0 5
4 2 1 8
5 2 2 4
6 3 0 1
7 3 1 6
8 3 2 1
The desired output should be like
a b c
0 0 0 nan
1 0 1 nan
2 0 2 nan
3 1 0 3
4 1 1 1
5 1 2 9
6 2 0 5
7 2 1 8
8 2 2 4
Unfortunately, I could not come up with anyway to solve it.
Use merge with left join and rename columns names:
df = df1.merge(df2.rename(columns={'a2':'a', 'b2':'b'}), on=['a','b'], how='left')
print (df)
a b c
0 0 0 NaN
1 0 1 NaN
2 0 2 NaN
3 1 0 3.0
4 1 1 5.0
5 1 2 0.0
6 2 0 2.0
7 2 1 6.0
8 2 2 2.0
I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?
You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN