Merging data into an existing pandas dataframe column conditionally - python

I have the following data:
one_dict = {0: "zero", 1: "one", 2: "two", 3: "three", 4: "four"}
two_dict = {0: "light", 1: "calc", 2: "line", 3: "blur", 4: "color"}
np.random.seed(2)
n = 15
a_df = pd.DataFrame(dict(a=np.random.randint(0, 4, n), b=np.random.randint(0, 3, n)))
a_df["c"] = np.nan
a_df = a_df.sort_values("b").reset_index(drop=True)
where the dataframe looks as:
In [45]: a_df
Out[45]:
a b c
0 3 0 NaN
1 1 0 NaN
2 0 0 NaN
3 2 0 NaN
4 3 0 NaN
5 1 0 NaN
6 2 1 NaN
7 2 1 NaN
8 3 1 NaN
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
I would like to replace values in c with those from dictionaries one_dict
and two_dict, with the result as follows:
In [45]: a_df
Out[45]:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 .
4 3 0 .
5 1 0 .
6 2 1 calc
7 2 1 calc
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN
 Attempt
I'm not sure what a good approach to this would be though.
I thought that I might do something along the following lines:
merge_df = pd.DataFrame(dict(one = one_dict, two=two_dict)).reset_index()
merge_df['zeros'] = 0
merge_df['ones'] = 1
giving
In [62]: merge_df
Out[62]:
index one two zeros ones
0 0 zero light 0 1
1 1 one calc 0 1
2 2 two line 0 1
3 3 three blur 0 1
4 4 four color 0 1
Then merge this into the a_df, but I'm not sure how to merge in and update
at the same time, or if this is a good approach.
Edit
keys correspond to the values of column a
. is just shorthand, this should be filled in with the value as others are

This is just matter of creating new dataframe with the correct structure and merge:
(a_df.drop('c', axis=1)
.merge(pd.DataFrame([one_dict,two_dict])
.rename_axis(index='b',columns='a')
.stack().reset_index(name='c'),
on=['a','b'],
how='left')
)
Output:
a b c
0 3 0 three
1 1 0 one
2 0 0 zero
3 2 0 two
4 3 0 three
5 1 0 one
6 2 1 line
7 2 1 line
8 3 1 blur
9 0 2 NaN
10 3 2 NaN
11 3 2 NaN
12 0 2 NaN
13 3 2 NaN
14 1 2 NaN

Related

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Concat following row to the right of a df - python

I'm aiming to subset a pandas df using a condition and append those rows to the right of a df. For example, where Num2 is equal to 1, I want to take the following row and append it to the right of the df. The following appends every row, where as I just want to append the following row after a 1 in Num2. I'd also like to be able to append specific cols. Using below, this could be only Num1 and Num2.
df = pd.DataFrame({
'Num1' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num2' : [0,0,0,0,0,1,3,0,1,2,0,0,0,0,1,4],
'Value' : [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
})
df1 = df.add_suffix('1').join(df.shift(-1).add_suffix('2'))
intended output:
# grab all rows after a 1 in Num2
ones = df.loc[df["Num2"].shift().isin([1])]
# append these to the right
Num1 Num2 Value Num12 Num22
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 4 1 0 0 3
6 0 3 0
7 1 0 0
8 2 1 0 3 2
9 3 2 0
10 1 0 0
11 1 0 0
12 2 0 0
13 3 0 0
14 4 1 0 0 4
15 0 4 0
You can try:
df=df.join(df.shift(-1).mask(df['Num2'].ne(1)).drop('Value',1).add_suffix('2'))
OR
ones.index=ones.index-1
df=df.join(ones.drop('Value',1).add_suffix('2'))
#OR(use any 1 since both method doing the same thing)
df=pd.concat([df,ones.drop('Value',1).add_suffix('2')],axis=1)
If needed use fillna():
df[["Num12", "Num22"]]=df[["Num12", "Num22"]].fillna('')
We can do this by making new columns that are the -1 shifts of the previous three, then setting them equal to "" if Num2 isn't 1.
mask = df.Num2 != 1
df[["Num12", "Num22"]] = df[["Num1", "Num2"]].shift(-1)
df.loc[mask, ["Num12", "Num22"]] = ""
Got a warning on this, but nevertheless
>>> df[["Num12", "Num22"]] = np.where(df[['Num1', "Num2"]]['Num2'][:,np.newaxis] == 1, df[['Num1', 'Num2']].shift(-1), [np.nan, np.nan])
<stdin>:1: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
>>> df
Num1 Num2 Value Num12 Num22
0 0 0 0 NaN NaN
1 1 0 0 NaN NaN
2 2 0 0 NaN NaN
3 3 0 0 NaN NaN
4 4 0 0 NaN NaN
5 4 1 0 0.0 3.0
6 0 3 0 NaN NaN
7 1 0 0 NaN NaN
8 2 1 0 3.0 2.0
9 3 2 0 NaN NaN
10 1 0 0 NaN NaN
11 1 0 0 NaN NaN
12 2 0 0 NaN NaN
13 3 0 0 NaN NaN
14 4 1 0 0.0 4.0
15 0 4 0 NaN NaN

Fast way to fill NaN in DataFrame

I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods
Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)

How to get an item from a second df if there is a match between them

Probably this question has already an answer, but I could not succeed to find any.
I want to get items from a second data-frame to be appended to a new column in the first dataframe if there a match between both dataframe
Here I am showing some sample data quite similar to the case I am confronting.
import pandas as pd
import numpy as np
a = np.arange(3).repeat(3)
b = np.tile(np.arange(3),3)
df1 = pd.DataFrame({'a':a, 'b':b})
a b
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
a2 = np.arange(1, 4).repeat(3)
b2 = np.tile(np.arange(3),3)
c = np.random.randint(0, 10, size=a2.size)
df2 = pd.DataFrame({'a2':a2, 'b2':b2, 'c':c})
a2 b2 c
0 1 0 3
1 1 1 1
2 1 2 9
3 2 0 5
4 2 1 8
5 2 2 4
6 3 0 1
7 3 1 6
8 3 2 1
The desired output should be like
a b c
0 0 0 nan
1 0 1 nan
2 0 2 nan
3 1 0 3
4 1 1 1
5 1 2 9
6 2 0 5
7 2 1 8
8 2 2 4
Unfortunately, I could not come up with anyway to solve it.
Use merge with left join and rename columns names:
df = df1.merge(df2.rename(columns={'a2':'a', 'b2':'b'}), on=['a','b'], how='left')
print (df)
a b c
0 0 0 NaN
1 0 1 NaN
2 0 2 NaN
3 1 0 3.0
4 1 1 5.0
5 1 2 0.0
6 2 0 2.0
7 2 1 6.0
8 2 2 2.0

Pandas Fill Column with Dictionary

I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?
You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN

Categories

Resources