Why does pd.DataFrame with pd.isnull fail? - python

tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(pd.isnull(tt).astype(int), index = tt.index, columns=map(lambda x: x + '_'+'NA',tt.columns))
bb
I want create this dataframe with pd.isnull(tt), and the columns name contain the NA, but why does this fail?

Using values
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(data=pd.isnull(tt).astype(int).values, index = tt.index, columns=list(map(lambda x: x + '_'+'NA',tt.columns)))
The reason why :
pandas data carry over the column and index , which pd.isnull(tt).astype(int) already have the columns name as b and a
More information
bb=pd.DataFrame(data=pd.isnull(tt).astype(int), index = tt.index,columns=['a','b', 'a_NA','b_NA'] )
bb
Out[399]:
a b a_NA b_NA
0 0 1 NaN NaN
1 0 0 NaN NaN
2 1 0 NaN NaN
3 0 0 NaN NaN

Related

Pandas: Same indices for each column. Is there a better way to solve this?

Sorry for the lousy text in the question? I can't come up with a summarized way to ask this question.
I have a dataframe (variable df) such as the below:
df
ID
A
B
C
1
m
nan
nan
2
n
nan
nan
3
b
nan
nan
1
nan
t
nan
2
nan
e
nan
3
nan
r
nan
1
nan
nan
y
2
nan
nan
u
3
nan
nan
i
The desired output is:
ID
A
B
C
1
m
t
y
2
n
e
u
3
b
r
i
I solved this by running the following lines:
new_df = pd.DataFrame()
for column in df.columns:
new_df = pd.concat([new_df, df[column].dropna()], join='outer', axis=1)
And then I figured this would be faster:
empty_dict = {}
for column in df.columns:
empty_dict[column] = df[column].dropna()
new_df = pd.DataFrame.from_dict(empty_dict)
However, the dropna could represent a problem if, for example, there is a missing value in the rows that have the values to be used in each column. E.g. if df.loc[2,'A'] = nan, then that key in the dictionary will only have 2 values causing a misalignment with the rest of the columns. I'm not convinced.
I have the feeling pandas must have a builtin function that will do a better job and either of my two solutions. Is there? If not, is there any better way of solving this?
Looks like you only need groupby().first():
df.groupby('ID', as_index=False).first()
Output:
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i
Use stack_unstack() as suggested by #QuangHoang if ID is the index:
>>> df.stack().unstack().reset_index()
A B C
ID
1 m t y
2 n e u
3 b r i
You can use melt and pivot:
>>> df.melt('ID').dropna().pivot('ID', 'variable', 'value') \
.rename_axis(columns=None).reset_index()
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i

Change values of one column based on values of other column pandas dataframe

I have this pandas dataframe:
id A B
1 nan 0
2 nan 1
3 6 0
4 nan 1
5 12 1
6 14 0
I want to change the value of all nan is 'A' based on the value of 'B',
for example if B = 0, A should be random number between [0,1]
if B = 1, A should be random number between [1,3]
How do i do this?
Solution if performance is important - generate random values by length of DataFrame and then assign values by conditions:
Use numpy.random.randint for generate random values and pass to numpy.select with chainded condition with & for bitwise AND, compare is by Series.isna and Series.eq :
a = np.random.randint(0,2, size=len(df)) #generate 0,1
b = np.random.randint(1,4, size=len(df)) #generate 1,2,3
m1 = df.A.isna()
m2 = df.B.eq(0)
m3 = df.B.eq(1)
df['A'] = np.select([m1 & m2, m1 & m3],[a, b], df.A)
print (df)
id A B
0 1 1.0 0
1 2 3.0 1
2 3 6.0 0
3 4 3.0 1
4 5 12.0 1
5 6 14.0 0

Ignore Nulls in pandas map dictionary

My Dataframe looks like this :
COL1 COL2 COL3
A M X
B F Y
NaN M Y
A nan Y
I am trying to label encode with nulls as such. My result should look like:
COL1_ COL2_ COL3_
0 0 0
1 1 1
NaN 0 1
0 nan 1
The code i tried :
modified_l2 = {}
for val in list(df_obj.columns):
modified_l2[val] = {k: i for i,k in enumerate(df_obj[val].unique(),0)}
for cols in modified_l2.keys():
df_obj[cols+'_']=df_obj[cols].map(modified_l2[cols],na_action='ignore')
Achieved Result :
Expected Result :
Try using the below code, I first use the apply function, than I drop the NaNs, then I convert it into a list then I use the list.index method for each value in the new list, and list.index gives the index of the first occurence of the value, after that convert it into the Series, and make the index the index of the series without NaNs, I am doing that since after I drop the NaNs it will turn from index 0, 1, 2, 3 to 0, 2, 3 or something like that, whereas the missing index will be NaN again, after that I add a underscore to each column, and I join it with the original dataframe:
print(df.join(df.apply(lambda x: pd.Series(map(x.dropna().tolist().index, x.dropna()), index=x.dropna().index)).add_suffix('_')))
Output:
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1
Here best is use factorize with replace:
df = df.join(df.apply(lambda x : pd.factorize(x)[0]).replace(-1, np.nan).add_suffix('_'))
print (df)
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1

Python Pandas adds column header as entry instead of actual data after adding new column

I have an unexpected behavior when adding a new row to a pre-allocated DataFrame after I added a new column to this DataFrame.
I created the following minimal example (using Python 3.6.5 and Panda 0.23.0):
First, I create a pre-allocated DataFrame with 3 columns
import pandas as pd
df = pd.DataFrame(columns=('A', 'B', 'C'), index=range(5))
# The resulting DataFrame df
# A B C
#0 NaN NaN NaN
#1 NaN NaN NaN
#2 NaN NaN NaN
#3 NaN NaN NaN
#4 NaN NaN NaN
Then, I am adding a few rows, which works like expected
new_row = {'A':0, 'B':0, 'C':0}
df.loc[0] = new_row
df.loc[1] = new_row
df.loc[2] = new_row
# The resulting DataFrame df
# A B C
#0 0 0 0
#1 0 0 0
#2 0 0 0
#3 NaN NaN NaN
#4 NaN NaN NaN
Then, I am adding a new column with a default value
df['D'] = 0
# The resulting DataFrame df
# A B C D
#0 0 0 0 0
#1 0 0 0 0
#2 0 0 0 0
#3 NaN NaN NaN 0
#4 NaN NaN NaN 0
And eventually, adding a new row after adding the new column, I get this
new_row = {'A':0, 'B':0, 'C':0, 'D':0}
df.loc[3] = new_row
# The resulting DataFrame df
# A B C D
#0 0 0 0 0
#1 0 0 0 0
#2 0 0 0 0
#3 A B C D
#4 NaN NaN NaN 0
So it seams that, for some reason the DataFrame header is added as the new row instead of the actual values. Am I doing something wrong? I noted that this only happens when I set the size of the table with index=range(5). If I do not set the size of the table adding columns and rows is working like expected. However, I would like to pre-allocate the table due to performance issues.
It's a problem with the datatypes. When you create a dataframe without specifying any data, it automatically assigns datatype object to all columns.
Create your dataframe like this:
df = pd.DataFrame(columns=('A', 'B', 'C'), index=range(5), data=0)

create binary columns in a dataframe from condition on its value

I have a dataframe that looks like this one:
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A','B','C'])
df.iloc[0,0] = 'a'
df.iloc[1,0] = 'b'
df.iloc[1,1] = 'c'
df.iloc[2,0] = 'b'
df.iloc[3,0] = 'c'
df.iloc[3,1] = 'b'
df.iloc[3,2] = 'd'
df
out : A B C
0 a NaN NaN
1 b c NaN
2 b NaN NaN
3 c b d
And I would like to add new columns to it which names are the values inside the dataframe (here 'a','b','c',and 'd'). Those columns are binary, and reflect if the values 'a','b','c',and 'd' are in the row.
In one picture, the output I'd like is:
A B C a b c d
0 a NaN NaN 1 0 0 0
1 b c NaN 0 1 1 0
2 b NaN NaN 0 1 0 0
3 c b d 0 1 1 1
To do this I first create the columns filled with zeros:
cols = pd.Series(df.values.ravel()).value_counts().index
for col in cols:
df[col] = 0
(It doesn't create the columns in the right order, but that doesn't matter)
Then I...use a loop over the rows and columns...
for row in df.index:
for col in cols:
if col in df.loc[row].values:
df.ix[row,col] = 1
You'll get why I'm looking for another way to do it, even if my dataframe is relatively small (76k rows), it still takes around 8 minutes, which is far too long.
Any idea?
You're looking for get_dummies. Here I choose to use the .str version:
df.fillna('', inplace=True)
(df.A + '|' + df.B + '|' + df.C).str.get_dummies()
Output:
a b c d
0 1 0 0 0
1 0 1 1 0
2 0 1 0 0
3 0 1 1 1

Categories

Resources