strange behaviour of sub function in pandas - python

I am encountering strange problem with df.sub function in pandas. I wrote a simple code to subtract df columns with a reference column.
import pandas as pd
def normalize(df,col):
'''Enter the column value in "col".'''
return df.sub(df[col], axis=0)
df = pd.read_csv('norm_debug.txt', sep='\t', index_col=0); print(df.head(3))
new = normalize(df,'A'); print(new.head(3))
The output of this code is the following, as expected:
df:
A B C D E
target_id
one 10.0 3 20 10 1
two 10.0 4 30 10 1
three 6.7 5 40 10 1
A B C D E
target_id
one 0.0 -7.0 10.0 0.0 -9.0
two 0.0 -6.0 20.0 0.0 -9.0
three 0.0 -1.7 33.3 3.3 -5.7
But, when I put this as an executable in argparse, I get all NaNs !
import argparse
import platform
import os
import pandas as pd
def normalize(df,col):
'''Normalize the log table with desired column,
Enter the column value in "col".'''
return df.sub(df[col], axis=0)
parser = argparse.ArgumentParser(description='''Manipulate tables ''',
usage='python3 %(prog)s -e input.tsv [-nm col_name] -op output.tsv',
epilog='''Short prog. desc:\
Pass the expression matrix to filter, log2(val) etc.,''')
parser.add_argument("-e","--expr", metavar='', required=True, help="tab-delimited expression matrix file")
parser.add_argument("-op","--outprefix", metavar='', required=True, help="output file prefix")
parser.add_argument("-nm","--norm", metavar='', required=True, nargs=1, type=str, help="Normalize table based on column chosen")
args=parser.parse_args()
print(args)
if (os.path.isfile(args.expr)):
df = pd.read_csv(args.expr, sep='\t', index_col=0); print(df.head(3))
if(args.norm):
norm_df = normalize(df,args.norm); print(norm_df.head(3))
outfile = args.outprefix + ".normalized.tsv"
norm_df.to_csv(outfile, sep='\t'); print("Normalized table written to ", outfile)
else:
print("Provide valid option...")
else:
print("Please provide proper input..")
Output for this execution is:
python norm_debug.py -e norm_debug.txt -nm A -op norm_debug
A B C D E
target_id
one 10.0 3 20 10 1
two 10.0 4 30 10 1
three 6.7 5 40 10 1
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
I use, Python version: 3.6.7, Pandas version: 1.1.2. The first one (hard-coded) was executed in Jupyter notebook, while the argparse was executed in standard terminal. What is the issue here?
Thanks in advance

args.norm has been parsed as a list ['A'], not as a scalar 'A' (because of the option nargs=1). Remove that option.

Problem is you divide by one column DataFrame like:
new = normalize(df,['A'])
print (new)
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
print (df.sub(df[['A']], axis=0))
A B C D E
target_id
one 0.0 NaN NaN NaN NaN
two 0.0 NaN NaN NaN NaN
three 0.0 NaN NaN NaN NaN
Because parameter col_name is one element list like [col_name], not string like col_name.
If not possible, you can change function by DataFrame.squeeze:
def normalize(df,col):
'''Enter the column value in "col".'''
return df.sub(df[col].squeeze(), axis=0)
# df = pd.read_csv('norm_debug.txt', sep='\t', index_col=0); print(df.head(3))
new = normalize(df,['A'])
print (new)
A B C D E
target_id
one 0.0 -7.0 10.0 0.0 -9.0
two 0.0 -6.0 20.0 0.0 -9.0
three 0.0 -1.7 33.3 3.3 -5.7
Or use solution from #DYZ answer

Related

why numpy place is not replacing empty strings

Hello i have an dataframe as shown below
daf = pd.DataFrame({'A':[10,np.nan,20,np.nan,30]})
daf['B'] = ''
the above code has created a data frame with column B having empty strings
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
the problem here is i need to replace column with all empty strings,(note here entire column should be empty) with values provided with numpy place last argument here it is 1
so i used following code
np.place(daf.to_numpy(),((daf[['A','B']] == '').all() & (daf[['A','B']] == '')).to_numpy(),[1])
which did nothing it gave same output
A B
0 10.0
1 NaN
2 20.0
3 NaN
4 30.0
but when i assign daf['B'] = np.nan the code seems to work fine by checking if entire column is null, then replace it with 1
here is the data frame
A B
0 10.0 NaN
1 NaN NaN
2 20.0 NaN
3 NaN NaN
4 30.0 NaN
replace where those nan with 1 where the entire column is nan
np.place(daf.to_numpy(),(daf[['A','B']].isnull() & daf[['A','B']].isnull().all()).to_numpy(),[1])
which gave correct output
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
can some one tell me how to work with empty strings replacing , and give a reason why its not working with empty string as input
If I'm understanding your question correctly, you're wanting to replace a column with empty strings with a column of 1s. This can be done with pandas.replace()
daf.replace('', 1.0)
A B
0 10.0 1.0
1 NaN 1.0
2 20.0 1.0
3 NaN 1.0
4 30.0 1.0
This function also works with regex if you want to be more granular with the replacement.

Convert Pandas DataFrames with different columns into an iterable and reform as one DataFrame

How can I convert the three DataFrames a,b,c below into one DF with columns A,B,C,D?
I specifically want to gather the multiple DataFrames into one iterable (dict/list of dicts) before reconstituting them as one DF instead of appending or concatenating them.
My attempt:
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
list_of_dicts=[] #can be list of lists/dicts
for i in [a, b, c]:
x=i.to_dict('split')
list_of_dicts.append(x)
pd.DataFrame.from_records(list_of_dicts)
#Solved below. Credit to Eric Truett.
import pandas as pd
import itertools
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18]})
list_of_dicts=[]
for i in [a, b, c]:
x=i.to_dict('records')
list_of_dicts.append(x)
pd.DataFrame.from_records(list(itertools.chain.from_iterable(list_of_dicts)))
To create a single dataframe from these three, you can use the concat() function in pandas:
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
d = pd.concat([a, b, c])
print(d)
will give you:
A B C D
0 1.0 4 NaN NaN
1 2.0 5 NaN NaN
2 3.0 6 NaN NaN
0 7.0 10 NaN NaN
1 8.0 11 NaN NaN
2 9.0 12 NaN NaN
0 NaN 13 16.0 19.0
1 NaN 14 17.0 20.0
2 NaN 15 18.0 21.0
I think you can use append function to adding up multiple DataFrame objects. In the code below I initialize a variable s that will be used to combine all a, b, c DataFrame
import pandas as pd
a=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
b=pd.DataFrame({'A':[7,8,9],'B':[10,11,12]})
c=pd.DataFrame({'B':[13,14,15],'C':[16,17,18],'D':[19,20,21]})
list_of_dicts=[] #can be list of lists/dicts
s = pd.DataFrame()
for i in [a, b, c]:
s = s.append(i)
print(s)
Printing s would give the following output
A B C D
0 1.0 4 NaN NaN
1 2.0 5 NaN NaN
2 3.0 6 NaN NaN
0 7.0 10 NaN NaN
1 8.0 11 NaN NaN
2 9.0 12 NaN NaN
0 NaN 13 16.0 19.0
1 NaN 14 17.0 20.0
2 NaN 15 18.0 21.0

Conditional pairwise calculations in pandas

For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN

Pandas: cell-wise fillna(method = 'pad') of a list of DataFrame

Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources