Pandas combine two columns - python

I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"

You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN

Related

Creating non-exist columns in multiindex dataframe

Let's say we have dataframe like this
df = pd.DataFrame({
"metric": ["1","2","1" ,"1","2"],
"group1":["o", "x", "x" , "o", "x"],
"group2":['a', 'b', 'a', 'a', 'b'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
metric group1 group2 value value2
0 1 o a 0 0
1 2 x b 1 2
2 1 x a 2 4
3 1 o a 3 6
4 2 x b 4 8
then I want to have pivot format
df['g'] = df.groupby(['group1','group2'])['group2'].cumcount()
df1 = df.pivot(index=['g','metric'], columns=['group1','group2'], values=['value','value2']).sort_index(axis=1).rename_axis(columns={'g':None})
value value2
group1 o x o x
group2 a a b a a b
g metric
0 1 0.0 2.0 NaN 0.0 4.0 NaN
2 NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN 6.0 NaN NaN
2 NaN NaN 4.0 NaN NaN 8.0
From here we can see that ("value","o","b") and ("value2","o","b") not exist after making pivot
but I need to have those columns with values NA
So I tried;
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df1.assign(**{col : "NA" for col in np.setdiff1d(cols, df1.columns.values)})
which gives
Expected output
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
one corner case with this is that if b does not exist how to create that column ?
value value2
group1 o x o x
group2 a a a a
g metric
0 1 0.0 2.0 0.0 4.0
2 NaN NaN NaN NaN
1 1 3.0 NaN 6.0 NaN
2 NaN NaN NaN NaN
Multiple insert columns if not exist pandas
Pandas: Check if column exists in df from a list of columns
Pandas - How to check if multi index column exists
Use DataFrame.stack with DataFrame.unstack:
df1 = df1.stack([1,2],dropna=False).unstack([2,3])
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
Or with selecting last and last previous levels:
df1 = df1.stack([-2,-1],dropna=False).unstack([-2,-1])
Another idea:
df1 = df1.reindex(pd.MultiIndex.from_product(df1.columns.levels), axis=1)
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
EDIT:
If need set new columns by list of tuples:
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df = df1.reindex(pd.MultiIndex.from_tuples(cols).union(df1.columns), axis=1)
print (df)
value value2
o x o x
a b a b a a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN
2 NaN NaN NaN 4.0 NaN NaN 8.0

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

Manipulating value in a column based on a rule

I have 3 columns -A, B and C in a pandas dataframe. What i want to do is, where ever A is not null AND B|C are not null, that row in A should be set to null.
if(dffinal['A'].loc[dffinal['A'].notnull()] &
(dffinal['B'].loc[dffinal['B'].notnull()] |
dffinal['C'].loc[dffinal['C'].notnull()])):
dffinal['A'] = np.nan
this is the error I'm getting: cannot do a non-empty take from an empty axes.
Use df.loc[]:
df.loc[df.A.notna() & (df.B.notna()|df.C.notna()),'A']=np.nan
Here first condition is not necessary, so solution should be simplify:
dffinal = pd.DataFrame({
'A':[np.nan,np.nan,4,5,5,np.nan],
'B':[7,np.nan,np.nan,4,np.nan,np.nan],
'C':[1,3,5,7,np.nan,np.nan],
})
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 4.0 NaN 5.0
3 5.0 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
mask = (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
Same output like in first condition:
mask = dffinal['A'].notnull() & (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN

Python Pandas Dataframe replace values below treshold

How can I apply a function element-wise to a pandas DataFrame and pass a column-wise calculated value (e.g. quantile of column)? For example, what if I want to replace all elements in a DataFrame (with NaN) where the value is lower than the 80th percentile of the column?
def _deletevalues(x, quantile):
if x < quantile:
return np.nan
else:
return x
df.applymap(lambda x: _deletevalues(x, x.quantile(0.8)))
Using applymap only allows one to access each value individually and throws (of course) an AttributeError: ("'float' object has no attribute 'quantile'
Thank you in advance.
Use DataFrame.mask:
df = df.mask(df < df.quantile())
print (df)
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0
In [139]: df
Out[139]:
a b c
0 1 7 3
1 1 2 6
2 3 0 5
3 8 2 1
4 7 3 5
5 6 7 2
6 0 2 1
7 8 4 1
8 5 0 6
9 7 7 6
for all columns:
In [145]: df.apply(lambda x: np.where(x < x.quantile(),np.nan,x))
Out[145]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0
or
In [149]: df[df < df.quantile()] = np.nan
In [150]: df
Out[150]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources