pandas dataframe multiplication by column, with index matched - python

df1=pd.DataFrame(np.random.randn(6,3),index=list("ABCDEF"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(6,1),index=list("ABCDEF"))
I want to multiply each column of df1 with df2, and match by index label. That means:
df1["X"]*df2
df1["Y"]*df2
df1["Z"]*df2
The output would have the index and columns of df1.
How can I do this? Tried several ways, and it still didn't work...

Use mul function and multiple DataFrame by Series (column) select by position with iloc:
print(df1.mul(df2.iloc[:,0], axis=0))
X Y Z
A -0.577748 0.299258 -0.021782
B -0.952604 0.024046 -0.276979
C 0.175287 2.507922 0.597935
D -0.002698 0.043514 -0.012256
E -1.598639 0.635508 1.532068
F 0.196783 -0.234017 -0.111166
Detail:
print(df2.iloc[:, 0])
A -2.875274
B 1.881634
C 1.369197
D 1.358094
E -0.024610
F 0.443865
Name: 0, dtype: float64

You can use apply to multiply each column in df1 with df2.
df1.apply(lambda x: x * df2[0], axis=0)
X Y Z
A -0.437749 0.515611 -0.870987
B 0.105674 1.679020 -0.693983
C 0.055004 0.118673 -0.028035
D 0.704775 -1.786515 -0.982376
E 0.109218 -0.021522 -0.188369
F 1.491816 0.105558 -1.814437

Related

Is there a way to merge 2 rows of a df into 1?

I have a df that has plenty of row pairs that need to be condensed into 1. Column B identifies the pairs. All column values except one are identical. Is there a way to accomplish this in pandas?
Existing df:
A B C D E
x c v 2 w
x c v 2 r
Desired Output:
A B C D E
x c v 2 w,r
It's a little bit unintuitive to read but works:
df2 = (
df.groupby('B', as_index=False)
.agg({**dict.fromkeys(df.columns, 'first'), 'E': ','.join})
)
What we're doing here is grouping by column B and indicating that we want the first value occurring for each value of B across all columns, but then we're also over-riding what we want for the column E for aggregation to take place to join E's values sharing identical columns with B with a comma.
Hence you get:
A B C D E
0 x c v 2 w,r
This doesn't make assumptions about data types and leave columns alone that aren't strings but of course will error out if your E column contains non string values (or types that can't logically support it).
Like this:
df = df.apply(lambda x: ','.join(x), axis=0)
To use specific cols
df = df[['A','B']] ....

Python pandas dataframe: How to count and show the number of missing value in dataframe only?

I would like to ask how to count and show the number of missing value in dataframe only?
I am using:
df.isna().sum() but it will show all columns including non-missing value columns. How can I only count and show the columns with missing value with descending order value counts in dataframe?
Thank so much!
In my opinion simpliest is remove 0 values by boolean indexing and then sort_values:
s = df.isna().sum()
s = s[s != 0].sort_values(ascending=False)
Or use any for filter only columns with at least one True (one NaN):
df1 = df.isna()
s = df1.loc[:, df1.any()].sum().sort_values(ascending=False)
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,5,np.nan,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[np.nan,3,6,9,2,np.nan],
'F':list('aaabbb')
})
s = df.isna().sum()
s = s[s != 0].sort_values(ascending=False)
print (s)
B 3
E 2
C 1
dtype: int64
You can use pipe to remove zero values from your totals:
>>> df.isnull().sum().sort_values(ascending=False).pipe(lambda s: s[s > 0])
B 3
E 2
C 1
dtype: int64

Python: return max column from a set of columns

Let's say i have a dataframe with columns A, B, C, D
import pandas as pd
import numpy as np
## create dataframe 100 by 4
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.head(10)
I would like to create a new column, "max_bcd", and this column will say 'b','c','d', indicating that for that particular row, one of those three columns contains the largest value.
Does anyone know how to accomplish that?
Try this idmax with axis=1 will help you to find the max value among columnns:
>>> df.idxmax(axis=1)
0 B
1 C
2 D
dtype: object
import pandas as pd
import numpy as np
cols = ['B', 'C', 'D']
## create dataframe 100 by 4
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.head(10)
df.insert(4, 'max_BCD_name', None)
df.insert(5, 'max_BCD_value', None)
df['max_BCD_name'] = df.apply(lambda x: df[cols].idxmax(axis=1)) # column name
df['max_BCD_value'] = df.apply(lambda x: df[cols].max(axis=1)) # value
print(df)
Edit: Just saw your requirement of only B, C and D. Added code for that.
Output:
A B C D max_BCD_name max_BCD_value
0 -0.653010 -1.479903 3.415286 -1.246829 C 3.415286
1 0.343084 1.243901 0.502271 -0.467752 B 1.243901
2 0.099207 1.257792 -0.997121 -1.559208 B 1.257792
3 -0.646787 1.053846 -2.663767 1.022687 B 1.053846

Summing columns to form a new dataframe

I have a DataFrame
A B C D
2015-07-18 4.534390e+05 2.990611e+05 5.706540e+05 4.554383e+05
2015-07-22 3.991351e+05 2.606576e+05 3.876394e+05 4.019723e+05
2015-08-07 1.085791e+05 8.215599e+04 1.356295e+05 1.096541e+05
2015-08-19 1.397305e+06 8.681048e+05 1.672141e+06 1.403100e+06
...
I simply want to sum all columns to get a new dataframe
A B C D
sum s s s s
With the columnwise sums And then print it with to_csv(). When is use
df.sum(axis=0)
print(df)
A 9.099377e+06
B 5.897003e+06
C 1.049932e+07
D 9.208681e+06
dtype: float64
You can transform df.sum() to DataFrame and transpose it:
In [39]: df.sum().to_frame('sum').T
Out[39]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
A slightly shorter version of pd.DataFrame is (with credit to jezrael for simplification):
In [120]: pd.DataFrame([df.sum()], index=['sum'])
Out[120]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
Use DataFrame constructor:
df = pd.DataFrame(df.sum().values.reshape(-1, len(df.columns)),
columns=df.columns,
index=['sum'])
print (df)
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
I think the simplest is df.agg([sum])
df.agg([sum])
Out[40]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

Converting multiple columns to categories in Pandas. apply?

Consider a Dataframe. I want to convert a set of columns to_convert to categories.
I can certainly do the following:
for col in to_convert:
df[col] = df[col].astype('category')
but I was surprised that the following does not return a dataframe:
df[to_convert].apply(lambda x: x.astype('category'), axis=0)
which of course makes the following not work:
df[to_convert] = df[to_convert].apply(lambda x: x.astype('category'), axis=0)
Why does apply (axis=0) return a Series even though it is supposed to act on the columns one by one?
This was just fixed in master, and so will be in 0.17.0, see the issue here
In [7]: df = DataFrame({'A' : list('aabbcd'), 'B' : list('ffghhe')})
In [8]: df
Out[8]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [9]: df.dtypes
Out[9]:
A object
B object
dtype: object
In [10]: df.apply(lambda x: x.astype('category'))
Out[10]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [11]: df.apply(lambda x: x.astype('category')).dtypes
Out[11]:
A category
B category
dtype: object
Note that since pandas 0.23.0 you no longer apply to convert multiple columns to categorical data types. Now you can simply do df[to_convert].astype('category') instead (where to_convert is a set of columns as defined in the question).

Categories

Resources