Performing a correlation on multiple columns in pandas - python

Is it possible to do a correlation between multiple columns against one column in pandas? Like:
DF[['A']['B']].corr(DF['C'])

I believe you need corrwith and select multiple columns by list:
DF = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'A':[1,3,5,7,1,0],
})
print (DF[['A', 'B']].corrwith(DF['C']))
A 0.319717
B -0.316862
dtype: float64

Related

How to factorize entire DataFrame in pyspark

I have a Pyspark DataFrame and I want to factorize the entire df instead of each column to avoid the case that 2 different values in 2 columns have the same factorized value. I could do it with pandas as following:
_, b = pd.factorize(df.values.T.reshape(-1, ))
df = df.apply(lambda x: pd.Categorical(x, b).codes)
df = df.replace(-1, np.NaN)
Does anyone know how to do the same in Pyspark? Thank you very much.

Mean and standard deviation with multiple dataframes

I have multiple dataframes having the same columns and the same number of observations:
For example
d1 = {'ID': ['A','B','C','D'], 'Amount':
[1,2,3,4]}
df1 =pd.DataFrame(data=d1)
d2 = {'ID': ['A','B','C','D'], 'Amount':
[6,0,1,5]}
df2 =pd.DataFrame(data=d2)
d3 = {'ID': ['A','B','C','D'], 'Amount':
[8,1,2,3]}
df3 =pd.DataFrame(data=d3)
I need to drop one column (D) and its corresponding value in each of the dataframes and then, for each variable, calculating the mean and standard deviation.
The expected output should be
avg std
A 5 ...
B ... ...
C ... ...
Generally, for one dataframe, I would use drop columns and then I would compute the average using mean() and the standard deviation std().
How can I do this in an easy and fast way with multiple dataframes? (I have at least 10 of them).
Use concat with remove D in DataFrame.query and aggregate by GroupBy.agg with named aggregations:
df = (pd.concat([df1, df2, df3])
.query('ID != "D"')
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std')))
print (df)
avg std
ID
A 5 3.605551
B 1 1.000000
C 2 1.000000
Or remove D in last step by DataFrame.drop:
df = (pd.concat([df1, df2, df3])
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std'))
.drop('D'))
You can use pivot_table as well:
import numpy as np
pd.concat([df1, df2, df3]).pivot_table(index='ID', aggfunc=[np.mean, np.std]).drop('D')

DataFrame 'groupby' is fixing group columns with index

I have used a simple 'groupby' to condense rows in a Pandas dataframe:
df = df.groupby(['col1', 'col2', 'col3']).sum()
In the new DataFrame 'df', the three columns that were used in the 'groupby' function are now fixed within the index and are no longer column indexes 0, 1 and 2 - what was previously column index 4 is now column index 0.
How do I stop this from happening / reinclude the three 'groupby' columns along with the original data?
Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()
Try resetting the index
df = df.reset_index()

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

How to perform multiple pandas data type changes on different columns with one function?

I have 41 columns in a dataframe, out of those 22 I want to change the data type to 'str' except 1 column I want to change to 'float'.
Currently, I am doing this line of code to change individual columns to the datatype str or float, now doing this to 20 other columns:
df.active = df.active.astype(str)
df.total_spent = df.total_spent.astype(float)
How do I write a function that takes the columns I want to make into string and the one column above that I want as a float?
Let me know if you would like the list of columns, I thought it would be too much for now.
Thank you in advance.
I believe need seelct columns by list of names or by positions and converting:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
If want select columns by names:
c = ['B', 'C']
df[c] = df[c].astype(str)
If want select columns by positions:
p = [1,2]
df.iloc[:, p] = df.iloc[:, p].astype(str)
print (df.dtypes)
A object
B object
C object
D int64
E int64
F object
dtype: object

Categories

Resources