Pandas - sort columns in correlation table

Pandas - sort columns in correlation table - python

I am running a pearson correlation on my data set (from Excel) and this is the order the results come out as:
What I was wondering is if it is possible to get the n_hhld_trip as my first column as it is my dependent variable.
Below is my code that I have so far but not sure how to make it reflect the changes I want. I tried moving the variables in the pivot table command but that didn't do it:
zone_sum_mean_combo = pd.pivot_table(
read_excel,
index=['Zone'],
aggfunc={'Household ID': np.mean, 'dwtype': np.mean, 'n_hhld_trip': np.sum,
'expf': np.mean, 'n_emp_ft': np.sum, 'n_emp_home': np.sum,
'n_emp_pt': np.sum, 'n_lic': np.sum, 'n_pers': np.sum,
'n_student': np.sum, 'n_veh': np.sum}
)
index_reset = zone_sum_mean_combo.reset_index()
print(index_reset)
pearson_correlation = index_reset.corr(method='pearson')
print(pearson_correlation)

Sometimes it can be easier to hardcode the column order after everything is done:
df = df[["my_first_column", "my_second_column"]]
In your case, I think it's easier to just manipulate them:
columns = list(df.columns)
columns.remove("n_hhld_trip")
columns.insert(0, "n_hhld_trip")
df = df[columns]

Try to set_index and reset_index:
df.set_index('n_hhld_trip', append=True).reset_index(level=-1)

Related

How to get the correlated values of a dataset in a seperate column if the index and the columns are same

I imported a dataset into my python script and took the correlation. This is the code for correlation:
data = pd.read_excel('RQ_ID_Grouping.xlsx' , 'Sheet1')
corr = data.corr()
After the correlation the data looks like this:
I want to convert the data into below format:
I am using this code to achieve the above data , but it doesn't seem to be working:
corr1 = (corr.melt(var_name = 'X' , value_name = 'Y').groupby('X')['Y'].reset_index(name = 'Corr_Value'))
I know there should be something after the 'groupby' part but I don't know what . If you could help me , I would greatly appreciate it.

Use DataFrame.stack for reshape and drop missing values, convert MultiIndex to columns by DataFrame.reset_index and last set columns names:
df = corr.stack().reset_index()
df.columns = ['X','Y','Corr_Value']
Another solution with DataFrame.rename_axis:
df = corr.stack().rename_axis(('X','Y')).reset_index(name='Corr_Value')
And your solution with melt is also possible:
df = (corr.rename_axis('X')
.reset_index()
.melt('X', var_name='Y', value_name='Corr_Value')
.dropna()
.sort_values(['X','Y'])
.reset_index(drop=True))

Pivoting python

Quick question:
I have the following situation (table):
Imported data frame
Now what I would like to achieve is the following (or something in those lines, it does not have to be exactly that)
Goal
I do not want the following columns so I drop them
data.drop(data.columns[[0,5,6]], axis=1,inplace=True)
What I assumed is that the following line of code could solve it, but I am missing something?
pivoted = data.pivot(index=["Intentional homicides and other crimes","Unnamed: 2"],columns='Unnamed: 3', values='Unnamed: 4')
produces
ValueError: Length of passed values is 3395, index implies 2
Difference to the 8 question is that I do not want any aggregation functions, just to leave values as is.
Data can be found at: Data

The problem with the method pandas.DataFrame.pivot is that it does not handle duplicate values in the index. One way to solve this is to use the function pandas.pivot_table instead.
df = pd.read_csv('Crimes_UN_data.csv', skiprows=[0], encoding='latin1')
cols = list(df.columns)
cols[1] = 'Region'
df.columns = cols
pivoted = pd.pivot_table(df, values='Value', index=['Region', 'Year'], columns='Series', aggfunc=sum)
It should not sum anything, despite the aggfunc argument, but it was throwing pandas.core.base.DataError: No numeric types to aggregate if the argument was not provided.

python groupby statement only leaving aggregate fields

frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.

After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.

Pandas aggregation warning with lambdas (FutureWarning: using a dict with renaming is deprecated)

My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?

Use map for flatten columns names:
df.columns = df.columns.map('_'.join)

Pandas Groupby - naming aggregate output column

I have a pandas groupby command which looks like this:
df.groupby(['year', 'month'], as_index=False).agg({'users':sum})
Is there a way I can name the agg output something other than 'users' during the groupby command? For example, what if I wanted the sum of users to be total_users? I could rename the column after the groupby is complete, but wonder if there is another way.

I like #Alexander answer, but there is also add_prefix:
df.groupby(['year','month']).agg({'users':sum}).add_prefix('total_')

Per the docs:
If a dict is passed, the keys will be used to name the columns.
Otherwise the function’s name (stored in the function object) will be
used.
In [58]: grouped['D'].agg({'result1' : np.sum, ....:
'result2' : np.mean})
In your case:
df.groupby(['year', 'month'], as_index=False).users.agg({'total_users': np.sum})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - sort columns in correlation table - python

Sometimes it can be easier to hardcode the column order after everything is done: df = df[["my_first_column", "my_second_column"]] In your case, I think it's easier to just manipulate them: columns = list(df.columns) columns.remove("n_hhld_trip") columns.insert(0, "n_hhld_trip") df = df[columns]

Try to set_index and reset_index: df.set_index('n_hhld_trip', append=True).reset_index(level=-1)

Related

How to get the correlated values of a dataset in a seperate column if the index and the columns are same

Pivoting python

python groupby statement only leaving aggregate fields

Pandas aggregation warning with lambdas (FutureWarning: using a dict with renaming is deprecated)

Pandas Groupby - naming aggregate output column

Categories

Resources