I am running a pearson correlation on my data set (from Excel) and this is the order the results come out as:
What I was wondering is if it is possible to get the n_hhld_trip as my first column as it is my dependent variable.
Below is my code that I have so far but not sure how to make it reflect the changes I want. I tried moving the variables in the pivot table command but that didn't do it:
zone_sum_mean_combo = pd.pivot_table(
read_excel,
index=['Zone'],
aggfunc={'Household ID': np.mean, 'dwtype': np.mean, 'n_hhld_trip': np.sum,
'expf': np.mean, 'n_emp_ft': np.sum, 'n_emp_home': np.sum,
'n_emp_pt': np.sum, 'n_lic': np.sum, 'n_pers': np.sum,
'n_student': np.sum, 'n_veh': np.sum}
)
index_reset = zone_sum_mean_combo.reset_index()
print(index_reset)
pearson_correlation = index_reset.corr(method='pearson')
print(pearson_correlation)
Sometimes it can be easier to hardcode the column order after everything is done:
df = df[["my_first_column", "my_second_column"]]
In your case, I think it's easier to just manipulate them:
columns = list(df.columns)
columns.remove("n_hhld_trip")
columns.insert(0, "n_hhld_trip")
df = df[columns]
Try to set_index and reset_index:
df.set_index('n_hhld_trip', append=True).reset_index(level=-1)
Related
I imported a dataset into my python script and took the correlation. This is the code for correlation:
data = pd.read_excel('RQ_ID_Grouping.xlsx' , 'Sheet1')
corr = data.corr()
After the correlation the data looks like this:
I want to convert the data into below format:
I am using this code to achieve the above data , but it doesn't seem to be working:
corr1 = (corr.melt(var_name = 'X' , value_name = 'Y').groupby('X')['Y'].reset_index(name = 'Corr_Value'))
I know there should be something after the 'groupby' part but I don't know what . If you could help me , I would greatly appreciate it.
Use DataFrame.stack for reshape and drop missing values, convert MultiIndex to columns by DataFrame.reset_index and last set columns names:
df = corr.stack().reset_index()
df.columns = ['X','Y','Corr_Value']
Another solution with DataFrame.rename_axis:
df = corr.stack().rename_axis(('X','Y')).reset_index(name='Corr_Value')
And your solution with melt is also possible:
df = (corr.rename_axis('X')
.reset_index()
.melt('X', var_name='Y', value_name='Corr_Value')
.dropna()
.sort_values(['X','Y'])
.reset_index(drop=True))
Quick question:
I have the following situation (table):
Imported data frame
Now what I would like to achieve is the following (or something in those lines, it does not have to be exactly that)
Goal
I do not want the following columns so I drop them
data.drop(data.columns[[0,5,6]], axis=1,inplace=True)
What I assumed is that the following line of code could solve it, but I am missing something?
pivoted = data.pivot(index=["Intentional homicides and other crimes","Unnamed: 2"],columns='Unnamed: 3', values='Unnamed: 4')
produces
ValueError: Length of passed values is 3395, index implies 2
Difference to the 8 question is that I do not want any aggregation functions, just to leave values as is.
Data can be found at: Data
The problem with the method pandas.DataFrame.pivot is that it does not handle duplicate values in the index. One way to solve this is to use the function pandas.pivot_table instead.
df = pd.read_csv('Crimes_UN_data.csv', skiprows=[0], encoding='latin1')
cols = list(df.columns)
cols[1] = 'Region'
df.columns = cols
pivoted = pd.pivot_table(df, values='Value', index=['Region', 'Year'], columns='Series', aggfunc=sum)
It should not sum anything, despite the aggfunc argument, but it was throwing pandas.core.base.DataError: No numeric types to aggregate if the argument was not provided.
frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.
After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.
My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?
Use map for flatten columns names:
df.columns = df.columns.map('_'.join)
I have a pandas groupby command which looks like this:
df.groupby(['year', 'month'], as_index=False).agg({'users':sum})
Is there a way I can name the agg output something other than 'users' during the groupby command? For example, what if I wanted the sum of users to be total_users? I could rename the column after the groupby is complete, but wonder if there is another way.
I like #Alexander answer, but there is also add_prefix:
df.groupby(['year','month']).agg({'users':sum}).add_prefix('total_')
Per the docs:
If a dict is passed, the keys will be used to name the columns.
Otherwise the function’s name (stored in the function object) will be
used.
In [58]: grouped['D'].agg({'result1' : np.sum, ....:
'result2' : np.mean})
In your case:
df.groupby(['year', 'month'], as_index=False).users.agg({'total_users': np.sum})