I have the following dataframe (this is a sample there are many rows)
Student ID avg
0 205842 68.333333
1 280642 74.166667
I want to sort by decreasing average percentage grade, and, if equal the Increasing Student ID.
I have been able to sort with one parameter like below, however I'm unsure how to sort with two as I want
df_pct_scores.sort_values(by='avg', ascending=False)
Please see if this works:
df_pct_scores.sort_values(by = ['avg','ID'], ascending=[False, True])
Related
this is a code i wrote, but the output is too big, over 6000, how do i get the first result for each year
df_year = df.groupby('release_year')['genres'].value_counts()
Let's start from a small correction concerning variable name:
value_counts returns a Series (not DataFrame), so you should
not use name starting from df.
Assume that the variable holding this Series is gen.
Then, one of possible solutions is:
result = gen.groupby(level=0).apply(lambda grp:
grp.droplevel(0).sort_values(ascending=False).head(1))
Initially you wrote that you wanted the most popular genre in each year,
so I sorted each group in descending order and returned the first
row from the current group.
I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)
The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean.
I want to run each item individually like so:
Below I made a dataframe column to store the result
df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100
For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean.
Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change.
I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows?
I thought maybe through indexing the column of the price?
df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2):
"""
Calculating the percentage difference between two specific rows in a dataframe
"""
return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100
Here's my data:
https://docs.google.com/spreadsheets/d/1Nyvx2GXUFLxrJdRTIKNAqIVvGP7FyiQ9NrjKiHoX3kE/edit?usp=sharing
Dataset
It's a small part of dataset with 100s of order_id.
I want to find duration in #timestamp column with respect to order_id. Example. for order_id 3300400, duration will be from index 6 to index 0. Similarly for all other order ids.
I want to have the sum of items.quantity and items.price with respect to order ids. Ex. for order_id 3300400, sum of items.quantity = 2 and sum of items.price = 499+549 = 1048. Similarly for other order_ids.
I am new to python but I think it will need the use of loops. Any help will be highly appreciated.
Thanks and Regards,
Shantanu Jain
you have figured out how to use the groupby() method which is good. In order to work out the diff in timestamps its a little more work.
# Function to get first and last stamps within group
def get_index(df):
return df.iloc[[0, -1]]
# apply function and then use diff method on ['#timestamp']
df['time_diff'] = df.groupby('order_id').apply(get_index)['#timestamp'].diff()
I haven't tested any of this code, and it will only work if your time stamps are pd.timestamps. It should at least give you an idea on where to start
I have a two column dataframe with name limitData where the first column is CcyPair and second is trade notional
CcyPair,TradeNotional
USDCAD,1000000
USDCAD,7600
USDCAD,40000
GBPUSD,100000
GBPUSD,345000
etc
with a large number of CcyPair's and TradeNotional's per CcyPair. From here I generate summary statistics as follows
limitDataStats = limitData.groupby(['CcyPair']).describe()
This is easy enough. However, I would like to add a column to sumStats that contains the count of TradeNotional's greater than that ccyPair's 75% determined by .describe() stored in limitDataStats. I've searched a great deal and tried a number of variations but can't figure it out. Think it should be somewhere along the lines of the below (I thought I could reference the index of the groupby as mentioned here but that gives me the actual integer index here
limitData.groupby(['CcyPair'])['AbsBaseTrade'].apply(lambda x: x[x > limitDataStats.loc[x.index , '75%']].count())
Any ideas? Thanks, Colin
You can filter values greater than the 75th percentile and then count how many are greater than or equal to that value (used .sum() since boolean series is returned from ge())
limitData.groupby('CcyPair')['AbsBaseTrade'].apply(
lambda x: x.ge(x.quantile(.75)).sum()))