I have a quick question regarding sorting rows in a csv files using Pandas. The csv file which I have has the data that looks like:
quarter week Value
5 1 200
3 2 100
2 1 50
2 2 125
4 2 175
2 3 195
3 1 10
5 2 190
I need to sort in following way: sort the quarter and the corresponding weeks. So the output should look like following:
quarter week Value
2 1 50
2 2 125
2 3 195
3 1 10
3 2 100
4 2 175
5 1 200
5 2 190
My attempt:
df = df.sort('quarter', 'week')
But this does not produce the correct result. Any help/suggestions?
Thanks!
New answer, as of 14 March 2019
df.sort_values(by=["COLUMN"], ascending=False)
This returns a new sorted data frame, doesn't update the original one.
Note: You can change the ascending parameter according to your needs, without passing it, it will default to ascending=True
Note: sort has been deprecated in favour of sort_values, which you should use in Pandas 0.17+.
Typing help(df.sort) gives:
sort(self, columns=None, column=None, axis=0, ascending=True, inplace=False) method of pandas.core.frame.DataFrame instance
Sort DataFrame either by labels (along either axis) or by the values in
column(s)
Parameters
----------
columns : object
Column name(s) in frame. Accepts a column name or a list or tuple
for a nested sort.
[...]
Examples
--------
>>> result = df.sort(['A', 'B'], ascending=[1, 0])
[...]
and so you pass the columns you want to sort as a list:
>>> df
quarter week Value
0 5 1 200
1 3 2 100
2 2 1 50
3 2 2 125
4 4 2 175
5 2 3 195
6 3 1 10
7 5 2 190
>>> df.sort(["quarter", "week"])
quarter week Value
2 2 1 50
3 2 2 125
5 2 3 195
6 3 1 10
1 3 2 100
4 4 2 175
0 5 1 200
7 5 2 190
DataFrame object has no attribute sort
Related
This question already has answers here:
Apply multiple functions to multiple groupby columns
(7 answers)
Closed 6 months ago.
I have a dataframe that looks something like this:
Individual Category Amount Extras
A 1 250 30
A 1 300 10
A 1 500 8
A 2 350 12
B 1 200 9
B 2 300 20
B 2 450 15
I want to get a dataframe that looks like this:
Individual Category Count Amount Extras
A 1 3 1050 48
A 2 1 350 12
B 1 1 200 9
B 2 2 750 35
I know that you can use groupby with Pandas, but is it possible to group using count and sum simultaneously?
You could try as follows:
output_df = df.groupby(['Individual','Category']).agg(
Count=('Individual', 'count'),
Amount=('Amount','sum'),
Extras=('Extras','sum')).reset_index(drop=False)
print(output_df)
Individual Category Count Amount Extras
0 A 1 3 1050 48
1 A 2 1 350 12
2 B 1 1 200 9
3 B 2 2 750 35
So, we are using df.groupby, and then apply named aggregation, allowing us to "[name] output columns when applying multiple aggregation functions to specific columns".
I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120
Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)
Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')
I have the following dataframe df1.
import pandas as pd
df1=pd.DataFrame([[1,11,'mx212', 1000], [1,11,'rx321', 600],
[1,11,'/bc1', 5],[1,11,'/bc2', 11], [1,12,'sx234', 800],
[1,12,'mx456', 1232], [3,13,'mx322', 1000], [3,13,'/bc3', 34]],
columns=["sale","order","code","amt"])
sale order code amt
0 1 11 mx212 1000
1 1 11 rx321 600
2 1 11 /bc1 5
3 1 11 /bc2 11
4 1 12 sx234 800
5 1 12 mx456 1232
6 3 13 mx322 1000
7 3 13 /bc3 34
Here, a saleperson can have multiple orders and each order can have multiple codes. I want to aggregate and transform amt based on specific combinations of sale, order and code. A code starting with "/bc" needs to be aggregated with main code value("starting with values like 'mx','rx' etc). Note that any code value not staring with /bc is considered type "main". If there are multiple combinations of code values of type "/bc" and "main" type, the aggregation for amt should be done on each combination(for eq rows 1, 2, 3 and 4 has two combinations of type "main" and "/bc". Note that, a specific order would have equal values of code types "/bc" and "main". Once, the aggregation for an order is done, i want the code type "/bc" to be dropped.
If a particular sale and order has no code type "bc", the values of "amt" should be same. For eq, rows 5 and 6 should be unchanged and code, amt values should remain same.
The resulting dataframe df2 should ideally be this:
sale order code amt
0 1 11 mx212 1005
1 1 11 rx321 611
2 1 12 sx234 800
3 1 12 mx456 1232
4 3 13 mx322 1034
amt value in row 1 is "1000+5" and in row 2 is "600+11"{code type "main" is added to respective "/bc". amt values in row 3 and 4 remains same and in row 5, it is "1000+34".
I know this is a lot of information, but i tried to be as coherent as possible. I would request if there are any questions, please comment. I will appreciate it. Any kind of help is always welcomed :)
You could do it like this:
g=df1.groupby(['sale','order',df1.code.str.startswith('/bc')]).cumcount()
df1.groupby(['sale','order',g],as_index=False)['amt','code']\
.agg({'code':'first','amt':'sum'})
Output:
sale order code amt
0 1 11 mx212 1005
1 1 11 rx321 611
2 1 12 sx234 800
3 1 12 mx456 1232
4 3 13 mx322 1034
I break down the steps...key is building a column help to determine the inner group
df1.code=df1.code.replace({'bc':np.nan},regex=True)
df1['New']=df1.code.isnull()
d1=df1.groupby([df1.sale,df1.order,df1.groupby(['sale','order','New']).cumcount()],as_index=False).amt.sum()
pd.concat([d1,df1.dropna().code.reset_index(drop=True)],1)
Out[344]:
sale order amt code
0 1 11 1005 mx212
1 1 11 611 rx321
2 1 12 800 sx234
3 1 12 1232 mx456
4 3 13 1034 mx322