How to sort dataframe rows by multiple columns [duplicate] - python

This question already has answers here:
How to sort a dataFrame in python pandas by two or more columns?
(3 answers)
Closed last year.
I'm having trouble formatting a dataframe in a specific style. I want to have data pertaining to one S/N all clumped together. My ultimate goal with the dataset is to plot Dis vs Rate for all the S/Ns. I've tired iterating over rows to slice data but that hasnt worked. What would be the best(easiest) approach for this formatting. Thanks!
For example: S/N 332 has Dis 4.6 and Rate of 91.2 in the first row, immediately after that I want it to have S/N 332 with Dis 9.19 and Rate 76.2 and so on for all rows with S/N 332.
S/N Dis Rate
0 332 4.6030 91.204062
1 445 5.4280 60.233917
2 999 4.6030 91.474156
3 332 9.1985 76.212943
4 445 9.7345 31.902842
5 999 9.1985 76.212943
6 332 14.4405 77.664282
7 445 14.6015 36.261851
8 999 14.4405 77.664282
9 332 20.2005 76.725955
10 445 19.8630 40.705467
11 999 20.2005 76.725955
12 332 25.4780 31.597510
13 445 24.9050 4.897008
14 999 25.4780 31.597510
15 332 30.6670 74.096975
16 445 30.0550 35.217889
17 999 30.6670 74.096975
Edit: Tried using sort as #Ian Kenney suggested but that doesn't help because now the Dis values are no longer in the ascending order:
0 332 4.6030 91.204062
15 332 30.6670 74.096975
3 332 9.1985 76.212943
6 332 14.4405 77.664282
9 332 20.2005 76.725955
12 332 25.4780 31.597510
1 445 5.4280 60.233917
4 445 9.7345 31.902842
7 445 14.6015 36.261851
16 445 30.0550 35.217889
10 445 19.8630 40.705467
13 445 24.9050 4.897008

Use sort_values, which can accept a list of sorting targets. In this case it sounds like you want to sort by S/N, then Dis, then Rate:
df = df.sort_values(['S/N', 'Dis', 'Rate'])
# S/N Dis Rate
# 0 332 4.6030 91.204062
# 3 332 9.1985 76.212943
# 6 332 14.4405 77.664282
# 9 332 20.2005 76.725955
# 12 332 25.4780 31.597510
# 15 332 30.6670 74.096975
# 1 445 5.4280 60.233917
# 4 445 9.7345 31.902842
# 7 445 14.6015 36.261851
# 10 445 19.8630 40.705467
# 13 445 24.9050 4.897008
# 16 445 30.0550 35.217889
# ...

You can also achieve this by several ways, another way from the already existing answer is,
df.sort_values(by = ['S/N', "Dis", 'Rate'], inplace = True)
df
Output:
S/N Dis Rate
0 332 4.6030 91.204062
3 332 9.1985 76.212943
6 332 14.4405 77.664282
9 332 20.2005 76.725955
12 332 25.4780 31.597510
15 332 30.6670 74.096975
1 445 5.4280 60.233917
4 445 9.7345 31.902842
7 445 14.6015 36.261851
10 445 19.8630 40.705467
13 445 24.9050 4.897008
16 445 30.0550 35.217889
2 999 4.6030 91.474156
5 999 9.1985 76.212943
8 999 14.4405 77.664282
11 999 20.2005 76.725955
14 999 25.4780 31.597510
17 999 30.6670 74.096975
Here, the Inplace argument used within the sort_values function directly make the changes in the source dataframe which will eliminate the need to create another dataframe to store the sorted output.

Related

Pandas - how to sort week and year numbers formatted as strings?

I have a pandas dataframe like this, which sorted like:
>>> weekly_count.sort_values(by='date_in_weeks', inplace=True)
>>> weekly_count.loc[:9,:]
date_in_weeks count
0 1-2013 362
1 1-2014 378
2 1-2015 201
3 1-2016 294
4 1-2017 300
5 1-2018 297
6 10-2013 329
7 10-2014 314
8 10-2015 324
9 10-2016 322
in above data, first column, all rows of date_in_weeks are simply "week number of a year - year". I now want to sort it like this:
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
How do i do this?
Use Series.argsort with converted to datetimes with format %W week number of the year, link:
df = df.iloc[pd.to_datetime(df['date_in_weeks'] + '-0',format='%W-%Y-%w').argsort()]
print (df)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can also convert to datetime , assign to the df, then sort the values and drop the extra col:
s = pd.to_datetime(df['date_in_weeks'],format='%M-%Y')
final = df.assign(dt=s).sort_values(['dt','count']).drop('dt',1)
print(final)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can try using auxiliary columns:
import pandas as pd
df = pd.DataFrame({'date_in_weeks':['1-2013','1-2014','1-2015','10-2013','10-2014'],
'count':[362,378,201,329,314]})
df['aux'] = df['date_in_weeks'].str.split('-')
df['aux_2'] = df['aux'].str.get(1).astype(int)
df['aux'] = df['aux'].str.get(0).astype(int)
df = df.sort_values(['aux_2','aux'],ascending=True)
df = df.drop(columns=['aux','aux_2'])
print(df)
Output:
date_in_weeks count
0 1-2013 362
3 10-2013 329
1 1-2014 378
4 10-2014 314
2 1-2015 201

Dataframe structures in pandas/seaborn. Do I need to be careful about observations vs variables?

I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Selectively remove deprecated rows in a pandas dataframe

I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1

Python Pandas GroupBy().Sum() Having Clause

So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349

Categories

Resources