I have a pandas dataframe like this, which sorted like:
>>> weekly_count.sort_values(by='date_in_weeks', inplace=True)
>>> weekly_count.loc[:9,:]
date_in_weeks count
0 1-2013 362
1 1-2014 378
2 1-2015 201
3 1-2016 294
4 1-2017 300
5 1-2018 297
6 10-2013 329
7 10-2014 314
8 10-2015 324
9 10-2016 322
in above data, first column, all rows of date_in_weeks are simply "week number of a year - year". I now want to sort it like this:
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
How do i do this?
Use Series.argsort with converted to datetimes with format %W week number of the year, link:
df = df.iloc[pd.to_datetime(df['date_in_weeks'] + '-0',format='%W-%Y-%w').argsort()]
print (df)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can also convert to datetime , assign to the df, then sort the values and drop the extra col:
s = pd.to_datetime(df['date_in_weeks'],format='%M-%Y')
final = df.assign(dt=s).sort_values(['dt','count']).drop('dt',1)
print(final)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can try using auxiliary columns:
import pandas as pd
df = pd.DataFrame({'date_in_weeks':['1-2013','1-2014','1-2015','10-2013','10-2014'],
'count':[362,378,201,329,314]})
df['aux'] = df['date_in_weeks'].str.split('-')
df['aux_2'] = df['aux'].str.get(1).astype(int)
df['aux'] = df['aux'].str.get(0).astype(int)
df = df.sort_values(['aux_2','aux'],ascending=True)
df = df.drop(columns=['aux','aux_2'])
print(df)
Output:
date_in_weeks count
0 1-2013 362
3 10-2013 329
1 1-2014 378
4 10-2014 314
2 1-2015 201
I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).
I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.
I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1
So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349