Limited Output When Using Gazpacho and Pandas to Scrape - python

I'm trying to scrape a web page for the table of countries and their areas.
My code compiles and runs but only outputs the top two rows, when I want them all.
I thought the problem may lie with .head(), so I played around with it passing numbers and leaving it out all together, but I can't get it to print more than two.
Any help would be appreciated!
from gazpacho import get, Soup
import pandas as pd
url = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html"
response = get(url)
soup = Soup(response)
df0 = pd.read_html(str(soup.find('table')))[0]
print(df0[['Rank', 'Country', '(SQ KM)']].head())

First off, no need to use Pandas' .read_html() AND BeautifulSoup/requests AND gazpacho. Pandas actually uses beautifulsoup under the hood and uses requests as well.
Secondly, I don't have an issue with it not printing out more than 2 rows. Where are you running this? Is it possible you have a setting/preference that only outputs x amount of lines?
import pandas as pd
url = "https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html"
df0 = pd.read_html(url)[0]
print(df0[['Rank', 'Country', '(SQ KM)']])
Output:
print(df0[['Rank', 'Country', '(SQ KM)']].to_string())
Rank Country (SQ KM)
0 1 Russia 17098242
1 2 Antarctica 14000000
2 3 Canada 9984670
3 4 United States 9833517
4 5 China 9596960
5 6 Brazil 8515770
6 7 Australia 7741220
7 8 India 3287263
8 9 Argentina 2780400
9 10 Kazakhstan 2724900
10 11 Algeria 2381741
11 12 Congo, Democratic Republic of the 2344858
12 13 Greenland 2166086
13 14 Saudi Arabia 2149690
14 15 Mexico 1964375
15 16 Indonesia 1904569
16 17 Sudan 1861484
17 18 Libya 1759540
18 19 Iran 1648195
19 20 Mongolia 1564116
20 21 Peru 1285216
21 22 Chad 1284000
22 23 Niger 1267000
23 24 Angola 1246700
24 25 Mali 1240192
25 26 South Africa 1219090
26 27 Colombia 1138910
27 28 Ethiopia 1104300
28 29 Bolivia 1098581
29 30 Mauritania 1030700
30 31 Egypt 1001450
31 32 Tanzania 947300
32 33 Nigeria 923768
33 34 Venezuela 912050
34 35 Namibia 824292
35 36 Mozambique 799380
36 37 Pakistan 796095
37 38 Turkey 783562
38 39 Chile 756102
39 40 Zambia 752618
40 41 Burma 676578
41 42 Afghanistan 652230
42 43 South Sudan 644329
43 44 France 643801
44 45 Somalia 637657
45 46 Central African Republic 622984
46 47 Ukraine 603550
47 48 Madagascar 587041
48 49 Botswana 581730
49 50 Kenya 580367
50 51 Yemen 527968
51 52 Thailand 513120
52 53 Spain 505370
53 54 Turkmenistan 488100
54 55 Cameroon 475440
55 56 Papua New Guinea 462840
56 57 Sweden 450295
57 58 Uzbekistan 447400
58 59 Morocco 446550
59 60 Iraq 438317
60 61 Paraguay 406752
61 62 Zimbabwe 390757
62 63 Japan 377915
63 64 Germany 357022
64 65 Congo, Republic of the 342000
65 66 Finland 338145
66 67 Vietnam 331210
67 68 Malaysia 329847
68 69 Norway 323802
69 70 Cote d'Ivoire 322463
70 71 Poland 312685
71 72 Oman 309500
72 73 Italy 301340
73 74 Philippines 300000
74 75 Ecuador 283561
75 76 Burkina Faso 274200
76 77 New Zealand 268838
77 78 Gabon 267667
78 79 Western Sahara 266000
79 80 Guinea 245857
80 81 United Kingdom 243610
81 82 Uganda 241038
82 83 Ghana 238533
83 84 Romania 238391
84 85 Laos 236800
85 86 Guyana 214969
86 87 Belarus 207600
87 88 Kyrgyzstan 199951
88 89 Senegal 196722
89 90 Syria 185180
90 91 Cambodia 181035
91 92 Uruguay 176215
92 93 Suriname 163820
93 94 Tunisia 163610
94 95 Bangladesh 148460
95 96 Nepal 147181
96 97 Tajikistan 144100
97 98 Greece 131957
98 99 Nicaragua 130370
99 100 Korea, North 120538
100 101 Malawi 118484
101 102 Eritrea 117600
102 103 Benin 112622
103 104 Honduras 112090
104 105 Liberia 111369
105 106 Bulgaria 110879
106 107 Cuba 110860
107 108 Guatemala 108889
108 109 Iceland 103000
109 110 Korea, South 99720
110 111 Hungary 93028
111 112 Portugal 92090
112 113 Jordan 89342
113 114 Azerbaijan 86600
114 115 Austria 83871
115 116 United Arab Emirates 83600
116 117 Czechia 78867
117 118 Serbia 77474
118 119 Panama 75420
119 120 Sierra Leone 71740
120 121 Ireland 70273
121 122 Georgia 69700
122 123 Sri Lanka 65610
123 124 Lithuania 65300
124 125 Latvia 64589
125 126 Svalbard 62045
126 127 Togo 56785
127 128 Croatia 56594
128 129 Bosnia and Herzegovina 51197
129 130 Costa Rica 51100
130 131 Slovakia 49035
131 132 Dominican Republic 48670
132 133 Estonia 45228
133 134 Denmark 43094
134 135 Netherlands 41543
135 136 Switzerland 41277
136 137 Bhutan 38394
137 138 Guinea-Bissau 36125
138 139 Taiwan 35980
139 140 Moldova 33851
140 141 Belgium 30528
141 142 Lesotho 30355
142 143 Armenia 29743
143 144 Solomon Islands 28896
144 145 Albania 28748
145 146 Equatorial Guinea 28051
146 147 Burundi 27830
147 148 Haiti 27750
148 149 Rwanda 26338
149 150 Macedonia 25713
150 151 Djibouti 23200
151 152 Belize 22966
152 153 El Salvador 21041
153 154 Israel 20770
154 155 Slovenia 20273
155 156 New Caledonia 18575
156 157 Fiji 18274
157 158 Kuwait 17818
158 159 Swaziland 17364
159 160 Timor-Leste 14874
160 161 Bahamas, The 13880
161 162 Montenegro 13812
162 163 Vanuatu 12189
163 164 Falkland Islands (Islas Malvinas) 12173
164 165 Qatar 11586
165 166 Gambia, The 11300
166 167 Jamaica 10991
167 168 Kosovo 10887
168 169 Lebanon 10400
169 170 Cyprus 9251
170 171 Puerto Rico 9104
171 172 West Bank 5860
172 173 Brunei 5765
173 174 Trinidad and Tobago 5128
174 175 French Polynesia 4167
175 176 Cabo Verde 4033
176 177 South Georgia and South Sandwich Islands 3903
177 178 Samoa 2831
178 179 Luxembourg 2586
179 180 Comoros 2235
180 181 Mauritius 2040
181 182 Virgin Islands 1910
182 183 Faroe Islands 1393
183 184 Hong Kong 1108
184 185 Sao Tome and Principe 964
185 186 Turks and Caicos Islands 948
186 187 Kiribati 811
187 188 Bahrain 760
188 189 Dominica 751
189 190 Tonga 747
190 191 Micronesia, Federated States of 702
191 192 Singapore 697
192 193 Saint Lucia 616
193 194 Isle of Man 572
194 195 Guam 544
195 196 Andorra 468
196 197 Northern Mariana Islands 464
197 198 Palau 459
198 199 Seychelles 455
199 200 Curacao 444
200 201 Antigua and Barbuda 443
201 202 Barbados 430
202 203 Heard Island and McDonald Islands 412
203 204 Saint Helena, Ascension, and Tristan da Cunha 394
204 205 Saint Vincent and the Grenadines 389
205 206 Jan Mayen 377
206 207 Gaza Strip 360
207 208 Grenada 344
208 209 Malta 316
209 210 Maldives 298
210 211 Cayman Islands 264
211 212 Saint Kitts and Nevis 261
212 213 Niue 260
213 214 Saint Pierre and Miquelon 242
214 215 Cook Islands 236
215 216 American Samoa 199
216 217 Marshall Islands 181
217 218 Aruba 180
218 219 Liechtenstein 160
219 220 British Virgin Islands 151
220 221 Wallis and Futuna 142
221 222 Christmas Island 135
222 223 Dhekelia 131
223 224 Akrotiri 123
224 225 Jersey 116
225 226 Montserrat 102
226 227 Anguilla 91
227 228 Guernsey 78
228 229 San Marino 61
229 230 British Indian Ocean Territory 60
230 231 French Southern and Antarctic Lands 55
231 232 Saint Martin 54
232 233 Bermuda 54
233 234 Bouvet Island 49
234 235 Pitcairn Islands 47
235 236 Norfolk Island 36
236 237 Sint Maarten 34
237 238 Macau 28
238 239 Tuvalu 26
239 240 Saint Barthelemy 25
240 241 United States Pacific Island Wildlife Refuges 22
241 242 Nauru 21
242 243 Cocos (Keeling) Islands 14
243 244 Tokelau 12
244 245 Paracel Islands 8
245 246 Gibraltar 7
246 247 Wake Island 7
247 248 Clipperton Island 6
248 249 Navassa Island 5
249 250 Spratly Islands 5
250 251 Ashmore and Cartier Islands 5
251 252 Coral Sea Islands 3
252 253 Monaco 2
253 254 Holy See (Vatican City) 0

You can also use lxml for this
import requests
import lxml.html
url = 'https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html'
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
# Extract the table
table = tree.get_element_by_id('rankOrder')
data = table.xpath('//tr/td//text()')
# Separate the columns
rank = data[0::4]
country = data[1::4]
sq_km = data[2::4]
date_of_info = data[3::4]
If you need a data frame the rest is just formatting
# If you want a data frame
import pandas
df = pandas.DataFrame(dict(country=country, sq_km=sq_km, date_of_info=date_of_info))
df
country sq_km date_of_info
0 Russia 17,098,242 \r
1 Antarctica 14,000,000 \r
2 Canada 9,984,670 \r
3 United States 9,833,517 \r
4 China 9,596,960 \r
.. ... ... ...
249 Spratly Islands 5 \r
250 Ashmore and Cartier Islands 5 \r
251 Coral Sea Islands 3 \r
252 Monaco 2 \r
253 Holy See (Vatican City) 0 \r
[254 rows x 3 columns]

Related

How can we run Scipy Optimize Based After Doing Group By?

I have a dataframe that looks like this.
SectorID MyDate PrevMainCost OutageCost
0 123 10/31/2022 332 193
1 123 9/30/2022 308 269
2 123 8/31/2022 33 440
3 123 7/31/2022 230 147
4 123 6/30/2022 264 184
5 123 5/31/2022 290 46
6 123 4/30/2022 51 150
7 123 3/31/2022 69 253
8 123 2/28/2022 257 308
9 123 1/31/2022 441 349
10 456 10/31/2022 280 188
11 456 9/30/2022 432 150
12 456 8/31/2022 357 307
13 456 7/31/2022 425 45
14 456 6/30/2022 101 278
15 456 5/31/2022 62 240
16 456 4/30/2022 407 46
17 456 3/31/2022 35 218
18 456 2/28/2022 403 113
19 456 1/31/2022 295 200
20 456 12/31/2021 20 235
21 456 11/30/2021 440 403
22 789 10/31/2022 145 181
23 789 9/30/2022 320 259
24 789 8/31/2022 485 472
25 789 7/31/2022 59 24
26 789 6/30/2022 345 64
27 789 5/31/2022 34 480
28 789 4/30/2022 260 162
29 789 3/31/2022 46 399
30 999 10/31/2022 491 346
31 999 9/30/2022 77 212
32 999 8/31/2022 316 112
33 999 7/31/2022 106 351
34 999 6/30/2022 481 356
35 999 5/31/2022 20 269
36 999 4/30/2022 246 268
37 999 3/31/2022 377 173
38 999 2/28/2022 426 413
39 999 1/31/2022 341 168
40 999 12/31/2021 144 471
41 999 11/30/2021 358 393
42 999 10/31/2021 340 197
43 999 9/30/2021 119 252
44 999 8/31/2021 470 203
45 999 7/31/2021 359 163
46 999 6/30/2021 410 383
47 999 5/31/2021 200 119
48 999 4/30/2021 230 291
I am trying to find the minimum of PrevMainCost and OutageCost, after grouping by SectorID. Here's my primitive code.
import numpy as np
import pandas
df = pandas.read_clipboard(sep='\\s+')
df
df_sum = df.groupby('SectorID').sum()
df_sum
df_sum.loc[df_sum['PrevMainCost'] <= df_sum['OutageCost'], 'Result'] = 'Main'
df_sum.loc[df_sum['PrevMainCost'] > df_sum['OutageCost'], 'Result'] = 'Out'
Result: (result column shows flags whether PrevMainCost is lower or OutageCost is lower)
PrevMainCost OutageCost Result
SectorID
123 2275 2339 Main
456 3257 2423 Out
789 1694 2041 Main
999 5511 5140 Out
I am trying to figure out how to use Scipy Optimization to solve this problem. I Googled this problem and came up with this simple code sample.
from scipy.optimize import *
df_sum.groupby(by=['SectorID']).apply(lambda g: minimize(equation, g.Result, options={'xtol':0.001}).x)
When I run that, I get an error saying 'NameError: name 'equation' is not defined'.
How can I find the minimum of either the preventative maintenance cost or the outage cost, after grouping by SectorID? Also, how can I add some kind of constraint, such as no more than 30% of all resources can be used by one any particular SectorID?

Why my replace method does not work with the string [B]?

I have the following dataset called world_top_ten:
`
Most populous countries 2000 2015 2030[A]
0 China[B] 1270 1376 1416
1 India 1053 1311 1528
2 United States 283 322 356
3 Indonesia 212 258 295
4 Pakistan 136 208 245
5 Brazil 176 206 228
6 Nigeria 123 182 263
7 Bangladesh 131 161 186
8 Russia 146 146 149
9 Mexico 103 127 148
10 World total 6127 7349 8501
`
I am trying to replace the [B] with "":
world_top_ten['Most populous countries'].str.replace(r'"[B]"', '')
And it is returning me:
0 China[B]
1 India
2 United States
3 Indonesia
4 Pakistan
5 Brazil
6 Nigeria
7 Bangladesh
8 Russia
9 Mexico
10 World total
Name: Most populous countries, dtype: object
What am I doing wrong here?
Because [] is special regex character escape it:
world_top_ten['Most populous countries'].str.replace(r'\[B\]', '', regex=True)

How do I explode my dataframe based on each word in a column?

I have the following df:
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
...
9366 345 32 City of Louisville closed off this summer
9367 1200 234 New york rats "stronger than ever", reports say
9368 432 123 Congolese militia shipwrecked in Norway
9369 594 203 Scientists now agree on how to use ice in drinks
9370 611 153 Historic drought hits Atlantis
Now I would like to create a new dataframe where I can see what score and how many comments each word gets. Like this: df2
Word score num_comments
Uhaul 134 518
implement 134 518
nicotine-free 134 518
hiring 134 518
policy 134 518
Orangutan 28 43
saves 28 43
child 28 43
from 28 43
a 28 43
giant 28 43
volcano 28 43
...
etc..
I have tried Splitting the title into separate words and then exploding:
In [9]: df3
Out[9]:
df3['titles_split'] = df['titles'].str.split()
This gave me a column that looked like this:
Score num_comments titles_split
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy]
1 28 43 [Orangutan, saves, child, from, a, giant, volcano]
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour]
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue]
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud]
...
9366 345 32 [City, of, Louisville, closed, off, this, summer]
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say]
9368 432 123 [Congolese, militia, shipwrecked, in, Norway]
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks]
9370 611 153 [Historic, drought, hits, Atlantis]
Then I tried this code:
df3.explode(df3.assign(titles_split=df3.titles_split.str.split(',')), 'titles_split')
But I got the following error message:
ValueError: column must be a scalar, tuple, or list thereof
The same thing happened when I tried it for titles in df2.
I also tried creating new columns that repeated scores and num_comments as many times as there are words in titles (or titles_split). The idea was to create a dataframe like this:
In [9]: df4
Out[9]:
Score num_comments titles_split score_repeated
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy] 134,134,134,134,134,134
1 28 43 [Orangutan, saves, child, from, a, giant, volcano] 28,28,28,28,28,28,28
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour] 30,30,30 etc..
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue] etc.
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud] etc
...
9366 345 32 [City, of, Louisville, closed, off, this, summer] etc
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say] etc
9368 432 123 [Congolese, militia, shipwrecked, in, Norway] etc
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks] etc
9370 611 153 [Historic, drought, hits, Atlantis] etc
And then explode on titles_split, score_repeated and comments_repeated like this:
df4.explode(['titles_split', 'score_repeated', 'comments_repeated'])
But I never got to that point because I couldn't get repeated columns. I tried the following code:
df3['score_repeat'] = df3.apply(lambda x: [x.score] * len(x.titles_split) , axis =1)
Which gave me this error message:
TypeError: object of type 'float' has no len()
Then I tried:
df3['score_repeat'] = [[y] * x for x, y in zip(df3['titles_split'].str.len(),df['score'])]
Which gave me:
TypeError: can't multiply sequence by non-int of type 'float'
But I am not even sure I am going about this the right way. Do I even need to create score_repeated and comments_repeated?
Assume this is the df
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
5 345 32 City of Louisville closed off this summer
6 1200 234 New york rats "stronger than ever", reports say
7 432 123 Congolese militia shipwrecked in Norway
8 594 203 Scientists now agree on how to use ice in drinks
9 611 153 Historic drought hits Atlantis
You can try the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '') #data cleaning
df['titles'] = df['titles'].str.split() #split sentence into a list (of single words)
df2 = df.explode('titles', ignore_index=True)
df2.columns = ['score', 'num_comments', 'word']
print(df2)
score num_comments word
0 134 518 Uhaul
1 134 518 implement
2 134 518 nicotine-free
3 134 518 hiring
4 134 518 policy
5 28 43 Orangutan
6 28 43 saves
7 28 43 child
8 28 43 from
9 28 43 a
10 28 43 giant
11 28 43 volcano
12 30 114 Swimmer
13 30 114 dies
14 30 114 in
15 30 114 a
16 30 114 horrific
17 30 114 shark
18 30 114 attack
19 30 114 in
20 30 114 harbour
21 745 298 More
22 745 298 teenagers
23 745 298 than
24 745 298 ever
25 745 298 are
26 745 298 addicted
27 745 298 to
28 745 298 glue
29 40 67 Lebanese
30 40 67 lawyers
31 40 67 union
32 40 67 accuse
33 40 67 Al
34 40 67 Capone
35 40 67 of
36 40 67 fraud
37 345 32 City
38 345 32 of
39 345 32 Louisville
40 345 32 closed
41 345 32 off
42 345 32 this
43 345 32 summer
44 1200 234 New
45 1200 234 york
46 1200 234 rats
47 1200 234 stronger
48 1200 234 than
49 1200 234 ever
50 1200 234 reports
51 1200 234 say
52 432 123 Congolese
53 432 123 militia
54 432 123 shipwrecked
55 432 123 in
56 432 123 Norway
57 594 203 Scientists
58 594 203 now
59 594 203 agree
60 594 203 on
61 594 203 how
62 594 203 to
63 594 203 use
64 594 203 ice
65 594 203 in
66 594 203 drinks
67 611 153 Historic
68 611 153 drought
69 611 153 hits
70 611 153 Atlantis
Data cleaning needed
I noticed that there are strings with " (double-quotations) and , (comma), meaning the data is not clean. You could do some data cleaning with the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '')

pandas syntax error returning on multiple conditions

I cannot figure out what the problem is with the code, it is giving me "invalid syntax error" but im following exact instructions and it looks accurate, i'm trying to get just the people with over 30 doubles ('2B') and in the AL league from the merged data below (d820hw5p3)... any ideas whats going on??
d820hw5p6= d820hw5p3[(d820hw5p3.2B > 30) & (d820hw5p3.LEAGUE == 'AL')]
d820hw5p6
d820hw5p3 is this data:
First Last R H AB LEAGUE 2B 3B HR RBI
0 Leonys Martin 72 128 518 AL 17 3 15 47
1 Jay Bruce 74 135 540 NL 27 6 33 99
2 Jackie Bradley Jr. 94 149 558 AL 30 7 26 87
3 George Springer 116 168 644 AL 29 5 29 82
4 Corey Dickerson 57 125 510 AL 36 3 24 70
5 Dexter Fowler 84 126 457 NL 25 7 13 48
6 Angel Pagan 71 137 495 NL 24 5 12 55
7 Adam Eaton 91 176 620 AL 29 9 14 59
8 Yasmany Tomas 72 144 529 NL 30 1 31 83
9 Gregory Polanco 79 136 527 NL 34 4 22 86
10 Nomar Mazara 59 137 515 AL 13 3 20 64
11 Justin Upton 81 140 569 AL 28 2 31 87
12 Bryce Harper 84 123 506 NL 24 2 24 86
13 Kole Calhoun 91 161 594 AL 35 5 18 75
14 Ender Inciarte 85 152 522 NL 24 7 3 29
15 Jacoby Ellsbury 71 145 551 AL 24 5 9 56
16 Curtis Granderson 88 129 544 NL 24 5 30 59
17 Mookie Betts 122 214 673 AL 42 5 31 113
18 Denard Span 70 152 571 NL 23 5 11 53
19 Adam Duvall 85 133 552 NL 31 6 33 103
20 Brett Gardner 80 143 548 AL 22 6 7 41
21 Matt Kemp 89 167 623 NL 39 0 35 108
22 Khris Davis 85 137 555 AL 24 2 42 102
23 Mike Trout 123 173 549 AL 32 5 29 100
24 Melky Cabrera 70 175 591 AL 42 5 14 86
25 Jose Bautista 68 99 423 AL 24 1 22 69
26 Ian Desmond 107 178 625 AL 29 3 22 86
27 Alex Gordon 62 98 445 AL 16 2 17 40
28 Ryan Braun 80 156 511 NL 23 3 30 91
29 Nick Markakis 67 161 599 NL 38 0 13 89
30 Carlos Gonzalez 87 174 584 NL 42 2 25 100
31 Yoenis Cespedes 72 134 479 NL 25 1 31 86
32 Stephen Piscotty 86 159 582 NL 35 3 22 85
33 Michael Saunders 70 124 490 AL 32 3 24 57
34 Jayson Werth 84 128 525 NL 28 0 21 69
35 Howie Kendrick 65 124 486 NL 26 2 8 40
36 Adam Jones 86 164 619 AL 19 0 29 83
37 Marcell Ozuna 75 148 556 NL 23 6 23 76
38 Jason Heyward 61 122 530 NL 27 1 7 49
39 Marwin Gonzalez 55 123 484 AL 26 3 13 51
40 Starling Marte 71 152 489 NL 34 5 9 46
41 J.D. Martinez 69 141 459 AL 35 2 22 68
42 Kevin Pillar 59 146 549 AL 35 2 7 53
43 Charlie Blackmon 111 187 577 NL 35 5 29 82
44 Odubel Herrera 87 167 584 NL 21 6 15 49
45 Christian Yelich 78 172 577 NL 38 3 21 98
46 Andrew McCutchen 81 153 598 NL 26 3 24 79
I went of AMC's hunch that the column starting with a 2 is problematic, and created this minimal reproducible example:
import pandas as pd
# define Data Frame
df= pd.DataFrame({
'name': ['A', 'B', 'C'],
'2b': [1, 2, 3],
'b2': [4, 5, 6],
})
# Try to access column 2b
df.2b
Which returns SyntaxError: invalid syntax
While df['2b'] returns the expected series.
I did a brief search for documentation about this, and didn't see anything, but I expect it has something to do with this: Variable names in Python cannot start with a number or can they?
So in the end, while 2b is a valid column name, you will have to access it's series by using the df['column'] method.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Categories

Resources