In my excel file, the only thing I see in the mileage columns is the actual value, so for example the first line I see “35 kmpl”. Why is it putting /n/n before every value when I get python to print my file?
I was expecting just the kmpl values in that column but instead I’m getting /n/n infront of them on python.
code
here is my code pasted, I didn't get an error message so I can't put that:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
bikes = pd.read_csv('/content/bikes sorted final.csv')
print(bikes)
which returns:
model_name model_year kms_driven owner
0 Bajaj Avenger Cruise 220 2017 2017 17000 Km first owner
1 Royal Enfield Classic 350cc 2016 2016 50000 Km first owner
2 Hyosung GT250R 2012 2012 14795 Km first owner
3 KTM Duke 200cc 2012 2012 24561 Km third owner
4 Bajaj Pulsar 180cc 2016 2016 19718 Km first owner
...
location mileage power price
0 hyderabad \n\n 35 kmpl 19 bhp 63500
1 hyderabad \n\n 35 kmpl 19.80 bhp 115000
2 hyderabad \n\n 30 kmpl 28 bhp 300000
3 bangalore \n\n 35 kmpl 25 bhp 63400
4 bangalore \n\n 65 kmpl 17 bhp 55000
When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb
New coder here! I am trying to scrape web table data from multiple URLs. Each URL web-page has 1 table, however that table is split among multiple pages. My code only iterates through the table pages of the first URL and not the rest. So.. I am able to get pages 1-5 of NBA data for year 2000 only, but it stops there. How do I get my code to pull every year of data? Any help is greatly appreciated.
page = 1
year = 2000
while page < 20 and year < 2020:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
sal_table = soup.find_all('table', class_ = 'tablehead')
if len(sal_table) < 2:
sal_table = sal_table[0]
with open ('NBA_Salary_2000_2019.txt', 'a') as r:
for row in sal_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(30))
r.write('\n')
page+=1
else:
print("too many tables")
else:
year +=1
page = 1
I'd consider using pandas here as 1) it's .read_html() function (which uses beautifulsoup under the hood), is easier to parse <table> tags, and 2) it can easily then write straight to file.
Also, it's a waste to iterate through 20 pages (for example the first season you are after only has 4 pages...the rest are blank. So I'd consider adding something that says once it reaches a blank table, move on to the next season.
import pandas as pd
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
results = pd.DataFrame()
year = 2000
while year < 2020:
goToNextPage = True
page = 1
while goToNextPage == True:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
temp_df = pd.read_html(base_URL)[0]
temp_df.columns = list(temp_df.iloc[0,:])
temp_df = temp_df[temp_df['RK'] != 'RK']
if len(temp_df) == 0:
goToNextPage = False
year +=1
continue
print ('Aquiring Season: %s\tPage: %s' %(year, page))
temp_df['Season'] = '%s-%s' %(year-1, year)
results = results.append(temp_df, sort=False).reset_index(drop=True)
page+=1
results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)
Output:
print (results.head(25).to_string())
RK NAME TEAM SALARY Season
0 1 Shaquille O'Neal, C Los Angeles Lakers $17,142,000 1999-2000
1 2 Kevin Garnett, PF Minnesota Timberwolves $16,806,000 1999-2000
2 3 Alonzo Mourning, C Miami Heat $15,004,000 1999-2000
3 4 Juwan Howard, PF Washington Wizards $15,000,000 1999-2000
4 5 Scottie Pippen, SF Portland Trail Blazers $14,795,000 1999-2000
5 6 Karl Malone, PF Utah Jazz $14,000,000 1999-2000
6 7 Larry Johnson, F New York Knicks $11,910,000 1999-2000
7 8 Gary Payton, PG Seattle SuperSonics $11,020,000 1999-2000
8 9 Rasheed Wallace, PF Portland Trail Blazers $10,800,000 1999-2000
9 10 Shawn Kemp, C Cleveland Cavaliers $10,780,000 1999-2000
10 11 Damon Stoudamire, PG Portland Trail Blazers $10,125,000 1999-2000
11 12 Antonio McDyess, PF Denver Nuggets $9,900,000 1999-2000
12 13 Antoine Walker, PF Boston Celtics $9,000,000 1999-2000
13 14 Shareef Abdur-Rahim, PF Vancouver Grizzlies $9,000,000 1999-2000
14 15 Allen Iverson, SG Philadelphia 76ers $9,000,000 1999-2000
15 16 Vin Baker, PF Seattle SuperSonics $9,000,000 1999-2000
16 17 Ray Allen, SG Milwaukee Bucks $9,000,000 1999-2000
17 18 Anfernee Hardaway, SF Phoenix Suns $9,000,000 1999-2000
18 19 Kobe Bryant, SF Los Angeles Lakers $9,000,000 1999-2000
19 20 Stephon Marbury, PG New Jersey Nets $9,000,000 1999-2000
20 21 Vlade Divac, C Sacramento Kings $8,837,000 1999-2000
21 22 Bryant Reeves, C Vancouver Grizzlies $8,666,000 1999-2000
22 23 Tom Gugliotta, PF Phoenix Suns $8,558,000 1999-2000
23 24 Nick Van Exel, PG Denver Nuggets $8,354,000 1999-2000
24 25 Elden Campbell, C Charlotte Hornets $7,975,000 1999-2000
...
I am trying to scrape the results table from the following url: https://utmbmontblanc.com/en/page/107/results.html
However when I run my code it says 'No Tables Found'
import pandas as pd
url = 'https://utmbmontblanc.com/en/page/107/results.html'
data = pd.read_html(url, header = 0)
data.head()
ValueError: No tables found
Having used developer tools I know that there is definitely a table in the html code. Why is it not being found? Any help is greatly appreciated. Thanks in advance
build URL for Ajax request, for 2017 - CCC is like this
url = 'https://.......com/result.php?mode=edPass&ajax=true&annee=2017&course=ccc'
data = pd.read_html(url, header = 0)
print(data[0])
You can also use selenium if you are unable to find any other hacks.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
from bs4 import BeautifulSoup as BSoup
import pandas as pd
url = "https://utmbmontblanc.com/en/page/107/results.html"
driver = webdriver.Chrome("/home/bitto/chromedriver")#change this to your chromedriver path
year = 2017
driver.get(url)
element = WebDriverWait(driver, 10).until(
#changes div[#class='bloc'] to change year - [1] for 2018, [2] for 2017 etc
#change index of div[#class='row'] - [1], [2] for TDS etc
#change #value of option match your preferred option's value - you can find this from the inspect tool - First two are Scratch and ScratchH
EC.presence_of_element_located((By.XPATH, "//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/select[#name='cat'][1]/option[#value='Scratch']"))
)
element.click()#select option
#make relevant changes you made in top here also
driver.find_element_by_xpath("//div[#class='bloc'][2]/div[#class='row'][4]/span[#class='selectbutton']/input").click();#click go
sleep(10)#not preferred but will do for now
table=pd.read_html(driver.page_source)
print(table)
Output
[ GeneralRanking Family name First name Club Cat. ... Time Difference/ 1st Nationality
0 1 3001 - HAWKS Hayden HOKA ONE ONE SEH ... 10:24:30 00:00:00 United States
1 2 3018 - ŚWIERC Marcin SALOMON SUUNTO TEAM POLAND SEH ... 10:42:49 00:18:19 Poland
2 3 3005 - POMMERET Ludovic TEAM HOKA V1H ... 10:50:47 00:26:17 France
3 4 3214 - EVANS Thomas COMPRESS SPORT SEH ... 10:57:44 00:33:14 United Kingdom
4 5 3002 - OWENS Tom SALOMON SEH ... 11:03:48 00:39:18 United Kingdom
5 6 3011 - JONSSON Thorbergur 66 NORTH SEH ... 11:14:22 00:49:52 Iceland
6 7 3026 - BOUVIER-GAZ Nicolas TEAM NEW BALANCE SEH ... 11:18:33 00:54:03 France
7 8 3081 - JONES Michael WWW.APEXRUNNING.CO SEH ... 11:31:50 01:07:20 United Kingdom
8 9 3020 - COLLET Aurélien HOKA ONE ONE SEH ... 11:33:10 01:08:40 France
9 10 3009 - MARAVILLA Jorge HOKA ONE ONE V1H ... 11:36:14 01:11:44 United States
10 11 3036 - PERRILLAT Christophe SEH ... 11:40:05 01:15:35 France
11 12 3070 - FRAGUELA BREIJO Alejandro STUDIO54 V1H ... 11:40:11 01:15:41 Spain
12 13 3092 - AIGROZ Mike TRUST SEH ... 11:41:53 01:17:23 Switzerland
13 14 3021 - O'LEARY Paddy THE NORTH FACE SEH ... 11:47:04 01:22:34 Ireland
14 15 3065 - PÉREZ TORREGLOSA Juan CLUB ULTRATRAIL ... SEH ... 11:47:51 01:23:21 Spain
15 16 3031 - SÁNCHEZ CEBRIÁN Miguel Ángel LURBEL-LI... V1H ... 11:49:15 01:24:45 Spain
16 17 3062 - ANDREWS Justin SEH ... 11:49:47 01:25:17 United States
17 18 3039 - PIANA Giulio TEAM MUD AND SNOW SEH ... 11:50:23 01:25:53 Italy
18 19 3047 - RONIMOISS Andris Inov8 / OSveikals.lv ... SEH ... 11:52:25 01:27:55 Latvia
19 20 3052 - DURAND Regis TEAM TRAIL ISOSTAR V1H ... 11:56:40 01:32:10 France
20 21 3027 - SANDES Ryan SALOMON SEH ... 12:04:39 01:40:09 South Africa
21 22 3014 - EL MORABITY Rachid ULTRA TRAIL ATLAS T... SEH ... 12:10:01 01:45:31 Morocco
22 23 3067 - JONES Harry RUNIVORE SEH ... 12:10:12 01:45:42 United Kingdom
23 24 3030 - CLAVERY Erik - SEH ... 12:12:56 01:48:26 France
24 25 3056 - JIMENEZ LLORENS Juan Maria GREEN POWER... SEH ... 12:13:18 01:48:48 Spain
25 26 3024 - GALLAGHER Clare THE NORTH FACE SEF ... 12:13:57 01:49:27 United States
26 27 3136 - ASSEL Garry LICENCE INDIVIDUELLE LUXEM... SEH ... 12:20:46 01:56:16 Luxembourg
27 28 3071 - RIGODANZA Francesco SPIRITO TRAIL TEAM SEH ... 12:22:49 01:58:19 Italy
28 29 3118 - POLASZEK Christophe CHARTRES VERTICAL V1H ... 12:24:49 02:00:19 France
29 30 3125 - CALERO RODRIGUEZ David Altmann Sports/... SEH ... 12:25:07 02:00:37 Spain
... ... ... ... ... ... ... ...
1712 1713 5734 - GOT Hang Fai V2H ... 26:25:01 16:00:31 Hong Kong, China
1713 1714 4154 - RAMOS Liliana NIKE RUNNING CLUB V3F ... 26:26:22 16:01:52 Argentina
1714 1715 5448 - BECKRICH Xavier PHOENIX57 V1H ... 26:26:45 16:02:15 France
1715 1716 5213 - BARBERIO ARNOULT Isabelle PHOENIX57 V1F ... 26:26:49 16:02:19 France
1716 1717 4704 - ZHANG Zheng XIAOMABENTENG SEH ... 26:28:37 16:04:07 China
1717 1718 5282 - GUISOLAN Frédéric SEH ... 26:28:46 16:04:16 Switzerland
1718 1719 5306 - MEDINA Rafael V1H ... 26:29:26 16:04:56 Mexico
1719 1720 5379 - PENTCHEFF Nicolas SEH ... 26:33:05 16:08:35 France
1720 1721 4665 - GONZALEZ SUANCES Israel BAR ES PUIG V1H ... 26:33:58 16:09:28 Spain
1721 1722 4389 - TONANNY Marie SEF ... 26:34:51 16:10:21 France
1722 1723 5616 - GLORIAN Thierry V2H ... 26:35:47 16:11:17 France
1723 1724 5684 - CHEUNG Ho FAITHWALKERS V1H ... 26:37:09 16:12:39 Hong Kong, China
1724 1725 5719 - GANDER Pascal JEFF B TRAIL SEH ... 26:39:04 16:14:34 France
1725 1726 4555 - JURGIELEWICZ Urszula SEF ... 26:39:44 16:15:14 Poland
1726 1727 4722 - HIDALGO José Miguel C.D. ATLETISMO SAN... V1H ... 26:40:27 16:15:57 Spain
1727 1728 4425 - JITTIWUTIKARN Gif V1F ... 26:41:02 16:16:32 Thailand
1728 1729 4556 - ZHU Jing SEF ... 26:41:12 16:16:42 China
1729 1730 4314 - HU Dongli V1H ... 26:41:27 16:16:57 China
1730 1731 4239 - DURET Estelle OXYGENE BELBEUF V1F ... 26:41:51 16:17:21 France
1731 1732 4525 - MAGLIERI Fabrice ATHLETIC CLUB PAYS DE... V1H ... 26:42:11 16:17:41 France
1732 1733 4433 - ANDERSEN Laura Jentsch RUN DEM CREW SEF ... 26:42:27 16:17:57 Denmark
1733 1734 4563 - CHEUNG Annie On Nai FAITHWALKERS V1F ... 26:45:35 16:21:05 Hong Kong, China
1734 1735 4355 - KHALED Naïm GENEVE AEROPORT SEH ... 26:47:50 16:23:20 Algeria
1735 1736 4749 - STELLA Sara COURMAYEUR TRAILERS V1F ... 26:48:07 16:23:37 Italy
1736 1737 4063 - LALIMAN Leslie SEF ... 26:48:09 16:23:39 France
1737 1738 5702 - BURKE Tony Alchester/CTR/Bicester Tri V2H ... 26:50:52 16:26:22 Ireland
1738 1739 5146 - OLIVEIRA Sandra BUDEGUITA RUNNERS V1F ... 26:52:23 16:27:53 Portugal
1739 1740 5545 - VELLANDI Emilio TEAM PEGGIORI SCARPA MICO V1H ... 26:55:32 16:31:02 Italy
1740 1741 5543 - GASPAROVIC Bernard STADE FRANCAIS V3H ... 26:56:31 16:32:01 France
1741 1742 4760 - MENDONCA Carine ASPTT COMPIEGNE V2F ... 27:19:15 16:54:45 Belgium
[1742 rows x 7 columns]]
I am attempting to use Python and Selenium to web-scrape dynamically loaded data from a website. The problem is, only about half of the data is being reported as present, when in reality it all should be there. Even after using pauses before printing out all the page content, or simple find element by class searches, there seems to be no solution. The URL of the site is https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909. As you can see, there are 13 main sections, however I am only able to retrieve data from the first four games. To best show the problem I'll attach the code for printing the inner-HTML for the entire page to show the discrepancies between the loaded and non-loaded data.
from selenium import webdriver
import requests
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))
EDIT:
The problem is not the wait time, for I am running it line by line and fully waiting for it to load. It appears the problem boild down to selenium not grabbing all the JS loaded text on the page, as seen by the console output in the answer below.
#sudonym's analysis was in the right direction. You need to induce WebDriverWait for the desired elements to be visible before you attempt to extract them through execute_script() method as follows:
Code Block:
# -*- coding: UTF-8 -*-
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]")))
print(driver.execute_script("return document.documentElement.innerText;"))
Console Output:
SPORTSBOOK REVIEW
Home
Best Sportsbooks
Rating Guide
Blacklist
Bonuses
BETTING ODDS
FREE PICKS
Sports Picks
NFL
College Football
NBA
NCAAB
MLB
NHL
More Sports
How to Bet
Tools
FORUM
Home
Players Talk
Sportsbooks & Industry
Newbie Forum
Handicapper Think Tank
David Malinsky's Point Blank
Service Plays
Bitcoin Sports Betting
NBA Betting
NFL Betting
NCAAF Betting
MLB Betting
NHL Betting
CONTESTS
EARN BETPOINTS
What Are Betpoints?
SBR Sportsbook
SBR Casino
SBR Racebook
SBR Poker
SBR Store
Today
NFL
NBA
NHL
MLB
College Football
NCAA Basketball
Soccer
Soccer Odds
Major League Soccer
UEFA Champions League
UEFA Nations League
UEFA Europa League
English Premier League
World Cup 2022
Tennis
Tennis Odds
ATP
WTA
UFC
Boxing
More Sports
CFL
WNBA
AFL
Betting Odds/NFL Odds/Consensus
TODAY
|
YESTERDAY
|
DATE
?
Login
?
Settings
?
Bet Tracker
?
Bet Card
?
Favorites
NFL Consensus for Sep 09, 2018
USA - National Football League
Sunday Sep 09, 2018
01:00 PM
/
Pittsburgh vs Cleveland
453
Pittsburgh
454
Cleveland
Current Line
-3½+105
+3½-115
Wagers Placed
10040
54.07%
8530
45.93%
Amount Wagered
$381,520.00
56.10%
$298,550.00
43.90%
Average Bet Size
$38.00
$35.00
SBR Contest Best Bets
22
9
01:00 PM
/
San Francisco vs Minnesota
455
San Francisco
456
Minnesota
Current Line
+6-102
-6-108
Wagers Placed
6250
41.25%
8900
58.75%
Amount Wagered
$175,000.00
29.50%
$418,300.00
70.50%
Average Bet Size
$28.00
$47.00
SBR Contest Best Bets
5
19
01:00 PM
/
Cincinnati vs Indianapolis
457
Cincinnati
458
Indianapolis
Current Line
-1-104
+1-106
Wagers Placed
11640
66.36%
5900
33.64%
Amount Wagered
$1,338,600.00
85.65%
$224,200.00
14.35%
Average Bet Size
$115.00
$38.00
SBR Contest Best Bets
23
12
01:00 PM
/
Buffalo vs Baltimore
459
Buffalo
460
Baltimore
Current Line
+7½-103
-7½-107
Wagers Placed
5220
33.83%
10210
66.17%
Amount Wagered
$78,300.00
16.79%
$387,980.00
83.21%
Average Bet Size
$15.00
$38.00
SBR Contest Best Bets
5
17
01:00 PM
/
Jacksonville vs N.Y. Giants
461
Jacksonville
462
N.Y. Giants
01:00 PM
/
Tampa Bay vs New Orleans
463
Tampa Bay
464
New Orleans
01:00 PM
/
Houston vs New England
465
Houston
466
New England
01:00 PM
/
Tennessee vs Miami
467
Tennessee
468
Miami
04:05 PM
/
Kansas City vs L.A. Chargers
469
Kansas City
470
L.A. Chargers
04:25 PM
/
Seattle vs Denver
471
Seattle
472
Denver
04:25 PM
/
Dallas vs Carolina
473
Dallas
474
Carolina
04:25 PM
/
Washington vs Arizona
475
Washington
476
Arizona
08:20 PM
/
Chicago vs Green Bay
477
Chicago
478
Green Bay
Media
Site Map
Terms of use
Contact Us
Privacy Policy
DMCA
18+. Gamble Responsibly.
© Sportsbook Review. All Rights Reserved.
This solution is only worth to consider if there are lots of WebDriverWait calls
and given the interest in reduced runtime - else go for DebanjanB's
approach
You need to wait some time to let your html load completely. Also, you can set a timeout for script execution. To add a unconditional wait to driver.get(URL) in selenium, driver.set_page_load_timeout(n) with n = time/seconds and loop:
driver.set_page_load_timeout(n) # Set timeout of n seconds for page load
loading_finished = 0 # Set flag to 0
while loading_finished == 0: # Repeat while flag = 0
try:
sleep(random.uniform(0.1, 0.5)) # wait some time
website = driver.get(URL) # try to load for n seconds
loading_finished = 1 # Set flag to 1 and exit while loop
logger.info("website loaded") # Indicate load success
except:
logger.warn("timeout - retry") # Indicate load fail
else: # If flag == 1
driver.set_script_timeout(n) # Set timeout of n seconds for script
script_finished = 0 # Set flag to 0
while script_finished == 0 # Second loop
try:
print driver.execute_script("return document.documentElement.innerText;")
script_finished = 1 # Set flag to 1
logger.info("script done") # Indicate script done
except:
logger.warn("script timeout")
else:
logger.info("if you're still missing html here, increase timeout")