I am trying to get table data from below code but surprisingly the script shows a "none" output for table, though I could clearly see it in my HTML doc.
Look forward for help..
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
site = 'http://www.altrankarlstad.com/wisp'
hdr = {'User-Agent': 'Chrome/78.0.3904.108'}
req = Request(site, headers=hdr)
res = urlopen(req)
rawpage = res.read()
page = rawpage.replace("<!-->", "")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", {"class":"table workitems-table mt-2"})
print (table)
Also here comes the code with Selenium Script as suggested:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('C:\\Users\\rugupta\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.7\\chromedriver.exe')
driver.get(url)
driver.find_element_by_id('root').click() #click on search button to fetch list of bus schedule
time.sleep(10) #depends on how long it will take to go to next page after button click
for i in range(1,50):
url = "http://www.altrankarlstad.com/wisp".format(pagenum = i)
text_field = driver.find_elements_by_xpath("//*[#id="root"]/div/div/div/div[2]/table")
for h3Tag in text_field:
print(h3Tag.text)
The page wasn't fully loaded when you use Request. you can debug by printing res.
It seems the page is using javascript to load the table.
You should use selenium, load the page with driver (eg: chromedriver, Firefoxdriver). Sleep a while until the page is loaded (you define it, it take quite a bit to load fully). Then get the table using selenium
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('/path/to/chromedriver)
driver.get(url)
# I dont understand what's the purpose when clicking that button
time.sleep(100)
text_field = driver.find_elements_by_xpath('//*[#id="root"]/div/div/div/div[2]/table')
print (text_field[0].text)
You code worked fine with a bit of modifying, this will print all the text from the table. You should learn to debug and change it to get what you want.
This is my output running above scripts
Related
I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading
I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup
I can clearly see the tag I need in order to get the data I want to scrape.
According to multiple tutorials I am doing exactly the same way.
So why it gives me "None" when I simply want to display code between li class
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
soup = BeautifulSoup(response.text,'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)
Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.
import requests
from bs4 import BeautifulSoup as bs
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))
There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.
Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.
Unfortunately, you can't run JavaScript with BeautifulSoup .
Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)
I am having a problem with bs4 only finding some things in html. To be specific when I try to print span.nav2__menu-link-main-text it selects it and prints it without a problem but when I try to select other part of the page it probably selects it but It doesnt want to print it out. Here is the code that prints and the code that doesnt print:
Tried using different parsers other than lxml and none worked.
#This one prints
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('span.nav2__menu-link-main-text'):
print(i.text)
#This one does not print
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('div.value-dispaly__value'):
print(i.text)
I expect this program to print the current value of div.value-dispaly__value
but when I start the program it prints nothing even tough I can see the value is 4000 when I inspect the page.
It seems that code you are willing to get is dynamically added to the web page by javascript.
In order to update web js part, you have to use requests render() function.
Website page is javascript request rendering to get data, so you need to use automation library like selenium. download selenium web driver as per your browser requirement.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/
Replace your code to this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://osu.ppy.sh/users/12008062')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
for i in soup.find_all('div',{"class":"value-display__value"}):
print(i.get_text())
O/P:
#47,514
#108
11d 19h 49m
44
4,000
11d 19h 49m
44
4,000
#47,514
#108
0
0
I am trying to get some information from Instagram by scraping it. I have tried this code on twitter and it was working fine but it shows no result on Instagram both of the code are available here.
Twitter code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://twitter.com/realmadrid"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"ProfileHeaderCard"}))
Result: Perfectly given.
Instagram Code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Result: None
If you look at the source, you will see the content is dynamically loaded so there is no div._bugdy in what is returned by your request, depending on what it is you want you may be able to pull it from the script json:
import requests
import re
import json
r = requests.get("https://www.instagram.com/barackobama/")
soup = BeautifulSoup(r.content)
js = soup.find("script",text=re.compile("window._sharedData")).text
_json = json.loads((js[js.find("{"):js.rfind("}")+1]))
from pprint import pprint as pp
pp(_json)
That gives you everything you see in the <script type="text/javascript">window._sharedData = ..... in the source returned.
If you want to ge the followers then you will need to use something like selenium, the site is pretty much all dynamically loaded content, to get the followers you need to click the link which is only visible if you are logged in, this will get you closer to what you want:
from selenium import webdriver
import time
login = "https://www.instagram.com"
dr = webdriver.Chrome()
dr.get(login)
dr.find_element_by_xpath("//a[#class='_k6cv7']").click()
dr.find_element_by_xpath("//input[#name='username']").send_keys(youruname")
dr.find_element_by_xpath("//input[#name='password']").send_keys("yourpass")
dr.find_element_by_css_selector("button._aj7mu._taytv._ki5uo._o0442").click()
time.sleep(5)
dr.get("https://www.instagram.com/barackobama")
dr.find_element_by_css_selector('a[href="/barackobama/followers/"]').click()
time.sleep(3)
for li in dr.find_element_by_css_selector("div._n3cp9._qjr85").find_elements_by_xpath("//ul/li"):
print(li.text)
That pulls some text from the li tags that appear in the popup after you click the link, you can pull whatever you want from the unordered list:
First of all there seems to be a typo in the address on line 3.
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Secondly, since you are working with dynamically loaded content, Python might not be able to see all the content you see when browsing the page in your browser.
In order to solve that there are different webdrivers, such as Selenium webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) which emulate the browser and can wait for Javascript to generate/display data before looking it up.