BeautifulSoup, lmxl not downloading time text

BeautifulSoup, lmxl not downloading time text - python

I'm trying to download data from a website, everything is working fine except for when it encounters dates, it will just return "". I've looked at the html downloaded into the program and it has nothing between the tags which is why it's returning nothing. When you inspect the html online you can see it there clearly. Does anyone have any ideas?
from bs4 import BeautifulSoup
import requests
stocks=["3PL"]
keys = list()
values = list()
for stock in stocks:
source = requests.get(r"https://www.reuters.com/companies/" + stock + ".AX/key-metrics").text
soup = BeautifulSoup(source, 'lxml')
for data in soup.find_all("tr", class_="data"):
keys.append(data.th.text)
if data.td.text != "--":
values.append(data.td.text)
else:
values.append("nan")
print(keys[3])
print(values[3]) #This should return the date

It would seem your data is added with javascript. This is something requests will not handle as it won't render the page like a normal browsers. Only fetch the raw data.
However, you can use the selenium package to do this successfully. To instal this:
pip install selenium
You may need to setup some web drivers to use Firefox, or Chrome. But in the case below I used the browser that worked out of the box, being Safari.
I have adjusted your code a little to use the selenium package, and put your data in a dict to keep a nicer consistency.
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
stocks=["3PL"]
response_data = {}
driver = webdriver.Safari()
for stock in stocks:
url = r"https://www.reuters.com/companies/" + stock + ".AX/key-metrics"
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source)
for data in soup.find_all("tr", class_="data"):
if data.td.text != "--":
response_data[data.th.text] = data.td.text
else:
response_data[data.th.text] = 'nan'
driver.close()
Now you can check if the data is correctly downloaded:
print(response_data['Pricing date'])
Sep-04

Related

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help

If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28

the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading

Web scraping an element using beautifulSoup and Python

I am trying to grab an element from tradingview.com. Specifically this link. I want the price of a symbol of whatever link I give my program. I noticed when looking through the elements of the url, I can find the price of the stock here.
<div class="tv-symbol-price-quote__value js-symbol-last">
"3.065"
<span class>57851</span>
</div>
When running this code below, I get this output.
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup
URL = "https://www.tradingview.com/symbols/NEARUSD/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser') # If this line causes an error, run 'pip install html5lib' or install html5lib
L = [soup.find_all(class_ = "tv-symbol-price-quote__value js-symbol-last")]
print(L)
output
[[<div class="tv-symbol-price-quote__value js-symbol-last"></div>]]
How can I grab the entire price from this website? I would like the 3.065 as well as the 57851.

You have the most common problem: page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JS. You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTool I found it uses JavaScript to send POST (with some JSON data) and it gets fresh price.
import requests
payload = {
"columns": ["market_cap_calc", "market_cap_diluted_calc", "total_shares_outstanding", "total_shares_diluted", "total_value_traded"],
"range": [0, 1],
"symbols": {"tickers": ["BINANCE:NEARUSD"]}
}
url = 'https://scanner.tradingview.com/crypto/scan'
response = requests.post(url, json=payload)
print(response.text)
data = response.json()
print(data['data'][0]["d"][1]/1_000_000_000)
Result:
{"totalCount":1,"data":[{"s":"BINANCE:NEARUSD","d":[2507704855.0467912,3087555230,812197570,1000000000,106737372.9550421]}]}
3.08755523
EDIT:
It seems above code gives only market cap. And page uses websocket to get fresh price every few seconds.
wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNEARUSD%2F&date=2022_10_17-11_33
And this would need more complex code.
Other answer (with Selenium) gives you correct value.

The webpage's contents are loaded dynamically by JavaScript. So you have to use an automation tool something like selenium or hidden API.
Here I use selenium with bs4 to grab the desired dynamic content.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://www.tradingview.com/symbols/NEARUSD/"
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"lxml")
price = soup.find('div',class_ = "tv-symbol-price-quote__value js-symbol-last").get_text(strip=True)
print(price)
Output:
3.07525163

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)

Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.

You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

My python script does not print table from html

I am trying to get table data from below code but surprisingly the script shows a "none" output for table, though I could clearly see it in my HTML doc.
Look forward for help..
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
site = 'http://www.altrankarlstad.com/wisp'
hdr = {'User-Agent': 'Chrome/78.0.3904.108'}
req = Request(site, headers=hdr)
res = urlopen(req)
rawpage = res.read()
page = rawpage.replace("<!-->", "")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", {"class":"table workitems-table mt-2"})
print (table)
Also here comes the code with Selenium Script as suggested:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('C:\\Users\\rugupta\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.7\\chromedriver.exe')
driver.get(url)
driver.find_element_by_id('root').click() #click on search button to fetch list of bus schedule
time.sleep(10) #depends on how long it will take to go to next page after button click
for i in range(1,50):
url = "http://www.altrankarlstad.com/wisp".format(pagenum = i)
text_field = driver.find_elements_by_xpath("//*[#id="root"]/div/div/div/div[2]/table")
for h3Tag in text_field:
print(h3Tag.text)

The page wasn't fully loaded when you use Request. you can debug by printing res.
It seems the page is using javascript to load the table.
You should use selenium, load the page with driver (eg: chromedriver, Firefoxdriver). Sleep a while until the page is loaded (you define it, it take quite a bit to load fully). Then get the table using selenium
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('/path/to/chromedriver)
driver.get(url)
# I dont understand what's the purpose when clicking that button
time.sleep(100)
text_field = driver.find_elements_by_xpath('//*[#id="root"]/div/div/div/div[2]/table')
print (text_field[0].text)
You code worked fine with a bit of modifying, this will print all the text from the table. You should learn to debug and change it to get what you want.
This is my output running above scripts

Why python output doesn't match html for target website

I'm trying to webscrape a target website of details such as price, name, jpeg of the product, but what is pulled through python using beautifulsoup doesn't seem to match the html from the target website(using F12).
I've tried using html.parser and lxml within the beautifulsoup function, but both don't seem to make a difference. I've tried googling similar problems, but haven't found anything. I'm using atom to run the python code and am using Ubuntu 18.04.2. I am pretty new at using python, but have coded a bit before.
url = 'https://www.target.com/s?searchTerm=dove'
# Gets html from the given url
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaxin')
print(len(items))
It's suppose to output 28, but I consistently get 0

It looks like the elements that you're trying to find aren't there because they are created dynamically after the site loads. You can see that by yourself by looking at the source code when the website first loads. You can also try printing html_soup.prettify() and you'll see that the elements you're trying to find aren't there.
Inspired by this question, I present a solution based on using selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.target.com/s?searchTerm=dove"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
html_soup = BeautifulSoup(html, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaXIn')
driver.close()
print(len(items))
The previous code outputs 28 when I run it.
Note that you need to install selenium (installation guide here) and the appropriate driver for this to work (in my solution I used the Firefox driver which can be downloaded here).
Also note that I use class_ = 'bkaXIn' (case sensitive!) in html_soup.find_all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup, lmxl not downloading time text - python

Related

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

Web scraping an element using beautifulSoup and Python

Not getting all the information when scraping bet365.com

My python script does not print table from html

Why python output doesn't match html for target website

Categories

Resources