Scraping Webpage with Javascript Elements - python

So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!

On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1

Related

Extract specific JS value from web page using BeautifulSoup Python

I want to extract a field from the following web page: https://www.olx.bg/d/ad/podemnitsi-haspeli-tovarni-i-kuhnenski-asansori-motor-reduktori-CID1012-ID8pWNq.html
The value that I want to get is this one ( 3143 ):
I tried to do it, but no success, the value is JS generated. Here is my code so far.
page = requests.get('https://www.olx.bg/d/ad/podemnitsi-haspeli-tovarni-i-kuhnenski-asansori-motor-reduktori-CID1012-ID8pWNq.html')
soup = BeautifulSoup(page.content, 'html.parser')
Do you have any idea how can I do this ?
Check the source code of the webpage ( using Ctrl + U on the browser) and search your desired element whether they're present or not.
If they are present in the page source then extract them, it could be using regex if for example, they are inside a JSON.
If not try selenium library

Unable to find tag when data scraping

I am new to Python and I've been working on a program that alerts you when a new item is uploaded to jp.mercari.com (a shopping site). I have the alert part of the program working, but it operates based on the number of items that come up on the search results. When I scrape the website I am unable to find what I am looking for despite being able to locate it when I inspect element on the page. The scraping program looks like this:
from bs4 import BeautifulSoup
import requests
url = "https://jp.mercari.com/search?keyword=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("mer-text")
print(tag)
For more context, this is the website and some of the HTML. I've circled the parts I am trying to find in red:
Does anyone know why I am unable to find what I'm looking for?
Here is another example of the same problem but from a website that is in English:
import requests
url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("span")
print(tag)
Again, I can see the part of HTML I want to find when I inspect element but I can't find it when I scrape the website:
Here's what's happening with me: the element you seek (<mer-text>) is being found. However, the output is in Japanese, and Python doesn't know what to do with that. In my browser, it's being translated to English automatically by Google, so that's easier to deal with.

Beautiful Soup (Python) not seeing text inside of span

I can't figure out why BS4 is not seeing the text inside of the span in the following scenario:
Page: https://pypi.org/project/requests/
Text I'm looking for - number of stars on the left hand side (around 43,000 at the time of writing)
My code:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).text
also tried:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).get_text()
Both return an empty string ''. The element itself seems to be located correctly (I can browse through parents / siblings in PyCharm debugger without a problem. Fetching text in other parts of the website also works perfectly fine. It's just the github-related stats that fail to fetch.
Any ideas?
Because this page use Javascript to load the page dynamically.So you couldn't get it directly by response.text
The source code of the page:
You could crawl the API directly:
import requests
r = requests.get('https://api.github.com/repos/psf/requests')
print(r.json()["stargazers_count"])
Result:
43010
Using bs4, we can't scrape stars rate.
After inspecting the site, please check response html.
There, there is class information named "github-repo-info__item", but there is no text information.
in this case, use selenium.

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

scraping a constantly changing integer from a website

I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://www.oanda.com/currency/live-exchange-rates/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
data1 = soup.find(id='EUR_USD-b-int')
print data1
This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?
The value on this page is updated using Javascript by making a request to
GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/
(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)
The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)
The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.

Categories

Resources