scraping inside script tag with beautifulsoup - python

I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.

Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A

In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A

That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.

Related

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

Why I can't scrape Indiegogo although I am using the right xpath?

Here the code I am using to scrape an Indiegogo project but I get nothing:
url = 'https://www.indiegogo.com/projects/red-dot-watch'
page = requests.get(url=url)
tree = html.fromstring(page.content)
pledged = tree.xpath('//*[#id="main"]/div/div[2]/div/div/div[16]/div[1]/span[1]/span/span[2]/span[1]/text()')
if(len(pledged) > 0):
print(pledged[0])
else:
print("MISSING")
As #Ron said, Indiegogo is rendering its contents mostly via JavaScript, and simply requesting the page with Requests does not do that.
Happily, though, the structure of the Indiegogo pages may make it even easier for you to scrape things; there's a gon.campaign={...} JavaScript statement that seems to contain the data you're looking for. You should be able to use a regexp in the vein of gon.campaign=(\{.+\});gon to extract the data, then parse it as JSON.
EDIT: Here's an example - should work until Indiegogo decides to change their layout.
import re
import requests
import json
url = 'https://www.indiegogo.com/projects/red-dot-watch'
resp = requests.get(url)
resp.raise_for_status()
m = re.search(r'gon\.campaign=(\{.+?\});gon', resp.text)
if m:
data = json.loads(m.group(1))
else:
data = {}
print(data.get('balance'), '/', data.get('target_goal'))
Because your script is not parsing JS, thus you are not seeing the same webpage you get generated in your browser.

scrape data from interactive map

I'm trying to get the data from each pop-up on the map. I've used beautifulsoup in the past but this is a first getting data from an interactive map.
Any push in the right direction is helpful. So far i'm returning blanks.
Here's what i have, it isn't substantial...
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.oaklandconduit.com/development_map'
r = requests.get(url).text
soup = bs4(r, "html.parser")
address = soup.find_all("div", {"class": "leaflet-pane leaflet-marker-pane"})
Updated
On recommendations, I went with parsing the javascript content with re using the script below. But loading into json returns an error
import requests, re
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default/mapscript.js'
r = requests.get(url).content
content = re.findall(r'var.*?=\s*(.*?);', r, re.DOTALL | re.MULTILINE)[2]
json_content = json.loads(content)
The interactive map is loaded through and driven by JavaScript, therefore, using the requests library is not going to be sufficient enough to get the data you want because it only gets you the initial response (in this case, HTML source code).
If you view the source for the page (on Chrome: view-source:https://www.oaklandconduit.com/development_map) you'll see that there is an empty div like so:
<div id='map'></div>
This is the placeholder div for the map.
You'll want to use a method that allows the map to load and for you to programmatically interact with it. Selenium can do this for you but will be significantly slower than requests because it has to allow for this interactivity by launching a programmatically driven browser.
Continued with regex to parse map contents into Json. Here's my approach with comments if helpful to others.
import re, requests, json
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default' \
'/mapscript.js'
r = requests.get(url).content
# use regex to get geoJSON and replace single quotes with double
content = re.findall(r'var geoJson.*?=\s*(.*?)// Add custom popups', r, re.DOTALL | re.MULTILINE)[0].replace("'", '"')
# add quotes to key: "type" and remove trailing tab from value: "description"
content = re.sub(r"(type):", r'"type":', content).replace('\t', '')
# remove ";" from dict
content = content[:-5]
json_content = json.loads(content)
also open to other pythonic approaches.

Invalid Argument in Open method for web scraping

I am trying to scrape some data from the ancestry, I have a .net background but thought i'd try a bit of python for a project.
I'm falling at the first step, Firstly i am trying to open this page and then just print out the rows.
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
raw_html = open('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').read()
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('tblrow record'):
print(p)
I am getting an illegal argument on open.
According to documentation, open is used to:
Open [a] file and return a corresponding file object.
As such, you cannot use it for downloading the HTML contents of a webpage. You probably meant to use requests.get as follows:
raw_html = get('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').text
# .text gets the raw text of the response
# (http://docs.python-requests.org/en/master/api/#requests.Response.text)
Here are a few recommendation to improve your code as well:
requests.get provides many useful parameters, one of them being params, which allows you to provide the URL parameters in the form of a Python dictionary.
If you need to verify whether the request was successful before accessing its text, then just check if the returned response.status_code == requests.codes.ok. This only covers status code 200, but if you need more codes, then response.raise_for_status should be helpful.

Python3 - Urllib & BeautifulSoup4 to Extract Specific Text

I am new to python and I am trying to get a script working alongside with urllib and BeautifulSoup4 to collect the tweets which are streamable via the emojitracker API. It outputs the tweets of a specific emoji as .json files. An example is this link (opens in chrome):
http://emojitracker.com/api/details/1F52B
I can get all the text from the .json, but I only want to get the tweet (which is after "text:"). I had a look around and there was an example to get all the links on the page, using soup.findAll("a",class_="classname").
I used inspect element and found that the tweet i need is stored as: (span class="type-string")tweet goes here(/span). So I tried the following:
from bs4 import BeautifulSoup
import urllib.request
url = "http://emojitracker.com/api/details/1F52B"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
tweets = soup.findAll("span", class_"type-string")
for tweet in tweets:
print (tweet.string)
Running, this it did not print anything. How can I make it so that it only prints out the tweets?
The page you provide is not an html page. In fact it's formatted as a json file, so you won't be able to treat it as an HTML page.
As I understand, what you want is to retrieve all of the recent tweets.
In order to do this, we get the response, as already do, and parse the response string and convert it to a Python dictionary using the json library (which does not require installation as it's part of the standard library).
If you want to do this, we can write the following code:
import json
import urllib.request
url = "http://emojitracker.com/api/details/1F52B"
page = urllib.request.urlopen(url)
json = json.loads(str(page.read(), 'latin'))
for tweet in json['recent_tweets']:
print(tweet['text'])
Hope it helps,

Categories

Resources