Parsing with Python Help me - python

How can I pull all coordinates value (56.10457, 47.211815,36.130162, 67.135758) from the entire text?
<script>
data = milestonesMap.getEmptyData();
data.points.push({
properties: {
balloonContentHeader: "CORDON",
},
geometry: {
type: "Point",
coordinates: [46.10457, 67.211815]
}
});
data.points.push({
properties: {
balloonContentHeader: "CORDON",
},
geometry: {
type: "Point",
coordinates: [36.130162, 67.135758]
}
});
from bs4 import BeautifulSoup
import requests
url = 'https://xn--90adear.xn--p1ai/r/21/milestones'
page = requests.get(url)
print(page.status_code)
filteredNews = []
allNews = []
soup = BeautifulSoup(page.text, "html.parser")
print(soup)

user regex:
coord = re.findall("coordinates: \[([0-9., ]*),([0-9., ]*)\]", soup)
output
[('46.10457', ' 67.211815'), ('36.130162', ' 67.135758')]
or just re.findall("coordinates: \[([0-9., ]*)\]", soup) to have both long, lat as one tuple

Related

Execute js function in HTML page scraped by python to get json data

I have a website with products https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01 When I inspect the html page I see they have all info in json format in script tag under
window.INITIAL_DATA = JSON.parse('{"pa...')
I tried to scrape the html with requests and get the json string with regex, however my code somehow change the json structure and I cannot load it with json.loads()
response = requests.get('https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
regex = "JSON.parse\(.*;"
match = re.search(regex, str(soup))
json_string = match.group(0).replace("JSON.parse(", "")[1:-3]
json_data = json.loads(json_string)
it ends with json error because there are multiple weird spaces and " which does json library in python cannot handle
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 22173 (char 22172)
Is there a way how to get the json data or even better how to execute the window.INITIAL_DATA function directly in html response in python?
Try:
import re
import js2py
import requests
url = "https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_DATA = (.*)", html_doc)
data = js2py.eval_js(data.group(1))
print(data)
Prints:
{
"currentCountry": {
"englishName": "Sweden",
"localName": "Sverige",
"twoLetterCode": "SE",
},
"currentCurrency": "SEK",
"currentLanguage": "sv-SE",
"currentLanguageRevision": "43",
"currentLanguageTwoLetterName": "sv",
"dynamicData": [
{
"data": {},
"type": "NordicNest.ContentApi.DynamicData.MenuApiModel,NordicNest.ContentApi",
},
{
"type": "NordicNest.Core.Contentful.Model.SiteLayout.Footer,NordicNest.Core"
},
...

Python Scrape specific JS data

Im having some trouble extracting the following data from a page:
I have highlighted the json I would like to obtain from the page.
I have also pasted the javascript section it is in below:
<script type="text/x-magento-init">
{
"#conf-select-attr-173": {
"Magento_ConfigurableProduct/js/configurable/select/action": {
"config": {"attributes":{"173":{"id":"173","code":"Size","label":"Size","options":[{"id":"342","label":"Footwear-38","products":["104984"]},{"id":"345","label":"Footwear-39","products":["104985"]},{"id":"347","label":"Footwear-39.5","products":["104986"]},{"id":"349","label":"Footwear-40","products":["104987"]},{"id":"351","label":"Footwear-40.5","products":["104988"]},{"id":"354","label":"Footwear-41.5","products":["104989"]},{"id":"355","label":"Footwear-42","products":["104990"]},{"id":"357","label":"Footwear-42.5","products":["104991"]},{"id":"360","label":"Footwear-43.5","products":["104992"]},{"id":"361","label":"Footwear-44","products":["104993"]},{"id":"363","label":"Footwear-44.5","products":["104994"]},{"id":"364","label":"Footwear-45","products":["104995"]},{"id":"367","label":"Footwear-46","products":["104996"]},{"id":"369","label":"Footwear-46.5","products":["104997"]}],"position":"0"}},"template":"<%- data.price %>\u00a0 \u20ac","currencyFormat":"%s\u00a0 \u20ac","optionPrices":{"104984":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104985":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104986":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104987":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104988":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104989":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104990":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104991":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104992":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104993":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104994":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104995":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104996":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104997":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]}},"priceFormat":{"pattern":"%s\u00a0 \u20ac","precision":2,"requiredPrecision":2,"decimalSymbol":",","groupSymbol":".","groupLength":3,"integerRequired":1},"prices":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9}},"productId":"104998","chooseText":"Choose an Option...","images":[],"index":{"104984":{"173":"342"},"104985":{"173":"345"},"104986":{"173":"347"},"104987":{"173":"349"},"104988":{"173":"351"},"104989":{"173":"354"},"104990":{"173":"355"},"104991":{"173":"357"},"104992":{"173":"360"},"104993":{"173":"361"},"104994":{"173":"363"},"104995":{"173":"364"},"104996":{"173":"367"},"104997":{"173":"369"}},"sku":{"default":"1201A429-300","104984":"1201A429-300-Footwear-38","104985":"1201A429-300-Footwear-39","104986":"1201A429-300-Footwear-39.5","104987":"1201A429-300-Footwear-40","104988":"1201A429-300-Footwear-40.5","104989":"1201A429-300-Footwear-41.5","104990":"1201A429-300-Footwear-42","104991":"1201A429-300-Footwear-42.5","104992":"1201A429-300-Footwear-43.5","104993":"1201A429-300-Footwear-44","104994":"1201A429-300-Footwear-44.5","104995":"1201A429-300-Footwear-45","104996":"1201A429-300-Footwear-46","104997":"1201A429-300-Footwear-46.5"},"stock":{"104984":{"is_salable":true,"qty":1},"104985":{"is_salable":true,"qty":1},"104986":{"is_salable":true,"qty":0},"104987":{"is_salable":true,"qty":1},"104988":{"is_salable":true,"qty":1},"104989":{"is_salable":true,"qty":2},"104990":{"is_salable":true,"qty":0},"104991":{"is_salable":true,"qty":0},"104992":{"is_salable":true,"qty":3},"104993":{"is_salable":true,"qty":2},"104994":{"is_salable":true,"qty":1},"104995":{"is_salable":true,"qty":0},"104996":{"is_salable":true,"qty":0},"104997":{"is_salable":true,"qty":0}}},
"selected": ""
}
}
}
</script>
How can I obtain this quickly and efficiently - I have tried using Bs4 but I always get an object value of None returned. Please could someone show me how this can be done :)
Thanks!
This script looks like JSON data - so use module json to convert it to Python dictionary (ie. data) and get what you want -
data["#conf-select-attr-173"]["Magento_ConfigurableProduct/js/configurable/select/action"]["config"]
html = '''<script type="text/x-magento-init">
{
"#conf-select-attr-173": {
"Magento_ConfigurableProduct/js/configurable/select/action": {
"config": {"attributes":{"173":{"id":"173","code":"Size","label":"Size","options":[{"id":"342","label":"Footwear-38","products":["104984"]},{"id":"345","label":"Footwear-39","products":["104985"]},{"id":"347","label":"Footwear-39.5","products":["104986"]},{"id":"349","label":"Footwear-40","products":["104987"]},{"id":"351","label":"Footwear-40.5","products":["104988"]},{"id":"354","label":"Footwear-41.5","products":["104989"]},{"id":"355","label":"Footwear-42","products":["104990"]},{"id":"357","label":"Footwear-42.5","products":["104991"]},{"id":"360","label":"Footwear-43.5","products":["104992"]},{"id":"361","label":"Footwear-44","products":["104993"]},{"id":"363","label":"Footwear-44.5","products":["104994"]},{"id":"364","label":"Footwear-45","products":["104995"]},{"id":"367","label":"Footwear-46","products":["104996"]},{"id":"369","label":"Footwear-46.5","products":["104997"]}],"position":"0"}},"template":"<%- data.price %>\u00a0 \u20ac","currencyFormat":"%s\u00a0 \u20ac","optionPrices":{"104984":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104985":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104986":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104987":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104988":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104989":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104990":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104991":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104992":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104993":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104994":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104995":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104996":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104997":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]}},"priceFormat":{"pattern":"%s\u00a0 \u20ac","precision":2,"requiredPrecision":2,"decimalSymbol":",","groupSymbol":".","groupLength":3,"integerRequired":1},"prices":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9}},"productId":"104998","chooseText":"Choose an Option...","images":[],"index":{"104984":{"173":"342"},"104985":{"173":"345"},"104986":{"173":"347"},"104987":{"173":"349"},"104988":{"173":"351"},"104989":{"173":"354"},"104990":{"173":"355"},"104991":{"173":"357"},"104992":{"173":"360"},"104993":{"173":"361"},"104994":{"173":"363"},"104995":{"173":"364"},"104996":{"173":"367"},"104997":{"173":"369"}},"sku":{"default":"1201A429-300","104984":"1201A429-300-Footwear-38","104985":"1201A429-300-Footwear-39","104986":"1201A429-300-Footwear-39.5","104987":"1201A429-300-Footwear-40","104988":"1201A429-300-Footwear-40.5","104989":"1201A429-300-Footwear-41.5","104990":"1201A429-300-Footwear-42","104991":"1201A429-300-Footwear-42.5","104992":"1201A429-300-Footwear-43.5","104993":"1201A429-300-Footwear-44","104994":"1201A429-300-Footwear-44.5","104995":"1201A429-300-Footwear-45","104996":"1201A429-300-Footwear-46","104997":"1201A429-300-Footwear-46.5"},"stock":{"104984":{"is_salable":true,"qty":1},"104985":{"is_salable":true,"qty":1},"104986":{"is_salable":true,"qty":0},"104987":{"is_salable":true,"qty":1},"104988":{"is_salable":true,"qty":1},"104989":{"is_salable":true,"qty":2},"104990":{"is_salable":true,"qty":0},"104991":{"is_salable":true,"qty":0},"104992":{"is_salable":true,"qty":3},"104993":{"is_salable":true,"qty":2},"104994":{"is_salable":true,"qty":1},"104995":{"is_salable":true,"qty":0},"104996":{"is_salable":true,"qty":0},"104997":{"is_salable":true,"qty":0}}},
"selected": ""
}
}
}
</script>'''
from bs4 import BeautifulSoup
import json
soup = BeautifulSoup(html, 'html.parser')
text = soup.find('script').string
#print(text)
data = json.loads(text)
config = data["#conf-select-attr-173"]["Magento_ConfigurableProduct/js/configurable/select/action"]["config"]
print(config)
Eventually you may get 4-th line from this text, remove "config": and , at the end and again use json to convert it to Python dictionary
html = '''<script type="text/x-magento-init">
{
"#conf-select-attr-173": {
"Magento_ConfigurableProduct/js/configurable/select/action": {
"config": {"attributes":{"173":{"id":"173","code":"Size","label":"Size","options":[{"id":"342","label":"Footwear-38","products":["104984"]},{"id":"345","label":"Footwear-39","products":["104985"]},{"id":"347","label":"Footwear-39.5","products":["104986"]},{"id":"349","label":"Footwear-40","products":["104987"]},{"id":"351","label":"Footwear-40.5","products":["104988"]},{"id":"354","label":"Footwear-41.5","products":["104989"]},{"id":"355","label":"Footwear-42","products":["104990"]},{"id":"357","label":"Footwear-42.5","products":["104991"]},{"id":"360","label":"Footwear-43.5","products":["104992"]},{"id":"361","label":"Footwear-44","products":["104993"]},{"id":"363","label":"Footwear-44.5","products":["104994"]},{"id":"364","label":"Footwear-45","products":["104995"]},{"id":"367","label":"Footwear-46","products":["104996"]},{"id":"369","label":"Footwear-46.5","products":["104997"]}],"position":"0"}},"template":"<%- data.price %>\u00a0 \u20ac","currencyFormat":"%s\u00a0 \u20ac","optionPrices":{"104984":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104985":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104986":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104987":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104988":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104989":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104990":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104991":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104992":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104993":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104994":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104995":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104996":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]},"104997":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9},"tierPrices":[]}},"priceFormat":{"pattern":"%s\u00a0 \u20ac","precision":2,"requiredPrecision":2,"decimalSymbol":",","groupSymbol":".","groupLength":3,"integerRequired":1},"prices":{"oldPrice":{"amount":129.9},"basePrice":{"amount":109.15966286555},"finalPrice":{"amount":129.9}},"productId":"104998","chooseText":"Choose an Option...","images":[],"index":{"104984":{"173":"342"},"104985":{"173":"345"},"104986":{"173":"347"},"104987":{"173":"349"},"104988":{"173":"351"},"104989":{"173":"354"},"104990":{"173":"355"},"104991":{"173":"357"},"104992":{"173":"360"},"104993":{"173":"361"},"104994":{"173":"363"},"104995":{"173":"364"},"104996":{"173":"367"},"104997":{"173":"369"}},"sku":{"default":"1201A429-300","104984":"1201A429-300-Footwear-38","104985":"1201A429-300-Footwear-39","104986":"1201A429-300-Footwear-39.5","104987":"1201A429-300-Footwear-40","104988":"1201A429-300-Footwear-40.5","104989":"1201A429-300-Footwear-41.5","104990":"1201A429-300-Footwear-42","104991":"1201A429-300-Footwear-42.5","104992":"1201A429-300-Footwear-43.5","104993":"1201A429-300-Footwear-44","104994":"1201A429-300-Footwear-44.5","104995":"1201A429-300-Footwear-45","104996":"1201A429-300-Footwear-46","104997":"1201A429-300-Footwear-46.5"},"stock":{"104984":{"is_salable":true,"qty":1},"104985":{"is_salable":true,"qty":1},"104986":{"is_salable":true,"qty":0},"104987":{"is_salable":true,"qty":1},"104988":{"is_salable":true,"qty":1},"104989":{"is_salable":true,"qty":2},"104990":{"is_salable":true,"qty":0},"104991":{"is_salable":true,"qty":0},"104992":{"is_salable":true,"qty":3},"104993":{"is_salable":true,"qty":2},"104994":{"is_salable":true,"qty":1},"104995":{"is_salable":true,"qty":0},"104996":{"is_salable":true,"qty":0},"104997":{"is_salable":true,"qty":0}}},
"selected": ""
}
}
}
</script>'''
from bs4 import BeautifulSoup
import json
soup = BeautifulSoup(html, 'html.parser')
text = soup.find('script').string
lines = text.split('\n')
line4 = lines[4].strip()
line4 = line4.replace('"config": ', '')
line4 = line4[:-1] # remove `,` at the end
config = json.loads(line4)
print(config)

BeautifulSoup Find within an instagram html page

I have a problem to find something with bs4.
I'm trying to automatically find some urls in an html instagram page and (knowing that I'm a python noob) I can't find the way to search automatically within the html source code the urls who are in the exemple after the "display_url": http...".
I want to make my script search multiples url who appears as next as "display_url" and download them.
They have to be extracted as many times as they appear in the source code.
With bs4 I tried the :
f = urllib.request.urlopen(fileURL)
htmlSource = f.read()
soup = bs(htmlSource, 'html.parser')
metaTag = soup.find_all('meta', {'property': 'og:image'})
imgURL = metaTag[0]['content']
urllib.request.urlretrieve(imgURL, 'fileName.jpg')
But I can't make the soup.find_all(... work/search it.
Is there a way for me to find this part of the page with bs4 ?
Thanks a lot for your help.
Here is an exemple of my little (python) code as it is now : https://repl.it/#ClementJpn287/bs
<!––cropped...............-->
<body class="">
<span id="react-root"><svg width="50" height="50" viewBox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7">
<path
d="
<!––deleted part for privacy -->
" />
</svg></span>
<script type="text/javascript">
window._sharedData = {
"config": {
"csrf_token": "",
"viewer": {
<!––deleted part for privacy -->
"viewerId": ""
},
"supports_es6": true,
"country_code": "FR",
"language_code": "fr",
"locale": "fr_FR",
"entry_data": {
"PostPage": [{
"graphql": {
"shortcode_media": {
"__typename": "GraphSidecar",
<!––deleted part for privacy -->
"dimensions": {
"height": 1080,
"width": 1080
},
"gating_info": null,
"media_preview": null,
<--There's the important part that have to be extracted as many times it appear in the source code-->
"display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"display_resources": [{
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 640,
"config_height": 640
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 750,
"config_height": 750
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 1080,
"config_height": 1080
}],
"is_video": false,
<!––cropped...............-->
my newest code
You could find the appropriate script tag and regex out the info. I have assumed the first script tag containing window._sharedData = is the appropriate one. You can fiddle as required.
from bs4 import BeautifulSoup as bs
import re
html = '''
<html>
<head></head>
<body class="">
<span id="react-root">
<svg width="50" height="50" viewbox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7">
<path d="
<!––deleted part for privacy -->
" />
</svg></span>
<script type="text/javascript">
window._sharedData = {
"config": {
"csrf_token": "",
"viewer": {
<!––deleted part for privacy -->
"viewerId": ""
},
"supports_es6": true,
"country_code": "FR",
"language_code": "fr",
"locale": "fr_FR",
"entry_data": {
"PostPage": [{
"graphql": {
"shortcode_media": {
"__typename": "GraphSidecar",
<!––deleted part for privacy -->
"dimensions": {
"height": 1080,
"width": 1080
},
"gating_info": null,
"media_preview": null,
<--There's the important part that have to be extracted as many times it appear in the source code-->
"display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"display_resources": [{
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 640,
"config_height": 640
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 750,
"config_height": 750
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 1080,
"config_height": 1080
}],
"is_video": false,</script>
</body>
</html>
'''
soup = bs(html, 'lxml')
scripts = soup.select('script[type="text/javascript"]')
for script in scripts:
if ' window._sharedData =' in script.text:
data = script.text
break
r = re.compile(r'"display_url":(.*)",')
print(r.findall(data))
Thanks to #t.h.adam it may be possible to shorten the above to:
soup = bs(html, 'lxml')
r = re.compile(r'"display_url":(.*)",')
data = soup.find('script', text=r).text
print(r.findall(data))
The program advanced and it became something like this :
thepage = urllib.request.urlopen(html)
soup = BeautifulSoup(thepage, "html.parser")
print(soup.title.text)
txt = soup.select('script[type="text/javascript"]')[3]
texte = txt.get_text()
f1 = open("tet.txt", 'w')
f1.write(texte)
f1.close()
with open('tet.txt','r') as f:
data=''.join(f.readlines())
print(data[data.index('"display_url":"'):data.index('","display_resources":')+1])
But now something new appeared :
How to make the finding url part of the program (line 10, 11) repeat as long as the (' "display_url":" to --> ","display_resources": ') appear in the tet.txt file ?
The while loop can be used but how to make it repeat the process ?
Problem Solved
Here's the code to download multiples images from an instagram url with Pythonista 3 on iOS:
from sys import argv
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import photos
import clipboard
thepage = "your url"
#p.1
thepage = urllib.request.urlopen(html)
soup = BeautifulSoup(thepage, "html.parser")
print(soup.title.text)
txt = soup.select('script[type="text/javascript"]')[3]
texte = txt.get_text()
fille = open("tet.txt", 'w')
fille.write(texte)
fille.close()
#p.2
g = open('tet.txt','r')
data=''.join(g.readlines())
le1 = 0
le2 = 0
hturl = open('url.html', 'w')
still_looking = True
while still_looking:
still_looking = False
dat = data.find('play_url', le1)
det = data.find('play_resources', le2)
if dat >= le1:
#urls.append(dat)
le1 = dat + 1
still_looking = True
if det >= le2:
hturl.write(data[dat:det])
le2 = det + 1
still_looking = True
hturl.close()
#p.3
hturl2 = open('url.html', 'r')
dete = ''.join(hturl2.readlines())
le11 = 0
le22 = 0
urls = []
still_looking2 = True
while still_looking2:
still_looking2 = False
dat2 = dete.find('https://scontent-', le11)
det2 = dete.find('","dis', le22)
if dat2 >= le11:
urls.append(dat2)
le11 = dat2 + 1
still_looking2 = True
if det2 >= le22:
urls.append(dete[dat2:det2])
le22 = det2 + 1
still_looking2 = True
hturl2.close()
#p.4
imgs = len(urls)
nbind = imgs
nbindr = 3
images = 1
while nbindr < imgs:
urllib.request.urlretrieve(urls[nbindr], 'photo.jpg')
photos.create_image_asset('photo.jpg')
print ('Image ' + str(images) + ' downloaded')
nbindr = nbindr +2
images += 1
print("OK")
It's a bit fastidious but it's working and rapidly too.
Thanks for your help.

scrape a web page with scrapy dosen't return page content

I'm trying to scrape a web page with scrapy I noticed it won't work when I parsed web page throw my ipython shell it returned this:
'دانلود کتاب و کتاب صوتی با طاقچه\n // more info: http://angulartics.github.io/\n (function (i, s, o, g, r, a, m) {\n i[\'GoogleAnalyticsObject\'] = r; i[r] = i[r] || function () {\n (i[r].q = i[r].q || []).push(arguments)\n }, i[r].l = 1 * new Date(); a = s.createElement(o),\n m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)\n })(window, document, \'script\', \'//www.google-analytics.com/analytics.js\', \'ga\');\n ga(\'create\', \'UA-57199074-1\', { \'cookieDomain\': location.hostname == \'localhost\' ? \'none\' : \'auto\' });\n ga(\'require\', \'ec\');\n Taaghche works best with JavaScript enabled{ "#context": "http://schema.org", "#type": "WebSite", "url": "https://taaghche.ir/", "name": "طاقچه", "alternateName": "نزدیکترین کتاب فروشی شهر", "potentialAction": { "#type": "SearchAction", "target": "https://taaghche.ir/search?term={search_term_string}", "query-input": "required name=search_term_string" } }{ "#context": "http://schema.org", "#type": "Organization", "url": "https://taaghche.ir", "logo": "https://taaghche.ir/assets/images/taaghchebrand.png", "contactPoint": [{ "#type": "ContactPoint", "telephone": "+۹۸-۲۱-۸۸۱۴۹۸۱۶", "contacttype": "customer support", "areaServed": "IR" }] }'
more like a json response. how can I scrape throw it? by the way my scraper looks like this:
class Taaghche(scrapy.Spider):
name='taaghche'
def start_requests(self):
urls = []
link = 'https://taaghche.ir/search?term='
data = pd.read_csv('books.csv')
titles = data.title
for title in titles:
key = title.replace(" ", "%20")
urls.append(link+key)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_front)
def parse_front(self,response):
booklinks = response.xpath('//a[#class="book-link"][1]/#href').extract_first()
#print(booklinks)
#for booklink in booklinks:
yield response.follow(url =booklinks, callback=self.parse_page)
def parse_page(self,response):
...
The Website content is not render by server side.The Content of the website is rendered by JavaScript:
In this case you need use either.
Selenium (Integrate Selenium with scrapy )
Check request url in network tab. There might be API url and you can get data from url.
There might be other possible Solutions.

Extracting JSON from HTML using BeautifulSoup python

While I was practicing some web-scraping on a webpage (param cookies required), I found myself having problems to scrape out JSON data embedded in the HTML. The following was what I did:
import requests from bs4
import BeautifulSoup as soup
import json
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4": "1548175412",
"Hm_lvt_7cd4710f721b473263eed1f0840391b4": "1548140525",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223832333339343739626466613939303562613535386138333266383365326132434c4b516e65494645495474764a322b706f6d6f6941453d227d", }
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
page_soup = soup(ret.text, 'html.parser')
data = page_soup.findAll('script', {'type':'application/ld+json'})
The output is as follows:
[
<script type="application/ld+json">{
"#context": "https://schema.org",
"#type": "BreadcrumbList",
"itemListElement": [
{
"item": {
"name": "Home",
"#id": "https://www.lazada.sg/"
},
"#type": "ListItem",
"position": "1"
}
]
}</script>,
<script type="application/ld+json">{
"#context": "https://schema.org",
"#type": "ItemList",
"itemListElement": [
{
"offers": {
"priceCurrency": "SGD",
"#type": "Offer",
"price": "71.00",
"availability": "https://schema.org/InStock"
},
"image": "https://sg-test-11.slatic.net/p/670a73a9613c36b2bb01555ab4092ba2.jpg",
"#type": "Product",
"name": "Switch: Super Mario Party [Available in Stock! Immediate Shipping]",
"url": "https://www.lazada.sg/products/switch-super-mario-party-available-in-stock-immediate-shipping-i278269540-s429667097.html?search=1"
},
...
I tried to follow an existing thread Extract json from html in python beautifulsoup but found myself stuck, probably due to the different JSON formatting in the HTML soup. The part which I scrape out contains all the different products in that page, is there a way where I further scrape out each product's details (eg. Title, price, rating, etc) and count the number of products present? Thanks!
You can loop parsing out from the json after loading with json.loads. All the product info for those containers is listed in one script tag so you can just grab that.
import requests
from bs4 import BeautifulSoup as soup
import json
import pandas as pd
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4": "1548175412",
"Hm_lvt_7cd4710f721b473263eed1f0840391b4": "1548140525",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223832333339343739626466613939303562613535386138333266383365326132434c4b516e65494645495474764a322b706f6d6f6941453d227d", }
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
page_soup = soup(ret.text, 'lxml')
data = page_soup.select("[type='application/ld+json']")[1]
oJson = json.loads(data.text)["itemListElement"]
numProducts = len(oJson)
results = []
for product in oJson:
results.append([product['name'], product['offers']['price'], product['offers']['availability'].replace('https://schema.org/', '')]) # etc......
df = pd.DataFrame(results)
print(df)

Categories

Resources