I'm trying to snip a embedded json from a webpage and then passing the json object to json.loads(). First url is okay but when loading the second url it's return error
ValueError: Unterminated string starting at: line 1 column 2078 (char 2077)
here is the code
import requests,json
from bs4 import BeautifulSoup
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')[0]
data = scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
jdata = json.loads(data)
print(jdata)
If you print out scripts.text.split("window['AT_APOLLO_STATE'] = ")[1], you will see the follows that includes a ; right after and enthusiastic. So you get an invalid json string from scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]. And the data ends with and enthusiastic that is not a valid json string.
"strapline":"In our state-of-the-art dealerships across the U.K, Sytner Group
represents the world’s most prestigious car manufacturers.
All of our staff are knowledgeable and enthusiastic; making every interaction
special by going the extra mile.",
Reason has been given. You could also regex out appropriate string
import requests,json
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
p = re.compile(r"window\['AT_APOLLO_STATE'\] =(.*?});", re.DOTALL)
for url in urls:
r = requests.get(url)
jdata = json.loads(p.findall(r.text)[0])
print(jdata)
Missed a } in the original post.
Related
I'm tring to remove the extra space and "rebtel.bootstrappedData" in the second alinea but for some reason it won't work.
This is my output
"welcome_offer_cuba.block_1_title":"SaveonrechargetoCuba","welcome_offer_cuba.block_1_cta":"Sendrecharge!","welcome_offer_cuba.block_1_cta_prebook":"Pre-bookRecarga","welcome_offer_cuba.block_1_footprint":"Offervalidfornewusersonly.","welcome_offer_cuba.block_2_key":"","welcome_offer_cuba.block_2_title":"Howtosendarecharge?","welcome_offer_cuba.block_2_content":"<ol><li>Simplyenterthenumberyou’dliketosendrechargeinthefieldabove.</li><li>Clickthe“{{buttonText}}”button.</li><li>CreateaRebtelaccountifyouhaven’talready.</li><li>Done!Yourfriendshouldreceivetherechargeshortly.</li></ol>","welcome_offer_cuba.block_3_title":"DownloadtheRebtelapp!","welcome_offer_cuba.block_3_content":"Sendno-feerechargeandenjoythebestcallingratestoCubainoneplace."},"canonical":{"string":"<linkrel=\"canonical\"href=\"https://www.rebtel.com/en/rates/\"/>"}};
rebtel.bootstrappedData={"links":{"summary":{"collection":"country_links","ids":[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"params":{"locale":"en"},"meta":{}},"data":[{"title":"A","links":[{"iso2":"AF","route":"afghanistan","name":"Afghanistan","url":"/en/rates/afghanistan/","callingCardsUrl":"/en/calling-cards/afghanistan/","popular":false},{"iso2":"AL","route":"albania","name":"Albania","url":"/en/rates/albania/
And this is the code I used:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.rebtel.com/en/rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
x = range(132621, 132624)
script = soup.find_all("script")[4].text.strip()[38:]
print(script)
What should I add to "script" so it will remove the empty spaces?
Original answer
You can change the definition of your script variable by :
script = soup.find_all("script")[4].text.replace("\t", "")[38:]
It will remove all tabulations on your text and so the alineas.
Edit after conversation in the comments
You can use the following code to extract the data in json :
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.rebtel.com/en/rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
script = list(filter(None, soup.find_all("script")[4].text.replace("\t", "").split("\r\n")))
app_data = json.loads(script[1].replace("rebtel.appData = ", "")[:-1])
bootstrapped_data = json.loads(script[2].replace("rebtel.bootstrappedData = ", ""))
I extracted the lines of the script with split("\r\n") and get the wanted data from there.
I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)
I am currently trying to read out the locations of a company. The information about the locations is inside a script tag (json). So I read out the contet inside the corresponding script tag.
This is my code:
sauce = requests.get('https://www.ep.de/store-finder', verify=False, headers = {'User-Agent':'Mozilla/5.0'})
soup1 = BeautifulSoup(sauce.text, features="html.parser")
all_scripts = soup1.find_all('script')[6]
all_scripts.contents
The output is:
['\n\t\twindow.storeFinderComponent = {"center":{"lat":51.165691,"long":10.451526},"bounds":[[55.655085,5.160441],[46.439648,15.666775]],"stores":[{"code":"1238240","lat":51.411572,"long":10.425264,"name":"EP:Schulze","url":"/schulze-breitenworbis","showAsClosed":false,"isBusinessCard":false,"logoUrl":"https://cdn.prod.team-ec.com/logo/retailer/retailerlogo_epde_1238240.png","address":{"street":"Weststraße 6","zip":"37339","town":"Breitenworbis","phone":"+49 (36074) 31193"},"email":"info#ep-schulze-breitenworbis.de","openingHours":[{"day":"Mo.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},{"day":"Di.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},{"day":"Mi.","openingTime":"09:00","closingTime":"18:00","startPauseTime":"13:00","endPauseTime":"14:30"},...]
I have problems converting the content to a dictionary and reading all lat and long data.
When I try:
data = json.loads(all_scripts.get_text())
all_scripts.get_text() returns an empty list
So i tryed:
data = json.loads(all_scripts.contents)
But then i get an TypeError: the JSON object must be str, bytes or bytearray, not list
I dont know ho to convert the .content method to json:
data = json.loads(str(all_scripts.contents))
JSONDecodeError: Expecting value: line 1 column 2 (char 1)
Can anyone help me?
You could use regex to pull out the json and read that in.
import requests
import re
import json
html = requests.get('https://www.ep.de/store-finder', verify=False, headers = {'User-Agent':'Mozilla/5.0'}).text
pattern = re.compile('window\.storeFinderComponent = ({.*})')
result = pattern.search(html).groups(1)[0]
jsonData = json.loads(result)
You can removed first part of data and then last character of data and then load data to json
import json
data=all_scripts.contents[0]
removed_data=data.replace("\n\t\twindow.storeFinderComponent = ","")
clean_data=data[:-3]
json_data=json.loads(clean_data)
Output:
{'center': {'lat': 51.165691, 'long': 10.451526},
'bounds': [[55.655085, 5.160441], [46.439648, 15.666775]],
'stores': [{'code': '1238240',
'lat': 51.411572,
....
have a txt file with values
https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/
http://www.redbook.com.au/cars/research/used/details/1968-ford-fairmont-xt-manual/SPOT-ITM-336135
http://www.redbook.com.au/cars/research/used/details/1968-ford-f100-manual/SPOT-ITM-317784
code :
from bs4 import BeautifulSoup
import requests
url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
car_data = {}
# Overview
if tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()'):
badge = tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()')[0]
car_data["badge"] = badge
if tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()'):
car_data["series"] = tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()')[0]
if tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()'):
car_data["body_small"] = tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()')[0]
df=pd.DataFrame([car_data])
output :
df=
badge body_small series
0 50 Years Edition Sedan 10th Gen
how to take all the urls from txt file and loop it so that the output will append all values into a dict or df.
expected output
badge body_small series
0 50 Years Edition Sedan 10th Gen
1 (No Badge) Sedan XT
2 (No Badge) Utility (No Series)
tried converting the file into list and used forloop
url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','http://www.redbook.com.au/cars/research/used/details/1966-ford-falcon-deluxe-xp-manual/SPOT-ITM-386381']
headers = {'User-Agent':'Mozilla/5.0'}
for lop in url:
page = (requests.get(lop, headers=headers))
but only one url value is generating. and if there are 1000 url converting them to list will take a lot of time
The problem with your code is you are overwriting the variable 'page' again and again in the for loop, hence you will get data of the last request only.
Below is the correct code
url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','http://www.redbook.com.au/cars/research/used/details/1966-ford-falcon-deluxe-xp-manual/SPOT-ITM-386381']
headers = {'User-Agent':'Mozilla/5.0'}
page = []
for lop in url:
page.append(requests.get(lop, headers=headers).text)
Here (The code will generate a dictionary where each entry is the url (key) + the scraped data (value))
from bs4 import BeautifulSoup
import requests
def get_cars_data(url):
cars_data = {}
# TODO read the data using requests and with BS populate 'cars_data'
return cars_data
all_cars = {}
with open('urls.txt') as f:
urls = [line.strip() for line in f.readlines()]
for url in urls:
all_cars[url] = get_cars_data(url)
print('done')
If I got your question correctly then this is the answer for you question.
from bs4 import BeautifulSoup
import requests
cars = [] # gobal array for storing each car_data object
f = open("file.txt",'r') #file.txt would contain all the links that you wish to read
#This for loop will perform your thing for each url in the file
for url in f:
car_data={} # use it as a local variable
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
# Overview
if tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()'):
badge = tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()')[0]
car_data["badge"] = badge
if tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()'):
car_data["series"] = tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()')[0]
if tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()'):
car_data["body_small"] = tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()')[0]
cars.append(car_data) #Append it to global array
I'm trying to get data and export to CSV which I have main URL page and second URL main page which I have imported the following of these:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse, parse_qs
import csv
def get_page(url):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
mainpage = response.read().decode('utf-8')
return mainpage
mainpage = get_page(www.website1.com)
mainpage_parser = BeautifulSoup(mainpage,'html.parser')
secondpage = get_page('www.website2.com')
secondpage_parser = BeautifulSoup(secondpage,'html.parser')
The patterns of the data are the same such as Title, Address; thus, the code I
use is "find" or "find_all" in each class; for example,
try:
name = page_parser.find("h1",{"class":"xxx"}).find("a").get_text()
print(name)
except:
print(name)
Which it worked.
However, I couldn't get the "lat" and "lon" from url link in this html class:
<img class="aaa" alt="map" data-track-id="static-map" width="97" height="142" src="https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true">
The code I'm trying to get latitude and longitude is:
for gps in secondpage_parser.find_all('img',{"class":"aaa"}, src=True):
parsed_url = urlparse(gps['src'])
mykeys = ['lat', 'lon']
gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]
print(gpslocation)
But it has Key Error on the "gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]" line which it indicates "KeyError: 'lat'"
I would like to know which part here I have the mistake or how should I fix it. Please help.
This url has no query string but does have parameters (see what is the difference between URL parameters and query strings). So when you try to parse the query string you get an an empty dictionary. Hence the KeyError.
"https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true"
# ^--- semicolon, not question mark
Result of print(parsed_url)
ParseResult(
scheme='https',
netloc='www.website.com',
path='/aaaaaaa',
params='height=284&lat=18.111&lon=98.111&level=15&returnImage=true',
query='',
fragment='')
The key here is to parse the parameters. To fix your code change parsed_url.query to parsed_url.params:
gpslocation = [parse_qs(parsed_url.params)[k][0] for k in mykeys]