Related
I'm new to scrapy.
I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect.
Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug.
My printing structure is as bellow (as coded in the script)
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
when I put 2 URLs to scrap I get this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
It is still fine, but when I add more, the structure messes up and become like this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
If you see closely you will notice that the title of the third URL is below the title of the second one.
Can somebody help, please?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
UPDATE:
I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.
Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....
For example :
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results
I scrape certain values using beautifulsoup. I want to convert it into Nested JSON format.
Following is my value structure.
category = development
heading = Complete Python Bootcamp | Deep Learning Into Python Coding
image = https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg
link = https://www.udemy.com/course/complete-python-bootcamp-deep-learning-into-python-coding
category = development
heading = C++ Complete Course For Beginners
image = https://i.udemycdn.com/course/750x422/3847698_8547_2.jpg
link = https://www.udemy.com/course/c-complete-course-for-beginners/?couponCode=FREE2021
category = it-software
heading = TB0-116 TIBCO Enterprise Message Service 6 Practice Exam
image = https://i.udemycdn.com/course/750x422/2931054_d555.jpg
link = https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service-6-practice-exam-t
Expected json output:
[
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
]
BELOW I ATTACHED MY SCRAPING CODE
def scrapeData(category):
base_url = "https://udemycoupon.learnviral.com/coupon-category/"+category+"/"
print(base_url)
source=requests.get(base_url,headers=headers).text
soup = BeautifulSoup(source,'lxml')
contents = soup.find_all('div',class_="item-holder")
print()
# print(contents)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
Anyone help me to convert it into my expected format in python.Thanks in advance.
def scrape_category(name):
base_url = 'https://udemycoupon.learnviral.com/coupon-category/' + name + '/'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'lxml')
contents = soup.find_all('div', class_='item-holder')
courses = []
for item in contents:
heading = item.find('h3', {'class': 'entry-title'}).text.replace('[Free]', '')
image = item.find('div', {'class': 'store-image'}).find('img')['src']
course_link = item.find('a', {'class': 'coupon-code-link btn promotion'})
courses.append({
'heading': heading,
'image': image.replace('240x135', '750x422'),
'courselink': course_link['href'],
})
return courses
result = {}
for category in ('development', 'it-software', ):
result[category] = scrape_category(category)
print(result) # or print([result])
You can use defaultdict and update the scraping code to create the dictionary object for every new course:
from collections import defaultdict
main_d = defaultdict(list)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
d = {"heading": heading, "image": image, "courselink": courselink}
main_d[category].append(d)
main_d will be a dictionary object with following structure:
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
}
Note: This is not a tested code and might require some modifications to make it work correctly.
I am trying to extract the "mobility index" values for each state and county from this webpage:
https://www.cuebiq.com/visitation-insights-mobility-index/
The preferred output would be a panel data of place (state/county) by date for all available places and dates.
There is another thread (How can I scrape tooltips value from a Tableau graph embedded in a webpage) with a similar question. I tried to follow the solution there but it doesn't seem to work for my case.
Thanks a lot in advance.
(A way that I have tried is to download PDF files generated from Tableau, which would contain all counties' value on a specific date. However, I still need to find a way to make request for each date in the data. Anyway, let me know if you have a better idea than this route).
This tableau data url doesn't return any data. In fact, it only render images of the values (canvas probably) and I'm guessing it detects click based on coordinate. Probably, it's made this way to cache the value and render quickly.
But when you click on a state, it actually returns data but it seems it doesn't always returns the result for the state (but works the individual county).
The solution I've found is to use the tooltip to get the data for the state. When you click the state, it generates a request like this :
POST https://public.tableau.com/{path}/{session_id}/commands/tabsrv/render-tooltip-server
with the following form param :
worksheet: US Map - State - CMI
dashboard: CMI
tupleIds: [18]
vizRegionRect: {"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":null}
allowHoverActions: false
allowPromptText: true
allowWork: false
useInlineImages: true
where tupleIds: [18] refers to the index of the state in a list of states in reverse alphabetical order like this :
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
It gives a json with the html of the tooltip which has the CMI and YoY values you want to extract :
{
"vqlCmdResponse": {
"cmdResultList": [{
"commandName": "tabsrv:render-tooltip-server",
"commandReturn": {
"tooltipText": "{\"htmlTooltip\": \"<HTML HERE WITH THE VALUES>\"}]},\"overlayAnchors\":[]}"
}
}]
}
}
The only caveat is that you'll hava to make one request per state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": "US Map - State - CMI",
"dashboard": "CMI",
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
print(entry)
data.append(entry)
print(data)
Try this on repl.it
To get the county information it's the same as this post using the select endpoint which gives you the data with the same format as the post you've linked in your question
The following will extract data for all county and state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
worksheet = "US Map - State - CMI"
dashboard = "CMI"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"selection": json.dumps({
"objectIds":[stateIndex+1],
"selectionType":"tuples"
}),
"selectOptions": "select-options-simple"
})
entry["county_data"] = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(entry)
data.append(entry)
print(data)
My goal is to web scrape Google search results using BeautifulSoup. I am using Anaconda Python and use Ipython as the IDE console. Why don't I get an ouptput when run the following command?
def google_scrape(query):
address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
linkdictionary = {}
for li in soup.findAll('li', attrs={'class':'g'}):
sLink = li.find('a')
print sLink['href']
sSpan = li.find('span', attrs={'class':'st'})
print sSpan
return linkdictionary
if __name__ == '__main__':
links = google_scrape('english')
You are never adding anything to linkedDictionary
def google_scrape(query):
address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
linkdictionary = {}
for li in soup.findAll('li', attrs={'class':'g'}):
sLink = li.find('a')
sSpan = li.find('span', attrs={'class':'st'})
linkeDictionary['href'] = sLink['href']
linkedDictionary['sSpan'] = sSpan
return linkdictionary
if __name__ == '__main__':
links = google_scrape('english')
The problem as Cody Bouche mentioned is that nothing has been adding to the dict().
In my opinion, you'll have hard times updating your dict if you haven't change {}(dict) to [](array).
Appending to array is much simpler (note: I could be wrong here, it's just a personal opinion from previous experience).
To make it work in a simple maner, you need to change dict to array {} --> [] and then use .append({}) to append to list()
Code and example in the online IDE:
def google_scrape(query):
html = requests.get(f'https://www.google.com/search?q={query}', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
data = []
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
data.append({
'title': title,
'link': link,
})
print(f'{title}\n{link}')
print(json.dumps(data, indent=2))
google_scrape('english')
# part of the outputs:
'''
English language - Wikipedia
https://en.wikipedia.org/wiki/English_language
[
{
"title": "English language - Wikipedia",
"link": "https://en.wikipedia.org/wiki/English_language"
},
]
'''
If you still want to append to dict() then this is one of the ways of approaching this (only part of the for loop shown):
for container in soup.findAll('div', class_='tF2Cxc'):
data_dict = {}
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
# creates title key and assigns title value
data_dict['title'] = title
# creates link key and assigns link value
data_dict['link'] = link
print(json.dumps(data_dict, indent = 2))
# part of the output:
'''
{
"title": "Minecraft Official Site | Minecraft",
"link": "https://www.minecraft.net/en-us/"
}
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
Essentially, it's doing the same thing as the code above, but you don't to figure out how to do certain things or trying to understand how to scrape certain element, it's already done for the end-user with a JSON output so the only thing that needs to be done is to iterate over a JSON and get the desired output.
Code to integrate:
from serpapi import GoogleSearch
import json
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(json.dumps(result, indent = 2, ensure_ascii = False))
# part of the json output:
'''
{
"position": 1,
"title": "Minecraft - Aplikasi di Google Play",
"link": "https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=in&gl=US",
"displayed_link": "https://play.google.com › store › apps › details › id=co...",
"rich_snippet": {
"top": {
"detected_extensions": {
"skor": 46,
"suara": 4144655,
"us": 749
},
"extensions": [
"Skor: 4,6",
"4.144.655 suara",
"US$7,49",
"Android",
"Game"
]
}
}
'''
Disclaimer, I work for SerpApi.
I am new to python, and I have looked at some examples of scripts already made.
I started my script and it works fine but now I want to add some extras and I would appreciate your help.
def listar_videos(url):
codigo_fonte = abrir_url(url)
soup = BeautifulSoup(abrir_url(url))
content = BeautifulSoup(soup.find("div", { "id" : "dle-content" }).prettify())
filmes = content("div", { "class" : "short-film" })
for filme in filmes:
nome_filme = filme.img["alt"]
url = filme.a["href"].replace('Assistir ','')
img = filme.img["src"]
addDir(nome_filme.encode('utf8'),url,4,img,False,len(filmes))
pagina = BeautifulSoup(soup.find('div', { "class" : "pages" }).prettify())("div", { "class" : "pnext" })[0]["href"]
addDir('Página Seguinte >>',pagina,2,artfolder + 'prox.png')
IndexError: list index out of range on this line:
pagina = BeautifulSoup(soup.find('div', { "class" : "pages" }).prettify())("div", { "class" : "pnext" })[0]["href"]
I alredy try another different code but no sucess.
It's weird because I already used this code in a website Worpress and it works fine, but this new one gave me this error.
I need your help guys.