Related
I'm new to scrapy.
I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect.
Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug.
My printing structure is as bellow (as coded in the script)
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
when I put 2 URLs to scrap I get this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
It is still fine, but when I add more, the structure messes up and become like this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
If you see closely you will notice that the title of the third URL is below the title of the second one.
Can somebody help, please?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
UPDATE:
I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.
Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....
For example :
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results
I scrape certain values using beautifulsoup. I want to convert it into Nested JSON format.
Following is my value structure.
category = development
heading = Complete Python Bootcamp | Deep Learning Into Python Coding
image = https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg
link = https://www.udemy.com/course/complete-python-bootcamp-deep-learning-into-python-coding
category = development
heading = C++ Complete Course For Beginners
image = https://i.udemycdn.com/course/750x422/3847698_8547_2.jpg
link = https://www.udemy.com/course/c-complete-course-for-beginners/?couponCode=FREE2021
category = it-software
heading = TB0-116 TIBCO Enterprise Message Service 6 Practice Exam
image = https://i.udemycdn.com/course/750x422/2931054_d555.jpg
link = https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service-6-practice-exam-t
Expected json output:
[
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
]
BELOW I ATTACHED MY SCRAPING CODE
def scrapeData(category):
base_url = "https://udemycoupon.learnviral.com/coupon-category/"+category+"/"
print(base_url)
source=requests.get(base_url,headers=headers).text
soup = BeautifulSoup(source,'lxml')
contents = soup.find_all('div',class_="item-holder")
print()
# print(contents)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
Anyone help me to convert it into my expected format in python.Thanks in advance.
def scrape_category(name):
base_url = 'https://udemycoupon.learnviral.com/coupon-category/' + name + '/'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'lxml')
contents = soup.find_all('div', class_='item-holder')
courses = []
for item in contents:
heading = item.find('h3', {'class': 'entry-title'}).text.replace('[Free]', '')
image = item.find('div', {'class': 'store-image'}).find('img')['src']
course_link = item.find('a', {'class': 'coupon-code-link btn promotion'})
courses.append({
'heading': heading,
'image': image.replace('240x135', '750x422'),
'courselink': course_link['href'],
})
return courses
result = {}
for category in ('development', 'it-software', ):
result[category] = scrape_category(category)
print(result) # or print([result])
You can use defaultdict and update the scraping code to create the dictionary object for every new course:
from collections import defaultdict
main_d = defaultdict(list)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
d = {"heading": heading, "image": image, "courselink": courselink}
main_d[category].append(d)
main_d will be a dictionary object with following structure:
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
}
Note: This is not a tested code and might require some modifications to make it work correctly.
I am trying to extract the "mobility index" values for each state and county from this webpage:
https://www.cuebiq.com/visitation-insights-mobility-index/
The preferred output would be a panel data of place (state/county) by date for all available places and dates.
There is another thread (How can I scrape tooltips value from a Tableau graph embedded in a webpage) with a similar question. I tried to follow the solution there but it doesn't seem to work for my case.
Thanks a lot in advance.
(A way that I have tried is to download PDF files generated from Tableau, which would contain all counties' value on a specific date. However, I still need to find a way to make request for each date in the data. Anyway, let me know if you have a better idea than this route).
This tableau data url doesn't return any data. In fact, it only render images of the values (canvas probably) and I'm guessing it detects click based on coordinate. Probably, it's made this way to cache the value and render quickly.
But when you click on a state, it actually returns data but it seems it doesn't always returns the result for the state (but works the individual county).
The solution I've found is to use the tooltip to get the data for the state. When you click the state, it generates a request like this :
POST https://public.tableau.com/{path}/{session_id}/commands/tabsrv/render-tooltip-server
with the following form param :
worksheet: US Map - State - CMI
dashboard: CMI
tupleIds: [18]
vizRegionRect: {"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":null}
allowHoverActions: false
allowPromptText: true
allowWork: false
useInlineImages: true
where tupleIds: [18] refers to the index of the state in a list of states in reverse alphabetical order like this :
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
It gives a json with the html of the tooltip which has the CMI and YoY values you want to extract :
{
"vqlCmdResponse": {
"cmdResultList": [{
"commandName": "tabsrv:render-tooltip-server",
"commandReturn": {
"tooltipText": "{\"htmlTooltip\": \"<HTML HERE WITH THE VALUES>\"}]},\"overlayAnchors\":[]}"
}
}]
}
}
The only caveat is that you'll hava to make one request per state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": "US Map - State - CMI",
"dashboard": "CMI",
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
print(entry)
data.append(entry)
print(data)
Try this on repl.it
To get the county information it's the same as this post using the select endpoint which gives you the data with the same format as the post you've linked in your question
The following will extract data for all county and state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
worksheet = "US Map - State - CMI"
dashboard = "CMI"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"selection": json.dumps({
"objectIds":[stateIndex+1],
"selectionType":"tuples"
}),
"selectOptions": "select-options-simple"
})
entry["county_data"] = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(entry)
data.append(entry)
print(data)
I am fully aware there are other topics on this on this website, but the solutions do not work for me.
I am trying to grab the 'last' value from this API address:
https://cryptohub.online/api/market/ticker/PLSR/
In Python, I have tried many different scripts, and tried hard myself but I always end up getting a "KeyError" although "last" is in the API. Can anyone help me?
Your data in response has structure:
$ curl https://cryptohub.online/api/market/ticker/PLSR/ | json_pp
{
"BTC_PLSR" : {
"baseVolume" : 0.00772783,
"lowestAsk" : 0.00019999,
"percentChange" : -0.0703703704,
"quoteVolume" : 83.77319071,
"last" : 0.000251,
"id" : 78,
"highestBid" : 5e-05,
"high24hr" : 0.000251,
"low24hr" : 1.353e-05,
"isFrozen" : "0"
}
}
i.e. dictionary inside dictionary, so to extract values you need:
data = response.json()["BTC_PLSR"]
visitors = data["id"]
UPDATE on comment:
Im not sure I understand you, I did a simple test and it works fine:
$ python3 << EOF
> import requests
> url = 'https://cryptohub.online/api/market/ticker/PLSR/'
> response = requests.get(url)
> data = response.json()['BTC_PLSR']
> print('ID ==> ', data['id'])
> EOF
ID ==> 78
As you can see line print('ID ==> ', data['id']) returns output ID ==> 78
Test source code:
import requests
url = 'https://cryptohub.online/api/market/ticker/PLSR/'
response = requests.get(url)
data = response.json()['BTC_PLSR']
print('ID ==> ', data['id'])
I am running into an issue that I can't seem to get past. Any insight would be great.
The script is supposed to get memory allocation information from a database, and return that information as a formatted JSON object. The script works fine when I give it a static JSON object will stack_ids (the information I would be passing) but it won't work when I try to pass the information via POST.
Although the current state of my code uses request.json("") to access the passed data, I have also tried request.POST.get("").
My HTML includes this post request, using D3's xhr post:
var stacks = [230323, 201100, 201108, 229390, 201106, 201114];
var stack_ids = {'stack_ids': stacks};
var my_request = d3.xhr('/pie_graph');
my_request.header("Content-Type", "application/json")
my_request.post(stack_ids, function(stuff){
stuff = JSON.parse(stuff);
var data1 = stuff['allocations'];
var data2 = stuff['allocated bytes'];
var data3 = stuff['frees'];
var data4 = stuff['freed bytes'];
...
...
}, "json");
while my server script has this route:
#views.webapp.route('/pie_graph', method='POST')
def server_pie_graph_json():
db = views.db
config = views.config
ret = {
'allocations' : [],
'allocated bytes' : [],
'frees' : [],
'freed bytes' : [],
'leaks' : [],
'leaked bytes' : []
}
stack_ids = request.json['stack_ids']
#for each unique stack trace
for pos, stack_id in stack_ids:
stack = db.stacks[stack_id]
nallocs = format(stack.nallocs(db, config))
nalloc_bytes = format(stack.nalloc_bytes(db, config))
nfrees = format(stack.nfrees(db, config))
nfree_bytes = format(stack.nfree_bytes(db, config))
nleaks = format(stack.nallocs(db, config) - stack.nfrees(db, config))
nleaked_bytes = format(stack.nalloc_bytes(db, config) - stack.nfree_bytes(db, config))
# create a dictionary representing the stack
ret['allocations'].append({'label' : stack_id, 'value' : nallocs})
ret['allocated bytes'].append({'label' : stack_id, 'value' : nalloc_bytes})
ret['frees'].append({'label' : stack_id, 'value' : nfrees})
ret['freed bytes'].append({'label' : stack_id, 'value' : nfree_bytes})
ret['leaks'].append({'label' : stack_id, 'value' : nleaks})
ret['leaked bytes'].append({'label' : stack_id, 'value' : nfree_bytes})
# return dictionary of allocation information
return ret
Most of that can be ignored, the script works when I give it a static JSON object full of data.
The request currently returns a 500 Internal Server Error: JSONDecodeError('Expecting value: line 1 column 2 (char 1)',).
Can anyone explain to me what I am doing wrong?
Also, if you need me to explain anything further, or include any other information, I am happy to do that. My brain is slightly fried after working on this for so long, so I may have missed something.
Here is what I do with POST and it works:
from bottle import *
#post('/')
def do_something():
comment = request.forms.get('comment')
sourcecode = request.forms.get('sourceCode')
Source
function saveTheSourceCodeToServer(comment) {
var path = saveLocation();
var params = { 'sourceCode' : getTheSourceCode() , 'comment' : comment};
post_to_url(path, params, 'post');
}
Source with credits to JavaScript post request like a form submit