Scraping json response using scrapy - python

JSON RESPONSE FROM WEBSITE I am new to python scrapy and json . I am trying to scrape json response from 78751 . But it is showing error . The code i used is
import scrapy
import json
class BlackSpider(scrapy.Spider):
name = 'black'
start_urls = ['https://appworld.blackberry.com/cas/content/2360/reviews/2.17.2?page=1&pagesize=100&sortby=newest&callback=_content_2360_reviews_2_17_2&_=1499161778751']
def parse(self, response):
data = re.findall('(\{.+\})\);', response.body_as_unicode())
a=json.loads(data[0])
item = MyItem()
item["Reviews"] = a["reviews"][4]["review"]
return item
The error it is showing is
ValueError("No JSON object could be decoded")ERROR

The response you are getting is javascript function with some json in it:
_content_2360_reviews_2_17_2(\r\n{"some":"json"}]});\r\n
To extract the data from this you can use simple regex solution:
import re
import json
data = re.findall('(\{.+\})\);', response.body_as_unicode())
json.loads(data[0])
It trasnslates to: select everything between {} that ends with );
edit: results I'm getting with this:
{'platform': None,
'reviews': [{'createdDate': '2017-07-04',
'model': 'London',
'nickname': 'aravind14-92362',
'rating': 6,
'review': 'Very bad ',
'title': 'My WhatsApp no update '}],
'totalReviews': 569909,
'version': '2.17.2'}

Related

Iterate through a nested dict inside of a list, with some missing keys

I need your guys' help on how to extract information from a nested dictionary inside a list. Here's the code to get the data:
import requests
import json
import time
all_urls = []
for x in range(5000,5010):
url = f'https://api.jikan.moe/v4/anime/{x}/full'
all_urls.append(url)
all_responses = []
for page_url in all_urls:
response = requests.get(page_url)
all_responses.append(response)
time.sleep(1)
print(all_responses)
data = []
for i in all_responses:
json_data = json.loads(i.text)
data.append(json_data)
print(data)
The sample of the extracted data is as follows:
[{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5000/'},
{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5001/'},
{'data': {'mal_id': 5002,
'url': 'https://myanimelist.net/anime/5002/Bari_Bari_Densetsu',
'images': {'jpg': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.jpg',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.jpg',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.jpg'},
'webp': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.webp',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.webp',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.webp'}},
'trailer': {'youtube_id': None,
'url': None,
'embed_url': None,
'images': {'image_url': None,
'small_image_url': None,
'medium_image_url': None,
'large_image_url': None,
'maximum_image_url': None}},
'title': 'Bari Bari Densetsu',
'title_english': None,
'title_japanese': 'バリバリ伝説',
'title_synonyms': ['Baribari Densetsu',
......
I need to extract the title from the list of data. Any help is appreciated! Also, any recommendation on a better/simpler/cleaner code to extract the json data from an API is also greatly appreciated!
Firstly, no need to create multiple lists. You can do everything in one loop:
import requests
import json
data = []
for x in range(5000,5010):
page_url = f'https://api.jikan.moe/v4/anime/{x}/full'
response = requests.get(page_url)
json_data = json.loads(response.text)
data.append(json_data)
print(data)
Second, to address your problem, you have two options. You can use dict.get:
for dic in data:
title = dic.get('title', 'no title')
Or use the try/except pattern:
for dic in data:
try:
title = dic['title']
except KeyError:
# deal with case where dict has no title
pass

Beautiful soup - html parser returns dots instead of string visible on web

I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
When I send get request and parse response with BeautifulSoup using:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>
How can I get the number?
If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:
You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.
Here is a complete working example:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
Output:
895
To view all the JSON data in order to access the key/value pairs, you can use:
from pprint import pprint
pprint(response.json(), indent=4)
Partial output:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},

EBAY Finding API Date Filtering

I am trying to return a list of completed items in a given category using the ebay API. My code seems to be working however the results seem to be very limited (about 100). I was assuming there would be some limitation on how far back the api would go but even just a few days should return thousands of results for this category. Am I missing something in the code or is this just a limitation of the ebay API? I did make sure I was using production and not the sandbox.
So I have realized now that there are multiple pages to my query up to the 100 item / 100 page max. I am now running into issues with the date filtering. I see the filter reference material on site but I am still not getting the result I expect. In the updated query I am trying to pull only items completed yesterday but when running I am getting stuff from today. Is there a better way to input the date filters?
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'endTimeFrom' : '2020-02-03T00:00:00.000Z',
'endTimeTo' : '2020-02-04T00:00:00.000Z' ,
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '1'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)
I finally figured this out. I needed to add additional code for the item filters. The working code is below.
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'itemFilter': [
{'name': 'EndTimeFrom', 'value': '2020-02-03T00:00:00.000Z'},
{'name': 'EndTimeTo', 'value': '2020-02-04T00:00:00.000Z'}
#{'name': 'MinPrice', 'value': '200', 'paramName': 'Currency', 'paramValue': 'GBP'},
#{'name': 'MaxPrice', 'value': '400', 'paramName': 'Currency', 'paramValue': 'GBP'}
],
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '100'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)

Data field not properly loading dict for Request

I am trying to push this JSON into the data request, but I tink that the format is not passing in correctly. I keep getting a bad request. I know this request works when I put it in my REST client.
Am I not formating JSON correctly forthe post request?
import json
import requests
import pprint
json_obj1 = """
'dateStart': '2019-11-25T00:00:00.000Z',
'dateEnd': '2019-11-26T23:59:59.999Z',
'subscriptions':
{
'category': {
'name': 'Accessories',
'childrenUuids': [],
'uuid': 'c35cb71f-5dcd-4ae3-86b3-d642208ad7f5'
},
'geography': {
'uuid': 'ad63a8ff-f636-44e1-9fe0-1d1664dfd530',
'name': 'New York',
'geoType': 'METRO',
'childrenUuids': []
}
}
"""
s = requests.session()
s.headers = {'Content-Type': 'application/json'}
infra_link = <someURL>
infra_content = s.request(
method='POST', url=infra_link, data=json_obj1, headers=s.headers,
).text
RESULT:
{"timestamp":"2019-11-27T16:22:49.885+0000","status":400,"error":"Bad '
'Request","exception":"org.springframework.http.converter.HttpMessageNotReadableException","message":"Bad '
'Request","path":"/index"}')
Try to change this
infra_content = s.request(
method='POST', url=infra_link, data=json_string1, headers=s.headers,
).text
To this
infra_content = s.request(
method='POST', url=infra_link, data=json.dumps(json_string1), headers=s.headers,
).text
Added json.dump() in request data parameter.
You're not passing a json string as your data argument, you're passing a dict.
Try the following:
infra_content = s.request(
method='POST', url=infra_link, data=json.dumps(json_string1), headers=s.headers,
).text
Your variable json_string1 is badly named, which might be confusing you. As you can see, it's not a string but a dict (and json is a specific string format, but a string format nonetheless). json.dumps (which stands for dump as string) is used to serialize a dict into a json string.

Sending list of dicts as value of dict with requests.post going wrong

I have clien-server app.
I localized trouble and there logic of this:
Client:
# -*- coding: utf-8 -*-
import requests
def fixing:
response = requests.post('http://url_for_auth/', data={'client_id': 'client_id',
'client_secret':'its_secret', 'grant_type': 'password',
'username': 'user', 'password': 'password'})
f = response.json()
data = {'coordinate_x': 12.3, 'coordinate_y': 8.4, 'address': u'\u041c, 12',
'products': [{'count': 1, 'id': 's123'},{'count': 2, 'id': 's124'}]}
data.update(f)
response = requests.post('http://url_for_working/, data=data)
response.text #There I have an Error about which I will say later
oAuth2 working well. But in server-side I have no products in request.data
<QueryDict: {u'token_type': [u'type_is_ok'], u'access_token': [u'token_is_ok'],
u'expires_in': [u'36000'], u'coordinate_y': [u'8.4'],
u'coordinate_x': [u'12.3'], u'products': [u'count', u'id', u'count',
u'id'], u'address': [u'\u041c, 12'], u'scope': [u'read write'],
u'refresh_token': [u'token_is_ok']}>
This part of QueryDict make me sad...
'products': [u'count', u'id', u'count', u'id']
And when I tried to make python dict:
request.data.dict()
... u'products': u'id', ...
And for sure other fields working well with Django serializer's validation. But not that, because there I have wrong values.
Looks like request (because it have x-www-encoded-form default) cant include list of dicts as value for key in dict so... I should use json in this case.
Finally I maked this func:
import requests
import json
def fixing:
response = requests.post('http://url_for_auth/', data={'client_id': 'client_id',
'client_secret':'its_secret', 'grant_type': 'password',
'username': 'user', 'password': 'password'})
f = response.json()
headers = {'authorization': f['token_type'].encode('utf-8')+' '+f['access_token'].encode('utf-8'),
'Content-Type': 'application/json'}
data = {'coordinate_x': 12.3, 'coordinate_y': 8.4, 'address': u'\u041c, 12',
'products': [{'count': 1, 'id': 's123'},{'count': 2, 'id': 's124'}]}
response = requests.post('http://url_for_working/', data=json.dumps(data),
headers=headers)
response.text
There I got right response.
Solved!
Hello i would like to refresh this topic, cause i have similar problem to this and above solution doesn`t work for me.
import requests
import urllib.request
import pprint
import json
from requests import auth
from requests.models import HTTPBasicAuth
payload = {
'description': 'zxcy',
'tags':[{
'id': 22,
'label': 'Card'}]
}
files = {'file': open('JAM5.pdf','rb')}
client_id = 32590
response = requests.post('https://system...+str(client_id)' , files=files ,data=payload, auth=HTTPBasicAuth(...)
Above code succesfully add file to CRM system and description to added file, but i have to add label to this too, and its seems doesnt work at all
When i try it with data=json.dumps(payload) i got this:
raise ValueError("Data must not be a string.")
ValueError: Data must not be a string.

Categories

Resources