Scraping json response using scrapy

Scraping json response using scrapy - python

JSON RESPONSE FROM WEBSITE I am new to python scrapy and json . I am trying to scrape json response from 78751 . But it is showing error . The code i used is
import scrapy
import json
class BlackSpider(scrapy.Spider):
name = 'black'
start_urls = ['https://appworld.blackberry.com/cas/content/2360/reviews/2.17.2?page=1&pagesize=100&sortby=newest&callback=_content_2360_reviews_2_17_2&_=1499161778751']
def parse(self, response):
data = re.findall('(\{.+\})\);', response.body_as_unicode())
a=json.loads(data[0])
item = MyItem()
item["Reviews"] = a["reviews"][4]["review"]
return item
The error it is showing is
ValueError("No JSON object could be decoded")ERROR

The response you are getting is javascript function with some json in it:
_content_2360_reviews_2_17_2(\r\n{"some":"json"}]});\r\n
To extract the data from this you can use simple regex solution:
import re
import json
data = re.findall('(\{.+\})\);', response.body_as_unicode())
json.loads(data[0])
It trasnslates to: select everything between {} that ends with );
edit: results I'm getting with this:
{'platform': None,
'reviews': [{'createdDate': '2017-07-04',
'model': 'London',
'nickname': 'aravind14-92362',
'rating': 6,
'review': 'Very bad ',
'title': 'My WhatsApp no update '}],
'totalReviews': 569909,
'version': '2.17.2'}

Related

Iterate through a nested dict inside of a list, with some missing keys

I need your guys' help on how to extract information from a nested dictionary inside a list. Here's the code to get the data:
import requests
import json
import time
all_urls = []
for x in range(5000,5010):
url = f'https://api.jikan.moe/v4/anime/{x}/full'
all_urls.append(url)
all_responses = []
for page_url in all_urls:
response = requests.get(page_url)
all_responses.append(response)
time.sleep(1)
print(all_responses)
data = []
for i in all_responses:
json_data = json.loads(i.text)
data.append(json_data)
print(data)
The sample of the extracted data is as follows:
[{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5000/'},
{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5001/'},
{'data': {'mal_id': 5002,
'url': 'https://myanimelist.net/anime/5002/Bari_Bari_Densetsu',
'images': {'jpg': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.jpg',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.jpg',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.jpg'},
'webp': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.webp',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.webp',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.webp'}},
'trailer': {'youtube_id': None,
'url': None,
'embed_url': None,
'images': {'image_url': None,
'small_image_url': None,
'medium_image_url': None,
'large_image_url': None,
'maximum_image_url': None}},
'title': 'Bari Bari Densetsu',
'title_english': None,
'title_japanese': 'バリバリ伝説',
'title_synonyms': ['Baribari Densetsu',
......
I need to extract the title from the list of data. Any help is appreciated! Also, any recommendation on a better/simpler/cleaner code to extract the json data from an API is also greatly appreciated!

Firstly, no need to create multiple lists. You can do everything in one loop:
import requests
import json
data = []
for x in range(5000,5010):
page_url = f'https://api.jikan.moe/v4/anime/{x}/full'
response = requests.get(page_url)
json_data = json.loads(response.text)
data.append(json_data)
print(data)
Second, to address your problem, you have two options. You can use dict.get:
for dic in data:
title = dic.get('title', 'no title')
Or use the try/except pattern:
for dic in data:
try:
title = dic['title']
except KeyError:
# deal with case where dict has no title
pass

Beautiful soup - html parser returns dots instead of string visible on web

I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
When I send get request and parse response with BeautifulSoup using:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>
How can I get the number?

If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:
You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.
Here is a complete working example:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
Output:
895
To view all the JSON data in order to access the key/value pairs, you can use:
from pprint import pprint
pprint(response.json(), indent=4)
Partial output:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},

EBAY Finding API Date Filtering

I am trying to return a list of completed items in a given category using the ebay API. My code seems to be working however the results seem to be very limited (about 100). I was assuming there would be some limitation on how far back the api would go but even just a few days should return thousands of results for this category. Am I missing something in the code or is this just a limitation of the ebay API? I did make sure I was using production and not the sandbox.
So I have realized now that there are multiple pages to my query up to the 100 item / 100 page max. I am now running into issues with the date filtering. I see the filter reference material on site but I am still not getting the result I expect. In the updated query I am trying to pull only items completed yesterday but when running I am getting stuff from today. Is there a better way to input the date filters?
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'endTimeFrom' : '2020-02-03T00:00:00.000Z',
'endTimeTo' : '2020-02-04T00:00:00.000Z' ,
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '1'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)

I finally figured this out. I needed to add additional code for the item filters. The working code is below.
from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import os
import csv
api = finding(appid=<my appid>,config_file=None)
response = api.execute(
'findCompletedItems', {
'categoryId': '214',
'keywords' : 'prizm',
'itemFilter': [
{'name': 'EndTimeFrom', 'value': '2020-02-03T00:00:00.000Z'},
{'name': 'EndTimeTo', 'value': '2020-02-04T00:00:00.000Z'}
#{'name': 'MinPrice', 'value': '200', 'paramName': 'Currency', 'paramValue': 'GBP'},
#{'name': 'MaxPrice', 'value': '400', 'paramName': 'Currency', 'paramValue': 'GBP'}
],
'paginationInput': {
'entriesPerPage': '100',
'pageNumber': '100'
},
'sortOrder': 'EndTimeSoonest'
}
)
soup = BeautifulSoup(response.content , 'lxml')
totalitems = int(soup.find('totalentries').text)
items = soup.find_all('item')
for item in response.reply.searchResult.item:
print(item.itemId)
print(item.listingInfo.endTime)

Data field not properly loading dict for Request

I am trying to push this JSON into the data request, but I tink that the format is not passing in correctly. I keep getting a bad request. I know this request works when I put it in my REST client.
Am I not formating JSON correctly forthe post request?
import json
import requests
import pprint
json_obj1 = """
'dateStart': '2019-11-25T00:00:00.000Z',
'dateEnd': '2019-11-26T23:59:59.999Z',
'subscriptions':
{
'category': {
'name': 'Accessories',
'childrenUuids': [],
'uuid': 'c35cb71f-5dcd-4ae3-86b3-d642208ad7f5'
},
'geography': {
'uuid': 'ad63a8ff-f636-44e1-9fe0-1d1664dfd530',
'name': 'New York',
'geoType': 'METRO',
'childrenUuids': []
}
}
"""
s = requests.session()
s.headers = {'Content-Type': 'application/json'}
infra_link = <someURL>
infra_content = s.request(
method='POST', url=infra_link, data=json_obj1, headers=s.headers,
).text
RESULT:
{"timestamp":"2019-11-27T16:22:49.885+0000","status":400,"error":"Bad '
'Request","exception":"org.springframework.http.converter.HttpMessageNotReadableException","message":"Bad '
'Request","path":"/index"}')

Try to change this
infra_content = s.request(
method='POST', url=infra_link, data=json_string1, headers=s.headers,
).text
To this
infra_content = s.request(
method='POST', url=infra_link, data=json.dumps(json_string1), headers=s.headers,
).text
Added json.dump() in request data parameter.

You're not passing a json string as your data argument, you're passing a dict.
Try the following:
infra_content = s.request(
method='POST', url=infra_link, data=json.dumps(json_string1), headers=s.headers,
).text
Your variable json_string1 is badly named, which might be confusing you. As you can see, it's not a string but a dict (and json is a specific string format, but a string format nonetheless). json.dumps (which stands for dump as string) is used to serialize a dict into a json string.

Sending list of dicts as value of dict with requests.post going wrong

I have clien-server app.
I localized trouble and there logic of this:
Client:
# -*- coding: utf-8 -*-
import requests
def fixing:
response = requests.post('http://url_for_auth/', data={'client_id': 'client_id',
'client_secret':'its_secret', 'grant_type': 'password',
'username': 'user', 'password': 'password'})
f = response.json()
data = {'coordinate_x': 12.3, 'coordinate_y': 8.4, 'address': u'\u041c, 12',
'products': [{'count': 1, 'id': 's123'},{'count': 2, 'id': 's124'}]}
data.update(f)
response = requests.post('http://url_for_working/, data=data)
response.text #There I have an Error about which I will say later
oAuth2 working well. But in server-side I have no products in request.data
<QueryDict: {u'token_type': [u'type_is_ok'], u'access_token': [u'token_is_ok'],
u'expires_in': [u'36000'], u'coordinate_y': [u'8.4'],
u'coordinate_x': [u'12.3'], u'products': [u'count', u'id', u'count',
u'id'], u'address': [u'\u041c, 12'], u'scope': [u'read write'],
u'refresh_token': [u'token_is_ok']}>
This part of QueryDict make me sad...
'products': [u'count', u'id', u'count', u'id']
And when I tried to make python dict:
request.data.dict()
... u'products': u'id', ...
And for sure other fields working well with Django serializer's validation. But not that, because there I have wrong values.

Looks like request (because it have x-www-encoded-form default) cant include list of dicts as value for key in dict so... I should use json in this case.
Finally I maked this func:
import requests
import json
def fixing:
response = requests.post('http://url_for_auth/', data={'client_id': 'client_id',
'client_secret':'its_secret', 'grant_type': 'password',
'username': 'user', 'password': 'password'})
f = response.json()
headers = {'authorization': f['token_type'].encode('utf-8')+' '+f['access_token'].encode('utf-8'),
'Content-Type': 'application/json'}
data = {'coordinate_x': 12.3, 'coordinate_y': 8.4, 'address': u'\u041c, 12',
'products': [{'count': 1, 'id': 's123'},{'count': 2, 'id': 's124'}]}
response = requests.post('http://url_for_working/', data=json.dumps(data),
headers=headers)
response.text
There I got right response.
Solved!

Hello i would like to refresh this topic, cause i have similar problem to this and above solution doesn`t work for me.
import requests
import urllib.request
import pprint
import json
from requests import auth
from requests.models import HTTPBasicAuth
payload = {
'description': 'zxcy',
'tags':[{
'id': 22,
'label': 'Card'}]
}
files = {'file': open('JAM5.pdf','rb')}
client_id = 32590
response = requests.post('https://system...+str(client_id)' , files=files ,data=payload, auth=HTTPBasicAuth(...)
Above code succesfully add file to CRM system and description to added file, but i have to add label to this too, and its seems doesnt work at all
When i try it with data=json.dumps(payload) i got this:
raise ValueError("Data must not be a string.")
ValueError: Data must not be a string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping json response using scrapy - python

Related

Iterate through a nested dict inside of a list, with some missing keys

Beautiful soup - html parser returns dots instead of string visible on web

EBAY Finding API Date Filtering

Data field not properly loading dict for Request

Sending list of dicts as value of dict with requests.post going wrong

Categories

Resources