Python: Download Images from Google - python

I am using google's books API to fetch information about books based on ISBN number. I am getting thumbnails in response along with other information. The response looks like this:
"imageLinks": {
"smallThumbnail": "http://books.google.com/books/content?id=tEDhAAAAMAAJ&printsec=frontcover&img=1&zoom=5&source=gbs_api",
"thumbnail": "http://books.google.com/books/content?id=tEDhAAAAMAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api"
},
I want to download thumbnails on the links above and store them on local file system. How can it be done python?

Use urllib module
Ex:
import urllib
d = {"imageLinks": {
"smallThumbnail": "http://books.google.com/books/content?id=tEDhAAAAMAAJ&printsec=frontcover&img=1&zoom=5&source=gbs_api",
"thumbnail": "http://books.google.com/books/content?id=tEDhAAAAMAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api"
}
}
urllib.urlretrieve(d["imageLinks"]["thumbnail"], "MyThumbNail.jpg")
Python3X
from urllib import request
with open("MyThumbNail.jpg", "wb") as infile:
infile.write(request.urlopen(d["imageLinks"]["thumbnail"]).read())

Related

scraping data from json after using requests

i am trying to extract specific data from requested json file
so after passing Authorization and using requests.get i got my request , i think it is called dictionary for python coders and called json for javascript coders
it containt too much information that i dont need and i would like to extract one or two only
for example {"bio" : " hello world " }
and that json file contains more that one " bio "
for example i am scraping 100 accounts and i would like to extract all " bio " in one code
so i tried this :
from bs4 import BeautifulSoup
import requests
headers = {"Authorization" : "xxxx"}
req = requests.get('website', headers = headers)
data = req.text
soup = BeautifulSoup(data,'html.parser')
titles = soup.find_all('span',{'class':'bio'})
for title in titles :
print(title.text)
and didnt work , i tried multiple ideas with no success
if possible please write me a code that i can understande since iam trying to learn more about my mistakes
thanks
The Aphid library I created is perfect for this.
from command-prompt
py -m pip install Aphid
Then its just as easy as loading your json data and searching it with aphid.
import json
import Aphid
resp = requests.get(yoururl)
data = json.loads(resp.text)
results = Aphid.findall(data, 'bio')
results is now equal to a list of tuples(key, value), of every occurence of the 'bio' key.
After you get your request either:
you get a simple json file (in which case you import it to python using json) or
you get an html file from which you can extract the json code (using BeautifulSoup) which in turn you will parse using json library.

Open json file link through a code

I'm creating an addon and I'm modifying some functions that come within a py file.
What I intend to do is the following, I have this code:
def channellist():
return json.loads(openfile('lib.json',pastafinal=os.path.join(tugapath,'resources')))
This code gives access to a lib.json file that is inside the tugapath folder in the resources subfolder. What I did was put the lib.json file in the dropbox and wanted to replace it with the dropbox link from the lib.json file instead of calling the folders.
I tried to change the code but without success.
def channellist():
return json.loads(openfile('lib.json',pastafinal=os.path.join("https://www.dropbox.com/s/sj1246qtiodm6qd/lib.json?dl=1')))
If someone can help me, I'm grateful!
Thank you first.
Given that your link holds valid json - which is not the case with the content you posted - you could use requests.
If the content at dropbox looked liked this:
{"tv":
{"epg": "tv",
"streams":
[{"url": "http://topchantv.net:3456/live/Stalker/Stalker/838.m3u8",
"name": "IPTV",
"resolve": False,
"visible": True}],
"name": "tv",
"thumb": "thumb_tv.png"
}
}
Then fetching the content would be like this
import requests
url = 'https://www.dropbox.com/s/sj1246qtiodm6qd/lib.json?dl=1'
r = requests.get(url)
json_object = r.json()
So if you needed it inside a function, I guess you'd input the url and return the json like so:
def channellist(url):
r = requests.get(url)
json_object = r.json()
return json_object

python web-scraping yahoo finance

Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers
In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US&region=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)
further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend
In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']

Python & Json - Ebay Api Upload image Error

I've been trying to upload a png image to the Ebay Api with the return_file_upload call:
http://developer.ebay.com/Devzone/post-order/post-order_v2_return-returnId_file_upload__post.html#Samples
It's weird because the documentation says it accepts an array for the data parameter but the samples doesn't use arrays. When I tried using an array I got a Can not deserialize instance of byte out of VALUE_STRING at [Source: java.io.SequenceInputStream#4d57f134; line: 1, column: 11] (through reference chain: com.ebay.marketplace.returns.v3.services.request.UploadFileRequest["data"])
This is my code:
import json
import base64
import requests
with open("take_full_login.png", "rb") as image_file:
encoded_string = base64.encodestring(image_file.read())
url2 = 'https://api.ebay.com/post-order/v2/return/123456/file/upload'
payload2 = {
"data" : encoded_string,
"filePurpose" : "LABEL_RELATED"
}
requests.post(url=url2, data=json.dumps(payload2), headers=headers)
That currently outputs
{"error":[{"errorId":1616,"domain":"returnErrorDomain","severity":"ERROR","category":"REQUEST","message":"Invalid Input.","parameter":[{"value":"data","name":"parameter"}],"longMessage":"Invalid Input.","httpStatusCode":400}]}
Try replacing data=json.dumps(payload2) by json=payload2
The call /post-order/v2/cancellation/check_eligibility only worked that way for me

using xpath to parse images from the web

I'm written a bit of code in an attempt to pull photos from a website. I want it to find photos, then download them for use to tweet:
import urllib2
from lxml.html import fromstring
import sys
import time
url = "http://www.phillyhistory.org/PhotoArchive/Search.aspx"
response = urllib2.urlopen(url)
html = response.read()
dom = fromstring(html)
sels = dom.xpath('//*[(#id = "large_media")]')
for pic in sels[:1]:
output = open("file01.jpg","w")
output.write(pic.read())
output.close()
#twapi = tweepy.API(auth)
#twapi.update_with_media(imagefilename, status=xxx)
I'm new at this sort of thing, so I'm not really sure why this isn't working. No file is created, and no 'sels' are being created.
Your problem is that the image search (Search.aspx) doesn't just return a HTML page with all the content in it, but instead delivers a JavaScript application that then makes several subsequent requests (see AJAX) to fetch raw information about assets, and then builds a HTML page dynamically that contains all those search results.
You can observe this behavior by looking at the HTTP requests your browser makes when you load the page. Use the Firebug extension for Firefox or the builtin Chrome developer tools and open the Network tab. Look for requests that happen after the initial page load, particularly POST requests.
In this case the interesting requests are the ones to Thumbnails.ashx, Details.ashx and finally MediaStream.ashx. Once you identify those requests, look at what headers and form data your browser sends, and emulate that behavior with plain HTTP requests from Python.
The response from Thumbnails.ashx is actually JSON, so it's much easier to parse than HTML.
In this example I use the requests module because it's much, much better and easier to use than urllib(2). If you don't have it, install it with pip install requests.
Try this:
import requests
import urllib
BASE_URL = 'http://www.phillyhistory.org/PhotoArchive/'
QUERY_URL = BASE_URL + 'Thumbnails.ashx'
DETAILS_URL = BASE_URL + 'Details.ashx'
def get_media_url(asset_id):
response = requests.post(DETAILS_URL, data={'assetId': asset_id})
image_details = response.json()
media_id = image_details['assets'][0]['medialist'][0]['mediaId']
return '{}/MediaStream.ashx?mediaId={}'.format(BASE_URL, media_id)
def save_image(asset_id):
filename = '{}.jpg'.format(asset_id)
url = get_media_url(asset_id)
with open(filename, 'wb') as f:
response = requests.get(url)
f.write(response.content)
return filename
urlqs = {
'maxx': '-8321310.550067',
'maxy': '4912533.794965',
'minx': '-8413034.983992',
'miny': '4805521.955385',
'onlyWithoutLoc': 'false',
'sortOrderM': 'DISTANCE',
'sortOrderP': 'DISTANCE',
'type': 'area',
'updateDays': '0',
'withoutLoc': 'false',
'withoutMedia': 'false'
}
data = {
'start': 0,
'limit': 12,
'noStore': 'false',
'request': 'Images',
'urlqs': urllib.urlencode(urlqs)
}
response = requests.post(QUERY_URL, data=data)
result = response.json()
print '{} images found'.format(result['totalImages'])
for image in result['images']:
asset_id = image['assetId']
print 'Name: {}'.format(image['name'])
print 'Asset ID: {}'.format(asset_id)
filename = save_image(asset_id)
print "Saved image to '{}'.\n".format(filename)
Note: I didn't check what http://www.phillyhistory.org/'s Terms of Service have to say about automated crawling. You need to check yourself and make sure you're not in violation of their ToS with whatever you're doing.

Categories

Resources