Save image/table from a webpage while scraping - python

I would need to scrape a image from this website: https://web.archive.org/web/
for example for stackoverflow, towardsdatascience.
URL
stackoverflow.com
towardsdatascience.com
I do not know how to include information on the table/image within
<div class="sparkline" style="width: 1225px;"><div id="wm-graph-anchor"><div id="wm-ipp-sparkline" title="Explore captures for this URL" style="height: 77px;"><canvas class="sparkline-canvas" width="1225" height="75" alt="sparklines"></canvas></div></div><div id="year-labels"><span class="sparkline-year-label">1996</span><span class="sparkline-year-label">1997</span><span class="sparkline-year-label">1998</span><span class="sparkline-year-label">1999</span><span class="sparkline-year-label">2000</span><span class="sparkline-year-label">2001</span><span class="sparkline-year-label">2002</span><span class="sparkline-year-label">2003</span><span class="sparkline-year-label">2004</span><span class="sparkline-year-label">2005</span><span class="sparkline-year-label">2006</span><span class="sparkline-year-label">2007</span><span class="sparkline-year-label">2008</span><span class="sparkline-year-label">2009</span><span class="sparkline-year-label">2010</span><span class="sparkline-year-label">2011</span><span class="sparkline-year-label">2012</span><span class="sparkline-year-label">2013</span><span class="sparkline-year-label">2014</span><span class="sparkline-year-label">2015</span><span class="sparkline-year-label">2016</span><span class="sparkline-year-label">2017</span><span class="sparkline-year-label">2018</span><span class="sparkline-year-label">2019</span><span class="sparkline-year-label selected-year">2020</span></div></div>
i.e. the image where the timeline is shown through years.
I would like to save per each website this image/table, if possible.
I tried to write some code, but it misses this part:
import json
import requests
def my_function(file):
urls = list(set(file.URL.tolist()))
df_url= pd.DataFrame(columns=['URL'])
df_url['URL']=urls
api_url = 'https://web.archive.org/__wb/search/metadata'
for url in df_url['URL']:
res = requests.get(api_url, params={'q': url})
# part to scrape the image
return
my_function(df)
Can you give me some input on how to get those images?

If you have each image URL in the for loop, you can download the images using python library urllib.request function urlretrive:
First import it at the beginning of the script using
import os
from urllib.parse import urlparse
import urllib.request
And then download them using
for url in df_url['URL']:
urllib.request.urlretrieve(url,os.path.basename(urlparse(url).path))
If you don't to save using URL basename, then don't make first 2 imports.

Related

How to print selected text from JSON file using Python

I'm new to python and have undertaken my first project to automate something for my role (I'm in the network space, so forgive me if this is terrible!).
I'm required to to download a .json file from the below link:
https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
My script goes through and retrieves the manual download link.
The reason I'm getting the URL in this way, is that the download link changes every fortnight when MS update the file.
My preference is to extract the "addressPrefixes" contents from the names of "AzureCloud.australiacentral", "AzureCloud.australiacentral2", "AzureCloud.australiaeast", "AzureCloud.australiasoutheast".
I'm then wanting to strip out characters of " & ','.
Each of the subnet ranges should then reside on a new line and be placed in a text file.
If I perform the below, I'm able to get the output that I am wanting.
Am I correct in thinking that I can use a for loop to achieve this? If so, would it be better to use a Python dictionary as opposed to using JSON formatted output?
# Script to check Azure IPs
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import Modules for script
import requests
import re
import json
import urllib.request
search = 'https://download.*?\.json'
ms_dl_centre = "https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519"
requests_get = requests.get(ms_dl_centre)
json_url_search = re.search(search, requests_get.text)
json_file = json_url_search.group(0)
with urllib.request.urlopen(json_file) as url:
contents = json.loads(url.read().decode())
print(json.dumps(contents['values'][1]['properties']['addressPrefixes'], indent = 0)) #use this to print contents from json entry 1
I'm not convinced that using re to parse HTML is a good idea. BeautifulSoup is more suited to the task. Upon inspection of the HTML response I note that there's a span element of class file-link-view1 that seems to uniquely identify the URL to the JSON download. Assuming that to be a robust approach (i.e. Microsoft don't change the way the download URL is presented) then this is how I'd do it:-
import requests
from bs4 import BeautifulSoup
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
print(n['name'])
for ap in n['properties']['addressPrefixes']:
print(ap)
#andyknight, thanks for your direction. I'd up vote you but as I'm a noob, it doesn't permit from doing so.
I've taken the basis of your python script and added in some additional components.
I removed the print statement for the region name in the .txt file, as this is file is referenced by a firewall, which is looking for IP addresses.
I've added in Try/Except/Else for portion of the script, to identify if there is ever an error with reaching the URL, or other unspecified error. I've leveraged logging to send an email based on the status of the script. If an exception is thrown I get an email with traceback information, otherwise I receive an email advising the script was successful.
This writes out the specific prefixes for AU regions into a .txt file.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
import logging
import logging.handlers
from bs4 import BeautifulSoup
smtp_handler = logging.handlers.SMTPHandler(mailhost=("sanitised.smtp[.]xyz", 25),
fromaddr="UpdateIPs#sanitised[.]xyz",
toaddrs="FriendlyAdmin#sanitised[.]xyz",
subject=u"Check Azure IP Script completion status.")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.addHandler(smtp_handler)
namelist = ["AzureCloud.australiacentral", "AzureCloud.australiacentral2",
"AzureCloud.australiaeast", "AzureCloud.australiasoutheast"]
baseurl = 'https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519'
with requests.Session() as session:
response = session.get(baseurl)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
downloadurl = soup.find('span', class_='file-link-view1').find('a')['href']
response = session.get(downloadurl)
response.raise_for_status()
json = response.json()
for n in json['values']:
if n['name'] in namelist:
for ap in n['properties']['addressPrefixes']:
with open('Check_Azure_IPs.txt', 'a') as file:
file.write(ap + "\n")
except requests.exceptions.HTTPError as e:
logger.exception(
"URL is no longer valid, please check the URL that's defined in this script with MS, as this may have changed.\n\n")
except Exception as e:
logger.exception("Unknown error has occured, please review script")
else:
logger.info("Script has run successfully! Azure IPs have been updated.")
Please let me know if you think there is a better way to handle this, otherwise this is marked as answered. I appreciate your help greatly!

Loop through webpages and download all images

I have a nice URL structure to loop through:
https://marco.ccr.buffalo.edu/images?page=0&score=Clear
https://marco.ccr.buffalo.edu/images?page=1&score=Clear
https://marco.ccr.buffalo.edu/images?page=2&score=Clear
...
I want to loop through each of these pages and download the 21 images (JPEG or PNG). I've seen several Beautiful Soap examples, but Im still struggling to get something that will download multiple images and loop through the URLs. I think I can use urllib to loop through each URL like this, but Im not sure where the image saving comes in. Any help would be appreciated and thanks in advance!
for i in range(0,10):
urllib.urlretrieve('https://marco.ccr.buffalo.edu/images?page=' + str(i) + '&score=Clear')
I was trying to follow this post but I was unsuccessful:
How to extract and download all images from a website using beautifulSoup?
You can use requests:
from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os
#contextlib.contextmanager
def get_images(url:str):
d = soup(requests.get(url).text, 'html.parser')
yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]
n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
print(links)
for c, [link, ext] in enumerate(links, 1):
with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)
Now, inspecting the contents of the MARCO_images directory yields:
print(os.listdir('/Users/ajax/MARCO_images'))
Output:
['MARCO_img_1.jpg', 'MARCO_img_10.jpg', 'MARCO_img_11.jpg', 'MARCO_img_12.jpg', 'MARCO_img_13.jpg', 'MARCO_img_14.jpg', 'MARCO_img_15.jpg', 'MARCO_img_16.jpg', 'MARCO_img_17.jpg', 'MARCO_img_18.jpg', 'MARCO_img_19.jpg', 'MARCO_img_2.jpg', 'MARCO_img_20.jpg', 'MARCO_img_21.jpg', 'MARCO_img_3.jpg', 'MARCO_img_4.jpg', 'MARCO_img_5.jpg', 'MARCO_img_6.jpg', 'MARCO_img_7.jpg', 'MARCO_img_8.jpg', 'MARCO_img_9.jpg']

How to scrape data from JSON/Javascript of web page?

I'm new to Python, just get started with it today.
My system environment are Python 3.5 with some libraries on Windows10.
I want to extract football player data from site below as CSV file.
Problem: I can not extract data from soup.find_all('script')[17] to my expected CSV format. How to extract those data as I want ?
My code is shown as below.
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
My expected output would be similar to this
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
As #josiah Swain said, it's not going to be pretty. For this sort of thing it's more recommended to use JS as it can understand what you have.
Saying that, python is awesome and here is you solution!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
The result I get for copying and pasting this into python is:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
Edit: On reflection this is not the easiest code to read for a beginner. Here is a easier to read version
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])
So my understanding is that beautifulsoup is better for HTML parsing, but you are trying to parse javascript nested in the HTML.
So you have two options
Simply create a function that takes the result of soup.find_all('script')[17], loop and search the string manually for the data and extract it. You can even use ast.literal_eval(string_thats_really_a_dictionary) to make it even easier. This is may not be the best a approach but if you are new to python you might want to do it this just for practice.
Use the json library like in this example. or alternatively like this way. This is probably the better way to do.

python web-scraping yahoo finance

Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers
In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US&region=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)
further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend
In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Categories

Resources