Download pdfs with python - python

I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the "fileEntryId" text which contains the PDFs, according to this link and secondly try to download the PDF files using this approach link.
This is "my" code so far:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin
http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
if link.has_attr('href'):
x=link['href']
#If there is no such folder, the script will create one automatically
folder_location = r'c:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(x)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("x"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Thank you

Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.
import requests
from bs4 import BeautifulSoup
link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
inner_link = item.get("href")
resp = s.get(inner_link)
soup = BeautifulSoup(resp.text,"lxml")
pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
file_name = pdf_link.split("/")[-1].split("?")[0]
with open(f"{file_name}.pdf","wb") as f:
f.write(s.get(pdf_link).content)

Related

Why is the data retrieved showing as blank instead of outputting the correct numbers?

I can't seem to see what is missing. Why is the response not printing the ASINs?
import requests
import re
urls = [
'https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2',
'https://www.amazon.com/s?k=ps4+game&ref=nb_sb_noss_2'
]
for url in urls:
content = requests.get(url).content
decoded_content = content.decode()
asins = set(re.findall(r'/[^/]+/dp/([^"]+)', decoded_content))
print(asins)
traceback
set()
set()
[Finished in 0.735s]
Regular expressions should not be used to parse HTML. Every StackOverflow answer to questions like this do not recommend regex for HTML. It is difficult to write a regular expression complex enough to get the data-asin value from each <div>. The BeautifulSoup library will make this task easier. But if you must use regex, this code will return everything inside of the body tags:
re.findall(r'<body.*?>(.+?)</body>', decoded_content, flags=re.DOTALL)
Also, print decoded_content and read the HTML. You might not be receiving the same website that you see in the web browser. Using your code I just get an error message from Amazon or a small test to see if I am a robot. If you do not have real headers attached to your request, big websites like Amazon will not return the page you want. They try to prevent people from scraping their site.
Here is some code that works using the BeautifulSoup library. You need to install the library first pip3 install bs4.
from bs4 import BeautifulSoup
import requests
def getAsins(url):
headers = requests.utils.default_headers()
headers.update({'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
decoded_content = requests.get(url, headers=headers).content.decode()
soup = BeautifulSoup(decoded_content, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins
'''
result = getAsins('https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2')
print(result)
{None: 'B07RBN5C9C', '8652921a-81ee-4e15-b12d-5129c3d35195': 'B07P15JL3T', 'cb25b4bf-efc3-4bc6-ae7f-84f69dcf131b': 'B0886YWLC9', 'bc730e28-2818-472d-bc03-6e9fb97dcaad': 'B089F8R7SQ', '339c4ca0-1d24-4920-be60-54ef6890d542': 'B08GQW447N', '4532f725-f416-4372-8aa0-8751b2b090cc': 'B08DD5559K', 'a0e17b74-7457-4df7-85c9-5eefbfe4025b': 'B08BXHCQKR', '52ef86ef-58ac-492d-ad25-46e7bed0b8b9': 'B087XR383W', '3e79c338-525c-42a4-80da-4f2014ed6cf7': 'B07H5VVV1H', '45007b26-6d8c-4120-9ecc-0116bb5f703f': 'B07DJW4WZC', 'dc061247-2f4c-4f6b-a499-9e2c2e50324b': 'B07YLGXLYQ', '18ff6ba3-37b9-44f8-8f87-23445252ccbd': 'B01FST8A90', '6d9f29a1-9264-40b6-b34e-d4bfa9cb9b37': 'B088MZ4R82', '74569fd0-7938-4375-aade-5191cb84cd47': 'B07SXMV28K', 'd35cb3a0-daea-4c37-89c5-db53837365d4': 'B07DFJJ3FN', 'fc0b73cc-83dd-44d9-b920-d08f07be76eb': 'B07KYC1VL7', 'eaeb69d1-a2f9-4ea4-ac97-1d9a955d706b': 'B076PRWVFG', '0aafbb75-1bac-492c-848e-a046b2de9978': 'B07Q47W1B4', '9e373245-9e8b-4564-a32f-42baa7b51d64': 'B07C4SGGZ2', '4af7587a-98bf-41e0-bde6-2a2fad512d95': 'B07SJ2T3CW', '8635a92e-22a7-4474-a27d-3db75c75e500': 'B08D44W56B', '49d752ce-5d68-4323-be9b-3cbb34c8b562': 'B086JQGB7W', '6398531f-6864-4c7b-9879-84ee9de57d80': 'B07XD3TK36'}
'''
If you are reading html from a file then:
from bs4 import BeautifulSoup
import requests
def getAsins(location_to_file):
file = open(location_to_file)
soup = BeautifulSoup(file, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins

Scraping image hrefs from an Ordered List using BeautifulSoup

I am trying to retrieve the images from this website (with permission). Here is my code below with the website I want to access:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.vgmuseum.com/nes.htm"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html5lib")
li = soup.select('ol > li > a')
for link in li:
print(link.get('href'))
The images I would like to use are in this ordered list here:
list location for images
The page you are working with consists of iframes which is basically a way of including one page into the other. Browsers understand how iframes work and would download pages and display them in the browser window.
urllib2, though, is not a browser and cannot do that. You need to explore where the list of links is located, in which iframe and then follow the url where this iframe's content is coming from. In your case, the list of links on the left is coming from the http://www.vgmuseum.com/nes_b.html page.
Here is a working solution to follow links in the list of links, download pages containing images and the downloading images into the images/ directory. I am using requests module and utilizing lxml parser teamed up with BeautifulSoup for faster HTML parsing:
from urllib.parse import urljoin
import os
import requests
from bs4 import BeautifulSoup
url = "http://www.vgmuseum.com/nes_b.html"
def download_image(session, url):
print(url)
local_filename = os.path.join("images", url.split('/')[-1])
r = session.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
with requests.Session() as session:
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
response = session.get(url)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.select('ol > li > a[href*=images]'):
response = session.get(urljoin(response.url, link.get('href')))
for image in BeautifulSoup(response.content, "lxml").select("img[src]"):
download_image(session, url=urljoin(response.url, image["src"]))
I used the url in #Dan's comment above for parsing.
Code:
import requests
from bs4 import BeautifulSoup
url = 'http://www.vgmuseum.com/nes_b.html'
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
li = soup.find('ol')
soup = BeautifulSoup(str(li), 'html.parser')
a = soup.find_all('a')
for link in a:
if not link.get('href') == '#top' and not link.get('href') == None:
print(link.get('href'))
Output:
images/nes/10yard.html
images/nes2/10.html
pics2/100man.html
images/nes/1942.html
images/nes2/1942.html
images/nes/1943.html
images/nes2/1943.html
pics7/1944.html
images/nes/1999.html
images/nes2/2600.html
images/nes2/3dbattles.html
images/nes2/3dblock.html
images/nes2/3in1.html
images/nes/4cardgames.html
pics2/4.html
images/nes/4wheeldrivebattle.html
images/nes/634.html
images/nes/720NES.html
images/nes/8eyes.html
images/nes2/8eyes.html
images/nes2/8eyesp.html
pics2/89.html
images/nes/01/blob.html
pics5/boy.html
images/03/a.html
images/03/aa.html
images/nes/abadox.html
images/03/abadoxf.html
images/03/abadoxj.html
images/03/abadoxp.html
images/03/abarenbou.html
images/03/aces.html
images/03/action52.html
images/03/actionin.html
images/03/adddragons.html
images/03/addheroes.html
images/03/addhillsfar.html
images/03/addpool.html
pics/addamsfamily.html
pics/addamsfamilypugsley.html
images/nes/01/adventureislandNES.html
images/nes/adventureisland2.html
images/nes/advisland3.html
pics/adventureisland4.html
images/03/ai4.html
images/nes/magickingdom.html
pics/bayou.html
images/03/bayou.html
images/03/captain.html
images/nes/adventuresofdinoriki.html
images/03/ice.html
images/nes/01/lolo1.html
images/03/lolo.html
images/nes/01/adventuresoflolo2.html
images/03/lolo2.html
images/nes/adventuresoflolo3.html
pics/radgravity.html
images/03/rad.html
images/nes/01/rockyandbullwinkle.html
images/nes/01/tomsawyer.html
images/03/afroman.html
images/03/afromario.html
pics/afterburner.html
pics2/afterburner2.html
images/03/ai.html
images/03/aigiina.html
images/nes/01/airfortress.html
images/03/air.html
images/03/airk.html
images/nes/01/airwolf.html
images/03/airwolfe.html
images/03/airwolfj.html
images/03/akagawa.html
images/nes/01/akira.html
images/03/akka.html
images/03/akuma.html
pics2/adensetsu.html
pics2/adracula.html
images/nes/01/akumajo.html
pics2/aspecial.html
pics/alunser.html
images/nes/01/alfred.html
images/03/alice.html
images/nes/01/alien3.html
images/nes/01/asyndrome.html
images/03/alien.html
images/03/all.html
images/nes/01/allpro.html
images/nes/01/allstarsoftball.html
images/nes/01/alphamission.html
pics2/altered.html

Scraping a website to download all documents on it with BeautifulSoup throws IOError

Hi, I would like to download all the files that are published on the following website (https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer) via a Python, Julia or whatever language script. It used to be a http website where BeautifulSoup was working fine; it is now a https website on my code is unfortunately not working anymore.
All the files I desire to download are in a 'a' tag and are of class 'download'. Hence the line in the code that is not working is the following:
fileDownloader.retrieve(document_url, "forecasted-demand-files/"+document_name)
which raises the following error:
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 403, 'Forbidden', <httplib.HTTPMessage instance at 0x104f79e60>)
After some research on the net, I have not been able to find any information on how I could scrap the document on an https website, would anyone have a suggestion?
Thank you in advance for your answers!
Julien
import requests
import urllib
import re
from bs4 import BeautifulSoup
page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-
operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')
fileDownloader = urllib.URLopener()
mainLocation = "https://www.nationalgrid.com"
for document in soup.find_all('a', class_='download'):
document_name = document["title"]
document_url = mainLocation+document["href"]
fileDownloader.retrieve(document_url, "files/"+document_name)
The problem with the issue is that you should pass the agent as a header in order the request to be fulfilled.
I don't know how to do it with urllib but since you are already using requests (which is more human friendly) you can achieve this with the following code:
import requests
import urllib
from bs4 import BeautifulSoup
page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')
mainLocation = "http://www2.nationalgrid.com"
header = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
for a_link in soup.find_all('a', class_='download'):
document_name = a_link["title"]
document_url = mainLocation + a_link["href"]
print('Getting file: {}'.format(document_url))
page = requests.get(document_url, headers=header)
file_to_store = a_link.get('href').split('/')[-1]
with open('files/' + file_to_store, 'w') as f_out:
f_out.write(page.content)
Only with a small hack to retrieve the file name from the link.
It's not a https issue, it's just that the page you're trying to scrape has some file access restricted. It's good practice to handle exceptions when you expect them. In this case all of the file links may be broken or not accessible.
Try handling the exception like this:
import requests
import urllib
import re
from bs4 import BeautifulSoup
page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')
fileDownloader = urllib.URLopener()
mainLocation = "https://www.nationalgrid.com"
for document in soup.find_all('a', class_='download'):
document_name = document["title"]
document_url = mainLocation+document["href"]
try:
fileDownloader.retrieve(document_url, "forecasted-demand-files/"+document_name)
except IOError as e:
print('failed to download: {}'.format(document_url))

How to get the right source code with Python from the URLs using my web crawler?

I'm trying to use python to write a web crawler. I'm using re and requests module. I want to get urls from the first page (it's a forum) and get information from every url.
My problem now is, I already store the URLs in a List. But I can't get further to get the RIGHT source code of these URLs.
Here is my code:
import re
import requests
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE
#To get the source code of current url
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.text
#To get all the links in current page
def getallLinksinPage(sourceCode):
bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
allLinks = []
for each in bigClasses:
everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
allLinks.append(everylink)
return allLinks
You define your functions after you use them so your code will error. You should also not be using re to parse html, use a parser like beautifulsoup as below. Also use urlparse.urljoin to join the base url to the the links, what you actually want is the hrefs in the anchor tags inside the the div with the id threadlist:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.content
#To get all the links in current page
def getallLinksinPage(sourceCode):
soup = BeautifulSoup(sourceCode)
return [a["href"] for a in soup.select("#threadlist a.xst")]
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/'
html = getsourse(urljoin(url, eachLink))
print(html)
If you print urljoin(url, eachLink) in the loop you see you get all the correct links for the table and the correct source code returned, below is a snippet of the links returned:
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
If you visit the links above in your browser you will see it get the correct page, using http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3187289&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231 from your results you will see :
Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]
You can see clearly the difference in the url's. If you wanted yours to work you would need to do a replace in your regex:
everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]
But really don't parse html will a regex.

Using Python to Scrape Nested Divs and Spans in Twitter?

I'm trying to scrape the likes and retweets from the results of a Twitter search.
After running the Python below, I get an empty list, []. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.
The code I'm using is:
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)
I can successfully save the html to file using this code. It is missing large amounts of information when I search the text, such as the class names I am looking for...
So (part of) the problem is apparently in accurately accessing the source code.
filename = 'newfile2.txt'
with open(filename, 'w') as handle:
handle.writelines(str(data))
This screenshot shows the span that I'm trying to scrape.
I've looked at this question, and others like it, but I'm not quite getting there.
How can I use BeautifulSoup to get deeply nested div values?
It seems that your GET request returns valid HTML but with no tweet elements in the #timeline element. However, adding a user agent to the request headers seems to remedy this.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

Categories

Resources