Is there a way to get all the flags from https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags using python code?
I tried with pd.read_html and did not succeed. I tried scraping but it got so messy and I couldn't do it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
# Scrap webpage
soup = BeautifulSoup(page.content, 'html.parser')
flags = soup.find_all('a', attrs={'class': "image"})
Would be nice if I can download them to a specific folder too!
Thanks in advance!
Just as alternative to yours and the well described approach of MattieTK you could also use css selectors to select your elements more specific:
soup.select('img[src*="/Flag_of"]')
Iterate the ResultSet, pick the src and use a function to download the images:
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
Example
import requests
from bs4 import BeautifulSoup
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
file_name = url.split('/')[-1]
with open(file_name,'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
else:
print('Image Couldn\'t be retrieved',url)
page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content)
for e in soup.select('img[src*="/Flag_of"]'):
download_file('https:'+e.get('src'))
In your example flags is an array of anchor tags including the img tags.
What you want is a way to get each individual src attribute from the image tag.
You can achieve this by looping over the results of your soup.find_all like so. Each flag is separate, which allows you to get the contents of the flag (the image tag) and then the value of the src attribute.
for flag in soup.find_all('a', attrs={'class': "image"}):
src = flag.contents[0]['src'])
You can then work on downloading each of these to a file inside the loop.
Related
I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)
Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)
Hi I've been trying all day to find a way to download some images from this
URL: https://omgcheckitout.com/these-trypophobia-photos-will
but when I run this code I always get only the URLs for the small images in the corner and not the ones found in the article.
(I've also tried other ways but I get always the same result)
'''
import requests, os
from bs4 import BeautifulSoup as bs
url = 'https://omgcheckitout.com/these-trypophobia-photos-will'
r = requests.get(url)
soup = bs(r.text, "html.parser")
images = soup.find_all('img')
for image in images:
print(images['src'])
'''
**Converting my comment to an answer
original comment:
"I believe what is happening here is that the page that you are seeing in the browser is being loaded dynamically with javascript. Try typing in '.html' to the page url and see what happens. The images in the redirect are what are being downloaded with your code. I recommend taking a look at this thread https://stackoverflow.com/questions/52687372/beautifulsoup-not-returning-complete-html-of-the-page"
Try to download them to your disk
import requests
from os.path import basename
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
lnk = link.get('src')
with open(basename(lnk), "wb") as f:
f.write(requests.get(lnk).content)
for image in images:
print(images['src'])
You can also use a select to filter your tags to only get the ones with http links:
for link in soup.select("img[src^=http]"):
lnk = link["src"]
with open(basename(lnk)," wb") as f:
f.write(requests.get(lnk).content)
I'm trying to access the style of a DIV element on a page using Beautiful Soup 4 but I keep getting a key error. I know the styles are definitely there because I can inspect them using the inspector in the browser and I can see styles for the DIV with class "header large border first". (see the attached image)
Here is my code;
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
header_image_style = soup.find("div", class_="header large border first")['style']
I'm not sure what I'm doing wrong can anyone help?
Here is an image of the DIV with styles;
Beautiful soup does not analyze the contents in style tags or in linked style sheets unfortunately, so it will be difficult to retrieve that value since we will need to handle parsing the css on our own.
The value we are looking for is contained within the document's style tag, so we can get the contents of the style tag and parse it for ourselves to get the value. Here's a working example:
from bs4 import BeautifulSoup
import cssutils
import requests
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, 'html.parser')
# get the style tag contents
style_str = soup.find("style").text
# parse the tag's contents
rules = cssutils.parseString(style_str)
# find the first rule that applies to "div.header.large.first"
rule = next(filter(lambda x: x.selectorText == "div.header.large.first", rules))
# get the backgroundImage property
background_property = rule.style.backgroundImage
# Cut out the start of the text that says "url(" and ")"
img_url = background_property[4:-1]
print(img_url)
You will need to pip install cssutils in order for this example to work.
I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')
https://example.net/users/x
Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
content = urlopen(re.compile(r"https://example.net/users/[0-9]//"))
soup = BeautifulSoup(content)
Is this the right approach? I have to perform two things.
Get a continuous set of URLs
Extract & store retrieved contents from every page/URL.
UPDATE:
I've to get only one particular value from each of the webpages.
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
You could try this:
import urllib2
import shutil
urls = []
for i in range(10):
urls.append(str('https://www.example.org/users/' + i))
def getUrl(urls):
for url in urls:
# Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/', '_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
If you just need the contents of a web page, you could probably use lxml, from which you could parse the content. Something like:
from lxml import etree
r = requests.get('https://example.net/users/x')
dom = etree.fromstring(r.text)
# parse seomthing
title = dom.xpath('//h1[#class="title"]')[0].text
Additionally, if you are scraping 10s or 100s of thousands of pages, you might want to look into something like grequests where you can do multiple asynchronous HTTP requests.