How can I download all the data folders available in a website?

How can I download all the data folders available in a website? - python

In general, if a website displays a series of links to data containing folders (i.e. spreadsheets with economic data), how can I write a program that identifies all the links and downloads the data?
In particular, I am trying to download all folders from 2012 to 2018 in this website https://www.ngdc.noaa.gov/eog/viirs/download_dnb_composites.html
I tried the approach suggested below, yet it seems the links to the data are not downloaded.
my_target='https://www.ngdc.noaa.gov/eog/viirs/download_dnb_composites.html'
import requests
from bs4 import BeautifulSoup
r = requests.get(my_target)
data = r.text
soup = BeautifulSoup(data)
links=[]
for link in soup.find_all('a'):
links.append(link.get('href'))
print(link.get('href'))
Among all URL appended to links, none directs to the data.
Finally, even once I have the right links, how can they be used to actually download the files?
Many thanks! ;)

That's a typical web scraping task.
Use requests to download the page
then parse the content and extract the URLs usingbeutifulsoup
you can now download the files using their extracted URLs and requests

Related

beautifulsoup4 find_all not finding any data on Walmart grocery website

I am attempting to scrape some basic product information from the url linked here, but the bs4 find_all command isn't finding any data given the name of the class associated with the product div. Specifically, I am trying:
url = https://www.walmart.com/grocery/browse/Cereal-&-Breakfast-Food?aisle=1255027787111_1255027787501
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.find_all('div', class_='productListTile')
print(product_list)
But this prints an empty list []. Having inspected the webpage on Chrome, I know that 'productListTile' is the correct class name. Any idea what I am doing wrong?

You will need to use Selenium most likely. Beautiful Soup requests get redirected to a "Verify Your Identity" page.
Here is a very similar question to this one, which has code with Selenium and Beautiful Soup working in concert to scrape Wal-Mart
python web scraping using beautiful soup is not working

Web scraping technics vary with websites. In this case, you can either use selenium that is a good option and here I am adding another method with the beautiful soup itself, this helped me a lot.
In this case, inspect the web page and then select network, please refresh the page.
Then sort with type:
In the below image I had marked with red color the API's they called to get the data from the backend. So you can directly call the backend API to fetch the player's data.
Check the "Headers" you will see the API endpoint and in preview, you can see the API response in JSON format.
Now if you want to get the images then please check the source you will see the images and u can download the images and map with the id.

How to scrape a javascript website in Python?

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
Method 1: Beautiful Soup
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
Method 2: Selenium + BeautifulSoup
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)

There are different ways of gathering the content of a webpage that contains Javascript.
Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library
You have to do your research first

Webscraping table with multiple pages using BeautifulSoup

I'm trying to scrape this webpage https://www.whoscored.com/Statistics using BeautifulSoup in order to obtain all the information of the player statistics table. I'm having lot of difficulties and was wondering if anyone would be able to help me.
url = 'https://www.whoscored.com/Statistics'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
text = [element.text for element in soup.find_all('div' {'id':"statistics-table-summary"})]
My problem lies in the fact that I don't know what the correct tag is to obtain that table. Also the table has several pages and I would like to scrape every single one. The only indication I've seen of a change of page in the table is the number in the code below:
<div id="statistics-table-summary" class="" data-fwsc="11">

It looks to me like that site loads their data in using Javascript. In order to grab the data, you'll have to mimic how a browser loads a page; the requests library isn't enough. I'd recommend taking a look at a tool like Selenium, which uses a "robotic browser" to load the page. After the page is loaded, you can then use BeautifulSoup to retrieve the data you need.
Here's a link to a helpful tutorial from RealPython.
Good luck!

Webscraping: Downloading a pdf from a javascript link

I am using the requests library in python and attempting to scrape a website that has lots of public reports and documents in .pdf format. I have successfully done this on other websites, but I have hit a snag on this one: the links are javascript functions (objects? I don't know anything about javascript) that redirect me to another page, which then has the raw pdf link. Something like this:
import requests
from bs4 import BeautifulSoup as bs
url = 'page with search results.com'
html = requests.get(url).text
soup = bs(html)
obj_list = soup.findAll('a')
for a in obj_list:
link = a['href']
print(link)
>> javascript:readfile2("F","2201","2017_2201_20170622F14.pdf")
Ideally I would like a way to find what url this would navigate to. I could use selenium and click on the links, but there are a lot of documents and that would be time- and resource-intensive. Is there a way to do this with requests or a similar library?
Edit: It looks like every link goes to the same url, which loads a different pdf depending on which link you click. This makes me think that there is no way to do this in requests, but I am still holding out hope for something non-selenium-based.

There might be a default url on which these PDF files are present.
You need to find out the url, On which these pdf files open after clicking on hyper link.
Once you got that url, You need to parse pdf name from anchor text.
Afterwards, You append the pdf name with url(On which pdf is present). And request the final url.

BeautifulSoup4 fails to parse multiple tables

I'd like to systematically scrape the privacy breach data found here which is directly embedded in the HTML of the page. I've found various links on StackOverflow about missing HTML and not being able to scrape a table using BS4. Both of these threads seem very similar to the issue that I'm having, however i'm having a difficult time reconciling the differences.
Here's my problem: When I pull the HTML using either Requests or urllib (python 3.6) the second table does not appear in the soup. The second link above details that this can occur if the table/data is added in after the page loads using javascript. However when I examine the page source the data is all there, so that doesn't seem to be the issue. A snippet of my code is below.
url = 'https://www.privacyrights.org/data-breach/new?title=&page=1'
r = requests.get(url, verify=False)
soupy = BeautifulSoup(r.content, 'html5lib')
print(len(soupy.find_all('table')))
# only finds 1 table, there should be 2
This code snippet fails to find the table with the actual data in it. I've tried lmxl, html5lib, and html.parse parsers. I've tried urllib and Requests packages to pull down the page.
Why can't requests + BS4 find the table that I'm looking for?

Looking at the HTML delivered from the URL it appears that there only IS one table in it, which is precisely why Beautiful Soup can't find two!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I download all the data folders available in a website? - python

That's a typical web scraping task. Use requests to download the page then parse the content and extract the URLs usingbeutifulsoup you can now download the files using their extracted URLs and requests

Related

beautifulsoup4 find_all not finding any data on Walmart grocery website

How to scrape a javascript website in Python?

Webscraping table with multiple pages using BeautifulSoup

Webscraping: Downloading a pdf from a javascript link

BeautifulSoup4 fails to parse multiple tables

Categories

Resources