I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')
Related
I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?
Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html
This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.
How can I fix it?
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
links.append(my_url + current_link)
print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf but the /courses/320241/2019_2/ part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:
Is there a way I can use the same function to work with both types of links?
OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.
Anyway, here goes:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:
resp = requests.get(url)
soup = bs(resp.text,'lxml')
og = soup.find("meta", property="og:url")
base = urlparse(url)
for link in soup.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og:
links.append(og["content"] + current_link)
else:
links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
print(link)
I'm fairly new to scraping with Python.
I am trying to obtain the number of search results from a query on Exilead. In this example I would like to get "
586,564 results".
This is the code I am running:
r = requests.get(URL, headers=headers)
tree = html.fromstring(r.text)
stats = tree.xpath('//[#id="searchform"]/div/div/small/text()')
This returns an empty list.
I copy-pasted the xPath directly from the elements' page.
As an alternative, I have tried using Beautiful soup:
html = r.text
soup = BeautifulSoup(html, 'xml')
stats = soup.find('small', {'class': 'pull-right'}).text
which returns a Attribute error: NoneType object does not have attribute text.
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
Does anyone know why this is happening and how this can be resolved?
Thanks a lot!
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
This suggests that the data you're looking for is dynamically generated with javascript. You'll need to be able to see the element you're looking for in the html source.
To confirm this being the cause of your error, you could try something really simple like:
html = r.text
soup = BeautifulSoup(html, 'lxml')
*note the 'lxml' above.
And then manually check 'soup' to see if your desired element is there.
I can get that with a css selector combination of small.pull-right to target the tag and the class name of the element.
from bs4 import BeautifulSoup
import requests
url = 'https://www.exalead.com/search/web/results/?q=lead+poisoning'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.select_one('small.pull-right').text)
I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")
Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])