How extract URL from web page [closed]

How extract URL from web page [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need a way to extract url from the list at this web page https://iota-nodes.net/
using Python. I tried BeautifulSoup but without success.
My code is:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://iota-nodes.net/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))

No need for BeautifulSoup, as the data is coming from an AJAX request. Something like this should work:
import requests
response = requests.get('https://api.iota-nodes.net/')
data = response.json()
hostnames = [node['hostname'] for node in data]
Note that the data comes from an API endpoint being https://api.iota-nodes.net/.

Related

Downloading files from web [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 days ago.
Improve this question
i need first 5 text files from URL using python: http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/ , only '.txt' files should be dowloaded and store it in folder
I tried using requests lib to get access to website

Here I've done it using requests to get URLs content, and BeautifulSoup to retrieve urls to download .txt's from main page
Download page content using requests
Using BeautifulSoup, find all <a> tags
Get first 5 tags that ends up with .txt
Download content of those tags href using requests
import requests
from bs4 import BeautifulSoup
url = "http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/"
AMOUNT_OF_FILES = 5 # Amount of txt files to download
FILES_EXTENSION = ".txt" # Extension to download
# Getting url content
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
# Finding all a tags in table
a_tags = soup.find("table").find_all("a")
urls_to_download = []
# Getting urls to download .txt`s
for a_tag in a_tags:
if a_tag['href'].endswith(FILES_EXTENSION):
urls_to_download.append(url + a_tag['href'])
if len(urls_to_download) == AMOUNT_OF_FILES:
break
# Downloading file contents
for url in urls_to_download:
filename = url[url.rindex("/")+1:]
request_url = requests.get(url)
with open(filename, "wb") as file:
file.write(request_url.content)

Getting Text value from element using beautifulsoup in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am making python script which getting text data from online site.
this is simple web scraping script and the language is only python.
I don't use selenium and use only beautifulsoup.
and I can scrape text from <p> or <div> or even <h> and <a>
but when I try to get text from <td>, the code is not working.
I shared my code below.
from threading import Thread
from bs4 import BeautifulSoup
from lxml import etree
detailPage = requests.get(SUBURL, headers=HEADERS)
detailsoup = BeautifulSoup(detailPage.content, "html.parser")
detaildom = etree.HTML(str(detailsoup))
name = detaildom.xpath('//*[#id="productTitle"]')[0].text
asin = detaildom.xpath('//*[#id="productDetails_detailBullets_sections1"]/tbody/tr[1]/td')[0].text
here, getting name is working, asin return empty string.

You can find the table by its ID productDetails_detailBullets_sections1 and find the <td> which contains the "ASIN".
Using a CSS selector:
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print("ASIN:", soup.select_one("#productDetails_db_sections tr > td").get_text(strip=True))
Using .find():
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
table_info = soup.find(id="productDetails_detailBullets_sections1").find("tr")
print("ASIN:", table_info.find('td').get_text(strip=True))
Output (in both solutions):
ASIN: B079LWYC17

in python, I have 100 page's link and want to save them as html [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
100 page's links inside (links.txt)
This is the code I have so far (it is save only one page) but the part of saving all the 99 pages is missing
import requests
import urllib.request, urllib.error, urllib.parse
with open('links.txt', 'r') as links:
for link in links:
response = urllib.request.urlopen(link)
webContent = response.read()
f = open('obo-t17800628-33.html', 'wb')
f.write(webContent)
f.close

You need to give the files different names as you loop:
import requests
import urllib.request, urllib.error, urllib.parse
with open('links.txt', 'r') as links:
for idx, link in enumerate(links):
response = urllib.request.urlopen(link)
webContent = response.read()
with open('obo-t17800628-33.html' + str(idx), 'wb') as fout:
fout.write(webContent)
This will append a number to the end of each file name.

Using HTML parser to get contents of a particular div [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to use a HTML parser like beautiful soup (python) to get the contents of a specific div, store all data within it in my local server by running a python script that will be executed regularly on my web server by cron.
Also, I need to be able to show those contents exactly as they were shown in the webpage before on my web site.
If the contents of the div is text alone, it would be easy enough but it is a combination of text and image.
Although there are occasionally swf files, I do not wish to import them.
Let's say the div in question is called 'cont'.
What would be the best way to do this?

Luckily i have a spider which does exactly what you need to do.
from soup import BeautifulSoup as bs
from scrapy.http import Request
from scrapy.spider import BaseSpider
from hn.items import HnItem
class HnSpider(BaseSpider):
name = 'hn'
allowed_domains = []
start_urls = ['http://news.ycombinator.com']
def parse(self, response):
if 'news.ycombinator.com' in response.url:
soup = bs(response.body)
items = [(x[0].text, x[0].get('href')) for x in
filter(None, [
x.findChildren() for x in
soup.findAll('td', {'class': 'title'})
])]
for item in items:
print item
hn_item = HnItem()
hn_item['title'] = item[0]
hn_item['link'] = item[1]
try:
yield Request(item[1], callback=self.parse)
except ValueError:
yield Request('http://news.ycombinator.com/' + item[1], callback=self.parse)
yield hn_item
Refer the Github link to know more.

get all link site in source html (python) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I want get all link in one web page ,this function only one link but need get all link ! of course i know need The One Ring true but i don't know use
i need get all link
def get_next_target(page):
start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote

This is where a HTML parser comes in handy. I recommend BeautifulSoup:
from bs4 import BeautifulSoup as BS
def get_next_target(page)
soup = BS(page)
return soup.find_all('a', href=True)

You may use lxml for that:
import lxml.html
def get_all_links(page):
document = lxml.html.parse(page)
return document.xpath("//a")

site = urllib.urlopen('http://somehwere/over/the/rainbow.html')
site_data = site.read()
for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How extract URL from web page [closed] - python

Related

Downloading files from web [closed]

Getting Text value from element using beautifulsoup in Python [closed]

in python, I have 100 page's link and want to save them as html [closed]

Using HTML parser to get contents of a particular div [closed]

get all link site in source html (python) [closed]

Categories

Resources