Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
My question is about searching through html format with Python.
I am using this code:
with urllib.request.urlopen("http://") as url:
data = url.read().decode()
now this returns the whole HTML code from the page and I want to extract all email-addresses.
Can somebody lend me a hand here?
Thanks in advance
Using beautifulsoup BeautifulSoup And Requests you could do this:
import requests
from bs4 import BeautifulSoup
import re
response = requests.get("your_url")
response_text = response.text
beautiful_response = BeautifulSoup(response_text, 'html.parser')
email_regex = r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
list_of_emails = re.findall(email_regex, beautiful_response .text)
list_of_emails_decoded = []
for every_email in list_of_emails:
list_of_emails_decoded.append(every_email.encode('utf-8'))
Remember that you should not use regex for actual HTML parsing (Thanks #Patrick Artner), but you can use beautiful soup to extract all visible text or comments on a web page. Then you can use this text (which is just a string) to look for email addresses. Here is how you can do it:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
import re
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
with urllib.request.urlopen("https://en.wikipedia.org/wiki/Email_address") as url:
data = url.read().decode()
text = text_from_html(data)
print(re.findall(r"[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+#[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*", text))
The two helper functions just grab all text that can be seen on the page, and then the ridiculously long regex just pulls all email addresses from that text. I used wikipedia.com's article on emails as an example, and here is the output:
['John.Smith#example.com', 'local-part#domain', 'jsmith#example.com', 'john.smith#example.org', 'local-part#domain', 'John..Doe#example.com', 'fred+bah#domain', 'fred+foo#domain', 'fred#domain', 'john.smith#example.com', 'john.smith#example.com', 'jsmith#example.com', 'JSmith#example.com', 'john.smith#example.com', 'john.smith#example.com', 'prettyandsimple#example.com', 'very.common#example.com', 'disposable.style.email.with+symbol#example.com', 'other.email-with-dash#example.com', 'fully-qualified-domain#example.com', 'user.name+tag+sorting#example.com', 'user.name#example.com', 'x#example.com', 'example-indeed#strange-example.com', 'admin#mailserver1', "#!$%&'*+-/=?^_`{}|~#example.org", 'example#s.solutions', 'user#localserver', 'A#b', 'c#example.com', 'l#example.com', 'right#example.com', 'allowed#example.com', 'allowed#example.com', '1234567890123456789012345678901234567890123456789012345678901234+x#example.com', 'john..doe#example.com', 'example#localhost', 'john.doe#example', 'joeuser+tag#example.com', 'joeuser#example.com', 'foo+bar#example.com', 'foobar#example.com']
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am making python script which getting text data from online site.
this is simple web scraping script and the language is only python.
I don't use selenium and use only beautifulsoup.
and I can scrape text from <p> or <div> or even <h> and <a>
but when I try to get text from <td>, the code is not working.
I shared my code below.
from threading import Thread
from bs4 import BeautifulSoup
from lxml import etree
detailPage = requests.get(SUBURL, headers=HEADERS)
detailsoup = BeautifulSoup(detailPage.content, "html.parser")
detaildom = etree.HTML(str(detailsoup))
name = detaildom.xpath('//*[#id="productTitle"]')[0].text
asin = detaildom.xpath('//*[#id="productDetails_detailBullets_sections1"]/tbody/tr[1]/td')[0].text
here, getting name is working, asin return empty string.
You can find the table by its ID productDetails_detailBullets_sections1 and find the <td> which contains the "ASIN".
Using a CSS selector:
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print("ASIN:", soup.select_one("#productDetails_db_sections tr > td").get_text(strip=True))
Using .find():
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
table_info = soup.find(id="productDetails_detailBullets_sections1").find("tr")
print("ASIN:", table_info.find('td').get_text(strip=True))
Output (in both solutions):
ASIN: B079LWYC17
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need a way to extract url from the list at this web page https://iota-nodes.net/
using Python. I tried BeautifulSoup but without success.
My code is:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://iota-nodes.net/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
No need for BeautifulSoup, as the data is coming from an AJAX request. Something like this should work:
import requests
response = requests.get('https://api.iota-nodes.net/')
data = response.json()
hostnames = [node['hostname'] for node in data]
Note that the data comes from an API endpoint being https://api.iota-nodes.net/.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to webscrape https://in.udacity.com/courses/all. I need to get the courses shown while entering the search query. For eg: if I enter python, there are 17 courses coming as results.I need to fetch those courses only. Here the search query is not passed as part of the url.(not get method).so the html content is also not changing. Then how can I fetch those results without going through the entire course list.
in this code i am fetching all the course links getting the content of it and seraching the search term in that content.but it is not giving me the result that i expect.
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
from urllib.request import Request, urlopen
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')
search_term = input("enter the course:")
for link in courses:
#print("https://in.udacity.com" + link['href'])
html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()
if search_term in text_from_html(html).lower():
print('\n'+link.text)
print("https://in.udacity.com" + link['href'])
Using requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
for course in courses:
print(course.text)
OUTPUT:
VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.
EDIT:
As explainged by #Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"
for course in courses:
if re.search(search_term, course.text, re.IGNORECASE):
print(course.text)
OUTPUT:
AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems
The udacity page is actually returning all available courses when you request it. When you enter a search, the page is simply filtering the available data. This is why you do not see any changes to the URL when entering a search. A check using the browser's developer tools also confirms this. It also explains why the "search" is so fast.
As such, if you are searching for a given course, you would just need to filter the results yourself. For example:
import requests
from bs4 import BeautifulSoup
req = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(req.content, "html.parser")
a_tags = soup.find_all("a", class_="capitalize")
print("Number of courses:", len(a_tags))
print()
for a_tag in a_tags:
course = a_tag.text
if "python" in course.lower():
print(course)
This would display all courses with Python in the title:
Number of courses: 225
Python Foundation
AI Programming with Python
Programming Foundations with Python
Data Structures & Algorithms in Python
Read the tutorials for how to use requests (for making HTTP requests) and BeautifulSoup (for processing HTML). This will teach you what you need to know to download the pages, and extract the data from the HTML.
You will use the function BeautifulSoup.find_all() to locate all of the <div> elements in the page HTML, with class=course-summary-card. The content you want is within that <div>, and after reading the above links it should be trivial for you to figure out the rest ;)
Btw, one helpful tool for you as you learn how to do this will be to use the "Inspect element" feature (for Chrome/Firefox), which can be accessed by right clicking on elements in the browser, that enables you to look at the source code surrounding the element you're interested in extracting, so you can get information like it's class or id, parent divs, etc that will allow you to select it in BeautifulSoup/lxml/etc.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
What I am trying to do is find all the hyper links of a web page here is what I have so far but it does not work
from urllib.request import urlopen
def findHyperLinks(webpage):
link = "Not found"
encoding = "utf-8"
for webpagesline in webpage:
webpagesline = str(webpagesline, encoding)
if "<a href>" in webpagesline:
indexstart = webpagesline.find("<a href>")
indexend = webpagesline.find("</a>")
link = webpagesline[indexstart+7:indexend]
return link
return link
def main():
address = input("Please enter the adress of webpage to find the hyperlinks")
try:
webpage = urlopen(address)
link = findHyperLinks(webpage)
print("The hyperlinks are", link)
webpage.close()
except Exception as exceptObj:
print("Error:" , str(exceptObj))
main()
There are multiple problems in your code. One of them is that you are trying to find links with present, empty and the only one href attribute: <a href>.
Anyway, if you would use an HTML parser (well, to parse HTML), things would get much more easy and reliable. Example using BeautifulSoup:
from bs4 import BeautifulSoup
from urllib.request import urlopen
soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
print(link["href"], link.get_text())
Without BeautifulSoap you can use RegExp and simple function.
from urllib.request import urlopen
import re
def find_link(url):
response = urlopen(url)
res = str(response.read())
my_dict = re.findall('(?<=<a href=")[^"]*', res)
for x in my_dict:
# simple skip page bookmarks, like #about
if x[0] == '#':
continue
# simple control absolute url, like /about.html
# also be careful with redirects and add more flexible
# processing, if needed
if x[0] == '/':
x = url + x
print(x)
find_link('http://cnn.com')
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I want get all link in one web page ,this function only one link but need get all link ! of course i know need The One Ring true but i don't know use
i need get all link
def get_next_target(page):
start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
This is where a HTML parser comes in handy. I recommend BeautifulSoup:
from bs4 import BeautifulSoup as BS
def get_next_target(page)
soup = BS(page)
return soup.find_all('a', href=True)
You may use lxml for that:
import lxml.html
def get_all_links(page):
document = lxml.html.parse(page)
return document.xpath("//a")
site = urllib.urlopen('http://somehwere/over/the/rainbow.html')
site_data = site.read()
for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])