Web scraping: read all href - python

I write a small script to read all hrefs from web page with python.
But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.
code:
import urllib
import re
urls = ["http://something.com"]
regex='href=\"(.+?)\"'
pattern = re.compile(regex)
htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs
Can anybody help me? Thanks.

use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps
import requests
from bs4 import BeautifulSoup
url = 'whatever url you want to parse'
result = requests.get(url)
soup = BeautifulSoup(result.content,'html.parser')
for a in soup.find_all('a',href=True):
print "Found the URL:", a['href']

Related

How I get urls in website that content Apple article only (in python)?

I am newer to Python.I want to extract news article that talking about apple.My project want to get articles from BBC website that regarding Apple articles only.My code as follows I crawl the websites.But I can not identify how i get Apple article only.Anyone can help to solve my problem.
code
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
#pass the URL
url = urlopen("http://www.bbc.com")
#read the source from the URL
readHtml = url.read()
#close the url
url.close()
#passing HTML to scrap it
soup = BeautifulSoup(readHtml, 'html.parser')
all_tag_a = soup.find_all("a", limit=10)
for links in all_tag_a:
#just pull the href part from each link
print(links.get('href'))
Please try following:
from urllib.parse import urlparseo = urlparse('https://www.apple.com/in/') #URL is in the form of-> scheme://netloc/path;parameters?query#fragment.
#whatever URLs your are geeting write it in above statement, may be a loop will help
if 'apple' in o.netloc:
#If match found then here should be your Apple's URL print o.geturl()Please refer this for more onformtaion

How to extract certain parts of an HTML paragraph

I am new to webscraping and regular expressions and facing a problem here. One of my code gives me an output in HTML but I need to extract a certain part out of the paragraph and not the complete paragraph. I Need help with this. Below is my code.
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
Output on screen:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
What I need: 30/06/1980 & 40631880
For Python 2.7 try this way:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result
Clean way to parse URLs (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])

How regex until last occurrence?

I am using python, I need regex to get contacts link of web page. So, I made <a (.*?)>(.*?)Contacts(.*?)</a> and result is:
href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li>Photo</li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
,but I need on last <a ... like
href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
What regex pattern should I use?
python code:
match = re.findall('<a (.*?)>(.*?)Contacts(.*?)</a>', body)
if match:
for m in match:
print ''.join(m)
Since you are parsing HTML, I would suggest to use BeautifulSoup
# sample html from question
html = '<li>About</li><li>Photo</li><li class="last">Contacts</li>'
from bs4 import BeautifulSoup
doc = BeautifulSoup(html)
aTag = doc.find('a', id='menu583') # id for Contacts link
print(aTag['href'])
# '/ru/kontakt.html'
Try BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import re
links = []
urls ['www.u1.com','www.u2.om'....]
for url in urls:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link.string.lower() == 'contact':
links.append(link.get('href'))

How to extract all urls having .html at the end?

from bs4 import BeautifulSoup
import requests
r = requests.get("xxx")
soup = BeautifulSoup(r.content)
for link in soup.find_all('html'):
print link
This not working for me someone can help?
for link in soup.find_all('a'):
if '.html' in link['href']:
print link
you might want to use regular expressions and search for "href" attributes. Something like this to help you get started. Assuming you are searching all href attributes
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
tags = soup.find_all(href = re.compile(r"\.html$"))
tags variable will be a list of all html tags whose href attribute ends in .html. Now, you can loop through tags and extract the href

Unable to scrape certain values of a website using regex

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.
My code looks like:
import urllib
import re
def scrape():
url = "https://www.theWebsite.com"
statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
status = re.findall(statusText,htmltext)
print("Status: " + str(status))
scrape()
Which unfortunately returns only: "Status: []"
However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code
statusText = re.compile('(.+?)')
instead and I would get what I was trying to, "Status: ['About', 'About']"
Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.
I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div>
then you can use beautifulsoup and extract this division by:
soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
also see the official documentation of bs4
as an example i have edited your code for extracting a division form an article of bloomberg
you can make your own changes
import urllib
import re
from bs4 import BeautifulSoup
def scrape():
url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
print ans
scrape()
You can BeautifulSoup from here
P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Categories

Resources