I am using python, I need regex to get contacts link of web page. So, I made <a (.*?)>(.*?)Contacts(.*?)</a> and result is:
href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li>Photo</li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
,but I need on last <a ... like
href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
What regex pattern should I use?
python code:
match = re.findall('<a (.*?)>(.*?)Contacts(.*?)</a>', body)
if match:
for m in match:
print ''.join(m)
Since you are parsing HTML, I would suggest to use BeautifulSoup
# sample html from question
html = '<li>About</li><li>Photo</li><li class="last">Contacts</li>'
from bs4 import BeautifulSoup
doc = BeautifulSoup(html)
aTag = doc.find('a', id='menu583') # id for Contacts link
print(aTag['href'])
# '/ru/kontakt.html'
Try BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import re
links = []
urls ['www.u1.com','www.u2.om'....]
for url in urls:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link.string.lower() == 'contact':
links.append(link.get('href'))
Related
I'm using "requests" and "beautifulsoup" to search for all the href links from a webpage with a specific text. I've already made it but if the text comes in a new line, beautifulsoup doesn't "see" it and don't return that link.
soup = BeautifulSoup(webpageAdress, "lxml")
path = soup.findAll('a', href=True, text="Something3")
print(path)
Example:
Like this, it returns Href of Something3 text:
...
Something3
...
Like this, it doesn't return the Href of Something3 text:
...
<a href="page1/somethingC.aspx">
Something3</a>
...
The difference is that Href text (Something3) is in a new line.
And i can't change HTML code because i'm not the webmaster of that webpage.
Any idea how can i solve that?
Note: i've already tried to use soup.replace('\n', ' ').replace('\r', '') but i get the error NoneType' object is not callable.
You can use regex to find any text that contains `"Something3":
html = '''Something3
<a href="page1/somethingC.aspx">
Something3</a>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "lxml")
path = soup.findAll('a', href=True, text=re.compile("Something3"))
for link in path:
print (link['href'])
You can use :contains pseudo class with bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = 'Something3'
soup = bs(html, 'lxml')
links = [link.text for link in soup.select('a:contains(Something3)')]
print(links)
And a solution without regex:
path = soup.select('a')
if path[0].getText().strip() == 'Something3':
print(path)
Output:
[<a href="page1/somethingC.aspx">
Something3</a>]
I am new to webscraping and regular expressions and facing a problem here. One of my code gives me an output in HTML but I need to extract a certain part out of the paragraph and not the complete paragraph. I Need help with this. Below is my code.
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
Output on screen:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
What I need: 30/06/1980 & 40631880
For Python 2.7 try this way:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result
Clean way to parse URLs (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])
I write a small script to read all hrefs from web page with python.
But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.
code:
import urllib
import re
urls = ["http://something.com"]
regex='href=\"(.+?)\"'
pattern = re.compile(regex)
htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs
Can anybody help me? Thanks.
use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps
import requests
from bs4 import BeautifulSoup
url = 'whatever url you want to parse'
result = requests.get(url)
soup = BeautifulSoup(result.content,'html.parser')
for a in soup.find_all('a',href=True):
print "Found the URL:", a['href']
I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)
In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text
You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'
If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])
Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.
How to extract the contents between these tags when they're on multiple/ different lines?
<link>
https://widget.websta.me/rss/n/bleh
</link>
I tried:
content = findall('(.*)', web_page_contents, re.DOTALL)
But I get the next mention of instead of this one^
You can use BeautifulSoup to do that. It has a very good documentation and is very easy.
The following code will work:
import requests
from bs4 import BeautifulSoup
r = requests.get(webpage_url)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('link'):
print link.text