Using Regex to Search for HTML links near keywords

Using Regex to Search for HTML links near keywords - python

If I'm looking for the keyword "sales" and I want to get the nearest "http://www.somewebsite.com" even if there is multiple links in the file. I want the nearest link not the first link. This means I need to search for the link that comes just before the keyword match.
This doesn't work...
regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales
sales
Whats the best way to find a link that is closest to a keyword?

It is generally much easier and more robust to use an HTML parser rather than regex.
Using the third-party module lxml:
import lxml.html as LH
content = '''<html>
<p>other stuff</p><p>sales</p>
</html>
'''
doc = LH.fromstring(content)
for url in doc.xpath('''
//*[contains(text(),"sales")]
/preceding::*[starts-with(#href,"http")][1]/#href'''):
print(url)
yields
http://www.somewebsite.com
I find lxml (and XPath) a convenient way to express what elements I'm looking for. However, if installing a third-party module is not an option, you could also accomplish this particular job with HTMLParser from the standard library:
import HTMLParser
import contextlib
class MyParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.last_link = None
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if 'href' in attrs:
self.last_link = attrs['href']
content = '''<html>
<p>other stuff</p><p>sales</p>
</html>
'''
idx = content.find('sales')
with contextlib.closing(MyParser()) as parser:
parser.feed(content[:idx])
print(parser.last_link)
Regarding the XPath used in the lxml solution: The XPath has the following meaning:
//* # Find all elements
[contains(text(),"sales")] # whose text content contains "sales"
/preceding::* # search the preceding elements
[starts-with(#href,"http")] # such that it has an href attribute that starts with "http"
[1] # select the first such <a> tag only
/#href # return the value of the href attribute

I don't think you can do this one with regex alone (especially looking before the keyword match) as it has no sense of comparing distances.
I think you're best off doing something like this:
find all occurences of sales & get substring index, called salesIndex
find all occurences of https?://[-A-Za-z0-9./]+ and get the substring index, called urlIndex
loop through salesIndex. For each location i in salesIndex, find the urlIndex closest.
Depending on how you want to judge "closest" you may need to get the start and end indices of the sales and http... occurences to compare. i.e., find the end index of a URL that is closest to the start index of the current occurence of sales, and find the start index of a URL that is closest to the end index of the current occurence of sales, and pick the one that is closer.
You can use matches = re.finditer(pattern,string,re.IGNORECASE) to get a list of matches, and then match.span() to get the start/end substring indices for each match in matches.

Building on what mathematical.coffee suggested, you could try something along these lines:
import re
myString = "" ## the string you want to search
link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)
link_locations = []
for match in link_matches:
link_locations.append([match.span(),match.group()])
for match in sales_matches:
match_loc = match.span()
distances = []
for link_loc in link_locations:
if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
## append the distance between the END of the keyword and the START of the link
distances.append(match_loc[0] - link_loc[0][1])
else:
## append the distance between the END of the link and the START of the keyword
distances.append(link_loc[0][0] - match_loc[1])
for d in range(0,len(distances)-1):
if distances[d] == min(distances):
print ("Closest Link: " + link_locations[d][1] + "\n")
break

I tested out this code and it seems to be working...
def closesturl(keyword, website):
keylist = []
urllist = []
closest = []
urls = []
urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
keymatches = re.finditer(keyword, website, re.IGNORECASE)
for n in keymatches:
keylist.append([n.start(), n.end()])
if(len(keylist) > 0):
for m in urlmatches:
urllist.append([m.start(), m.end()])
if((len(keylist) > 0) and (len(urllist) > 0)):
for i in range (0, len(keylist)):
closest.append([abs(urllist[0][0]-keylist[i][0])])
urls.append(website[urllist[0][0]:urllist[0][1]])
if(len(urllist) >= 1):
for j in range (1, len(urllist)):
if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
closest[i] = abs(keylist[i][0]-urllist[j][0])
urls[i] = website[urllist[j][0]:urllist[j][1]]
if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
break # local minimum / inflection point break from url list
if((len(keylist) > 0) and (len(urllist) > 0)):
return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]
else:
return ""
somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
keyword = "mykeyword"
print closesturl(keyword, somestring)
The above when run shows... http://www.secondlink.com.
If someones got ideas on how to speed up this code that would be awesome!
Thanks
V$H.

Related

WIkipedia API get text under headers

I can scripe a wikipedia usein wikipedia api
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
regex_result = re.findall("==\s(.+?)\s==", text)
print(regex_result)
and I can from every element in a regex_result(Wikipedia headers ) get a text bellow and append it to another list. I dug the internet and I do not know how to do that with some function in Wikipedia API.
Second chance to get it in get a text and with some module extract a text between headers more here: find a some text in string bettwen some specific characters
I have tried this:
l = 0
for n in regex_result:
try:
regal = re.findall(f"==\s{regex_result[l]}\s==(.+?)\s=={regex_result[l+1]}\s==", text)
l+=2
except Exception:
continue
But I am not working:
output is only []

You don't want to call re twice, but rather iterate directly through the results provided by regex_result. Named groups in the form of (?P<name>...) make it even easier to extract the header name without the surrounding markup.
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
# using the number 2 for '=' means you can easily find sub-headers too by increasing the value
regex_result = re.findall("\n={2}\s(?P<header>.+?)\s={2}\n", text)
regex_result will then be a list of strings of the all the top-level section headers.
Here's what I use to make a table of contents from a wiki page. (Note: f-strings require Python 3.6)
def get_wikiheader_regex(level):
'''The top wikiheader level has two = signs, so add 1 to the level to get the correct number.'''
assert isinstance(level, int) and level > -1
header_regex = f"^={{{level+1}}}\s(?P<section>.*?)\s={{{level+1}}}$"
return header_regex
def get_toc(raw_page, level=1):
'''For a single raw wiki page, return the level 1 section headers as a table of contents.'''
toc = []
header_regex = get_wikiheader_regex(level=level)
for line in raw_page.splitlines():
if line.startswith('=') and re.search(header_regex, line):
toc.append(re.search(header_regex, line).group('section'))
return toc
>>> get_toc(text)

Python - Returning String with recursion

The project that I am doing requires us to input a url and then follow the link in a particular position a number of number of times then return the last page visited. I have found the solution with a while loop, but now I am trying to do it with recursion.
Example: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
My code is this so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
cnt = input("Enter count:")
post = input("Enter position:")
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
count = int(cnt)
position = int(post)
def GetLinks(initalPage, position, count):
html = urlopen(initalPage).read()
soup = BeautifulSoup(html, "html.parser")
temp = ""
links = list()
tags = soup('a')
for tag in tags:
x = tag.get('href', None)
links.append(x)
print(links[position - 1])
if count > 1:
GetLinks(links[position-1], position, count - 1)
return links[position - 1]
y = GetLinks(url, position, count)
print("****", y)
I see two problems with my code.
I am creating a list that is expanding with every recursion, which makes it very hard to locate the proper value.
Second, I am obviously returning the wrong item.
I don't know exactly how to fix this.

Trim links in list with python3.7

I have a little script in python3.7 (see related question here) that scrapes links from a website (http://digesto.asamblea.gob.ni/consultas/coleccion/) and saves them in a list. Unfortunately, they are only partial and I have to trim them to use them as links.
This is the relevant part of the script:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # href
print(list_of_links)# trim
if link.startswith('/'):
list_of_links.append(tld + link)
else:
list_of_links.append(current_url + link)
row.find_element_by_css_selector('button').click()
print(list_of_links)
How can I manipulate the list (as an example only with three entries here) that this
["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
looks like
["http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D"]
Breaking it down: on the example of the first link, I get this link from the website basically as
http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;
and need to trim it to
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D.
How do I achieve this in python from the entire list?

One approach is to split on the string /consultas/coleccion/window.open(', remove the unwanted end of the second string and concatenate the two processed strings to get your result.
This should do it:
new_links = []
for link in list_of_links:
current_strings = link.split("/consultas/coleccion/window.open('")
current_strings[1] = current_strings[1].split("');return")[0]
new_link = current_strings[0] + current_strings[1]
new_links.append(new_link)

This should do the trick:
s = "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"
s = s.replace("/consultas/coleccion/window.open('", "").replace("');return false;", "")

You could use a regular expression, to split the URLs in your list and let urllib.parse.urljoin() make the rest for you:
import re
from urllib.parse import urljoin
PATTERN = r"^([\S]+)window.open\('([\S]+)'"
links = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"]
result = "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D"
for link in links:
m = re.match(PATTERN, link, re.MULTILINE).groups()
# m is now: ('http://digesto.asamblea.gob.ni/consultas/coleccion/', '/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D')
if len(m) == 2:
newLink = urljoin(*m)
print(newLink)
assert newLink == result
Returns:
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D

To that you can use regular expression:
Consider this code:
import re
out = list()
lst = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
for el in lst:
temp = re.sub(r"(.*?)/window.open\('(.*?)'\).*", r"\1\2", el)
out.append(temp)
print(temp)
The function sub allows to replace part of strings matching the pattern specified. Basically it is telling:
(.*?): keeps all the characters before /window.open...
/window.open\( the input string must have the pattern /window.open( but it will not be kept
(.*?) keep all characters after the previous pattern until a ) is found (represented by \()

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!

I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>

This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

Parse HTML Table with Python BeautifulSoup

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.
From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text + "|",
print
I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks

HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:
from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group
""" looking for this recurring pattern:
<td valign="top" bgcolor="#FFFFCC">00-03</td>
<td valign="top">.50</td>
<td valign="top">.50</td>
and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""
td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))
realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')
# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend +
Group(2*(td + realnum + tdend))("vals"))
This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).
Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:
# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
for i in range(entry.start, entry.end+1):
lookup[i] = tuple(entry.vals)
And now to try out some lookups:
# print out some test values
for test in (0,20,100,700):
print (test, lookup[test])
prints:
0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

I think the above answer is better than what I would offer, but I have a BeautifulSoup answer that can get you started. This is a bit hackish, but I figured I would offer it nevertheless.
With BeautifulSoup, you can find all the tags with certain attributes in the following way (assuming you have a soup.object already set up):
soup.find_all('td', attrs={'bgcolor':'#FFFFCC'})
That will find all of your keys. The trick is to associate these with the values you want, which all show up immediately afterward and which are in pairs (if these things change, by the way, this solution won't work).
Thus, you can try the following to access what follows your key entries and put those into your_dictionary:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling
The problem is that the "next_sibling" is actually a '\n', so you have to do the following to capture the next value (the first value you want):
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling.next_sibling.string
And if you want the two following values, you have to double this:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]
Disclaimer: that last line is pretty ugly to me.

I've used BeautifulSoup 3, but it probably will work under 4.
# Import System libraries
import re
# Import Custom libraries
from BeautifulSoup import BeautifulSoup
# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
# Read the file into the BeautifulSoup class
soup = BeautifulSoup(file_h.read())
tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location
str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
for key in tr.findAll(key_location): # Loop through all found Integer key tds
key_list = []
key_str = key.text.strip()
for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
key_list.append(td.text)
key_list = map(float, key_list) # Convert the text values to floats
# String based dictionary section
str_key_dict[key_str] = key_list
# Number based dictionary section
num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
if(len(num_range) == 2):
num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
else:
num_key_dict.update([(num_range[0], key_list)])
for x in num_key_dict.items():
print x

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Regex to Search for HTML links near keywords - python

Related

WIkipedia API get text under headers

Python - Returning String with recursion

Trim links in list with python3.7

Beautifulsoup scrape content of a cell beside another one

Parse HTML Table with Python BeautifulSoup

Categories

Resources