Beautiful Soup .get_text() does not equal Python string when it should

Beautiful Soup .get_text() does not equal Python string when it should - python

I am using Beautiful Soup to grab text from an html element.
I am then using a loop and if statement to compare that text to a list of words. If they match I want to return a confirmation.
However, the code is not confirming any matches, even though print statements show there are in fact matches.
def findText():
text = ""
url = 'www.site.com'
#Get url and store
page = requests.get(url)
#Get page content
soup = BeautifulSoup(page.content,"html.parser")
els = soup.select(".className")
lists = els[1].select(".className2")
for l in lists:
try:
text=l.find("li").get_text()
except(AttributeError):
text="null"
return text
def isMatch(text):
#Open csv file
listFile = open('list.csv', 'rb')
#prep file to be read
newListFile =csv.reader(listFile)
match = ""
for r in newListFile:
if r[0]==text.lower():
match = True
else:
match = False
return match
congressCSVFile.close()
match is always False in the output
print(r[0]) returns (let's just say) "cat" in terminal
print(text) also returns "cat" in terminal

Your loop is the problem, or at least one of them. Once you find a record that matches, you keep going. match will only end up True if the last record matches. To fix this, simply return when you find a match:
for r in newListFile:
if r[0]==text.lower():
return True
return False
The match variable is not needed.
Better yet, use the any() function:
return any(r[0] == text.lower() for r in newListFile)

In your try: text = l.find("li").get_text(strip=True)
Soup and html in general adds a significant amount of white space. If you don't parse it out with the strip parameter then you may never get a match unless the white space is included in your list file.

Related

WIkipedia API get text under headers

I can scripe a wikipedia usein wikipedia api
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
regex_result = re.findall("==\s(.+?)\s==", text)
print(regex_result)
and I can from every element in a regex_result(Wikipedia headers ) get a text bellow and append it to another list. I dug the internet and I do not know how to do that with some function in Wikipedia API.
Second chance to get it in get a text and with some module extract a text between headers more here: find a some text in string bettwen some specific characters
I have tried this:
l = 0
for n in regex_result:
try:
regal = re.findall(f"==\s{regex_result[l]}\s==(.+?)\s=={regex_result[l+1]}\s==", text)
l+=2
except Exception:
continue
But I am not working:
output is only []

You don't want to call re twice, but rather iterate directly through the results provided by regex_result. Named groups in the form of (?P<name>...) make it even easier to extract the header name without the surrounding markup.
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
# using the number 2 for '=' means you can easily find sub-headers too by increasing the value
regex_result = re.findall("\n={2}\s(?P<header>.+?)\s={2}\n", text)
regex_result will then be a list of strings of the all the top-level section headers.
Here's what I use to make a table of contents from a wiki page. (Note: f-strings require Python 3.6)
def get_wikiheader_regex(level):
'''The top wikiheader level has two = signs, so add 1 to the level to get the correct number.'''
assert isinstance(level, int) and level > -1
header_regex = f"^={{{level+1}}}\s(?P<section>.*?)\s={{{level+1}}}$"
return header_regex
def get_toc(raw_page, level=1):
'''For a single raw wiki page, return the level 1 section headers as a table of contents.'''
toc = []
header_regex = get_wikiheader_regex(level=level)
for line in raw_page.splitlines():
if line.startswith('=') and re.search(header_regex, line):
toc.append(re.search(header_regex, line).group('section'))
return toc
>>> get_toc(text)

re.sub() gives Nameerror when no match

So I'm trying to search and replace rows of texts from a csv file, and I keep getting errors from it if re.sub() can't find any matches.
Say if the text in a row is
text = "a00123 一二三四五"
And my codes are
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
namelist_raw = re.sub(r'([a-z])00(\d{3})',r'\1-\2',text)
p = re.findall(r'\w',namelist_raw)
if p:
q = re.findall(r'([a-z]-\d{3})',namelist_raw)
for namelist in q:
print(namelist)
else:
namelist = "failed"
link = html + namelist
print(link)
so for this i should be getting a result of
www.abcdefg.com/a-123
so that's no problem.
but if the text is something like this,
text = "asdfdsdfd123 一二三四五"
I'll get Nameerror saying name 'namelist' is not defined
Why is that? I thought at the if else statement I've already wrote if anything else, namelist is "failed"
my code

Your p = re.findall(r'\w',namelist_raw) is extracting every word char from a string, and later, you only extract the values from the string if there were matches. You do not need that check.
Next, namelist is only populated if there is a match for [a-z]-\d{3}, but if there is no match, you do not get it populated. You need to account for that scenario, too.
Use
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
p = re.findall(r'([a-z])00(\d{3})', text) # Extract a list of tuples
namelist = [] # Init the list
for letter, number in p:
namelist.append(f"{letter}-{number}") # Populate namelist with formatted tuple values
if len(namelist): # If there was a match
namelist = "/".join(namelist) # Create a string by joining namelist items with /
else:
namelist = "failed" # Else, assign failed to the namelist
link = html + namelist
print(link)
See the Python demo.

Trim links in list with python3.7

I have a little script in python3.7 (see related question here) that scrapes links from a website (http://digesto.asamblea.gob.ni/consultas/coleccion/) and saves them in a list. Unfortunately, they are only partial and I have to trim them to use them as links.
This is the relevant part of the script:
list_of_links = [] # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
row.find_element_by_css_selector('button').click()
link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # href
print(list_of_links)# trim
if link.startswith('/'):
list_of_links.append(tld + link)
else:
list_of_links.append(current_url + link)
row.find_element_by_css_selector('button').click()
print(list_of_links)
How can I manipulate the list (as an example only with three entries here) that this
["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
looks like
["http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D"]
Breaking it down: on the example of the first link, I get this link from the website basically as
http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;
and need to trim it to
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D.
How do I achieve this in python from the entire list?

One approach is to split on the string /consultas/coleccion/window.open(', remove the unwanted end of the second string and concatenate the two processed strings to get your result.
This should do it:
new_links = []
for link in list_of_links:
current_strings = link.split("/consultas/coleccion/window.open('")
current_strings[1] = current_strings[1].split("');return")[0]
new_link = current_strings[0] + current_strings[1]
new_links.append(new_link)

This should do the trick:
s = "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"
s = s.replace("/consultas/coleccion/window.open('", "").replace("');return false;", "")

You could use a regular expression, to split the URLs in your list and let urllib.parse.urljoin() make the rest for you:
import re
from urllib.parse import urljoin
PATTERN = r"^([\S]+)window.open\('([\S]+)'"
links = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"]
result = "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D"
for link in links:
m = re.match(PATTERN, link, re.MULTILINE).groups()
# m is now: ('http://digesto.asamblea.gob.ni/consultas/coleccion/', '/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D')
if len(m) == 2:
newLink = urljoin(*m)
print(newLink)
assert newLink == result
Returns:
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D

To that you can use regular expression:
Consider this code:
import re
out = list()
lst = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]
for el in lst:
temp = re.sub(r"(.*?)/window.open\('(.*?)'\).*", r"\1\2", el)
out.append(temp)
print(temp)
The function sub allows to replace part of strings matching the pattern specified. Basically it is telling:
(.*?): keeps all the characters before /window.open...
/window.open\( the input string must have the pattern /window.open( but it will not be kept
(.*?) keep all characters after the previous pattern until a ) is found (represented by \()

Same code gives different output depends whether it has list comprehensions or generators

I am trying to clean this website and get every word. But using generators gives me more words than using lists. Also, these words are inconsistent. Sometimes I have more 1 words, sometimes none, sometimes more than 30 words. I have read about generators on python documentation and looked up some questions about generators. What i understand is it shouldn't differ. I don't understand what's going on underneath the hood. I am using python 3.6. Also I have read Generator Comprehension different output from list comprehension? but I can't understand the situation.
This is first function with generators.
def text_cleaner1(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = (line.strip() for line in text.splitlines()) # break into lines
print(type(lines))
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
print(type(chunks))
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
This is second function with list comprehensions.
def text_cleaner2(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = [line.strip() for line in text.splitlines()] # break into lines
chunks = [phrase.strip() for line in lines for phrase in line.split(" ")] # break multi-headlines into a line each
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
And this code give me different results randomly.
text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")

Generator is "lazy" - it doesn't execute code immediately but it executes it later when results will be needed. It means it doesn't get values from variables or functions immediately but it keeps references to variables and functions.
Example from link
all_configs = [
{'a': 1, 'b':3},
{'a': 2, 'b':2}
]
unique_keys = ['a','b']
for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
print(x)
print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
print(list(x))
In generator there is for loop inside another for loop.
Internal generator gets reference to c instead of value in c and it will get value later.
Later (when it has to get results from generators) it starts execution with external generator for c in all_configs. When external generator is executed it loops and generates two internal geneartors which use reference to c, not value from c, but when it loops it also changes value in c - so finally you have "list" with two internal generators and {'a': 2, 'b':2} in c.
After that it executes internals generators which finally get value from c but in this moment c already has {'a': 2, 'b':2}.
BTW: there is similar problem with lambda in for loop when you use it with Buttons in tkinter.

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!

I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>

This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup .get_text() does not equal Python string when it should - python

In your try: text = l.find("li").get_text(strip=True) Soup and html in general adds a significant amount of white space. If you don't parse it out with the strip parameter then you may never get a match unless the white space is included in your list file.

Related

WIkipedia API get text under headers

re.sub() gives Nameerror when no match

Trim links in list with python3.7

Same code gives different output depends whether it has list comprehensions or generators

Beautifulsoup scrape content of a cell beside another one

Categories

Resources