Python regex help - python

I am trying to make a regex that finds all names, url and phone numbers in an html page.
But I'm having trouble with the phone number part. I think the problem with the numbers part is that is searches until it finds the </strong> but in that process it skips people, instead of making a empty string if the person has no phone number ( simply put instead of a list like this: url1+name1+num1 | url2+name2+"" | url3+name3+num3 it returns a list like this: url1+name1+num1 | url2+name2+num3 , with url3+name3 deleted in the process)
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
I am searchin for people in s single very long line. A person could have an url or phone number.
An example of a person with an url and a phone number
<tr> <td class="lablinksName"><div> dr. Ivan Bratko akad. prof.</div></td> <td class="lablinksMail"><img src="/Static/images/gui/mail.gif" height="8" width="11"></td> <td class="lablinksPhone"><div><strong>T:</strong> +386 1 4768 393 </div></td> </tr>
And an example of a person with no url or phone number
<tr> <td class="lablinksName"><div> dr. Branko Matjaž Jurič prof.</div></td> <td class="lablinksMail"><img src="/Static/images/gui/mail.gif" height="8" width="11"></td> <td class="lablinksPhone"><div> </div></td> </tr>
I hope i was clear enough and if any one can help me.

import lxml.html
root = lxml.html.parse("http://my.example.com/page.html").getroot()
rows = root.xpath("//table[#id='contactinfo']/tr")
for r in rows:
nameText = r.xpath("td[#class='lablinksName']/div/text() | td[#class='lablinksName']/div/a/text()")
name = u''.join(nameText).strip()
urls = r.xpath("td[#class='lablinksName']/div/a/#href")
url = len(urls)>0 and urls[0] or ''
phoneText = r.xpath("td[#class='lablinksPhone']/div/text()")
phone = u''.join(phoneText).strip()
print name, url, phone
For the purpose of this code, I assume <table id="contactinfo">{your table rows}</table>.

The quick and dirty way to fix it:
Replace
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):
with
for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page.replace("<tr>","\n"):
The issue is that the the .*? in .*?</strong> can match strings containing td class="lablinksMail. It cannot match \n. Any time you use . in a Regex (rather than [^<]), this kind of annoyance tends to happen.

If you're having this kind of difficulty, it's usually a good sign you're using the wrong approach. In particular, if I were doing this via regexp, I wouldn't even try unless the line in question had the "<td class="lablinksPhone">" tag in it.

Looks like a job for Beautiful Soup.
I love the quote: "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

Related

Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

UPDATE: HedgeHog's answer worked. To overcome the numpy issue, I uninstalled numpy-1.19.4 and installed the previous version numpy-1.19.3.
[Python 3.9.0 and BeautifulSoup 4.9.0.]
I am trying to use the BeautifulSoup library in Python to parse the HTML table found on the Department of Justice's Office of Legal Counsel website, and write the data to a CSV file. The table can be found at https://www.justice.gov/olc/opinions?keys=&items_per_page=40.
The table is deeply nested within 11 <div> elements. The abridged prettified version of the HTML up to the table's location is:
<html>
<body>
<section>
<11 continually nested div elements>
...
<table>
</table>
...
</divs>
</section>
</body>
</html>
The table is a simple three-column table, topped with a header row (which is inside a <thead> element), as shown below:
Date
Title
Headnotes
01/19/2021
Preemption of State and Local Requirements Under a PREP Act Declaration
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
The <tr> elements have one of four different classes:
<tr class="odd views-row-first"> - This only exists on the very first row after the header row.
<tr class="even"> - appears on every even table row
<tr class="odd"> - appears on every odd row after the first row
<tr class="even views-row-last"> - appears on the very last row (the user can choose to see 10, 20, or 40 items per page, which means the last row will always be even)
Within the <tr> elements, naturally, each <td> element corresponds to one of the data types (date, title, headnotes). Notwithstanding the specific <tr> class, each table row follows the same general format:
<tr class="odd-or-even/first-or-last">
<td class="views-field views-field-field-opinion-post-date active">
<span class="date-display-single" . . . >
01/01/1970
</span>
</td>
<td class="views-field views-field-field-opinion-attachment-file">
<a href="/olc/files/file-number/download">
Title
</a>
</td>
<td class="views-field views-field-field-opinion-overview">
<p>
Headnotes
</p>
<p>
Some headnotes have multiple paragraph elements.
</p>
</td>
</tr>
All of the Python scripts I have used have started with this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
f = open("olc-op.csv", "w", encoding="utf-8")
headers = "Date, Title, Headnotes \n"
f.write(headers)
My tinkering has primarily been focused on the find_all() argument and the for loop.
The problem I am having is that I am either getting only a single row in my CSV file or the error in the title to this post.
Since all of the <td> elements I want to scrape are within the <tbody> element, I ran tbody through find_all():
requests = soup.find_all("tbody")
In the for loop I specified <td> as the element, followed by the class name applied to each data:
for result in results:
date = result.find("td", class_="views-field views-field-field-opinion-post-date active").text
title = result.find("td", class_="views-field views-field-field-opinion-attachment-file").text
headnotes = result.find("td", class_="views-field views-field-field-opinion-overview").text
data = date + "," + title + "," + headnotes
f.write(data)
The output of the above code in the CSV file is:
Date,Title,Headnotes
01/19/2021 ,
Preemption of State and Local Requirements Under a PREP Act Declaration ,
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
Yes, the data is technically separated by a comma, but not in the way I intended. There is also some unneeded whitespace
after the header row.
I replaced the .text at the end of the .find() statements with .striped_strings, which returned the
following TypeError:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
To try and overcome this error, I changed f.write(data) to f.write(str(data)) in the for loop, and received
the same TypeError.
I did some further researach, and changed the end of each variable in the for loop from .striped_strings to
.get_text(strip=True). I also changed my f.write() statement to
f.write(date + "," + title + "," + headnotes)
These changes yielded one perfectly scraped table row, in addition to the header row:
Date, Title, Headnotes
01/19/2021,Preemption of State and Local Requirements Under a PREP Act Declaration,The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
But I obviously wanted to loop over the entire table and get all of the table rows.
The second to last thing I tried was to possibly get more specific in the find_all() statement. I changed it from tbody to
tr with no class specified, so it would (I thought) return all of the <tr> elements, which I could then parse
for the specific <td> element. Instead, I got this error:
AttributeError: 'NoneType' object has no attribute 'get_text'
The final change I made was to change .get_text(strip=True) back to .text, which resulted in the error in the
title of this post:
AttributeError: 'NoneType' object has no attribute 'text'
Where have I gone wrong?
Alternativ is use of pandas
Always ask yourself - Is there an easier way to get my goals?
It is, you can simply use pandas to do it in two lines. In your case it do all the things for you.
Requesting the url
Searching for the table and scraping the contents
Push the results to an csv
I also try to go through your question and may answer to it.
Example
import pandas as pd
pd.read_html('https://www.justice.gov/olc/opinions?keys=&items_per_page=40')[0].to_csv('olc-op.csv', index=False)
But answering to your question
Excited by the effort of asking your question, I will go some bonus miles and tell you what happens.
There are two major points that prevented you from reaching your goal .
Selecting the right things
Reason why there is only one line in your csv - You made this:
soup.find_all("tbody")
So your loop only loops one time, cause there is only one tbody - You figured out the structure and talked about the <tr> but do not selected them for looping.
Writing your lines
Even if you fixed the above you would have found only one line in the csv, cause the \n was missing in your string.
Hope that helps to understand, what went wrong and you can use it in case pandas wont work, cause of dynamic served content, ...
Example
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
with open("olc-op.csv", "a+", encoding="utf-8") as f:
headers = "Date, Title, Headnotes \n"
f.write(headers)
for result in soup.select("tbody tr"):
tds = result.find_all("td")
date = tds[0].get_text(strip=True)
title = tds[1].get_text(strip=True)
headnotes = tds[2].get_text(strip=True)
data = date + "," + title + "," + headnotes +'\n'
f.writelines(data)

About extracting data from a website (Python)

I made a program to extract information from a website. It works like below:
for row in table.findAll('td'):
topas = row.find('p')
pastoo = row.find('ul')
if topas:
continue
elif pastoo:
continue
else:
input = row.get_text()
input.strip()
file.write(input)
file.write("~") #adding separator
It works perfectly when the .html file is well-formated, like this:
<table class="responsiveTable">
<tbody>
<tr><td>Country:</td><td>Belgium</td></tr>
<tr><td>Year:</td><td>various years</td></tr>
</tbody>
</table>
However, in some .html files, things are quite messy which look like this:
<table class="responsiveTable">
<tbody><tr><td>Country:</td><td>Indonesia</td></tr>
**<tr><td>Year:</td><td>2017 (Jan 27th)
</td></tr>**
</tbody></table>
As you can see, the 4th row of the code made an unnecessary line break. I tried to use .strip() to remove it but it didn't work. Is there any strong function which can remove the line break?? Thank you!!

Duplicates when extracting data from html table using lxmk.html.xpath()

I am trying to extract data from this table at Espn cricinfo.
Each row is comprised of the folowing format (Data replaced by headers):
<tr class="data1">
<td class="left" nowrap="nowrap"><a>Player Name</a> (Country)</td>
<td>Score</td>
<td>Minutes Played</td>
<td nowrap="nowrap">Balls Faced</td>
<td etc...
</tr>
I have used the following code in a python script to capture the values in the table:
bats = content.xpath('//tr[#class="data1"]/td[1]/a')
cntry = content.xpath('//tr[#class="data1"]/td[1]/*')
run = content.xpath('//tr[#class="data1"]/td[2]')
mins = content.xpath('//tr[#class="data1"]/td[3]')
bf = content.xpath('//tr[#class="data1"]/td[4]')
The data is then put into a csv file for storage.
All of the data is successfully being captured apart from the country of the player. The player name and country are stored inside the same <td> tag; however, the player name is also inside an <a> tag, allowing it to be captured easily. My problem is that the value captured for the players country (the cntry variable above) is the players name. I am sure that the code is incorrect but I am not sure why.
Where you have:
cntry = content.xpath('//tr[#class="data1"]/td[1]/*')
The '*' is looking for the child tags and passes by any text.
You can replace your line of code with this to grab the text instead of the tags:
cntry = content.xpath('//tr[#class="data1"]/td[1]/text()')
See if that works for you.
EDIT
To remove the white spacing at beginning of each item, just do the following:
cntry = content.xpath('//tr[#class="data1"]/td[1]/text()')
cntry = [str(x).strip() for x in cntry]

Python Mechanize login form, sending input to a field with a randomly generated name

I'm trying to automate the login to a site, http://www.tthfanfic.org/login.php.
The problem I am having is that the password field has a name that is randomly generated, I have tried using it's label, type and id all of which remain static but to no avail.
Here is the HTML of the form:
<tr>
<th><label for="urealname">User Name</label></th>
<td><input type='text' id='urealname' name='urealname' value=''/> NOTE: Your user name may not be the same as your pen name.</td>
</tr>
<tr>
<th><label for="password">Password</label></th><td><input type='password' id='password' name='e008565a17664e26ac8c0e13af71a6d2'/></td>
</tr>
<tr>
<th>Remember Me</th><td><input type='checkbox' id='remember' name='remember'/>
<label for="remember">Log me in automatically for two weeks on this computer using a cookie. </label> Do not select this option if this is a public computer, or you have an evil sibling.</td>
</tr>
<tr>
<td colspan='2' style="text-align:center">
<input type='submit' value='Login' name='loginsubmit'/>
</td>
</tr>
I've tried to format that for readability but it still looks bad, consider checking the code on the supplied page.
Here is the code I get when printing the form through mechanize:
<POST http://www.tthfanfic.org/login.php application/x-www-form-urlencoded
<HiddenControl(ctkn=a40e5ff08d51a874d0d7b59173bf3d483142d2dde56889d35dd6914de92f2f2a) (readonly)>
<TextControl(urealname=)>
<PasswordControl(986f996e16074151964c247608da4aa6=)>
<CheckboxControl(remember=[on])>
<SubmitControl(loginsubmit=Login) (readonly)>>
The number sequence in the PasswordControl is the part that changes each time I reload the page, in the HTML from the site it seems to have several other tags ascribed to it but none of them work when I try to select them, that or I'm doing it incorrectly.
Here is the code I am using to try and select the control by label:
fieldTwo = br.form.find_control(label='password')
br[fieldOne] = identifier
br[fieldTwo] = password
I can post the rest of my login code if neccesary but this is the only part that is not working, I have had success with other sites where the password name remains the same.
So, is it possible for me to select the passwordControl using it's label, type or ID, or do I need to scrape its name?
EDIT: Oops, forgot to add the error message:
raise ControlNotFoundError("no control matching "+description)
mechanize._form.ControlNotFoundError: no control matching label 'password'
SOLVED:
Solution given by a guy on reddit, thanks Bliti.
Working code:
br.select_form(nr=2)
list = []
for f in br.form.controls:
list.append(f.name)
fieldTwo = list[2]
Solution given by a guy on reddit, thanks Bliti.
Working code:
#Select the form you want to use.
br.select_form(nr=2)
list = []
for f in br.form.controls:
#Add the names of each item in br.formcontrols
list.append(f.name)
#Select the correct one from the list.
fieldTwo = list[2]

Beautiful Soup: Get the Contents of Sub-Nodes

I have following python code:
def scrapeSite(urlToCheck):
html = urllib2.urlopen(urlToCheck).read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
tdtags = soup.findAll('td', { "class" : "c" })
for t in tdtags:
print t.encode('latin1')
This will return me following html code:
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
I'd like to get the text between the a-Node (e.g. FOO or BAR), which would be t.contents.contents. Unfortunately it doesn't work that easy :)
Does anyone have an idea how to solve that?
Thanks a lot, any help is appreciated!
Cheers,
Joseph
In this case, you can use t.contents[1].contents[0] to get FOO and BAR.
The thing is that contents returns a list with all elements (Tags and NavigableStrings), if you print contents, you can see it's something like
[u'\n', FOO, u'\n']
So, to get to the actual tag you need to access contents[1] (if you have the exact same contents, this can vary depending on the source HTML), after you've find the proper index you can use contents[0] afterwards to get the string inside the a tag.
Now, as this depends on the exact contents of the HTML source, it's very fragile. A more generic and robust solution would be to use find() again to find the 'a' tag, via t.find('a') and then use the contents list to get the values in it t.find('a').contents[0] or just t.find('a').contents to get the whole list.
For your specific example, pyparsing's makeHTMLTags can be useful, since they are tolerant of many HTML variabilities in HTML tags, but provide a handy structure to the results:
html = """
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
<td class="d">
BAZZ
</td>
"""
from pyparsing import *
td,tdEnd = makeHTMLTags("td")
a,aEnd = makeHTMLTags("a")
td.setParseAction(withAttribute(**{"class":"c"}))
pattern = td + a("anchor") + SkipTo(aEnd)("aBody") + aEnd + tdEnd
for t,_,_ in pattern.scanString(html):
print t.aBody, '->', t.anchor.href
prints:
FOO -> more.asp
BAR -> alotmore.asp

Categories

Resources