Python Regex - Parsing HTML

Python Regex - Parsing HTML - python

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.
import sys
import re
#def extract_names(filename):
f = open('name.html', 'r')
text = f.read()
match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)
outf = open('details.txt', 'a')
outf.write(match)
outf.close()
My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file. But for some reason I can't seem to get it right.
Your help is greatly appreciated.

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
print ElementTree.tostring(elem)

Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

Related

Create a python parser for a .txt file, needing to get the individual urls

I'm writing a parser for a .txt file. I want to find all the url's starting with http or https, and only those, from a simple chrome copy paste, but I don't know how to use regular expressions together with pandas. If you can help me, I would like to use my program in PyCharm, thanks!

You can use pythons re module and a regular expression to find these, for example using re.findall():
import re
with open('filename.txt', 'r') as file:
lines = file.readlines()
urls = re.findall(r'^https?:\/\/.*$', lines
This is a useful website where you can experiment with your regular expressions until you get them right.

How to use re module in python to extract information?

I write a little script to use collins website for translation. heres my code:
import urllib.request
import re
def translate(search):
base_url = 'http://www.collinsdictionary.com/dictionary/american/'
url = base_url + search
p = urllib.request.urlopen(url).read()
f = open('t.txt', 'w+b')
f.write(p)
f.close()
f = open('t.txt', 'r')
t = f.read()
m = re.search(r'(<span class="def">)(\w.*)(</span>]*)',t)
n = m.group(2)
print(n)
f.close()
I have some questions:
I can't use re.search on p. it raises this error:
TypeError: can't use a string pattern on a bytes-like object
is there a way to use re.search without saving it?
After saving file I should reopen it to use re.search otherwise it raises this error: TypeError: must be str, not bytes why this error happens?
in this program I want to extract information between <span class="def"> and </span> from first match. but pattern that I wrote not work good in all cases. for example translate('three') is good. out put is : "totaling one more than two" but for translate('tree') out put is:
"a treelike bush or shrub â‡’ a rose tree"
is there a way to correct this pattern. regular expression or any other tools?

When you call read on the response returned by urllib, you get a bytes object, which you need to decode to convert it to a string.
Change
p=urllib.request.urlopen(url).read()
to
p=urllib.request.urlopen(url).read().decode('utf-8')
You should read this https://docs.python.org/3/howto/unicode.html to understand why because issues like this come up a lot.
Also, you probably don't want to parse HTML using regex. Some better alternatives for Python are mentioned here.

lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.

creating xml documents with whitespace with xml.etree.cElementTree

I'm working on a project to store various bits of text in xml files, but because people besides me are going to look at it and use it, it has to be properly indented and such. I looked at a question on how to generate xml files using cElement Tree here, and the guy says something about putting in info about making things pretty if people ask, but there isn't anything there (I guess because no one asked). So basically, is there a way to properly indent and whitespace using cElementTree, or should i just throw up my hands and go learn how to use lxml.

You can use minidom to prettify our xml string:
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Return a pretty-printed XML string for the Element.
def prettify(xmlStr):
INDENT = " "
rough_string = ET.tostring(xmlStr, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=INDENT)
# name of root tag
root = ET.Element("root")
child = ET.SubElement(root, 'child')
child.text = 'This is text of child'
prettified_xmlStr = prettify(root)
output_file = open("Output.xml", "w")
output_file.write(prettified_xmlStr)
output_file.close()
print("Done!")

Answering myself here:
Not with ElementTree. The best option would be to download and install the module for lxml, then simply enable the option
prettyprint = True
when generating new XML files.

How to read an entire web page into a variable

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.

You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!

You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.

Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.

I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.

If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex - Parsing HTML - python

Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

Related

Create a python parser for a .txt file, needing to get the individual urls

How to use re module in python to extract information?

lxml parsing with python: how to with objectify

creating xml documents with whitespace with xml.etree.cElementTree

How to read an entire web page into a variable

Categories

Resources