How to use re module in python to extract information? - python

I write a little script to use collins website for translation. heres my code:
import urllib.request
import re
def translate(search):
base_url = 'http://www.collinsdictionary.com/dictionary/american/'
url = base_url + search
p = urllib.request.urlopen(url).read()
f = open('t.txt', 'w+b')
f.write(p)
f.close()
f = open('t.txt', 'r')
t = f.read()
m = re.search(r'(<span class="def">)(\w.*)(</span>]*)',t)
n = m.group(2)
print(n)
f.close()
I have some questions:
I can't use re.search on p. it raises this error:
TypeError: can't use a string pattern on a bytes-like object
is there a way to use re.search without saving it?
After saving file I should reopen it to use re.search otherwise it raises this error: TypeError: must be str, not bytes why this error happens?
in this program I want to extract information between <span class="def"> and </span> from first match. but pattern that I wrote not work good in all cases. for example translate('three') is good. out put is : "totaling one more than two" but for translate('tree') out put is:
"a treelike bush or shrub ⇒ a rose tree"
is there a way to correct this pattern. regular expression or any other tools?

When you call read on the response returned by urllib, you get a bytes object, which you need to decode to convert it to a string.
Change
p=urllib.request.urlopen(url).read()
to
p=urllib.request.urlopen(url).read().decode('utf-8')
You should read this https://docs.python.org/3/howto/unicode.html to understand why because issues like this come up a lot.
Also, you probably don't want to parse HTML using regex. Some better alternatives for Python are mentioned here.

Related

How can I search for a substring that is between two values ​of a string in python?

someone who can give me an idea about doing the following: I have a text file
of a single line but they send me several data, all as if it were xml but within a txt.
ej:
<Respuesta><ResultEnviado><Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado><EntregaItemResulados><EntregaItem><ItemId>123</ItemId><NameItem>MuebleSala</NameItem><ValorItem>180</ValorItem></EntregaItem><EntregaItem><ItemId>124</ItemId><NameItem>MuebleComedor</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>125</ItemId><NameItem>Cama</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>126</ItemId><NameItem>escritorio</NameItem>ValorItem>200</ValorItem></EntregaItem></EntregaItemResulados></Respuesta>
As you could see, it is a file with the extension txt.
<ResultEnviado><Resultado><Entrega>1213255654</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado>
I am using python for the exercise.
Thank you very much for your comments or ideas.
Here we can use regular expressions .search() and .match() functions to find everything between the set tags. Note you need to import regular expression using the import re.
More info on regular expressions in python: here
import re
#open the file and read it
path = "C:/temp/file.txt"
with open(path, "r") as f:
text = f.read()
#we use regular experssion to find everything between the tags
match = re.search("<ResultEnviado>(.*?)</ResultEnviado>", text)
#prints the text if it matches
if match:
print(match.group(1))
else:
print("No match found.")
this prints:
<Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado>
Please let me know if you need any more help.

How to extract the file name of a pdf link including numbers using regular expressions

I have the following url string:
"https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
How can I use regular expression to get the filename of an url?
I have tried:
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
text = re.sub("/[^/]*$", '', text)
text
but I am receiving:
'https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022'
The desired output is:
"amtsblatt_05_20220209.pdf"
I am thankful for any advice.
You can go with:
import re
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
pdf_name = re.findall("/[^/]*$", text)[0]
print(pdf_name)
or simple with:
pdf_name = text.split('/')[-1]
print(pdf_name)
If you exactly want to use regex then try
re.findall(r"/(\w+\.pdf)", text)[-1]
Alternative to regular expressions, which may make it more clear what the intend is
import urllib
import pathlib
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
filename = pathlib.Path(urllib.parse.urlparse(text).path).name
Or with additional package urplath
import urlpath
filename = urlpath.URL(text).name
As an answer why your approach did not work
re.sub("/[^/]*$", '', text)
This does find your desired string, but it then substitutes it with nothing, so it removes what you have found. You'd probably wanted to either find the string
>>> re.search("/[^/]*$", text).group()
'/amtsblatt_05_20220209.pdf'
# Without the leading /
>>> re.search("/([^/]*)$", text).group(1)
'amtsblatt_05_20220209.pdf'
Or you wanted to discard everything that is not the filename
>>> re.sub("^.*/(?!=[^\/]+$)", "", text)
'amtsblatt_05_20220209.pdf'

Create new list from old using re.sub() in python 2.7

My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]

Python Regex - Parsing HTML

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.
import sys
import re
#def extract_names(filename):
f = open('name.html', 'r')
text = f.read()
match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)
outf = open('details.txt', 'a')
outf.write(match)
outf.close()
My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file. But for some reason I can't seem to get it right.
Your help is greatly appreciated.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
print ElementTree.tostring(elem)
Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

How to read an entire web page into a variable

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!

Categories

Resources