I'm currently trying to write a function that takes two inputs:
1 - The URL for a web page
2 - The name of a text file containing some regular expressions
My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:
example
Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork
where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:
address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)
The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?
EDIT: This is how I'm reading the regexes from the text file:
with open("test_file.txt","r") as file:
for regex in file:
address = re.search(regex, web_page_source_code)
Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.
Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.
You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.
The text you want in the file is:
File
\<address\>\s*([^<]*)\b\s*<
Here's how you can check it
In [1]: a = open('testfile.txt')
In [2]: line = a.readline()
-- this is the line as you'd see it in python code when properly escaped
In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'
-- this is what it actually means (what re will use)
In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<
OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:
Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
You also need to remove the newline '\n' character at the end
So overall, your code should look something like:
a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)
Related
someone who can give me an idea about doing the following: I have a text file
of a single line but they send me several data, all as if it were xml but within a txt.
ej:
<Respuesta><ResultEnviado><Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado><EntregaItemResulados><EntregaItem><ItemId>123</ItemId><NameItem>MuebleSala</NameItem><ValorItem>180</ValorItem></EntregaItem><EntregaItem><ItemId>124</ItemId><NameItem>MuebleComedor</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>125</ItemId><NameItem>Cama</NameItem>ValorItem>200</ValorItem></EntregaItem><EntregaItem><ItemId>126</ItemId><NameItem>escritorio</NameItem>ValorItem>200</ValorItem></EntregaItem></EntregaItemResulados></Respuesta>
As you could see, it is a file with the extension txt.
<ResultEnviado><Resultado><Entrega>1213255654</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado></ResultEnviado>
I am using python for the exercise.
Thank you very much for your comments or ideas.
Here we can use regular expressions .search() and .match() functions to find everything between the set tags. Note you need to import regular expression using the import re.
More info on regular expressions in python: here
import re
#open the file and read it
path = "C:/temp/file.txt"
with open(path, "r") as f:
text = f.read()
#we use regular experssion to find everything between the tags
match = re.search("<ResultEnviado>(.*?)</ResultEnviado>", text)
#prints the text if it matches
if match:
print(match.group(1))
else:
print("No match found.")
this prints:
<Resultado><Entrega>00123</Entrega><Refer>MueblesHiroshima</Refer><Item>34</Item><Valor>780</Valor></Resultado>
Please let me know if you need any more help.
I am new to python and stuck with some issue which could be pretty easy for python expert. I am trying to read text file in python but not getting desired out put using f string.
print(f'{lines[0]} {lines[2]}')\n
I am getting output in two lines, although I didn't use \n
Hello
I am testing!
Expected output:
Hello I am testing!
It's because when you read a file at end of every line a newline character exists. So if you even print only one line you'll get a blank line after the text. You can solve it by using strip method:
print(f'{lines[0].strip()} {lines[2].strip()}')
As you can see in your text file, the text is in different lines, so python interpreted it as a newline. So, it added a new line character \n. You just have to strip the character.
print(f'{lines[0].strip('\n')} {lines[2].strip('\n')}')
I'm going through a course at work for Python. We're using Pycharm, and I'm not sure if that's what the problem is.
Basically, I have to read in a text file, scrub it, then count the frequency of specific words. The counting is not an issue. (I looped through a scrubbed list, checked the scrubbed list for the specific words, then added the specific words to a dictionary as I looped through the list. It works fine).
My issue is really about scrubbing the data. I ended up doing successive scrubs to get to a final clean list. But when I read the documentation, I should be able to use regex or re and scrub my file with one line of code. No matter what I do, importing re, or regex I get errors that stop my code.
How can I write the below code pythonically?
# Open the file in read mode
with open('chocolate.txt', 'r') as file:
input_col = file.read().replace(',', '')
text3 = input_col.replace('.', '')
text2 = text3.replace('"', '')
text = text2.split()
You could try using a regular expression which looks something like this
import re
result = re.sub(r'("|.|,)', "", text)
print(result)
Here text is the string you would read from the text file
Hope this helps!
x = re.sub(r'("|\.|,)', "", str)
I am currently trying to develop a python script to sanitize configuration. My objective is to read line by line from txt, which I could using following code
fh = open('test.txt')
for line in fh:
print(line)
fh.close()
output came up as follows
hostname
198.168.1.1
198.168.1.2
snmp string abck
Now I want to
Search the string matching "hostname" replace with X
Search the ipv4 addresses using regular expression
\b(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(?1)){3}\b and replace with X.X\1 (replacing only first two octets with X)
Aything after "snmp string" need to replaced with X
so the file final output I am looking for is
X
x.x.1.1
x.x.1.2
snmp string x
I could not orchestrate everything together. Any help or guidance will be greatly appreciated.
There are lots of approaches to this, but here's one: rather than just printing each line of the file, store each line in a list:
with open("test.txt") as fh:
contents = []
for line in fh:
contents.append(line)
print(contents)
Now you can loop through that list in order to perform your regex operations. I'm not going to write that code for you, but you can use python's inbuilt regex library.
I have .gz file that contains several strings. My requirement is that I have to do several regex based operations in the data that is contained in the .gz file
I get the error when I use a re.findall() in the lines of data extracted
File "C:\Users\santoshn\AppData\Local\Continuum\anaconda3\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object
I have tried opening with option "r" with the same result.
Do I have to decompress this file first and then do the regex operations or is there a way to address this ?
Data contains several text lines, an example line is listed below:
ThreadContext 432 mov (8) <8;1,2>r2 <8;3,3>r4 Instruction count
I was able to fix this issue by reading the file using gzip.open()
with gzip.open(file,"rb") as f:
binFile = f.readlines()
After this file is read, each line in the file is converted to 'ascii'. Subsequently all regex operations like re.search() and re.findall() work fine.
for line in binFile: # go over each line
line = line.strip().decode('ascii')
I know this is an old question but I stumbled on it (as well as the other HTML references in the comments) when trying to sort out this same issue. Rather than opening the gzip file as a binary ("rb") and then decoding it to ASCII the gzip docs led me to simply opening the GZ file as text which allowed normal string manipulation after that:
with gzip.open(filepath,"rt") as f:
data = f.readlines()
for line in data:
split_string = date_time_pattern.split(line)
# Whatever other string manipulation you may need.
The date_time_pattern variable is simply my compiled regex for different log date formats.