How to get the line number in Python regex findall method [duplicate] - python

I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).
I have this regular expression: '\w*_VB.?\sout_RP' and my POS-tagged text file: 'Corpus.txt'.
I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:
2
5
44
So far all I have in my script is the following:
OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
phrase='\w*_VB.?\sout_RP'
for phrase in textfile:
OutputLineNumbers.close()
Any idea how to solve this problem?
In advance, thanks for your help!

This should solve your problem, presuming you have correct regex in variable 'phrase'
import re
# compile regex
regex = re.compile('[0-9]+')
# open the files
with open('Corpus.txt','r') as inputFile:
with open('OutputLineNumbers', 'w') as outputLineNumbers:
# loop through each line in corpus
for line_i, line in enumerate(inputFile, 1):
# check if we have a regex match
if regex.search( line ):
# if so, write it the output file
outputLineNumbers.write( "%d\n" % line_i )

you can do it directly with bash if your regular expression is grep friendly. show the line numbers using "-n"
for example:
grep -n "[1-9][0-9]" tags.txt
will output matching lines with the line numbers included at first
2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577

Related

I am able to read the txt line by line, buy not sure how can I now search and replace perticular string with X

I am currently trying to develop a python script to sanitize configuration. My objective is to read line by line from txt, which I could using following code
fh = open('test.txt')
for line in fh:
print(line)
fh.close()
output came up as follows
hostname
198.168.1.1
198.168.1.2
snmp string abck
Now I want to
Search the string matching "hostname" replace with X
Search the ipv4 addresses using regular expression
\b(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(?1)){3}\b and replace with X.X\1 (replacing only first two octets with X)
Aything after "snmp string" need to replaced with X
so the file final output I am looking for is
X
x.x.1.1
x.x.1.2
snmp string x
I could not orchestrate everything together. Any help or guidance will be greatly appreciated.
There are lots of approaches to this, but here's one: rather than just printing each line of the file, store each line in a list:
with open("test.txt") as fh:
contents = []
for line in fh:
contents.append(line)
print(contents)
Now you can loop through that list in order to perform your regex operations. I'm not going to write that code for you, but you can use python's inbuilt regex library.

Return the lines if it matches the pattern in python

How can I print the line from a file if the pattern matches? But there is a caveat. The search should not consider any brackets in the line of the file.
My search pattern is CurrentPrincipalLegalEventAssociation
Here is the content of file
941,agg.list,CurrentPrincipalMailingAddressStreetLine1,CompanyElementDefinition
c755ad,atom.list,CurrentPrincipal[LegalEventAssociation][Type][*],CompanyElementDefinition
8798c3,atom.list,CurrentPrincipal[MailingAddressStreetLine1][*],CompanyElementDefinition
2e43d1,atom.list,CurrentPrincipal[MailingAddressStreetLine2][*],CompanyElementDefinition
b3a13b,atom.list,CurrentPrincipal[MailingContinentName][*],CompanyElementDefinition
When I do the search it should return me the below line
c755ad,atom.list,CurrentPrincipal[LegalEventAssociation][Type][*],CompanyElementDefinition
Here the pattern does not have any brackets however, the line which I want has brackets in it. I am looking for a program which can ignore the brackets while matching the pattern.
I am new to python and all I know is to print a line if it matches a specific word but not of this sort.
Here is what I have tried but that did not work.
for line in file.splitlines():
if "CurrentPrincipalLegalEventAssociation" in line:
print line
This should do what you need it to:
with open(filename) as myfile:
for row in myfile:
if 'CurrentPrincipalLegalEventAssociation' in row.replace('[', '').replace(']', ''):
print(row)
This loops through the lines in the file, checks for your string after removing brackets, and returns a match if it finds one.

Reading Regular Expressions from a text file

I'm currently trying to write a function that takes two inputs:
1 - The URL for a web page
2 - The name of a text file containing some regular expressions
My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:
example
Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork
where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:
address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)
The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?
EDIT: This is how I'm reading the regexes from the text file:
with open("test_file.txt","r") as file:
for regex in file:
address = re.search(regex, web_page_source_code)
Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.
Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.
You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.
The text you want in the file is:
File
\<address\>\s*([^<]*)\b\s*<
Here's how you can check it
In [1]: a = open('testfile.txt')
In [2]: line = a.readline()
-- this is the line as you'd see it in python code when properly escaped
In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'
-- this is what it actually means (what re will use)
In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<
OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:
Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
You also need to remove the newline '\n' character at the end
So overall, your code should look something like:
a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)

Python 3: Searching A Large Text File With REGEX

I wish to search a large text file with regex and have set-up the following code:
import re
regex = input("REGEX: ")
SearchFunction = re.compile(regex)
f = open('data','r', encoding='utf-8')
result = re.search(SearchFunction, f)
print(result.groups())
f.close()
Of course, this doesn't work because the second argument for re.search should be a string or buffer. However, I cannot insert all of my text file into a string as it is too long (meaning that it would take forever). What is the alternative?
You check if the pattern matches for each line. This won't load the entire file to the memory:
for line in f:
result = re.search(SearchFunction, line)
You can use a memory-mapped file with the mmap module. Think of it as a file pretending to be a string (or the opposite of a StringIO). You can find an example in this Python Module of the Week article about mmap by Doug Hellman.

Search for a line that contains some text, replace complete line python

So I am trying to search for a certain string which for example could be:
process.control.timeout=30, but the 30 could be anything. I tried this:
for line in process:
line = line.replace("process.control.timeout", "process.control.timeout=900")
outFile.write(line)
But this will bring me back process.control.timeout=900=30. I have a feeling I will need to use regex with a wildcard? But I'm pretty new to python.
Without regex:
for line in process:
if "process.control.timeout" in line:
# You need to include a newline if you're replacing the whole line
line = "process.control.timeout=900\n"
outFile.write(line)
or
outFile.writelines("process.control.timeout=900\n"
if "process.control.timeout" in line else line
for line in process)
If the text you're matching is at the beginning of the line, use line.startswith("process.control.timeout") instead of "process.control.timeout" in line.
You are correct, regex is the way to go.
import re
pattern = re.compile(r"process\.control\.timeout=\d+")
for line in process:
line = pattern.sub("process.control.timeout=900", line)
outFile.write(line)
This is probably what you want (matching = and digits after = is optional). As you are searching and replacing in a loop, compiling the regex separately will be more efficient than using re.sub directly.
import re
pattern = re.compile(r'process\.control\.timeout(=\d+)?')
for line in process:
pattern.sub('process.control.timeout=900', line)

Categories

Resources