I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)
Related
I am fairly new to python. I am trying to use regular expressions to match specific text in a file.
I can extract the data but only one regular expression at a time since the both values are in different lines and I am struggling to put them together. These severa lines repeat all the time in the file.
[06/05/2020 08:30:16]
othertext <000.000.000.000> xx s
example <000.000.000.000> xx s
I managed to print one or the other regular expressions:
[06/05/2020 08:30:16]
or
example <000.000.000.000> xx s
But not combined into something like this:
(timestamp) (text)
[06/05/2020 08:30:16] example <000.000.000.000> xx s
These are the regular expressions
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
regex = r"(^example\s+.*\<000\.000\.000\.000\>\s+.*$)" # line that contain the text
This is the code so far, I have tried a secondary for loop with another condition but seem that only match one of the regular expression at a time.
Any pointers will be greatly appreciated.
import re
filename = input("Enter the file: ")
regex = r"^\[\d\d\/\d\d\/\d\d\d\d\s\d\d\:\d\d\:\d\d\]" #Timestamp
with open (filename, "r") as file:
list = []
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
list.append(match_text)
print (match_text)
You can match blocks of text similar to this in one go with a regex of this type:
(^\[\d\d\/\d\d\/\d\d\d\d[ ]+\d\d:\d\d:\d\d\])\s+[\s\S]*?(^example.*)
Demo
All the file's text needs to be 'gulped' to do so however.
The key elements of the regex:
[\s\S]*?
^ idiomatically, this matches all characters in regex
^ zero or more
^ not greedily or the rest of the text will match skipping
the (^example.*) part
I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks
You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
See non-regex approach demo.
NOTES
In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)
You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']
* is a special character in regex, so you have to escape it as #The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
Additional changes
You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.
I need to extract data in a data file beginning with the letter
"U"
or
"L"
and exclude comment lines beginning with character "/" .
Example:
/data file FLG.dat
UAB-AB LRD1503 / reminder latches
I used a regex pattern in the python program which results in only capturing the comment lines. I'm only getting comment lines but not the identity beginning with character.
You can use ^([UL].+?)(?:/.*|)$. Code:
import re
s = """/data file FLG.dat
UAB-AB LRD1503 / reminder latches
LAB-AB LRD1503 / reminder latches
SAB-AB LRD1503 / reminder latches"""
lines = re.findall(r"^([UL].+?)(?:/.*|)$", s, re.MULTILINE)
If you want to delete spaces at the end of string you can use list comprehension with same regular expression:
lines = [match.group(1).strip() for match in re.finditer(r"^([UL].+)/.*$", s, re.MULTILINE)]
OR you can edit regular expression to not include spaces before slash ^([UL].+?)(?:\s*/.*|)$:
lines = re.findall(r"^([UL].+?)(?:\s*/.*|)$", s, re.MULTILINE)
In case the comments in your data lines are optional here's a regular expression that covers both types, lines with or without a comment.
The regular expression for that is R"^([UL][^/]*)"
(edited, original RE was R"^([UL][^/]*)(/.*)?$")
The first group is the data you want to extract, the 2nd (optional group) would catch the comment if any.
This example code prints only the 2 valid data lines.
import re
lines=["/data file FLG.dat",
"UAB-AB LRD1503 / reminder latches",
"UAB-AC LRD1600",
"MAB-AD LRD1700 / does not start with U or L"
]
datare=re.compile(R"^([UL][^/]*)")
matches = ( match.group(1).strip() for match in ( datare.match(line) for line in lines) if match)
for match in matches:
print(match)
Note how match.group(1).strip() extracts the first group of your RE and strip() removes any trailing spaces in your match
Also note that you can replace lines in this example with a file handle and it would work the same way
If the matches = line looks too complicated, it's an efficient way for writing this:
for line in lines:
match = datare.match(line)
if match:
print(match.group(1).strip())
I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)
I am trying to read in a file and every time , year is found it prints it out. For example if it finds , 2003 it will print that out, but if it finds ,2003 it will ignore it. I originally used a split and was able to get the year to match up, but when I added the , I realized that it looked at it like two different words so I dont think that would work.
Here is my code:
import string
import re
while True:
filename=raw_input('Enter a file name: ')
if filename == 'exit':
break
try:
file = open(filename, 'r')
text=file.read()
file.close()
except:
print('file does not exist')
else:
p=re.compile('^\,\s(19|20)\d\d$')//this is my regular expression
print(text)
m=p.search(text)
if m:
print(m.groups())
If you want to search the file for the regex rather than match the entire file contents, remove ^ and $ from the regex.
If you want more than one match per file, use finditer or findall instead of search.
Use raw string when specifying the regex: p=re.compile(r',\s(19|20)\d\d')
Example:
for m in re.finditer(r',\s((19|20)\d\d)', text):
print m.group(1)
>>> import re
>>> text = "foo bar, 2003, 2006,1923, derp"
>>> p = re.compile(r',\s((?:19|20)\d\d)')
>>> p.findall(text)
['2003', '2006']
Simplified example. First of all, remove the anchors (^ and $) and use findall instead of search to find all matches. I also used ?: to designate a non-matching group (it won't show up in the results) and made the year a group instead.
If you just add a * to the \s in your regex, I think it should work. This will make it match zero or more whitespace characters, instead of exactly one. If you only want it to match zero or one, add a + instead.