Regex for parsing list items - python

I want to slurp a text into a list, and then parse a bit each item so I can keep the text I actually want.
I'm currently using:
with open("C:/text.txt" ,"rU") as input:
lines = [line.rstrip('\n') for line in input]
for line in lines:
#str(line)
regex = r"\:\s*\"(.*)\"\s{5}\d?"
try:
found = re.search(regex, line).group(1)
except AttributeError:
found ='nah'
print(found)
But it doesn't work. Always goes to the exception. When applied to a defined string, it works. Is there a difference when dealing with list items?
The text file is structured as such:
Thank you in advance!

It is clear from the image you provided that there are 3 whitespaces between the text and the digits.
Without exact text, it is impossible to classify the symbols, but it is clear there is at least one.
So, you need to modify the regex you are using to
r':\s*"(.*)"\s+'
Here, \s+ matches 1 or more whitespaces.
Note that \d? at the end of the pattern is not required if you are not interested in the whole match and only need Group 1 value.

Related

Regex to search file for exact string

I have looked at the other questions around this, but I can't quite get those answers to work for my situation.
I have a string that I'm searching for inside a file. I want to match the exact string and if there's a match, do something. I'm trying to formulate regex for the string, "author:". If this string is found, strip it from the line and give me everything to the right of it, removing any whites space. Any ideas looking at the code below in how to achieve this?.
metadata_author = re.compile(r'\bauthor:\b')
with open(tempfile, encoding='latin-1') as search:
for line in search:
result = metadata_author.search(line)
if result in line:
author = result.strip()
print(author)
I would use a lookbehind (with a negative lookbehind for the possible dot as mentioned in the comment):
metadata_author = re.compile(r'(?<=(?<!\.)\bauthor:).+')
with open(tempfile, encoding='latin-1') as search:
for line in search:
result = metadata_author.search(line)
if result:
author = result.group().strip()
print(author)
re.search returns a match object, and not a string, so to get the matching string you have to call result.group().
If you want to remove all whitespace (instead of just trimming), use re.sub(r'\s*', '', result.group()) instead of result.group().strip().

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Editing a text file using python

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!

Replacing only a specific group within a matched expression

I'm parsing text in which I would like to make changes, but only to specific lines.
I have a regular expression pattern that catches the entire line if it's a line of interest, and within the expression I have a remembered group of the thing I would actually like to change.
I would like to be able to changed only the specific group within a matched expression, and not replace the entire expression (that would replace the entire line).
For example:
I have a textual file with:
This is a completely silly example.
something something "this should be replaced" bla.
more uninteresting stuff
And I have the regex:
pattern = '.*("[^"]*").*'
Then I catch the second line, but I would to replace only the "this should be replaced" matched group within the line, not the entire line. (so using re.sub(pattern, replacement, string) won't do the job.
Thanks in advance!
What's wrong with
r'"[^"]+"'
Your .* before and after the matched expression match zero-length-string too, so you don't need it at all.
re.sub(r'"[^"]+"', 'DEF', 'abc"def"ghi')
# returns 'abcDEFghi'
and your example text will result into:
'This is a completely silly example.\nsomething something DEF bla.\nmore uninteresting stuff
eumiro answer is best in this very case, but for the sake of completeness, if you really need to perform some more complicated processing of pre, inside, and post text, you can simply use multiple groups, like:
'(.*)("[^"]*")(.*)'
(first group provides the the text before, third the text after, do what you like with them)
Also, you may prefer to forbid " in the pre-part:
'([^"]*)("[^"]*")(.*)'
re.match and re.search return a "match object". (See the python documentation). Supposing you want to replace group 3 in your RE, pull out its start/end indices and replace the substring directly:
mobj = re.match(pattern, line)
start = mobj.start(3)
end = mobj.end(3)
line = line[:start] + replacement + line[end:]

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources