Negative lookahead after newline?

Negative lookahead after newline? - python

I have a CSV-like text file that has about 1000 lines. Between each record in the file is a long series of dashes. The records generally end with a \n, but sometimes there is an extra \n before the end of the record. Simplified example:
"1x", "1y", "Hi there"
-------------------------------
"2x", "2y", "Hello - I'm lost"
-------------------------------
"3x", "3y", "How ya
doing?"
-------------------------------
I want to replace the extra \n's with spaces, i.e. concatenate the lines between the dashes. I thought I would be able to do this (Python 2.5):
text = open("thefile.txt", "r").read()
better_text = re.sub(r'\n(?!\-)', ' ', text)
but that seems to replace every \n, not just the ones that are not followed by a dash. What am I doing wrong?
I am asking this question in an attempt to improve my own regex skills and understand the mistakes that I made. The end goal is to generate a text file in a format that is usable by a specific VBA for Word macro that generates a styled Word document which will then be digested by a Word-friendly CMS.

This is a good place to use a generator function to skip the lines of ----'s and yield something that the csv module can read.
def readCleanLines( someFile ):
for line in someFile:
if line.strip() == len(line.strip())*'-':
continue
yield line
reader= csv.reader( readCleanLines( someFile ) )
for row in reader:
print row
This should handle the line breaks inside quotes seamlessly and silently.
If you want to do other things with this file, for example, save a copy with the ---- lines removed, you can do this.
with open( "source", "r" ) as someFile:
with open( "destination", "w" ) as anotherFile:
for line in readCleanLines( someFile ):
anotherFile.write( line )
That will make a copy with the ---- lines removed. This isn't really worth the effort, since reading and skipping the lines is very, very fast and doesn't require any additional storage.

You need to exclude the line breaks at the end of the separating lines. Try this:
\n(?<!-\n)(?!-)
This regular expression uses a negative look-behind assertion to exclude \n that’s preceeded by an -.

re.sub(r'(?<!-)\n(?!-)', ' ', text)
(Hyphen doesn't need escaping outside of a character class.)

A RegEx isn't always the best tool for the job. How about running it through something like "Split" or "Tokenize" first? (I'm sure python has an equivalent) Then you have your records and can assume newlines are just continuations.

Related

Problem with finding the correct match with regex

I have some data, which I'm trying to process. Basically I want to change all the commas , to semicolon ;, but some fields contain text, usernames or passwords that also contain commas. How do I change all the commas except the ones inclosed in "?
Test data:
Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
What have I tried?
secrets = """Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
"""
test = re.findall(r'(.+?\")(.+)(\".+)', secrets)
for line in test:
part1, part2, part3 = line
processed = "".join([part1.replace(",", ";"), part2, part3.replace(",", ";")])
print(processed)
Result:
test1;;username;"pass,word";These are the notes;\Some\Folder;;
test2;;"user1, user2, user3","pass,word","Hello, I'm mr Notes";\Some\Folder;;
It works fine, when there's only one occurence of "" in the line and no line breaks, but when there are more or there's a line break within the quotations, it's broken. How can I fix this?
FYI: Notes can contain multiple line breaks.

You don't need a regex here, take advantage of a CSV parser:
import csv, io
inp = csv.reader(io.StringIO(secrets), # or use file as input
quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
with open('out.csv', 'w') as out:
csv.writer(out, delimiter=';').writerows(inp)
output file:
Secret Name;URL;Username;Password;Notes;Folder;TOTP Key;TOTP Backup Codes
test1;;username;pass,word;These are the notes;\Some\Folder;;
test2;;user1, user2, user3;pass,word;Hello, I'm mr Notes;\Some\Folder;;
test3;http://1.2.3.4/ucsm/ucsm.jnlp;"xxxx
(use Drop down, select Hello)";password;Use the following
Server1
Server2;\Some\Folder;;
Optionally, use the quoting=csv.QUOTE_ALL parameter in csv.writer.

This should do I believe:
import re
print( re.sub(r'("[^"]*")|,', lambda x: x.group(1) if x.group(1) else x.group().replace(",", ";"), secrets))

mozway's solution looks like the best way to resolve this, but interestingly, SM1312's regex works almost perfectly with a much more simple replacement argument for the sub function (i.e. r'\1;'):
import re
print (re.sub(r'("[^"]*")|,', r'\1;', secrets))
The only issue is this introduces an extra semicolon after a quoted entry. This happens because the first alternation member (i.e. ("[^"]*")) does not consume a comma, but the replacement argument adds a semicolon regardless of which alternation member matches. Simply adding a comma to the first alternation member resolves this and works perfectly for the sample data:
import re
print (re.sub(r'("[^"]*"),|,', r'\1;', secrets))
However, it fails if the data includes a quoted entry as the last (i.e. the TOTP Backup Codes) column of the data; any commas in the last quoted entry will be changed to semicolons. This is likely not an acceptable failure mode since it is changing the data set. The following resolves that issue, but introduces a different error that may be tolerable; it adds an extra semicolon at the end of the line:
import re
print (re.sub(r'("[^"]*")(,|(?=\s+))|,', r'\1;', secrets))
This is accomplished by changing the first part of the original alternation member to use alternation itself. That is, the part that was matching the comma after the quoted entry is changed to additionally check for nothing but whitespace (i.e. (,|(?=\s+))), which includes an end of line, after the quoted entry using the following positive lookahead assertion: (?=\s+). The positive lookahead assertion for whitespace is used instead of simply matching whitespace to avoid consuming the whitespace and eliminating it from the resulting output.

Python regex fullmatch doesn't work as expected

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.

First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....

Python regex a binary file text file - how to use a range of numbers and word boundry?

I have a text file that requires me to read it in binary and write out in binary. No problem. I need to mask out Social Security Numbers with Xs, pretty easy normally:
text = re.sub("\\b\d{3}-\d{2}-\{4}\\b","XXX-XX-XXXX", text)
This is a sample of the text I'm parsing:
more stuff here
CHILDREN�S 001-02-0003 get rid of that
stuff goes here
not001-02-0003
but ssn:001-02-0003
and I need to turn it into this:
more stuff here
CHILDREN�S XXX-XX-XXXX get rid of that
stuff goes here
not001-02-0003
but ssn:XXX-XX-XXXX
Super! So now I'm trying to write that same regex 'in binary'. Here is what I've got and it's 'works' but gosh it doesn't feel right at all:
line = re.sub(b"\\B(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\\B", b"\x00X\x00X\x00X\x00-\x00X\x00X\x00-\x00X\x00X\x00X\x00X", line)
Notes:
that junk in CHILDRENS, gotta keep it like that
need to word boundary, thus the 4th line doesn't get masked out
Shouldn't my regex be a range of numbers instead? I just don't know how to do that in binary. And my word boundaries only work backwards as \B instead of \b, uh.. what is up with that?
UPDATE: I've also tried this:
line = re.sub(b"[\x30-\x39]", b"\x58", line)
and that does it for EVERY number, but if I try to even do something simple like:
line = re.sub(b"[\x30-\x39][\x30-\x39]", b"\x58\x58", line)
it doesn't match anything anymore, any idea why?

You might try:
import re
rx = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
with open("test.txt", "rb") as fr, open("test2.txt", "wb+") as fp:
repl = rx.sub('XXX-XX-XXXX', fr.read())
fp.write(repl)
This keeps every junk characters as they are and writes them to test2.txt.
Note that, when you don't want every backslash escaped, you could use r'string here' in Python.

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))

You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.

The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.

split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")

Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.

Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))

To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)

Eliminating extra commas

I am having trouble replacing three commas with one comma in a text file of data.
I am processing a large text file to put it into comma delimited format so I can query it using a database.
I do the following at the command prompt and it works:
>>> import re
>>> line = 'one,,,two'
>>> line=re.sub(',+',',',line)
>>> print line
one,two
>>>
following below is my actual code:
with open("dmis8.txt", "r") as ifp:
with open("dmis7.txt", "w") as ofp:
for line in ifp:
#join lines by removing a line ending.
line=re.sub('(?m)(MM/ANGDEC)[\r\n]+$','',line)
#various replacements of text with nothing. This removes the text
line=re.sub('IDENTIFIER','',line)
line=re.sub('PART','50-1437',line)
line=re.sub('Eval','',line)
line=re.sub('Feat','',line)
line=re.sub('=','',line)
#line=re.sub('r"++++"','',line)
line=re.sub('r"----|"',' ',line)
line=re.sub('Nom','',line)
line=re.sub('Act',' ',line)
line=re.sub('Dev','',line)
line=re.sub('LwTol','',line)
line=re.sub('UpTol','',line)
line=re.sub(':','',line)
line=re.sub('(?m)(Trend)[\r\n]*$',' ',line)
#Remove spaces replace with semicolon
line=re.sub('[ \v\t\f]+', ',', line)
#no worky line=re.sub(r",,,",',',line)
line=re.sub(',+',',',line)
#line=line.replace(",+", ",")
#line=line.replace(",,,", ",")
ofp.write(line)
This is what i get from the code above:
There are several commas together. I don't understand why they won't get replaced down to one comma.
Never mind that I don't see how the extra commas got there in the first place.
50-1437,d
2012/05/01
00/08/27
232_PD_1_DIA,PED_HL1_CR,,,12.482,12.478,-0.004,-0.021,0.020,----|++++
232_PD_2_DIA_TOP,PED_HL2_TOP,,12.482,12.483,0.001,-0.021,0.020,----|++++
232_PD_2_DIA,PED_HL2_CR,,12.482,12.477,-0.005,-0.021,0.020,----|++++
232_PD_2_DIA_BOT,PED_HL2_BOT,,12.482,12.470,-0.012,-0.021,0.020,--|--++++
raw data for reference:
PART IDENTIFIER : d
2012/05/01
00/08/27
232_PD_1_DIA Eval Feat = PED_HL1_CR MM/ANGDEC
Nom Act Dev LwTol UpTol Trend
12.482 12.478 -0.004 -0.021 0.020 ----|++++
232_PD_2_DIA_TOP Eval Feat = PED_HL2_TOP MM/ANGDEC
12.482 12.483 0.001 -0.021 0.020 ----|++++
232_PD_2_DIA Eval Feat = PED_HL2_CR MM/ANGDEC
12.482 12.477 -0.005 -0.021 0.020 ----|++++
Can someone kindly point what I am doing wrong?
thanks in advance...

Your regex is working fine. The problem is that it you concatenate the lines (by write()ing them) after you scrub them with your regex.
Instead, use "".join() on all of your lines, run re.sub() on the whole thing, and then write() it all to the file at once.

I think your problem is caused by the fact that removing line endings does not join lines, in combination with the fact that write does not add newlines to the end each string. So you have multiple input lines that look like a single line in the output.
Looking at the comments, you seem to think that just replacing the end of the line by an empty string will magically append the next line to it, but that doesn't actually work. So the three commas you're seeing are not replaced by your re.sub command because they're not in one line, they're multiple input lines (which after all the replacements are empty except for commas) which get printed to a single output line because you stripped their '\n' characters, and write doesn't automatically add '\n' to the end of each written string (unlike print).
To debug your code, just put print line after each line of code, to see what each "line" actually is - that should help you see what's going wrong.
In general, reading file formats where each "record" spans multiple lines requires more complicated methods than just a for line in file loop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.