Treat '^M' as regular characters in Python - python

I have a file that contains text to generate LaTeX mathematical expressions, one per line. This file should contain exactly 103,559 lines. But some lines contain the character sequence '^M' (CTRL-v CTRL-m) either at the end or interspersed within the lines, possibly multiple times. As a result, when I try to read the lines from the file using Python, the number of lines returned is greater than expected (actually returns 104,654 lines).
How do I tell Python to not generate a newline on each occurrence of the sequence '^M'? Thank you.

Use the newline argument to open().
Nearly a duplicate of Don't convert newline when reading a file, from where I got this solution:
with open(sys.argv[1], 'r', newline='\n') as fh:
for i, line in enumerate(fh):
print(i, line)
(Be aware that, when printing as in this example, the ^M ('\r') character will put the current point at the start of a line, overwriting existing characters.)

Related

output gets printed but it has a extra empty line

I'm trying this simple code to print a txt file with a if condition.
Code works fine, but when the output gets printed but it has a extra empty line. how to fix that?
with open('test.txt') as file:
for line in file:
if 'Efficient AP Image Upgrade ..... Enabled' in line:
break
print(line)
The line in line contains a newline character at the end. To avoid the print function to add another newline (the default behaviour), you should call print('line', end='') to specify that you don't want the extra newline.
You are probably using print on strings that already have a final newline -- in fact now that the question has been tidied up we can see that this is the case because you are using an iterator over file, and this will produce a sequence of lines that end with newline characters (except possibly the last line if it does not have a newline in the input file).
Note that after the data items, the print function will write an additional newline by default (more specifically, it will write the value specified by the end parameter, which defaults to a newline).
Possible approaches:
Use sys.stdout.write (does not append newline):
sys.stdout.write(text)
Use print but set it to write empty string instead of newline at the end:
print(text, end='')
Remove any newlines before printing (in principle this may include newlines in the middle of the string but because your strings come from an iterator over file object, there shouldn't be any):
print(text.replace('\n', ''))
Remove any leading or trailing whitespace (including newlines) before printing - note that this may include other spaces:
print(text.strip())
print() by default, creates a new line when you execute that. For your notice, try using
print(line,end='')
or
print(line,end=' ')
To remove the trailing new line you can strip new lines from the right side of the line:
print(text.rstrip('\n'))

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)

Why doesn't .rstrip('\n') work?

Let's say doc.txt contains
a
b
c
d
and that my code is
f = open('doc.txt')
doc = f.read()
doc = doc.rstrip('\n')
print doc
why do I get the same values?
str.rstrip() removes the trailing newline, not all the newlines in the middle. You have one long string, after all.
Use str.splitlines() to split your document into lines without newlines; you can rejoin it if you want to:
doclines = doc.splitlines()
doc_rejoined = ''.join(doclines)
but now doc_rejoined will have all lines running together without a delimiter.
Because you read the whole document into one string that looks like:
'a\nb\nc\nd\n'
When you do a rstrip('\n') on that string, only the rightmost \n will be removed, leaving all the other untouched, so the string would look like:
'a\nb\nc\nd'
The solution would be to split the file into lines and then right strip every line. Or just replace all the newline characters with nothing: s.replace('\n', ''), which gives you 'abcd'.
rstrip strips trailing spaces from the whole string. If you were expecting it to work on individual lines, you'd need to split the string into lines first using something like doc.split('\n').
Try this instead:
with open('doc.txt') as f:
for line in f:
print line,
Explanation:
The recommended way to open a file is using with, which takes care of closing the file at the end
You can iterate over each line in the file using for line in f
There's no need to call rstrip() now, because we're reading and printing one line at a time
Consider using replace and replacing each instance of '\n' with ''. This would get rid of all the new line characters in the input text.

Eliminating extra commas

I am having trouble replacing three commas with one comma in a text file of data.
I am processing a large text file to put it into comma delimited format so I can query it using a database.
I do the following at the command prompt and it works:
>>> import re
>>> line = 'one,,,two'
>>> line=re.sub(',+',',',line)
>>> print line
one,two
>>>
following below is my actual code:
with open("dmis8.txt", "r") as ifp:
with open("dmis7.txt", "w") as ofp:
for line in ifp:
#join lines by removing a line ending.
line=re.sub('(?m)(MM/ANGDEC)[\r\n]+$','',line)
#various replacements of text with nothing. This removes the text
line=re.sub('IDENTIFIER','',line)
line=re.sub('PART','50-1437',line)
line=re.sub('Eval','',line)
line=re.sub('Feat','',line)
line=re.sub('=','',line)
#line=re.sub('r"++++"','',line)
line=re.sub('r"----|"',' ',line)
line=re.sub('Nom','',line)
line=re.sub('Act',' ',line)
line=re.sub('Dev','',line)
line=re.sub('LwTol','',line)
line=re.sub('UpTol','',line)
line=re.sub(':','',line)
line=re.sub('(?m)(Trend)[\r\n]*$',' ',line)
#Remove spaces replace with semicolon
line=re.sub('[ \v\t\f]+', ',', line)
#no worky line=re.sub(r",,,",',',line)
line=re.sub(',+',',',line)
#line=line.replace(",+", ",")
#line=line.replace(",,,", ",")
ofp.write(line)
This is what i get from the code above:
There are several commas together. I don't understand why they won't get replaced down to one comma.
Never mind that I don't see how the extra commas got there in the first place.
50-1437,d
2012/05/01
00/08/27
232_PD_1_DIA,PED_HL1_CR,,,12.482,12.478,-0.004,-0.021,0.020,----|++++
232_PD_2_DIA_TOP,PED_HL2_TOP,,12.482,12.483,0.001,-0.021,0.020,----|++++
232_PD_2_DIA,PED_HL2_CR,,12.482,12.477,-0.005,-0.021,0.020,----|++++
232_PD_2_DIA_BOT,PED_HL2_BOT,,12.482,12.470,-0.012,-0.021,0.020,--|--++++
raw data for reference:
PART IDENTIFIER : d
2012/05/01
00/08/27
232_PD_1_DIA Eval Feat = PED_HL1_CR MM/ANGDEC
Nom Act Dev LwTol UpTol Trend
12.482 12.478 -0.004 -0.021 0.020 ----|++++
232_PD_2_DIA_TOP Eval Feat = PED_HL2_TOP MM/ANGDEC
12.482 12.483 0.001 -0.021 0.020 ----|++++
232_PD_2_DIA Eval Feat = PED_HL2_CR MM/ANGDEC
12.482 12.477 -0.005 -0.021 0.020 ----|++++
Can someone kindly point what I am doing wrong?
thanks in advance...
Your regex is working fine. The problem is that it you concatenate the lines (by write()ing them) after you scrub them with your regex.
Instead, use "".join() on all of your lines, run re.sub() on the whole thing, and then write() it all to the file at once.
I think your problem is caused by the fact that removing line endings does not join lines, in combination with the fact that write does not add newlines to the end each string. So you have multiple input lines that look like a single line in the output.
Looking at the comments, you seem to think that just replacing the end of the line by an empty string will magically append the next line to it, but that doesn't actually work. So the three commas you're seeing are not replaced by your re.sub command because they're not in one line, they're multiple input lines (which after all the replacements are empty except for commas) which get printed to a single output line because you stripped their '\n' characters, and write doesn't automatically add '\n' to the end of each written string (unlike print).
To debug your code, just put print line after each line of code, to see what each "line" actually is - that should help you see what's going wrong.
In general, reading file formats where each "record" spans multiple lines requires more complicated methods than just a for line in file loop.

Python reading 2 strings from the same line

how can I read once at a time 2 strings from a txt file, that are written on the same line?
e.g.
francesco 10
# out is your file
out.readline().split() # result is ['francesco', '10']
Assuming that your two strings are separated by whitespace. You can split based on any string (comma, colon, etc.)
Why not read just the line and split it up later? You'd have to read byte-by-byte and look for the space character, which is very inefficient. Better to read the entire line, and then split the resulting string on the space, giving you two strings.
'francesco 10'.split()
will give you ['francesco', '10'].
for line in fi:
line.split()
Its ideal to just iterate over a file object.

Categories

Resources