Let's say doc.txt contains
a
b
c
d
and that my code is
f = open('doc.txt')
doc = f.read()
doc = doc.rstrip('\n')
print doc
why do I get the same values?
str.rstrip() removes the trailing newline, not all the newlines in the middle. You have one long string, after all.
Use str.splitlines() to split your document into lines without newlines; you can rejoin it if you want to:
doclines = doc.splitlines()
doc_rejoined = ''.join(doclines)
but now doc_rejoined will have all lines running together without a delimiter.
Because you read the whole document into one string that looks like:
'a\nb\nc\nd\n'
When you do a rstrip('\n') on that string, only the rightmost \n will be removed, leaving all the other untouched, so the string would look like:
'a\nb\nc\nd'
The solution would be to split the file into lines and then right strip every line. Or just replace all the newline characters with nothing: s.replace('\n', ''), which gives you 'abcd'.
rstrip strips trailing spaces from the whole string. If you were expecting it to work on individual lines, you'd need to split the string into lines first using something like doc.split('\n').
Try this instead:
with open('doc.txt') as f:
for line in f:
print line,
Explanation:
The recommended way to open a file is using with, which takes care of closing the file at the end
You can iterate over each line in the file using for line in f
There's no need to call rstrip() now, because we're reading and printing one line at a time
Consider using replace and replacing each instance of '\n' with ''. This would get rid of all the new line characters in the input text.
Related
I'm trying this simple code to print a txt file with a if condition.
Code works fine, but when the output gets printed but it has a extra empty line. how to fix that?
with open('test.txt') as file:
for line in file:
if 'Efficient AP Image Upgrade ..... Enabled' in line:
break
print(line)
The line in line contains a newline character at the end. To avoid the print function to add another newline (the default behaviour), you should call print('line', end='') to specify that you don't want the extra newline.
You are probably using print on strings that already have a final newline -- in fact now that the question has been tidied up we can see that this is the case because you are using an iterator over file, and this will produce a sequence of lines that end with newline characters (except possibly the last line if it does not have a newline in the input file).
Note that after the data items, the print function will write an additional newline by default (more specifically, it will write the value specified by the end parameter, which defaults to a newline).
Possible approaches:
Use sys.stdout.write (does not append newline):
sys.stdout.write(text)
Use print but set it to write empty string instead of newline at the end:
print(text, end='')
Remove any newlines before printing (in principle this may include newlines in the middle of the string but because your strings come from an iterator over file object, there shouldn't be any):
print(text.replace('\n', ''))
Remove any leading or trailing whitespace (including newlines) before printing - note that this may include other spaces:
print(text.strip())
print() by default, creates a new line when you execute that. For your notice, try using
print(line,end='')
or
print(line,end=' ')
To remove the trailing new line you can strip new lines from the right side of the line:
print(text.rstrip('\n'))
I'm want search newline character in string using regex in python.I don't want to include \r or \n in Message.
I have tried regex which is able to detect \r\n correctly. But when i'm removing \r\n from Line variable. still it prints the error.
Line="got less no of bytes than requested\r\n"
if(re.search('\\r|\\n',Line)):
print("Do not use \\r\\n in MSG");
It Should detect \r\n in Line variable which is as a text not the invisible \n.
It should not print when the Line is Like below:
Line="got less no of bytes than requested"
You are looking for the re.sub function.
Try to do this:
Import re
Line="got less no of bytes than requested\r\n"
replaced = re.sub('\n','',Line)
replaced = re.sub('\r','',Line)
print replaced
Instead of checking for newlines, it would probably be better to just remove them. No need to use regex for it, just use strip, it will remove all whitespace and newlines from the ends of the string:
line = 'got less no of bytes than requested\r\n'
line = line.strip()
# line = 'got less no of bytes than requested'
If you want to do it with regex you can use:
import re
line = 'got less no of bytes than requested\r\n'
line = re.sub(r'\n|\r', '', line)
# line = 'got less no of bytes than requested'
If you insist on checking for the newlines, you can do it like this:
if '\n' in line or '\r' in line:
print(r'Do not use \r\n in MSG');
Or the same with regex:
import re
if re.search(r'\n|\r', line):
print(r'Do not use \r\n in MSG');
Also: it's advisable to have your Python variables named with snake_case.
First of all consider to use strip as many guys here mentioned.
Second, if you want to match newline at ANY position in string use search not match
What is the difference between re.search and re.match?Here is more about search vs match
newline_regexp = re.compile("\n|\r")
newline_regexp.search(Line) # will give u search object or None if not found
If you just want to check for line breaks in the message, you can use the string function find(). Note the use of raw text as indicated by the r in front of strings. This removes the need to escape the backslash.
line = r"got less no of bytes than requested\r\n"
print(line)
if line.find(r'\r\n') > 0:
print("Do not use line breaks in MSG");
As others have noted, you are probably looking for line.strip(). But, in case you still want to practice regex, you would use the following code:
Line="got less no of bytes than requested\r\n"
# \r\n located anywhere in the string
prog = re.compile(r'\r\n')
# \r or \n located anywhere in the string
prog = re.compile(r'(\r|\n)')
if prog.search(Line):
print('Do not use \\r\\n in MSG');
I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)
I am having trouble replacing three commas with one comma in a text file of data.
I am processing a large text file to put it into comma delimited format so I can query it using a database.
I do the following at the command prompt and it works:
>>> import re
>>> line = 'one,,,two'
>>> line=re.sub(',+',',',line)
>>> print line
one,two
>>>
following below is my actual code:
with open("dmis8.txt", "r") as ifp:
with open("dmis7.txt", "w") as ofp:
for line in ifp:
#join lines by removing a line ending.
line=re.sub('(?m)(MM/ANGDEC)[\r\n]+$','',line)
#various replacements of text with nothing. This removes the text
line=re.sub('IDENTIFIER','',line)
line=re.sub('PART','50-1437',line)
line=re.sub('Eval','',line)
line=re.sub('Feat','',line)
line=re.sub('=','',line)
#line=re.sub('r"++++"','',line)
line=re.sub('r"----|"',' ',line)
line=re.sub('Nom','',line)
line=re.sub('Act',' ',line)
line=re.sub('Dev','',line)
line=re.sub('LwTol','',line)
line=re.sub('UpTol','',line)
line=re.sub(':','',line)
line=re.sub('(?m)(Trend)[\r\n]*$',' ',line)
#Remove spaces replace with semicolon
line=re.sub('[ \v\t\f]+', ',', line)
#no worky line=re.sub(r",,,",',',line)
line=re.sub(',+',',',line)
#line=line.replace(",+", ",")
#line=line.replace(",,,", ",")
ofp.write(line)
This is what i get from the code above:
There are several commas together. I don't understand why they won't get replaced down to one comma.
Never mind that I don't see how the extra commas got there in the first place.
50-1437,d
2012/05/01
00/08/27
232_PD_1_DIA,PED_HL1_CR,,,12.482,12.478,-0.004,-0.021,0.020,----|++++
232_PD_2_DIA_TOP,PED_HL2_TOP,,12.482,12.483,0.001,-0.021,0.020,----|++++
232_PD_2_DIA,PED_HL2_CR,,12.482,12.477,-0.005,-0.021,0.020,----|++++
232_PD_2_DIA_BOT,PED_HL2_BOT,,12.482,12.470,-0.012,-0.021,0.020,--|--++++
raw data for reference:
PART IDENTIFIER : d
2012/05/01
00/08/27
232_PD_1_DIA Eval Feat = PED_HL1_CR MM/ANGDEC
Nom Act Dev LwTol UpTol Trend
12.482 12.478 -0.004 -0.021 0.020 ----|++++
232_PD_2_DIA_TOP Eval Feat = PED_HL2_TOP MM/ANGDEC
12.482 12.483 0.001 -0.021 0.020 ----|++++
232_PD_2_DIA Eval Feat = PED_HL2_CR MM/ANGDEC
12.482 12.477 -0.005 -0.021 0.020 ----|++++
Can someone kindly point what I am doing wrong?
thanks in advance...
Your regex is working fine. The problem is that it you concatenate the lines (by write()ing them) after you scrub them with your regex.
Instead, use "".join() on all of your lines, run re.sub() on the whole thing, and then write() it all to the file at once.
I think your problem is caused by the fact that removing line endings does not join lines, in combination with the fact that write does not add newlines to the end each string. So you have multiple input lines that look like a single line in the output.
Looking at the comments, you seem to think that just replacing the end of the line by an empty string will magically append the next line to it, but that doesn't actually work. So the three commas you're seeing are not replaced by your re.sub command because they're not in one line, they're multiple input lines (which after all the replacements are empty except for commas) which get printed to a single output line because you stripped their '\n' characters, and write doesn't automatically add '\n' to the end of each written string (unlike print).
To debug your code, just put print line after each line of code, to see what each "line" actually is - that should help you see what's going wrong.
In general, reading file formats where each "record" spans multiple lines requires more complicated methods than just a for line in file loop.
Given that "empty line" is a white space:
I am trying to read a text file line by line. I want to ignore whitespace lines. Or in a more correct way, I want to detect empty lines.
An empty line can contain spaces, newline characters, etc. And it is still considered as an empty line. If you open it up in notepad, in an empty line you should not see anything.
Is there a quick way of doing this in Python? I am new to python by the way.
for line in someopenfile:
if line.isspace():
empty_line()
Using strip() on any string returns a string with all leading and trailing whitespace stripped. So calling that on a line with only whitespace gives you an empty string. You can then just filter on strings with non-zero length.
>>> lines=[' ','abc','']
>>> print filter(lambda x:len(x.strip()),lines)
['abc']
>>>