How do I write only certain lines to a file in Python? - python

I have a file that looks like this(have to put in code box so it resembles file):
text
(starts with parentheses)
tabbed info
text
(starts with parentheses)
tabbed info
...repeat
I want to grab only "text" lines from the file(or every fourth line) and copy them to another file. This is the code I have, but it copies everything to the new file:
import sys
def process_file(filename):
output_file = open("data.txt", 'w')
input_file = open(filename, "r")
for line in input_file:
line = line.strip()
if not line.startswith("(") or line.startswith(""):
output_file.write(line)
output_file.close()
if __name__ == "__main__":
process_file(sys.argv[1])

The reason why your script is copying every line is because line.startswith("") is True, no matter what line equals.
You might try using isspace to test if line begins with a space:
def process_file(filename):
with open("data.txt", 'w') as output_file:
with open(filename, "r") as input_file:
for line in input_file:
line=line.rstrip()
if not line.startswith("(") or line[:1].isspace():
output_file.write(line)

with open('data.txt','w') as of:
of.write(''.join(textline
for textline in open(filename)
if textline[0] not in ' \t(')
)
To write every fourth line use slice result[::4]
with open('data.txt','w') as of:
of.write(''.join([textline
for textline in open(filename)
if textline[0] not in ' \t('][::4])
)
I need not to rstrip the newlines as I use them with write.

In addition to line.startswith("") always being true, line.strip() will remove the leading tab forcing the tabbed data to be written as well. change it to line.rstrip() and use \t to test for a tab. That part of your code should look like:
line = line.rstrip()
if not line.startswith(('(', '\t')):
#....
In response to your question in the comments:
#edited in response to comments in post
for i, line in input_file:
if i % 4 == 0:
output_file.write(line)

try:
if not line.startswith("(") and not line.startswith("\t"):
without doing line.strip() (this will strip the tabs)

So the issue is that (1) you are misusing boolean logic, and (2) every possible line starts with "".
First, the boolean logic:
The way the or operator works is that it returns True if either of its operands is True. The operands are "not line.startswith('(')" and "line.startswith('')". Note that the not only applies to one of the operands. If you want to apply it to the total result of the or expression, you will have to put the whole thing in parentheses.
The second issue is your use of the startswith() method with a zero-length strong as an argument. This essentially says "match any string where the first zero characters are nothing. It matches any strong you could give it.
See other answers for what you should be doing here.

Related

Parsing Logs with Regular Expressions Python

Coding and Python lightweight :)
I've gotta iterate through some logfiles and pick out the ones that say ERROR. Boom done got that. What I've gotta do is figure out how to grab the following 10 lines containing the details of the error. Its gotta be some combo of an if statement and a for/while loop I presume. Any help would be appreciated.
import os
import re
# Regex used to match
line_regex = re.compile(r"ERROR")
# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("NodeOut.log")
# Overwrites the file, ensure we're starting out with a blank file
#TODO Append this later
with open(output_filename, "w") as out_file:
out_file.write("")
# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
# Open input file in 'read' mode
with open("MXNode1.stdout", "r") as in_file:
# Loop over each log line
for line in in_file:
# If log line matches our regex, print remove later, and write > file
if (line_regex.search(line)):
# for i in range():
print(line)
out_file.write(line)
There is no need for regex to do this, you can just use the in operator ("ERROR" in line).
Also, to clear the content of the file without opening it in w mode, you can simply place the cursor at the beginning of the file and truncate.
import os
output_filename = os.path.normpath("NodeOut.log")
with open(output_filename, 'a') as out_file:
out_file.seek(0, 0)
out_file.truncate(0)
with open("MXNode1.stdout", 'r') as in_file:
line = in_file.readline()
while line:
if "ERROR" in line:
out_file.write(line)
for i in range(10):
out_file.write(in_file.readline())
line = in_file.readline()
We use a while loop to read lines one by one using in_file.readline(). The advantage is that you can easily read the next line using a for loop.
See the doc:
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline.
Assuming you would only want to always grab the next 10 lines, then you could do something similar to:
with open("MXNode1.stdout", "r") as in_file:
# Loop over each log line
lineCount = 11
for line in in_file:
# If log line matches our regex, print remove later, and write > file
if (line_regex.search(line)):
# for i in range():
print(line)
lineCount = 0
if (lineCount < 11):
lineCount += 1
out_file.write(line)
The second if statement will help you always grab the line. The magic number of 11 is so that you grab the next 10 lines after the initial line that the ERROR was found on.

splitlines() and iterating over an opened file give different results

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo

How do I split each line into two strings and print without the comma?

I'm trying to have output to be without commas, and separate each line into two strings and print them.
My code so far yields:
173,70
134,63
122,61
140,68
201,75
222,78
183,71
144,69
But i'd like it to print it out without the comma and the values on each line separated as strings.
if __name__ == '__main__':
# Complete main section of code
file_name = "data.txt"
# Open the file for reading here
my_file = open('data.txt')
lines = my_file.read()
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
print(lines)
In your sample code, line contains the full content of the file as a str.
my_file = open('data.txt')
lines = my_file.read()
You then later re-open the file to iterate the lines:
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
Note, however, str.split and str.replace do not modify the existing value, as strs in python are immutable. Also note you are operating on lines there, rather than the for-loop variable line.
Instead, you'll need to assign the result of those functions into new values, or give them as arguments (E.g., to print). So you'll want to open the file, iterate over the lines and print the value with the "," replaced with a " ":
with open("data.txt") as f:
for line in f:
print(line.replace(",", " "))
Or, since you are operating on the whole file anyway:
with open("data.txt") as f:
print(f.read().replace(",", " "))
Or, as your file appears to be CSV content, you may wish to use the csv module from the standard library instead:
import csv
with open("data.txt", newline="") as csvfile:
for row in csv.reader(csvfile):
print(*row)
with open('data.txt', 'r') as f:
for line in f:
for value in line.split(','):
print(value)
while python can offer us several ways to open files this is the prefered one for working with files. becuase we are opening the file in lazy mode (this is the prefered one espicialy for large files), and after exiting the with scope (identation block) the file io will be closed automaticly by the system.
here we are openening the file in read mode. files folow the iterator polices, so we can iterrate over them like lists. each line is a true line in the file and is a string type.
After getting the line, in line variable, we split (see str.split()) the line into 2 tokens, one before the comma and the other after the comma. split return new constructed list of strings. if you need to omit some unwanted characters you can use the str.strip() method. usualy strip and split combined together.
elegant and efficient file reading - method 1
with open("data.txt", 'r') as io:
for line in io:
sl=io.split(',') # now sl is a list of strings.
print("{} {}".format(sl[0],sl[1])) #now we use the format, for printing the results on the screen.
non elegant, but efficient file reading - method 2
fp = open("data.txt", 'r')
line = None
while (line=fp.readline()) != '': #when line become empty string, EOF have been reached. the end of file!
sl=line.split(',')
print("{} {}".format(sl[0],sl[1]))

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

How to write lines from a input file to an output file in reversed order in python 3

What I want to do is take a series of lines from one text document, and put them in reverse in a second. For example text document a contains:
hi
there
people
So therefore I would want to write these same lines to text document b, except like this:
people
there
hi
So far I have:
def write_matching_lines(input_filename, output_filename):
infile = open(input_filename)
lines = infile.readlines()
outfile = open(output_filename, 'w')
for line in reversed(lines):
outfile.write(line.rstrip())
infile.close()
outfile.close()
but this only returns:
peopletherehi
in one line. any help would be appreciated.
One line will do:
open("out", "wb").writelines(reversed(open("in").readlines()))
You just need to + '\n' since .write does not do that for you, alternatively you can use
print >>f, line.rstrip()
equivalently in Python 3:
print(line.rstrip(), file=f)
which will add a new line for you. Or do something like this:
>>> with open('text.txt') as fin, open('out.txt', 'w') as fout:
fout.writelines(reversed([line.rstrip() + '\n' for line in fin]))
This code assumes that you don't know if the last line has a newline or not, if you know it does you can just use
fout.writelines(reversed(fin.readlines()))
Why do you rstrip() your line before writing it? You're stripping off the newline at the end of each line as you write it. And yet you then notice that you don't have any newlines. Simply remove the rstrip() in your write.
Less is more.
Update
If I couldn't prove/verify that the last line has a terminating newline, I'd personally be inclined to mess with the one line where it mattered, up front. E.g.
....
outfile = open(output_filename, 'w')
lines[-1] = lines[-1].rstrip() + '\n' # make sure last line has a newline
for line in reversed(lines):
outfile.write(line)
....
with open(your_filename) as h:
print ''.join(reversed(h.readlines()))
or, if you want to write it to other stream:
with open(your_filename_out, 'w') as h_out:
with open(your_filename_in) as h_in:
h_out.write(''.join(reversed(h_in.readlines()))

Categories

Resources