If I have a text file that has a bunch of random text before I get to the stuff I actually want, how do I move the file pointer there?
Say for example my text file looks like this:
#foeijfoijeoijoijfoiej ijfoiejoi jfeoijfoifj i jfoei joi jo ijf eoij oie jojf
#feoijfoiejf ioj oij oi jo ij i joi jo ij oij #### oijroijf 3## # o
#foeijfoiej i jo i iojf 3 ## #io joi joij oi j## io joi joi j3# 3i ojoi joij
# The stuff I care about
(The hashtags are a part of the actual text file)
How do I move the file pointer to the line of stuff I care about, and then how would I get python to tell me the number of the line, and start the reading of the file there?
I've tried doing a loop to find the line that the last hashtag is in, and then reading from there, but I still need to get rid of the hashtag, and need the line number.
Try using the readlines function. This will return a list containing each line. You can use a for loop to parse through each line, searching for what you need, then obtain the number of the line via its index in the list. For instance:
with open('some_file_path.txt') as f:
contents = f.readlines()
object = '#the line I am looking for'
for line in contents:
if object in line:
line_num = contents.index(object)
To get rid of the pound sign, just use the replace function. Eg. new_line = line.replace('#','')
You can't seek to it directly without knowing the size of the junk data or scanning through the junk data. But it's not too hard to wrap the file in itertools.dropwhile to discard lines until you see the "good" data, after which it iterates through all remaining lines:
import itertools
# Or def a regular function that returns True until you see the line
# delimiting the beginning of the "good" data
not_good = '# The stuff I care about\n'.__ne__
with open(filename) as f:
for line in itertools.dropwhile(not_good, f):
... You'll iterate the lines at and after the good line ...
If you actually need the file descriptor positioned appropriately, not just the lines, this variant should work:
import io
with open(filename) as f:
# Get first good line
good_start = next(itertools.dropwhile(not_good, f))
# Seek back to undo the read of the first good line:
f.seek(-len(good_start), io.SEEK_CUR)
# f is now positioned at the beginning of the line that begins the good data
You can tweak this to get the actual line number if you really need it (rather than just needing the offset). It's a little less readable though, so explicit iteration via enumerate may make more sense if you need to do it (left as exercise). The way to make Python work for you is:
from future_builtins import map # Py2 only
from operator import itemgetter
with open(filename) as f:
linectr = itertools.count()
# Get first good line
# Pair each line with a 0-up number to advance the count generator, but
# strip it immediately so not_good only processes lines, not line nums
good_start = next(itertools.dropwhile(not_good, map(itemgetter(0), zip(f, linectr))))
good_lineno = next(linectr) # Keeps the 1-up line number by advancing once
# Seek back to undo the read of the first good line:
f.seek(-len(good_start), io.SEEK_CUR)
# f is now positioned at the beginning of the line that begins the good data
Related
New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.
I am trying to figure out how to extract 3 lines before and after a matched word.
At the moment, my word is found. I wrote up some text to test my code. And, I figured out how to print three lines after my match.
But, I am having difficulty trying to figure out how to print three lines before the word, "secure".
Here is what I have so far:
from itertools import islice
with open("testdoc.txt", "r") as f:
for line in f:
if "secure" in line:
print("".join(line))
print ("".join(islice(f,3)))
Here is the text I created for testing:
----------------------------
This is a test to see
if i can extract information
using this code
I hope, I try,
maybe secure shell will save thee
Im adding extra lines to see my output
hoping that it comes out correctly
boy im tired, sleep is nice
until then, time will suffice
You need to buffer your lines so you can recall them. The simplest way is to just load all the lines into a list:
with open("testdoc.txt", "r") as f:
lines = f.readlines() # read all lines into a list
for index, line in enumerate(lines): # enumerate the list and loop through it
if "secure" in line: # check if the current line has your substring
print(line.rstrip()) # print the current line (stripped off whitespace)
print("".join(lines[max(0,index-3):index])) # print three lines preceeding it
But if you need maximum storage efficiency you can use a buffer to store the last 3 lines as you loop over the file line by line. A collections.deque is ideal for that.
i came up with this solution, just adding the previous lines in a list, and deleting the first one after 4 elements
from itertools import islice
with open("testdoc.txt", "r") as f:
linesBefore = list()
for line in f:
linesBefore.append(line.rstrip())
if len(linesBefore) > 4: #Adding up to 4 lines
linesBefore.pop(0)
if "secure" in line:
if len(linesBefore) == 4: # if there are at least 3 lines before the match
for i in range(3):
print(linesBefore[i])
else: #if there are less than 3 lines before the match
print(''.join(linesBefore))
print("".join(line.rstrip()))
print ("".join(islice(f,3)))
I'm writing a Python script to read a file, and when I arrive at a section of the file, the final way to read those lines in the section depends on information that's given also in that section. So I found here that I could use something like
fp = open('myfile')
last_pos = fp.tell()
line = fp.readline()
while line != '':
if line == 'SPECIAL':
fp.seek(last_pos)
other_function(fp)
break
last_pos = fp.tell()
line = fp.readline()
Yet, the structure of my current code is something like the following:
fh = open(filename)
# get generator function and attach None at the end to stop iteration
items = itertools.chain(((lino,line) for lino, line in enumerate(fh, start=1)), (None,))
item = True
lino, line = next(items)
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for i in range(specialLines):
lino, eline = next(items)
# etc. get the special data I need here
# try to set the pointer to start to reread the special section
fh.seek(start)
# then reread the special section
But this approach gives the following error:
telling position disabled by next() call
Is there a way to prevent this?
Using the file as an iterator (such as calling next() on it or using it in a for loop) uses an internal buffer; the actual file read position is further along the file and using .tell() will not give you the position of the next line to yield.
If you need to seek back and forth, the solution is not to use next() directly on the file object but use file.readline() only. You can still use an iterator for that, use the two-argument version of iter():
fileobj = open(filename)
fh = iter(fileobj.readline, '')
Calling next() on fileiterator() will invoke fileobj.readline() until that function returns an empty string. In effect, this creates a file iterator that doesn't use the internal buffer.
Demo:
>>> fh = open('example.txt')
>>> fhiter = iter(fh.readline, '')
>>> next(fhiter)
'foo spam eggs\n'
>>> fh.tell()
14
>>> fh.seek(0)
0
>>> next(fhiter)
'foo spam eggs\n'
Note that your enumerate chain can be simplified to:
items = itertools.chain(enumerate(fh, start=1), (None,))
although I am in the dark why you think a (None,) sentinel is needed here; StopIteration will still be raised, albeit one more next() call later.
To read specialLines count lines, use itertools.islice():
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
You can just loop directly over fh instead of using an infinite loop and next() calls here too:
with open(filename) as fh:
enumerated = enumerate(iter(fileobj.readline, ''), start=1):
for lino, line in enumerated:
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
fh.seek(start)
but do note that your line numbers will still increment even when you seek back!
You probably want to refactor your code to not need to re-read sections of your file, however.
I'm not an expert with version 3 of Python, but it seems like you're reading using generator that yields lines that are read from file. Thus you can have only one-side direction.
You'll have to use another approach.
I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']
I've been working on a program which assists in log analysis. It finds error or fail messages using regex and prints them to a new .txt file. However, it would be much more beneficial if the program including the top and bottom 4 lines around what the match is. I can't figure out how to do this! Here is a part of the existing program:
def error_finder(filepath):
source = open(filepath, "r").readlines()
error_logs = set()
my_data = []
for line in source:
line = line.strip()
if re.search(exp, line):
error_logs.add(line)
I'm assuming something needs to be added to the very last line, but I've been working on this for a bit and either am not applying myself fully or just can't figure it out.
Any advice or help on this is appreciated.
Thank you!
Why python?
grep -C4 '^your_regex$' logfile > outfile.txt
Some comments:
I'm not sure why error_logs is a set instead of a list.
Using readlines() will read the entire file in memory, which will be inefficient for large files. You should be able to just iterate over the file a line at a time.
exp (which you're using for re.search) isn't defined anywhere, but I assume that's elsewhere in your code.
Anyway, here's complete code that should do what you want without reading the whole file in memory. It will also preserve the order of input lines.
import re
from collections import deque
exp = '\d'
# matches numbers, change to what you need
def error_finder(filepath, context_lines = 4):
source = open(filepath, 'r')
error_logs = []
buffer = deque(maxlen=context_lines)
lines_after = 0
for line in source:
line = line.strip()
if re.search(exp, line):
# add previous lines first
for prev_line in buffer:
error_logs.append(prev_line)
# clear the buffer
buffer.clear()
# add current line
error_logs.append(line)
# schedule lines that follow to be added too
lines_after = context_lines
elif lines_after > 0:
# a line that matched the regex came not so long ago
lines_after -= 1
error_logs.append(line)
else:
buffer.append(line)
# maybe do something with error_logs? I'll just return it
return error_logs
I suggest to use index loop instead of for each loop, try this:
error_logs = list()
for i in range(len(source)):
line = source[i].strip()
if re.search(exp, line):
error_logs.append((line,i-4,i+4))
in this case your errors log will contain ('line of error', line index - 4, line index + 4), so you can get these lines later form "source"