Python, Extracting 3 lines before and after a match

Python, Extracting 3 lines before and after a match - python

I am trying to figure out how to extract 3 lines before and after a matched word.
At the moment, my word is found. I wrote up some text to test my code. And, I figured out how to print three lines after my match.
But, I am having difficulty trying to figure out how to print three lines before the word, "secure".
Here is what I have so far:
from itertools import islice
with open("testdoc.txt", "r") as f:
for line in f:
if "secure" in line:
print("".join(line))
print ("".join(islice(f,3)))
Here is the text I created for testing:
----------------------------
This is a test to see
if i can extract information
using this code
I hope, I try,
maybe secure shell will save thee
Im adding extra lines to see my output
hoping that it comes out correctly
boy im tired, sleep is nice
until then, time will suffice

You need to buffer your lines so you can recall them. The simplest way is to just load all the lines into a list:
with open("testdoc.txt", "r") as f:
lines = f.readlines() # read all lines into a list
for index, line in enumerate(lines): # enumerate the list and loop through it
if "secure" in line: # check if the current line has your substring
print(line.rstrip()) # print the current line (stripped off whitespace)
print("".join(lines[max(0,index-3):index])) # print three lines preceeding it
But if you need maximum storage efficiency you can use a buffer to store the last 3 lines as you loop over the file line by line. A collections.deque is ideal for that.

i came up with this solution, just adding the previous lines in a list, and deleting the first one after 4 elements
from itertools import islice
with open("testdoc.txt", "r") as f:
linesBefore = list()
for line in f:
linesBefore.append(line.rstrip())
if len(linesBefore) > 4: #Adding up to 4 lines
linesBefore.pop(0)
if "secure" in line:
if len(linesBefore) == 4: # if there are at least 3 lines before the match
for i in range(3):
print(linesBefore[i])
else: #if there are less than 3 lines before the match
print(''.join(linesBefore))
print("".join(line.rstrip()))
print ("".join(islice(f,3)))

Related

Read N lines until EOF in Python 3

Hi tried several solutions found on SO but I am missing some info.
I want to read 4 lines at once until I hit EOF. I know how to do it in other languages, but what is the best approach in Python 3?
This is what I have, lines is always the first 4 lines and the code stops afterwards (I know, because the comprehension only gives me the first 4 elements of all_lines. I could use some kind of counter and break and so on, but that seems rather cheap to me.
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for lines in all_lines[:4]:
print(lines)
I want to handle 4 lines at once until I hit EOF. The file I am working with is rather short, maybe about 100 lines MAX

If you want to iterate the lines in chunks of 4, you can do something like this:
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for i in range(0, len(all_lines), 4):
print(all_lines[i:i+4])

Instead of reading in the whole file and then looping over the lines four at a time, you can simply read them in four at a time. Consider
def fun(myfile):
if not os.path.isfile(myfile):
return
with open(myfile, 'r') as fo:
while True:
for line in (fo.readline() for _ in range(4)):
if not line:
return
print(line)
Here, a generator expression is used to read four lines, which is embedded in an "infinite" loop, which stop when line is falsy (the empty str ''), which only happens when we have reached EOF.

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?

I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com

The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.

A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Too many values to unpack in python: Caused by the file format

I have two files, which have two columns as following:
file 1
------
main 46
tag 23
bear 15
moon 2
file 2
------
main 20
rocky 6
zoo 4
bear 2
I am trying to compare the first 2 rows of each file together and in case there are some words that are the same, I will sum up the numbers and write those in a new file.
I read the file and used a foreach loop to go through each line, but it returns a ValueError:too many values to unpack.
import os
from itertools import islice
DIR = r'dir'
for filename in os.listdir(DIR):
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for i in range(2):
line = f.readline().strip()
word, freq = line.split():
print(word)
print(count)
In the file, there is an extra empty line after each line of the text. I searched for the \n; but nothing is there.
then I removed them manually and then it worked.

If you don't know how many items you have in the line, then you can't use the nice unpack facility. You'll need to split and check how many you got. For instance:
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for line in f:
data = line.split()
if len(data) >= 2:
word, count = line[:2]
This will get you the first two fields of any line containing at least that many. Since you haven't specified what to do with other lines or extra fields, I'll leave that (any else part) up to you. I've also left out the strip part to accent the existing code; line input and split will get rid of newlines and spaces, but not necessarily all white space.

reading data from multiple lines as a single item

I have a set of data from a file as such
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
I am currently matching he regex based on splice(23): with a search as follows:
re_johnny = re.compile('splice')
with open("file.txt", 'r') as file:
read = file.readlines()
for line in read:
if re_johnny.match(line):
print(line)
I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.

Input file: fin
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"="gotwastedatthehouse"
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\
00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\
77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\
00,2e,00,77,00,61,00,76,00,ff,00
[mattplayhouse\wherecanwego\tothepoolhall]
Adding to tigerhawk's suggestion you can try something like this:
Code:
import re
with open('fin', 'r') as f:
for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]:
if 'splice' in l:
print(l)
Output:
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
"johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00

With regex you have multiplied your problems. Instead, keep it simple:
If a line starts with ", it begins a record.
Else, append it to the previous record.
You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Extracting 3 lines before and after a match - python

Related

Read N lines until EOF in Python 3

Concatenate lines with previous line based on number of letters in first column

Too many values to unpack in python: Caused by the file format

reading data from multiple lines as a single item

Refering to a list of names using Python

Categories

Resources