Iterating over a .txt file with a regular expression conditional

Iterating over a .txt file with a regular expression conditional - python

Program workflow:
Open "asigra_backup.txt" file and read each line
Search for the exact string: "Errors: " + {any value ranging from 1 - 100}. e.g "Errors: 12"
When a match is found, open a separate .txt file in write&append mode
Write the match found. Example: "Errors: 4"
In addition to above write, append the next 4 lines below the match found in step 3; as that is additional log information
What I've done:
Tested a regular expressions that matches with my sample data on regex101.com
Used list comprehension to find all matches in my test file
Where I need help (please):
Figuring out how to append additional 4 lines of log information below each match string found
CURRENT CODE:
result = [line.split("\n")[0] for line in open('asigra_backup.txt') if re.match('^Errors:\s([1-9]|[1-9][0-9]|100)',line)]
print(result)
CURRENT OUTPUT:
['Errors: 1', 'Errors: 128']
DESIRED OUTPUT:
Errors: 1
Pasta
Fish
Dog
Doctonr
Errors: 128
Lemon
Seasoned
Rhinon
Goat
SAMPLE .TXT FILE
Errors: 1
Pasta
Fish
Dog
Doctonr
Errors: 128
Lemon
Seasoned
Rhinon
Goat
Errors: 0
Rhinon
Cat
Dog
Fish

For those wanting additional clarification, as it may help the next person, this was my final solution:
def errors_to_file(self):
"""
Opens file containing Asigra backup logs, "asigra_backup.txt", and returns a list of all errors within the log.
Uses a regular expression match conditional on each line within the asigra backup log file. Error number range is 1 - 100.
Formats errors log by appending a space every 10th element in the errors log list.txt
Writes formatted error log to a file in current directory: "asigra_errors.txt"
"""
# "asigra_backup.txt" contains log information from the performed backup.
with open('asigra_backup.txt', "r") as f:
lines0 = [line.rstrip() for line in f]
# empty list that is appended with errors found in the log
lines = []
for i, line in enumerate(lines0):
if re.match('^Errors:\s([1-9]|[1-9][0-9]|100)',line):
lines.extend(lines0[i:i+9])
if len(lines) == 0:
print("No errors found")
print("Gracefully exiting")
sys.exit(1)
k = ''
N = 9
formatted_errors = list(chain(*[lines[i : i+N] + [k]
if len(lines[i : i+N]) == N
else lines[i : i+N]
for i in range(0, len(lines), N)]))
with open("asigra_errors.txt", "w") as e:
for i, line in enumerate(formatted_errors):
e.write(f"{line}\n")
Huge thank you to those that answered my question.

Using better regex and re.findall can make it easier. In the following regex, all Errors: and 4 following lines are detected.
import re
regex_matches = re.findall('(?:[\r\n]+|^)((Errors:\s*([1-9][0-9]?|100))(?:[\r\n\s\t]+.*){4})', open('asigra_backup.txt', 'r').read())
open('separate.txt', 'a').write('\n' + '\n'.join([i[0] for i in regex_matches]))
To access error numbers or error lines following lines can use:
error_rows = [i[1] for i in regex_matches]
error_numbers = [i[2] for i in regex_matches]
print(error_rows)
print(error_numbers)

I wrote a code which prints the output as requested. The code will work when Errors: 1 line is added as last line. See the text I have parsed:
data_to_parse = """
Errors: 56
Pasta
Fish
Dog
Doctonr
Errors: 0
Lemon
Seasoned
Rhinon
Goat
Errors: 45
Rhinon
Cat
Dog
Fish
Errors: 34
Rhinon
Cat
Dog
Fish1
Errors: 1
"""
See the code which gives the desired output without using regex. Indices have been used to get desired data.
lines = data_to_parse.splitlines()
errors_indices = []
i = 0
k = 0
for line in lines: # where Errors: are located are found in saved in list errors_indices.
if 'Errors:' in line:
errors_indices.append(i)
i = i+1
#counter = False
while k < len(errors_indices):
counter = False # It is needed to find the indices when Errors: 0 is hit.
for j in range(errors_indices[k-1], errors_indices[k]):
if 'Errors:' in lines[j]:
lines2 = lines[j].split(':')
lines2_val = lines2[1].strip()
if int(lines2_val) != 0:
print(lines[j])
if int(lines2_val) == 0:
counter = True
elif 'Errors:' not in lines[j] and counter == False:
print(lines[j])
k=k+1
I have tried a few times to see if the code is working properly. It looks it gives the requested output properly. See the output when the code is run as:

Related

How to read ONLY 1 word in python?

I've created an empty text file, and saved some stuff to it. This is what I saved:
Saish ddd TestUser ForTestUse
There is a space before these words. Anyways, I wanted to know how to read only 1 WORD in the text file using python. This is the code I used:
#Uncommenting the line below the line does literally nothing.
import time
#import mmap, re
print("Loading Data...")
time.sleep(2)
with open("User_Data.txt") as f:
lines = f.read() ##Assume the sample file has 3 lines
first = lines.split(None, 1)[0]
print(first)
print("Type user number 1 - 4 for using different user.")
ans = input('Is the name above correct?(y/1 - 4) ')
if ans == 'y':
print("Ok! You will be called", first)
elif ans == '1':
print("You are already registered to", first)
elif ans == '2':
print('Switching to accounts...')
time.sleep(0.5)
with open("User_Data.txt") as f:
lines = f.read() ##Assume the sample file has 3 lines
second = lines.split(None, 2)[2]
print(second)
#Fix the passord issue! Very important as this is SECURITY!!!
when I run the code, my output is:
Loading Data...
Saish
Type user number 1 - 4 for using different user.
Is the name above correct?(y/1 - 4) 2
Switching to accounts...
TestUser ForTestUse
as you can see, it diplays both "TestUser" and "ForTestUse" while I only want it to display "TestUser".

When you give a limit to split(), all the items from that limit to the end are combined. So if you do
lines = 'Saish ddd TestUser ForTestUse'
split = lines.split(None, 2)
the result is
['Saish', 'ddd', 'TestUser ForTestUse']
If you just want the third word, don't give a limit to split().
second = lines.split()[2]

You can use it directly without passing any None
lines.split()[2]
I understand your passing (None, 2) because you want to get None if there is no value at index 2,
A simple way to check if the index is available in the list
Python 2
2 in zip(*enumerate(lines.split()))[0]
Python 3
2 in list(zip(*enumerate(lines.split())))[0]

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards

It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

Better regex implementation than for looping whole file?

I have files looking like this:
# BJD K2SC-Flux EAPFlux Err Flag Spline
2457217.463564 5848.004 5846.670 6.764 0 0.998291
2457217.483996 6195.018 6193.685 6.781 1 0.998291
2457217.504428 6396.612 6395.278 6.790 0 0.998292
2457217.524861 6220.890 6219.556 6.782 0 0.998292
2457217.545293 5891.856 5890.523 6.766 1 0.998292
2457217.565725 5581.000 5579.667 6.749 1 0.998292
2457217.586158 5230.566 5229.232 6.733 1 0.998292
2457217.606590 4901.128 4899.795 6.718 0 0.998293
2457217.627023 4604.127 4602.793 6.700 0 0.998293
I need to find and count the lines with Flag = 1. (5th column.) This is how I have done it:
foundlines=[]
c=0
import re
with open('examplefile') as f:
for index, line in enumerate(f):
try:
found = re.findall(r' 1 ', line)[0]
foundlines.append(index)
print(line)
c+=1
except:
pass
print(c)
In Shell, I would just do grep " 1 " examplefile | wc -l, which is much shorter than the Python script above. The python script works, but I am interested in whether is there a shorter, more compact way to do the task than the script above? I prefer the shortness of Shell so I would like to have something similar in Python.

You have CSV data, you can use the csv module:
import csv
with open('your file', 'r', newline='', encoding='utf8') as fp:
rows = csv.reader(fp, delimiter=' ')
# generator comprehension
errors = (row for row in rows if row[4] == '1')
for error in errors:
print(error)

You shell implementation can be made even shorter, grep has -c option to get you a count, no need for an anonymous pipe and wc:
grep -c " 1 " examplefile
You shell code simply gets you the line counts where the pattern 1 is found, but your Python code additionally keeps a list of indexes of lines where the pattern is matched.
Only to get the line counts, you can use sum and genexp/list comprehension, also no need for Regex; simple string __contains__ check would do as strings are iterable:
with open('examplefile') as f:
count = sum(1 for line in f if ' 1 ' in line)
print(count)
If you want to keep indexes as well, you can stick to your idea with only replacing re test with str test:
count = 0
indexes = []
with open('examplefile') as f:
for idx, line in enumerate(f):
if ' 1 ' in line:
count += 1
indexes.append(idx)
Additionally, doing a bare except is almost always a bad idea (at least you should use except Exception to leave out SystemExit, KeyboardInterrupt like exceptions), catch only the exceptions you know might be raised.
Also, while parsing structured data, you should use specific tool e.g. here csv.reader with space as the separator (line.split(' ') should do in this case as well) and checking against index-4 would be safest (see Tomalak's answer). With the ' 1 ' in line test, there would be misleading results if any other column contains 1.
Considering the above, here's the shell way using awk to match against the 5-th field:
awk '$5 == "1" {count+=1}; END{print count}' examplefile

Shortest code
This is a very short version under some specific preconditions:
You just want to count occurrences like your grep invocation
There is guaranteed to be only one " 1 " per line
That " 1 " can only occur in the desired column
Your file fits easily into memory
Note that if these preconditions are not met, this may cause issues with memory or return false positives.
print(open("examplefile").read().count(" 1 "))
Easy and versatile, slightly longer
Of course, if you're interested in actually doing something with these lines later on, I recommend Pandas:
df = pandas.read_table('test.txt', delimiter=" ",
comment="#",
names=['BJD', 'K2SC-Flux', 'EAPFlux', 'Err', 'Flag', 'Spline'])
To get all the rows where Flag is 1:
flaggedrows = df[df.Flag == 1]
returns:
BJD K2SC-Flux EAPFlux Err Flag Spline
1 2.457217e+06 6195.018 6193.685 6.781 1 0.998291
4 2.457218e+06 5891.856 5890.523 6.766 1 0.998292
5 2.457218e+06 5581.000 5579.667 6.749 1 0.998292
6 2.457218e+06 5230.566 5229.232 6.733 1 0.998292
To count them:
print(len(flaggedrows))
returns 4

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5

You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()

def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"

#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

patterns searching in text

I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?

Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.

This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)

You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).

Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating over a .txt file with a regular expression conditional - python

Related

How to read ONLY 1 word in python?

Extract a string between other two in Python

Better regex implementation than for looping whole file?

matching and dispalying specific lines through python

patterns searching in text

Categories

Resources