Python using re module to parse an imported text file

Python using re module to parse an imported text file - python

def regexread():
import re
result = ''
savefileagain = open('sliceeverfile3.txt','w')
#text=open('emeverslicefile4.txt','r')
text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,'
pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
#with open('emeverslicefile4.txt') as text:
f = re.findall(pattern,text)
for item in f:
print(item)
savefileagain.write(item)
#savefileagain.close()
The above function as written parses the text and returns sets of seven numbers. I have three problems.
Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts.
Secondly, when I try to write results to the 'write' file, nothing is returned and
thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.

This should do the trick:
import re
filename = 'sliceeverfile3.txt'
pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
new_file = []
# Make sure file gets closed after being iterated
with open(filename, 'r') as f:
# Read the file contents and generate a list with each line
lines = f.readlines()
# Iterate each line
for line in lines:
# Regex applied to each line
match = re.search(pattern, line)
if match:
# Make sure to add \n to display correctly when we write it back
new_line = match.group() + '\n'
print new_line
new_file.append(new_line)
with open(filename, 'w') as f:
# go to start of file
f.seek(0)
# actually write the lines
f.writelines(new_file)

You're sort of on the right track...
You'll iterate over the file:
How to iterate over the file in python
and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.

Related

Write output to text file

I've checked and compare to other questions here but I didn't find a solution.
I open a tile text and I get from this file IP addresses but I can't write this information to a new text file.
My output file shows only the last line of my log. There is only this []
My second question is I'd like to group by the same IP addresses before I write it to a new text file.
import re
in_file = open("D:\BLOCK\log")
out_file = open("D:\BLOCK\output.txt", "w")
for line in in_file:
ipki = re.findall( r'[0-9]+(?:\.[0-9]+){3}', line )
print(ipki)
out_file.write(str(ipki))
file.close()
print(ipki)
['70.31.28.181']
['70.31.28.181']
['70.31.28.181']
['130.43.58.196']
['130.43.58.196']
['130.43.58.196']
[]
[]
[]

As mentioned in comments, your problem is that you keep replacing ipki in the for loop and only write its final value at the end. re.findall returns a list of zero or more matched strings - since your output file contains the string representation of an empty list ("[]"), it means that the last line of the input file had no match.
You could add processing to add found ip4 addresses to a master list, but since re.findall can process large blocks of text, its easier to read the entire file at once and let it do the lifting for you. Once you have the list, you can use set to get rid of duplicates before writing the result file.
>>> import re
>>> with open('log') as fp:
... ip4addrs = set(re.findall(r'[0-9]+(?:\.[0-9]+){3}', fp.read()))
...
>>> with open('output.txt', 'w') as fp:
... fp.write('\n'.join(ip4addrs))
...
26
>>> print(ip4addrs)
{'130.43.58.196', '70.31.28.181'}
>>> print(open('output.txt').read())
130.43.58.196
70.31.28.181

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.

What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.

There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592

***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

I need to open and rewrite a line in a file in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Search and replace a line in a file in Python
How do I modify a text file in Python?
I have an input file that I need to rewrite with the different files needed to be modified before running a program. I have tried a variety of the solutions on here but none of them seem to work. I end up just overwriting my file with a blank file
f = open(filename, 'r+')
text = f.read()
text = re.sub('foobar', 'bar', text)
f.seek(0)
f.write(text)
f.truncate()
f.close()
Or with that code for instance the name I am changing is different each time I run the program so I need to replace the entire line not just one keyword

A simple way may be to read the text into a string, then concatenate the string with the text you want to write:
infile = open('hey.txt','r+')
content = infile.read()
text = ['foo','bar']
for item in text:
content +=item #adds 'foo' on first iteration, 'bar' on second
infile.write(content)
infile.close()
or to change a particular key word:
infile = open('hey.txt','r+')
content = infile.read()
table = str.maketrans('foo','bar')
content = content.translate(table) #replaces 'foo' with 'bar'
infile.write(content)
infile.close()
or to change by line, you can use readlines and refer to each line as the index of a list:
infile = open('hey.txt','r+')
content = infile.readlines() #reads line by line and out puts a list of each line
content[1] = 'This is a new line\n' #replaces content of the 2nd line (index 1)
infile.write(content)
infile.close()
Maybe not a particularly elegant way to solve the problem, but it could be wrapped up in a function and the 'text' variable could be a number of data types like a dictionary, list, etc. There are also a number of ways to replace each line in a file, it just depends on what the criteria are for changing the line (are you searching for a character or word in the line? Are you just looking to replace a line based on where it is in the file?)--so those are also some things to consider.
Edit: Added quotes to third code sample

Though ugly this solution ends up working
infile = open('file.txt', 'r+')
content = infile.readlines() #reads line by line and out puts a list of each line
content[1] = "foo \n" #replaces content of the 2nd line (index 1)
infile.close
infile = open('file.txt', 'w') #clears content of file.
infile.close
infile = open('file.txt', 'r+')
for item in content: #rewrites file content from list
infile.write("%s" % item)
infile.close()
Thanks for all the help!!

Writing items to file on separate lines without blank line at the end

I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?

If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))

to read line from file without getting "\n" appended at the end [duplicate]

This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 6 years ago.
My file is "xml.txt" with following contents:
books.xml
news.xml
mix.xml
if I use readline() function it appends "\n" at the name of all the files which is an error because I want to open the files contained within the xml.txt. I wrote this:
fo = open("xml.tx","r")
for i in range(count.__len__()): #here count is one of may arrays that i'm using
file = fo.readline()
find_root(file) # here find_root is my own created function not displayed here
error encountered on running this code:
IOError: [Errno 2] No such file or directory: 'books.xml\n'

To remove just the newline at the end:
line = line.rstrip('\n')
The reason readline keeps the newline character is so you can distinguish between an empty line (has the newline) and the end of the file (empty string).

From Best method for reading newline delimited files in Python and discarding the newlines?
lines = open(filename).read().splitlines()

You could use the .rstrip() method of string objects to get a version with trailing whitespace (including newlines) removed.
E.g.:
find_root(file.rstrip())

I timed it just for curiosity. Below are the results for a vary large file.
tldr;
File read then split seems to be the fastest approach on a large file.
with open(FILENAME, "r") as file:
lines = file.read().split("\n")
However, if you need to loop through the lines anyway then you probably want:
with open(FILENAME, "r") as file:
for line in file:
line = line.rstrip("\n")
Python 3.4.2
import timeit
FILENAME = "mylargefile.csv"
DELIMITER = "\n"
def splitlines_read():
"""Read the file then split the lines from the splitlines builtin method.
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = file.read().splitlines()
return lines
# end splitlines_read
def split_read():
"""Read the file then split the lines.
This method will return empty strings for blank lines (Same as the other methods).
This method may also have an extra additional element as an empty string (compared to
splitlines_read).
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = file.read().split(DELIMITER)
return lines
# end split_read
def strip_read():
"""Loop through the file and create a new list of lines and removes any "\n" by rstrip
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = [line.rstrip(DELIMITER) for line in file]
return lines
# end strip_readline
def strip_readlines():
"""Loop through the file's read lines and create a new list of lines and removes any "\n" by
rstrip. ... will probably be slower than the strip_read, but might as well test everything.
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = [line.rstrip(DELIMITER) for line in file.readlines()]
return lines
# end strip_readline
def compare_times():
run = 100
splitlines_t = timeit.timeit(splitlines_read, number=run)
print("Splitlines Read:", splitlines_t)
split_t = timeit.timeit(split_read, number=run)
print("Split Read:", split_t)
strip_t = timeit.timeit(strip_read, number=run)
print("Strip Read:", strip_t)
striplines_t = timeit.timeit(strip_readlines, number=run)
print("Strip Readlines:", striplines_t)
# end compare_times
def compare_values():
"""Compare the values of the file.
Note: split_read fails, because has an extra empty string in the list of lines. That's the only
reason why it fails.
"""
splr = splitlines_read()
sprl = split_read()
strr = strip_read()
strl = strip_readlines()
print("splitlines_read")
print(repr(splr[:10]))
print("split_read", splr == sprl)
print(repr(sprl[:10]))
print("strip_read", splr == strr)
print(repr(strr[:10]))
print("strip_readline", splr == strl)
print(repr(strl[:10]))
# end compare_values
if __name__ == "__main__":
compare_values()
compare_times()
Results:
run = 1000
Splitlines Read: 201.02846901328783
Split Read: 137.51448011841822
Strip Read: 156.18040391519133
Strip Readline: 172.12281272950372
run = 100
Splitlines Read: 19.956802833188124
Split Read: 13.657361738959867
Strip Read: 15.731161020969516
Strip Readlines: 17.434831199281092
run = 100
Splitlines Read: 20.01516321280158
Split Read: 13.786344555543899
Strip Read: 16.02410587620824
Strip Readlines: 17.09326775703279
File read then split seems to be the fastest approach on a large file.
Note: read then split("\n") will have an extra empty string at the end of the list.
Note: read then splitlines() checks for more then just "\n" possibly "\r\n".

It's better style to use a context manager for the file, and len() instead of calling .__len__()
with open("xml.tx","r") as fo:
for i in range(len(count)): #here count is one of may arrays that i'm using
file = next(fo).rstrip("\n")
find_root(file) # here find_root is my own created function not displayed here

To remove the newline character fro the end you could also use something like this:
for line in file:
print line[:-1]

A use case with #Lars Wirzenius's answer:
with open("list.txt", "r") as myfile:
for lines in myfile:
lines = lines.rstrip('\n') # the trick
try:
with open(lines) as myFile:
print "ok"
except IOError as e:
print "files does not exist"

# mode : 'r', 'w', 'a'
f = open("ur_filename", "mode")
for t in f:
if(t):
fn.write(t.rstrip("\n"))
"If" condition will check whether the line has string or not, if yes next line will strip the "\n" at the end and write to a file.
Code Tested. ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python using re module to parse an imported text file - python

You're sort of on the right track... You'll iterate over the file: How to iterate over the file in python and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.

Related

Write output to text file

Adding each item in list to end of specific lines in FASTA file

I need to open and rewrite a line in a file in Python [duplicate]

Writing items to file on separate lines without blank line at the end

to read line from file without getting "\n" appended at the end [duplicate]

Categories

Resources