Write output to text file - python
I've checked and compare to other questions here but I didn't find a solution.
I open a tile text and I get from this file IP addresses but I can't write this information to a new text file.
My output file shows only the last line of my log. There is only this []
My second question is I'd like to group by the same IP addresses before I write it to a new text file.
import re
in_file = open("D:\BLOCK\log")
out_file = open("D:\BLOCK\output.txt", "w")
for line in in_file:
ipki = re.findall( r'[0-9]+(?:\.[0-9]+){3}', line )
print(ipki)
out_file.write(str(ipki))
file.close()
print(ipki)
['70.31.28.181']
['70.31.28.181']
['70.31.28.181']
['130.43.58.196']
['130.43.58.196']
['130.43.58.196']
[]
[]
[]
As mentioned in comments, your problem is that you keep replacing ipki in the for loop and only write its final value at the end. re.findall returns a list of zero or more matched strings - since your output file contains the string representation of an empty list ("[]"), it means that the last line of the input file had no match.
You could add processing to add found ip4 addresses to a master list, but since re.findall can process large blocks of text, its easier to read the entire file at once and let it do the lifting for you. Once you have the list, you can use set to get rid of duplicates before writing the result file.
>>> import re
>>> with open('log') as fp:
... ip4addrs = set(re.findall(r'[0-9]+(?:\.[0-9]+){3}', fp.read()))
...
>>> with open('output.txt', 'w') as fp:
... fp.write('\n'.join(ip4addrs))
...
26
>>> print(ip4addrs)
{'130.43.58.196', '70.31.28.181'}
>>> print(open('output.txt').read())
130.43.58.196
70.31.28.181
Related
Parse text file which groups data
Trying to figure out how to extract strings and put into new file on new line for each string Can't get my head around RegEx and all the things I'm looking at online show the data all being on one line but mine is already separated. Trying to parse the output of another program, it outputs three lines Date,Address,Name and then has a newline and another set of three and I only need Address. fo = open("C:\Sampledata.txt", "r") item = fo.readlines( Not even got anything working yet!
outList = [] inText = open("C:\Sampledata.txt", "r").read() for line in inText.split("\n"): Date,Address,Name = line.split(",") outList .append(Address) outText = "\n".join(outList ) open("outFile.txt","w").write(outText )
I'm not quite sure if this addesses your problem, but maybe something like: addresses = list() with open("file1", "r") as input: for line in input: if line.startswith("Address"): addresses.append(line.strip("\n")) Edit: Or if "Adress" is only contained once per file you can break the loop after detecting the line starting with "Address": addresses = list() with open("file1", "r") as input: for line in input: if line.startswith("Address"): addresses.append(line.strip("\n")) break Then you can write all adresses into a new file. with open("newFile", "w") as outfile: for adress in addresses: outfile.write(adress + "\n")
reading data from multiple lines as a single item
I have a set of data from a file as such "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\ 00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\ 77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\ 00,2e,00,77,00,61,00,76,00,ff,00 "johnnyboy"="gotwastedatthehouse" "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\ 00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\ 77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\ 00,2e,00,77,00,61,00,76,00,ff,00 [mattplayhouse\wherecanwego\tothepoolhall] How can I read/reference the text per "johnnyboy"=splice(23) as as single line as such: "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00 I am currently matching he regex based on splice(23): with a search as follows: re_johnny = re.compile('splice') with open("file.txt", 'r') as file: read = file.readlines() for line in read: if re_johnny.match(line): print(line) I think I need to take and remove the backslashes and the spaces to merge the lines but am unfamiliar with how to do that and not obtain the blank lines or the new line that is not like my regex. When trying the first solution attempt, my last row was pulled inappropriately. Any assistance would be great.
Input file: fin "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\ 00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\ 77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\ 00,2e,00,77,00,61,00,76,00,ff,00 "johnnyboy"="gotwastedatthehouse" "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,\ 00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,\ 77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,\ 00,2e,00,77,00,61,00,76,00,ff,00 [mattplayhouse\wherecanwego\tothepoolhall] Adding to tigerhawk's suggestion you can try something like this: Code: import re with open('fin', 'r') as f: for l in [''.join([b.strip('\\') for b in a.split()]) for a in f.read().split('\n\n')]: if 'splice' in l: print(l) Output: "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00 "johnnyboy"=splice(23):15,00,30,00,31,00,32,02,39,00,62,00,a3,00,33,00,2d,0f,39,00,00,5c,00,6d,00,65,00,64,00,69,00,61,00,5c,00,57,00,69,00,6e,00,64,00,6f,00,77,00,73,00,20,00,41,00,61,00,63,00,6b,00,65,aa,72,00,6f,00,75,00,6e,dd,64,00,2e,00,77,00,61,00,76,00,ff,00
With regex you have multiplied your problems. Instead, keep it simple: If a line starts with ", it begins a record. Else, append it to the previous record. You can implement parsing for such a scheme in just a few lines in Python. And you don't need regex.
Adding each item in list to end of specific lines in FASTA file
I solved this in the comments below. So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file. Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made. For example: File1: ">seq1 unwanted here AATATTATA ATATATATA >seq2 unwanted stuff here GTGTGTGTG GTGTGTGTG >seq3 more stuff I don't want ACACACACAC ACACACACAC" I want it to keep ">seq#" but replace everything after with the next item in the list below: List: mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']" Result (modified file1): ">seq1 things1 AATATTATA ATATATATA >seq2 # adds nothing here due to mylist[1] = '' GTGTGTGTG GTGTGTGTG >seq3 things3 ACACACACAC ACACACACAC As you can see I want it to add even the blank items in the list. So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code: #!/usr/bin/python import sys # gets list of annotations def get_annos(infile): with open(infile, 'r') as fh: # makes sure the file is closed properly annos = [] for line in fh: annos.append( line.split('\t')[5] ) # added tab as separator return annos # replaces extra info on each header with correct annotation def add_annos(infile1, infile2, outfile): annos = get_annos(infile1) # contains list of annos with open(infile2, 'r') as f2, open(outfile, 'w') as output: for line in f2: if line.startswith('>'): line_split = list(line.split()[0]) # split line on whitespace and store first element in list line_split.append(annos.pop(0)) # append data of interest to current id line output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character else: output.write(line) anno = sys.argv[1] seq = sys.argv[2] out = sys.argv[3] add_annos(anno, seq, out) get_annos(anno) This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need. Here is a simple example extracted from the library website: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) You should get something like this on your screen: gi|2765658|emb|Z78533.1|CIZ78533 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 740 ... gi|2765564|emb|Z78439.1|PBZ78439 Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet()) 592
***********EDIT********* I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory. #!/usr/bin/python # Script takes unedited FASTA file, removed seq length and # other header info, adds annotation after sequence name # run as: $ python addanno.py testanno.out testseq.fasta out.txt import sys # gets list of annotations def get_annos(infile): f = open(infile) list2 = [] for line in f: columns = line.strip().split('\t') list2.append(columns[5]) return list2 # replaces extra info on each header with correct annotation def add_annos(infile1, infile2, outfile): mylist = get_annos(infile1) # contains list of annos f2 = open(infile2, 'r') output = open(out, 'w') for line in f2: if line.startswith('>'): l = line.partition(" ") list3 = list(l) del list3[1:] list3.append(' ') list3.append(mylist.pop(0)) final = ''.join(list3) line = line.replace(line, final) output.write(line) output.write('\n') else: output.write(line) anno = sys.argv[1] seq = sys.argv[2] out = sys.argv[3] add_annos(anno, seq, out) get_annos(anno)
Refering to a list of names using Python
I am new to Python, so please bear with me. I can't get this little script to work properly: genome = open('refT.txt','r') datafile - a reference genome with a bunch (2 million) of contigs: Contig_01 TGCAGGTAAAAAACTGTCACCTGCTGGT Contig_02 TGCAGGTCTTCCCACTTTATGATCCCTTA Contig_03 TGCAGTGTGTCACTGGCCAAGCCCAGCGC Contig_04 TGCAGTGAGCAGACCCCAAAGGGAACCAT Contig_05 TGCAGTAAGGGTAAGATTTGCTTGACCTA The file is opened: cont_list = open('dataT.txt','r') a list of contigs that I want to extract from the dataset listed above: Contig_01 Contig_02 Contig_03 Contig_05 My hopeless script: for line in cont_list: if genome.readline() not in line: continue else: a=genome.readline() s=line+a data_out = open ('output.txt','a') data_out.write("%s" % s) data_out.close() input('Press ENTER to exit') The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05". I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome): def pair(file_obj): for line in file_obj: yield line, next(file_obj) Now, I would use that to get the desired elements: wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'} with open('filename') as fin: pairs = pair(fin) while wanted: p = next(pairs) if p[0] in wanted: # write to output file, store in a list, or dict, ... wanted.forget(p[0])
I would recommend several things: Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place. Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them. Here's some example code that might do what you're looking for from itertools import izip_longest # Read in contigs from file and store in list contigs = [] with open('dataT.txt', 'r') as contigfile: for line in contigfile: contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL # Read through genome file, open up an output file with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile: # Nifty way to sort through fasta files 2 lines at a time for name, seq in izip_longest(*[genomefile]*2): # compare the contig name to your list of contigs if name.rstrip() in contigs: outfile.write(name) #optional. remove if you only want the seq outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like. def get_sequences(data_file, valid_contigs): sequences = [] with open(data_file) as cont_list: for line in cont_list: if line.startswith(valid_contigs): sequence = cont_list.next().strip() sequences.append(sequence) return sequences if __name__ == '__main__': valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05') sequences = get_sequences('dataT.txt', valid_contigs) print(sequences) The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters. From there, writing the sequences grabbed to an output file is pretty straightforward. Example output: ['TGCAGGTAAAAAACTGTCACCTGCTGGT', 'TGCAGGTCTTCCCACTTTATGATCCCTTA', 'TGCAGTGTGTCACTGGCCAAGCCCAGCGC', 'TGCAGTAAGGGTAAGATTTGCTTGACCTA']
Python using re module to parse an imported text file
def regexread(): import re result = '' savefileagain = open('sliceeverfile3.txt','w') #text=open('emeverslicefile4.txt','r') text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,' pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d' #with open('emeverslicefile4.txt') as text: f = re.findall(pattern,text) for item in f: print(item) savefileagain.write(item) #savefileagain.close() The above function as written parses the text and returns sets of seven numbers. I have three problems. Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts. Secondly, when I try to write results to the 'write' file, nothing is returned and thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.
This should do the trick: import re filename = 'sliceeverfile3.txt' pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d' new_file = [] # Make sure file gets closed after being iterated with open(filename, 'r') as f: # Read the file contents and generate a list with each line lines = f.readlines() # Iterate each line for line in lines: # Regex applied to each line match = re.search(pattern, line) if match: # Make sure to add \n to display correctly when we write it back new_line = match.group() + '\n' print new_line new_file.append(new_line) with open(filename, 'w') as f: # go to start of file f.seek(0) # actually write the lines f.writelines(new_file)
You're sort of on the right track... You'll iterate over the file: How to iterate over the file in python and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.