I have a large number of entries in a file. Let me call it file A.
File A:
('aaa.dat', 'aaa.dat', 'aaa.dat')
('aaa.dat', 'aaa.dat', 'bbb.dat')
('aaa.dat', 'aaa.dat', 'ccc.dat')
I want to use these entries, line by line, in a program that would iteratively pick an entry from file A, concatenate the files in this way:
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat'] ###entry number 3
with open('out.dat', 'w') as outfile: ###the name has to be aaa-aaa-ccc.dat
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
All I need to do is to substitute the filenames iteratively and create an output in a "aaa-aaa-aaa.dat" format. I would appreciate any help-- feeling a bit lost!
Many thanks!!!
You can retrieve and modify the file names in the following way:
import re
pattern = re.compile('\W')
with open('fnames.txt', 'r') as infile:
for line in infile:
line = (re.sub(pattern, ' ', line)).split()
# Old filenames - to concatenate contents
content = [x + '.dat' for x in line[::2]];
# New filename
new_name = ('-').join(line[::2]) + '.dat'
# Write the concatenated content to the new
# file (first read the content all at once)
with open(new_name, 'w') as outfile:
for con in content:
with open(con, 'r') as old:
new_content = old.read()
outfile.write(new_content)
This program reads your input file, here named fnames.txt with the exact structure from your post, line by line. For each line it splits the entries using a precompiled regex (precompiling regex is suitable here and should make things faster). This assumes that your filenames are only alphanumeric characters, since the regex substitutes all non-alphanumeric characters with a space.
It retrieves only 'aaa' and dat entries as a list of strings for each line and forms a new name by joining every second entry starting from 0 and adding a .dat extension to it. It joins using a - as in the post.
It then retrieves the individual file names from which it will extract the content into a list content by selecting every second entry from line.
Finally, it reads each of the files in content and writes them to the common file new_name. It reads each of them all at ones which may be a problem if these files are big and in general there may be more efficient ways of doing all this. Also, if you are planning to do more things with the content from old files before writing, consider moving the old file-specific operations to a separate function for readability and any potential debugging.
Something like this:
with open(fname) as infile, open('out.dat', 'w') as outfile:
for line in infile:
line = line.strip()
if line: # not empty
filenames = eval(line.strip()) # read tuple
filenames = [f[:-4] for f in filenames] # remove extension
filename = '-'.join(filenames) + '.dat' # make filename
outfile.write(filename + '\n') # write
If your problem is just calculating the new filenames, how about using os.path.splitext?
'-'.join([
f[0] for f in [os.path.splitext(path) for path in filenames]
]) + '.dat'
Which can be probably better understood if you see it like this:
import os
clean_fnames = []
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat']
for fname in filenames:
name, extension = os.path.splitext(fname)
clean_fnames.append(name)
name_without_ext = '-'.join(clean_fnames)
name_with_ext = name_without_ext + '.dat'
print(name_with_ext)
HOWEVER: If your issue is that you can not get the filenames in a list by reading the file line by line, you must keep in mind that when you read files, you get text (strings) NOT Python structures. You need to rebuild a list from a text like: "('aaa.dat', 'aaa.dat', 'aaa.dat')\n".
You could take a look to ast.literal_eval or try to rebuild it yourself. The code below outputs a lot of messages to show what's happening:
import pprint
collected_fnames = []
with open('./fileA.txt') as f:
for line in f:
print("Read this (literal) line: %s" % repr(line))
line_without_whitespaces_on_the_sides = line.strip()
if not line_without_whitespaces_on_the_sides:
print("line is empty... skipping")
continue
else:
line_without_parenthesis = (
line_without_whitespaces_on_the_sides
.lstrip('(')
.rstrip(')')
)
print("Cleaned parenthesis: %s" % line_without_parenthesis)
chunks = line_without_parenthesis.split(', ')
print("Collected %s chunks in a %s: %s" % (len(chunks), type(chunks), chunks))
chunks_without_quotations = [chunk.replace("'", "") for chunk in chunks]
print("Now we don't have quotations: %s" % chunks_without_quotations)
collected_fnames.append(chunks_without_quotations)
print("collected %s lines with filenames:\n%s" %
(len(collected_fnames), pprint.pformat(collected_fnames)))
Related
I have a folder of 400 .txt files and am attempting to take the sixth line from every file in the directory, and output each line all into a new singular .txt file with the sixth line from each file listed one after the other in the new file. For example, the output I am attempting to create should look like:
**output.txt**
This is the sixth line from 1.txt
This is the sixth line from 2.txt
This is the sixth line from 3.txt
So far I'm able to print off all the files in the directory in a list to be acted upon with:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
for entry in entries:
print(entry)
I have researched and tried various combinations of the readlines() method, but I'm not sure exactly how to combine them in multiples over an entire directory of 400 files. I'm still trying to learn, any ideas if I'm on the right path and how to combine them is appreciated.
Here is another way if you want to use for loop for iterating over your text file and pick a specific line.In this code all the .txt files are fetched at the beginning.
import glob
list_of_txt = glob.glob(r"C:\Users\defaultuser\Desktop\UprocScripts\*.txt")
for textfiles in list_of_txt:
with open(r"C:\Users\defaultuser\Desktop\UprocScripts\final.txt", 'a+') as final_text_file:
with open(textfiles, 'r') as textFile:
for n, line in enumerate(textFile):
if n+1 == 6: # if it's line no. 6 then write it on your final txt file
final_text_file.writelines(line)
Also note that I am using the glob module here. In addition if you want to add "from some.txt" after each line then just replace the last line with this:
final_text_file.write(line.strip() + " from " + textfiles.split('\\')[-1] + "\r\n")
You need to read each file, get the sixth line from each of them, then write that line to the output file.
Like so:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
for entry in entries:
with open('output.txt', 'w') as out_file:
with open(entry) as text_file:
lines = text_file.readlines()
target_line = lines[5] # sixth line
out_file.write(target_line)
Note this does read the complete file for each of the input files- which might be inefficient. You can try to get around that by trying to utilize the hint parameter to readlines - which accepts an approximate number of bytes to read until. If you know the apprx size of each line (in bytes) you can pass 6 * line_size as hint to try & optimize the read part.
You don't need to read all the file, you can read only the first 6 lines like this:
import os
entries = os.listdir(r'C:/Users/defaultuser/Desktop/UprocScripts')
final = []
for entry in entries
# Read the first 6 lines and add the last one (you don't need to read everything):
with open(entry) as f:
lines = []
for _ in range(6):
lines.append(f.readline())
final.append(lines[-1])
# And write
with open("final.txt", "r") as f:
f.writelines(final)
import os
files_list = []
sixth_line_list = []
output_list = []
directory = 'C:\\Users\\defaultuser\\Desktop\\UprocScripts'
for file in os.listdir(directory):
if file.endswith('.txt'):
files_list.append(''.join([directory, '\\', file]))
for file in files_list:
with open(file, 'r') as file_:
sixth_line_list.append({file: file_.readlines()[5]})
for i in range(0, len(sixth_line_list), 1):
output_list.append(''.join([sixth_line_list[i].values()[0], ' from ', sixth_line_list[i].keys()[0]]))
with open(''.join([directory, '\\output.txt']), 'w') as output:
output.writelines(output_list)
I am trying to search a large group of text files (160K) for a specific string that changes for each file. I have a text file that has every file in the directory with the string value I want to search. Basically I want to use python to create a new text file that gives the file name, the string, and a 1 if the string is present and a 0 if it is not.
The approach I am using so far is to create a dictionary from a text file. From there I am stuck. Here is what I figure in pseudo-code:
**assign dictionary**
d = {}
with open('file.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
**loop through directory**
for filename in os.listdir(os.getcwd()):
***here is where I get lost***
match file name to dictionary
look for string
write filename, string, 1 if found
write filename, string, 0 if not found
Thank you. It needs to be somewhat efficient since its a large amount of text to go through.
Here is what I ended up with
d = {}
with open('ibes.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
import os
for filename in os.listdir(os.getcwd()):
string = d.get(filename, "!##$%^&*")
if string in open(filename, 'r').read():
with open("ibes_in.txt", 'a') as out:
out.write("{} {} {}\n".format(filename, string, 1))
else:
with open("ibes_in.txt", 'a') as out:
out.write("{} {} {}\n".format(filename, string, 0))
As I understand your question, the dictionary relates file names to strings
d = {
"file1.txt": "widget",
"file2.txt": "sprocket", #etc
}
If each file is not too large you can read each file into memory:
for filename in os.listdir(os.getcwd()):
string = d[filename]
if string in open(filename, 'r').read():
print(filename, string, "1")
else:
print(filename, string, "0")
This example uses print, but you could write to a file instead. Open the output file before the loop outfile = open("outfile.txt", 'w') and instead of printing use
outfile.write("{} {} {}\n".format(filename, string, 1))
On the other hand, if each file is too large to fit easily into memory, you could use a mmap as described in Search for string in txt file Python
I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
I have a question about reading in a .txt rile and taking the string from inside to be used later on in the code.
If I have a file called 'file0.txt' and it contains:
file1.txt
file2.txt
The rest of the files either contain more string file names or are empty.
How can I save both of these strings for later use. What I attempted to do was:
infile = open(file, 'r')
line = infile.readline()
line.split('\n')
But that returned the following:
['file1.txt', '']
I understand that readline only reads one line, but I thought that by splitting it by the return key it would also grab the next file string.
I am attempting to simulate a file tree or to show which files are connected together, but as it stands now it is only going through the first file string in each .txt file.
Currently my output is:
File 1 crawled.
File 3 crawled.
Dead end reached.
My hope was that instead of just recursivley crawling the first file it would go through the entire web, but that goes back to my issue of not giving the program the second file name in the first place.
I'm not asking for a specific answer, just a push in the right direction on how to better handle the strings from the files and be able to store both of them instead of 1.
My current code is pretty ugly, but hopefully it gets the idea across, I will just post it for reference to what I'm trying to accomplish.
def crawl(file):
infile = open(file, 'r')
line = infile.readline()
print(line.split('\n'))
if 'file1.txt' in line:
print('File 1 crawled.')
return crawl('file1.txt')
if 'file2.txt' in line:
print('File 2 crawled.')
return crawl('file2.txt')
if 'file3.txt' in line:
print('File 3 crawled.')
return crawl('file3.txt')
if 'file4.txt' in line:
print('File 4 crawled.')
return crawl('file4.txt')
if 'file5.txt' in line:
print('File 5 crawled.')
return crawl('file5.txt')
#etc...etc...
else:
print('Dead end reached.')
Outside the function:
file = 'file0.txt'
crawl(file)
Using read() or readlines() will help. e.g.
infile = open(file, 'r')
lines = infile.readlines()
print list(lines)
gives
['file1.txt\n', 'file2.txt\n']
or
infile = open(file, 'r')
lines = infile.read()
print list(lines.split('\n'))
gives
['file1.txt', 'file2.txt']
Readline only gets one line from the file so it has a newline at the end. What you want is file.read() which will give you the whole file as a single string. Split that using newline and you should have what you need. Also remember that you need to save the list of lines as a new variable i.e. assign to your line.split('\n') action. You could also just use readlines which will get a list of lines from the file.
change readline to readlines. and no need to split(\n), its already a list.
here is a tutorial you should read
I prepared file0.txt with two files in it, file1.txt, with one file in it, plus file2.txt and file3.txt, which contained no data. Note, this won't extract values already in the list
def get_files(current_file, files=[]):
# Initialize file list with previous values, or intial value
new_files = []
if not files:
new_files = [current_file]
else:
new_files = files
# Read files not already in list, to the list
with open(current_file, "r") as f_in:
for new_file in f_in.read().splitlines():
if new_file not in new_files:
new_files.append(new_file.strip())
# Do we need to recurse?
cur_file_index = new_files.index(current_file)
if cur_file_index < len(new_files) - 1:
next_file = new_files[cur_file_index + 1]
# Recurse
get_files(next_file, new_files)
# We're done
return new_files
initial_file = "file0.txt"
files = get_files(initial_file)
print(files)
Returns: ['file0.txt', 'file1.txt', 'file2.txt', 'file3.txt']
file0.txt
file1.txt
file2.txt
file1.txt
file3.txt
file2.txt and file3.txt were blank
Edits: Added .strip() for safety, and added the contents of the data files so this can be replicated.
I work with spatial data that is output to text files with the following format:
COMPANY NAME
P.O. BOX 999999
ZIP CODE , CITY
+99 999 9999
23 April 2013 09:27:55
PROJECT: Link Ref
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Design DTM is 30MB 2.5X2.5
Stripping applied to design is 0.000
Point Number Easting Northing R.L. Design R.L. Difference Tol Name
3224808 422092.700 6096059.380 2.520 -19.066 -21.586 --
3224809 422092.200 6096059.030 2.510 -19.065 -21.575 --
<Remainder of lines>
3273093 422698.920 6096372.550 1.240 -20.057 -21.297 --
Average height difference is -21.390
RMS is 21.596
0.00 % above tolerance
98.37 % below tolerance
End of Report
As shown, the files have a header and a footer. The data is delimited by spaces, but not an equal amount between the columns.
What I need, is comma delimited files with Easting, Northing and Difference.
I'd like to prevent having to modify several hundred large files by hand and am writing a small script to process the files. This is what I have so far:
#! /usr/bin/env python
import csv,glob,os
from itertools import islice
list_of_files = glob.glob('C:/test/*.txt')
for filename in list_of_files:
(short_filename, extension )= os.path.splitext(filename)
print short_filename
file_out_name = short_filename + '_ed' + extension
with open (filename, 'rb') as source:
reader = csv.reader( source)
for row in islice(reader, 10, None):
file_out= open (file_out_name, 'wb')
writer= csv.writer(file_out)
writer.writerows(reader)
print 'Created file: '+ file_out_name
file_out.close()
print 'All done!'
Questions:
How can I let the line starting with 'Point number' become the header in the output file? I'm trying to put DictReader in place of the reader/writer bit but can't get it to work.
Writing the output file with delimiter ',' does work but writes a comma in place of each space, giving way too much empty columns in my output file. How do I circumvent this?
How do I remove the footer?
I can see a problem with your code, you are creating a new writer for each row; so you will end up only with the last one.
Your code could be something like this, without the need of CSV readers or writers, as it's simple enough to be parsed as simple text (problem would arise if you had text columns, with escaped characters and so on).
def process_file(source, dest):
found_header = False
for line in source:
line = line.strip()
if not header_found:
#ignore everything until we find this text
header_found = line.starswith('Point Number')
elif not line:
return #we are done when we find an empty line, I guess
else:
#write the needed columns
columns = line.split()
dest.writeline(','.join(columns[i] for i in (1, 2, 5)))
for filename in list_of_files:
short_filename, extension = os.path.splitext(filename)
file_out_name = short_filename + '_ed' + extension
with open(filename, 'r') as source:
with open(file_out_name. 'w') as dest:
process_file(source, dest)
This worked:
#! /usr/bin/env python
import glob,os
list_of_files = glob.glob('C:/test/*.txt')
def process_file(source, dest):
header_found = False
for line in source:
line = line.strip()
if not header_found:
#ignore everything until we find this text
header_found = line.startswith('Stripping applied') #otherwise, header is lost
elif not line:
return #we are done when we find an empty line
else:
#write the needed columns
columns = line.split()
dest.writelines(','.join(columns[i] for i in (1, 2, 5))+"\n") #newline character adding was necessary
for filename in list_of_files:
short_filename, extension = os.path.splitext(filename)
file_out_name = short_filename + '_ed' + ".csv"
with open(filename, 'r') as source:
with open(file_out_name, 'wb') as dest:
process_file(source, dest)
To answer to your first and last question: it is simply about ignoring the corresponding lines, i.e. not to write them to output. This corresponds to if not header_found and else if not line: blocks of fortran proposal.
Second point is that there is no dedicated delimiter in your file: you have one or more spaces, which makes it hard to be parsed using csv module. Using split() will parse each line and return the list of non-blank characters, and will therefore only return useful values.