Python: concatenating text files - python

Using Python, I'm seeking to iteratively combine two set of txt files to create a third set of txt files.
I have a directory of txt files in two categories:
text_[number].txt (eg: text_0.txt, text_1.txt, text_2.txt....text_20.txt)
comments_[number].txt (eg: comments_0.txt, comments_1.txt, comments_2.txt...comments_20.txt).
I'd like to iteratively combine the text_[number] files with the matching comments_[number] files into a new file category feedback_[number].txt. The script would combine text_0.txt and comments_0.txt into feedback_0.txt, and continue through each pair in the directory. The number of text and comments files will always match, but the total number of text and comment files is variable depending on preceding scripts.
I can combine two pairs using the code below with a list of file pairs:
filenames = ['text_0.txt', 'comments_0.txt']
with open("feedback_0.txt", "w") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
However, I'm uncertain how to structure iteration for the rest of the files. I'm also curious how to generate lists from the contents of the file directory. Any advice or assistance on moving forward is greatly appreciated.

It would be far simpler (and possibly faster) to just fork a cat process:
import subprocess
n = ... # number of files
for i in range(n):
with open(f'feedback_{i}.txt', 'w') as f:
subprocess.run(['cat', 'text_{i}.txt', 'comments_{i}.txt'], stdout=f)
Or, if you already have lists of the file names:
for text, comment, feedback in zip(text_files, comment_files, feedback_files):
with open(feedback, 'w') as f:
subprocess.run(['cat', text, comment], stdout=f)
Unless these are all extremely small files, the cost of reading and writing the bytes will outweigh the cost of forking a new process for each pair.

Maybe not the most elegant but...
length = 10
txt = [f"text_{n}.txt" for n in range(length)]
com = [f"comments_{n}.txt" for n in range(length)]
feed = [f"feedback_{n}.txt" for n in range(length)]
for f, t, c in zip(feed, txt, com):
with open(f, "w") as outfile:
with open(t) as infile1:
contents = infile1.read()
outfile.write(contents)
with open(c) as infile2:
contents = infile2.read()
outfile.write(contents)

There are many ways to achieve this, but I don't seem to see any solution that's both beginner-friendly and takes into account the structure of the files you described.
You can iterate through the files, and for every text_[num].txt, fetch the corresponding comments_[num].txt and write to feedback_[num].txt as shown below. There's no need to add any counters or make any other assumptions about the files that might not always be true:
import os
srcpath = 'path/to/files'
for f in os.listdir(srcpath):
if f.startswith('text'):
index = f[5:-4] # extract the [num] part
# Build the paths to text, comment, feedback files
txt_path = os.path.join(srcpath, f)
cmnt_path = os.path.join(srcpath, f'comments_{index}.txt')
fb_path = os.path.join(srcpath, f'feedback_{index}.txt')
# write to output – reading in in byte mode following chepner's advice
with open(fb_path, 'wb') as outfile:
outfile.write(open(txt_path, 'rb').read())
outfile.write(open(cmnt_path, 'rb').read())

The simplest way would probably be to just iterate from 1 onwards, stopping at the first missing file. This works assuming that your files are numbered in increasing order and with no gaps (e.g. you have 1, 2, 3 and not 1, 3).
import os
from itertools import count
for i in count(1):
t = f'text_{i}.txt'
c = f'comments_{i}.txt'
if not os.path.isfile(t) or not os.path.isfile(c):
break
with open(f'feedback_{i}.txt', 'wb') as outfile:
outfile.write(open(t, 'rb').read())
outfile.write(open(c, 'rb').read())

You can try this
filenames = ['text_0.txt', 'comments_0.txt','text_1.txt', 'comments_1.txt','text_2.txt', 'comments_2.txt','text_3.txt', 'comments_3.txt']
for i,j in enumerate (zip(filenames[::2],filenames[1::2])):
with open(f'feedback_{i}','a+') as file:
for k in j:
with open(k,'r') as f:
files=f.read()
file.write(files)
I have taken a list here. Instead, you can do
import os
filenames=os.listdir('path/to/folder')

Related

A simple question about substituting a common part of a file name when concatenating txt files

I have many many files where the starting 3 letters is different. I am trying to concatenate the monthly files into semi annually using the code below. But rather than replace FXE 7 times with the different letters, I want to just replace it in 1 place. I tried a few methods including using ETF = FXE and then substituting FXE for {ETF} but my inexperience with the syntax is stumping me. Any quick advice is appreciated. Thx in advance.
# Creating a list of filenames
filenames = ['FXE_2022_01.txt', 'FXE_2022_02.txt', 'FXE_2022_03.txt', 'FXE_2022_04.txt', 'FXE_2022_05.txt', 'FXE_2022_06.txt']
# Open file3 in write mode
with open('FXE_2022_01_to_06.txt', 'w') as outfile:
# Iterate through list
for names in filenames:
# Open each file in read mode
with open(names) as infile:
# read the data from file1 and
# file2 and write it in file3
outfile.write(infile.read())
# Add '\n' to enter data of file2
# from next line
outfile.write("\n")
Here's one possible way to do it using list comprehension and f strings. I also generalized the year and index numbers:
f_headers = ['FXE', 'ETF', 'GBJ']
f_year = '2022'
f_range = list(range(1,7))
fnames = [[f'{fh}_{f_year}_{fr:02d}.txt' for fr in f_range] for fh in f_headers]
for fn in fnames:
print(fn)
Output:
['FXE_2022_01.txt', 'FXE_2022_02.txt', 'FXE_2022_03.txt', 'FXE_2022_04.txt', 'FXE_2022_05.txt', 'FXE_2022_06.txt']
['ETF_2022_01.txt', 'ETF_2022_02.txt', 'ETF_2022_03.txt', 'ETF_2022_04.txt', 'ETF_2022_05.txt', 'ETF_2022_06.txt']
['GBJ_2022_01.txt', 'GBJ_2022_02.txt', 'GBJ_2022_03.txt', 'GBJ_2022_04.txt', 'GBJ_2022_05.txt', 'GBJ_2022_06.txt']

repeating a program multiple times (conducting calculations and writing the result in another file)

I have a code for analysing large amount of data (form two different files it takes the values which are separated by a space, it then calculates the relative difference between those values and writes them in another file).
from itertools import islice
with open('ex_original_1.idl') as f1, open('ex_new_1.idl') as f2:
with open('ex_dif_1.txt', 'w') as f3:
f1 = islice(f1, 905, None) # skip first 905 lines
f2 = islice(f2, 905, None) # skip first 905 lines
for f1_line, f2_line in zip(f1, f2):
f1_vals = map(float, f1_line.strip().split())
f2_vals = map(float, f2_line.strip().split())
for v1, v2 in zip(f1_vals, f2_vals):
try:
result = v1/v2
f3.write(str(result)+"\n")
except ZeroDivisionError: #should there be a value of zero
print("Encountered a value equal to zero in the second file. Skipping...")
continue
Whilst it works well on two files (ex_original_1.idl and ex_new_1.idl), I do have a lot more files of the same type (~500). I would like to perform this program more times and the output files should be named in a logical matter: ex_dif_1.txt. To make the matters more structured, the 2 different types (ex_original_i and ex_new_i) are located in different directories and I would like to write the new files in a separate directory (If i understand correctly, before the file name I include the path for all files, yes?). To recap the files which I have are:
ex_original_1, ex_original_2, ex_original_3 ... ex_original_500
ex_new_1, ex_new_2, ex_new_3 ... ex_original_500
I would like to get:
ex_dif_1, ex_dif_2, ex_dif_3 ... ex_dif_500
Using this line of code, which works only once.
Should it be appropriate to make another separate program to run this one multiple times or rather to include a command in this existing program, an example would be appreciated?
Hope it was clear enough. Thanks in advance for the help.
Start your program with a loop. Guess this should work for you...
list_of_files = os.listdir("path to files")
in1files = [file for file in list_of_files if file.startswith('ex_original_')]
in2files = [file for file in list_of_files if file.startswith('ex_new_')]
outfiles=[]
for i in range(0,len(in1files)):
outfile = "ex_dif_"+str(i+1)
file1 = in1files[i]
file2 = in2files[i]
with open(file1,'r') as f1, open(file2,'r') as f2:
with open(outfile, 'w') as f3:
{your stuff continues here on....}
if files are present in different folders, this will differ. Let me know the exact scenario.
If present on the same folder, you can try this.
Instead of hardcoding the filenames, get them dynamically! If that outfile isnt present, get the default name, otherwise one plus the old file.
>>> import os
>>> files = [f for f in os.listdir('.') if f.lower().startswith('ex_dif_')]
>>> files
['ex_dif_1.txt']
>>> outfile="outfilefolder/ex_diff_%d.txt"
>>> number = 1
>>> if files:number=int(max(files,key=os.path.getctime).split('_')[2].split('.')[0])+1
>>> outfile=outfile%(number) if files else outfile%1`
>>> print outfile
'ex_diff_2.txt'
Similarly for ex_orignal and ex_new input file.
>>> ofile="ofilefolder/ex_original_%d.idl"
>>> ofile1=ofile%(number) if files else ofile%1`
>>> nfile="nfilefolder/ex_new_%d.idl"
>>> nfile=nfile%(number) if files else nfile%1`
Change the part where you open the file to get the dynamically generated names.
with open(ofile) as f1, open(filen) as f2:
with open(outfile, 'w') as f3:

Multiple editing of CSV files

I have a small delay with operating CSV files in python (3.5). Previously I was working with single files and there was no problem, but right now I have >100 files in one folder.
So, my goal is:
To parse all *.csv files in the directory
From each file delete first 6 rows , the files consists of the following data:
"nu(Ep), 2.6.8"
"Date: 2/10/16, 11:18:21 AM"
19
Ep,nu
0.0952645,0.123776,
0.119036,0.157720,
...
0.992060,0.374300,
Save each file separately (for example adding "_edited"), so there should be only numbers saved.
As an option - I have data subdivided on two parts for one material. For example: Ag(0-1_s).csv and Ag(1-4)_s.csv (after steps 1-3 the should be like Ag(*)_edited.csv). How can I merge this two files in a way of adding data from (1-4) to the end of (0-1) saving it in a third file?
My code so far is the following:
import os, sys
import csv
import re
import glob
import fileinput
def get_all_files(directory, extension='.csv'):
dir_list = os.listdir(directory)
csv_files = []
for i in dir_list:
if i.endswith(extension):
csv_files.append(os.path.realpath(i))
return csv_files
csv_files = get_all_files('/Directory/Path/Here')
#Here is the problem with csv's, I dont know how to scan files
#which are in the list "csv_files".
for n in csv_files:
#print(n)
lines = [] #empty, because I dont know how to write it properly per
#each file
input = open(n, 'r')
reader = csv.reader(n)
temp = []
for i in range(5):
next(reader)
#a for loop for here regarding rows?
#for row in n: ???
# ???
input.close()
#newfilename = "".join(n.split(".csv")) + "edited.csv"
#newfilename can be used within open() below:
with open(n + '_edited.csv', 'w') as nf:
writer = csv.writer(nf)
writer.writerows(lines)
This is the fastest way I can think of. If you have a solid-state drive, you could throw multiprocessing at this for more of a performance boost
import glob
import os
for fpath in glob.glob('path/to/directory/*.csv'):
fname = os.basename(fpath).rsplit(os.path.extsep, 1)[0]
with open(fpath) as infile, open(os.path.join('path/to/dir', fname+"_edited"+os.path.extsep+'csv'), 'w') as outfile:
for _ in range(6): infile.readline()
for line in infile: outfile.write(line)

Output comes in dictionary sorted order in python

I have a collection of csv files with names like 2.csv , 3.csv ...., 999.csv. Each file has 91 rows in it. I want to have a set of new files collecting a particular row from all the files. Eg. row1.csv should have first row of all the 998 files and similarly row35.csv should have 35th row of all the 998 files. Therefore I should have in total 91 files (one for each row) with each file having 998 rows (one for each original file) after my script completes the run.
I use the following code for the task
import glob
import os
for i in range(2,92):
outfile = open("row_%i.csv" %i,'w')
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
with open(filename, 'r') as infile:
lineno = 0
for line in infile:
lineno += 1
if lineno == i:
outfile.write(line)
outfile.close()
Now in any of the outfile row_i.csv my data is arranged in dictionary sorted order. Example :
First row in row_50.csv file is the 50th row of 10.csv.
In other words in any row_i.csv the rows comes from 10.csv, 100.csv, 101.csv and so on.
I wanted to know why is that happening and is there a way in which I can ensure that my row_i.csv is ordered in the way the files are ordered i.e. 2.csv, 3.csv and so on.
Thanks for the time spent reading this.
Not sure if this works out or if there are more problems, but it seems like glob either returns the filenames in sorted order (sorted as strings), or in random order. In both cases, you will have to extract the number from the filenames and sort by that number.
Try something like this:
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob(...)
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
...
Also, it seems like you are opening, looping and closing all those 999 files for all of the 92 outfiles again and again! It might be better to open all of the 92 outfiles at once and store them in a dictionary, mapping line numbers to files. This way, you have to loop the 999 files just once.
Something like this (totally not tested):
outfiles = {i: open("row_%i.csv" %i, 'w') for i in range(2,92)}
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
with open(filename, 'r') as infile:
for lineno, line in enumerate(infile):
outfiles[lineno].write(line)
for outfile in outfiles.values():
outfile.close()
You need to sort the filename list before start iteration on it. This can help you:
import re
import glob
filename_list = glob.glob('DataSet-MediaEval/devFeatures/*.csv')
def splitByNumbers(x):
r = re.compile('(\d+)')
l = r.split(x)
return [int(y) if y.isdigit() else y for y in l]
filenames = sorted(filename_list, key = splitByNumbers)
then you can use instead of
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
this
for filename in filenames:

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Categories

Resources