python file concatenation and combining files - python

My main problem is this:
I have a set of files, and I am concatenating them this way in python:
sys.stdout=open("out.dat","w")
filenames = ['bla.txt', 'bla.txt', 'bla.txt']
with open('out.dat', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
with open('out.dat') as f:
print "".join(line.strip() for line in f)
sys.stdout.close()
The bla.txt file looks like
aaa
and the intention is to make it look like
aaaaaaaaa
(3 times the same string, not on a new line each time...)
for some reason what I do produces an output that looks like
aaaaaa
a
I am not sure why this is happening and if there is a simpler/more elegant solution.
More second problem is that eventually, my plan is to have a number of different files (letter triplets for example) that I could concatenate in all possible combinations: aaabbbccc,aaacccbbb, ...,etc
Any guidance appreciated! Thank you!

There are some confusing things about your code, I'll leave some comments on the respective places:
# Not sure what is reason for this
sys.stdout=open("out.dat","w")
filenames = ['bla.txt', 'bla.txt', 'bla.txt']
# This does what you need
with open('out.dat', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
# Here, you open `out.dat` and rewrites it content back into it -
# because you made `sys.stdout = open("out.dat", "w")` above.
# All these lines could be removed (along with `sys.stdout` assignment above)
with open('out.dat') as f:
print "".join(line.strip() for line in f)
sys.stdout.close()
The most minimalistic approach I could think of:
# Open output
with open('out.dat', 'w') as outfile:
# Iterate over each input
for infilename in ['bla.txt'] * 3:
# Open each input and write it to output
with open(infilename) as infile:
outfile.write(infile.read())
As for your error, it should not be happening, could you confirm that the content of bla.txt is exactly aaa?

Nihey Takizawa post almost answers why you've got this error. First, let's see what is going on on each step of the program execution.
sys.stdout=open("out.dat","w")
This is pretty important. Because you replace stdout with file handler to "out.dat", every internal function or statement that use it will write to "out.dat" from now on.
with open('out.dat', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
After this block, content of the file "out.dat" is:
aaa
aaa
aaa
...or in other words: aaa\naaa\naaa\n where \n is single character standing for newline. Number of chars: 12 (9 times a and 3 times newline \n).
with open('out.dat') as f:
print "".join(line.strip() for line in f)
Here is important thing. Remember, that because in step 1 you've changed sys.stdout to "out.dat" internal function print writes output to "out.dat".
You strip each line and join them, so you write "aaaaaaaaa" to "out.dat".
1 2 3 4 5 6 7 8 9 10 11 12
a a a \n a a a \n a a a \n # this is content of the file before print
a a a a a a a a a \n # that you write, 9 a chars + \n
# which is added by print function by default
Note, that you've replaced 10 out of 12 characters and close the file, so 11 and 12 chars would remain the same. Result is your output.
Solution? NEVER mess with things like by changing sys.stdout file handler unless you know what you're doing.
EDIT: How to fix your code.
I thought that Nihey Takizawa nicely explained how to fix your code, but it's actually not completely correct as I see. Here's solution:
filenames = ['bla.txt', 'bla.txt', 'bla.txt']
with open('out.dat', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
Now your out.dat file contains aaaaaaaaa string only without newlines.

Related

Combine two wordlist in one file Python

I have two wordlists, as per the examples below:
wordlist1.txt
aa
bb
cc
wordlist2.txt
11
22
33
I want to take every line from wordlist2.txt and put it after each line in wordlist1.txt and combine them in wordlist3.txt like this:
aa
11
bb
22
cc
33
.
.
Can you please help me with how to do it? Thanks!
Try to always try to include what you have tried.
However, this is a great place to start.
def read_file_to_list(filename):
with open(filename) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
wordlist1= read_file_to_list("wordlist1.txt")
wordlist2= read_file_to_list("wordlist2.txt")
with open("wordlist3.txt",'w',encoding = 'utf-8') as f:
for x,y in zip(wordlist1,wordlist2):
f.write(x+"\n")
f.write(y+"\n")
Check the following question for more ideas and understanding: How to read a file line-by-line into a list?
Cheers
Open wordlist1.txt and wordlist2.txt for reading and wordlist3.txt for writing. Then it's as simple as:
with open('wordlist3.txt', 'w') as w3, open('wordlist1.txt') as w1, open('wordlist2.txt') as w2:
for l1, l2 in zip(map(str.rstrip, w1), map(str.rstrip, w2)):
print(f'{l1}\n{l2}', file=w3)
Instead of using .splitlines(), you can also iterate over the files directly. Here's the code:
wordlist1 = open("wordlist1.txt", "r")
wordlist2 = open("wordlist2.txt", "r")
wordlist3 = open("wordlist3.txt", "w")
for txt1,txt2 in zip(wordlist1, wordlist2):
if not txt1.endswith("\n"):
txt1+="\n"
wordlist3.write(txt1)
wordlist3.write(txt2)
wordlist1.close()
wordlist2.close()
wordlist3.close()
In the first block, we are opening the files. For the first two, we use "r", which stands for read, as we don't want to change anything to the files. We can omit this, as "r" is the default argument of the open function. For the second one, we use "w", which stands for write. If the file didn't exist yet, it will create a new file.
Next, we use the zip function in the for loop. It creates an iterator containing tuples from all iterables provided as arguments. In this loop, it will contain tuples containing each one line of wordlist1.txt and one of wordlist2.txt. These tuples are directly unpacked into the variables txt1 and txt2.
Next we use an if statement to check whether the line of wordlist1.txt ends with a newline. This might not be the case with the last line, so this needs to be checked. We don't check it with the second line, as it is no problem that the last line has no newline because it will also be at the end of the resulting file.
Next, we are writing the text to wordlist3.txt. This means that the text is appended to the end of the file. However, the text that was already in the file before the opening, is lost.
Finally, we close the files. This is very important to do, as otherwise some progress might not be saved and no other applications can use the file meanwhile.
Try this:
with open('wordlist1.txt', 'r') as f1:
f1_list = f1.read().splitlines()
with open('wordlist2.txt', 'r') as f2:
f2_list = f2.read().splitlines()
f3_list = [x for t in list(zip(f1, f2)) for x in t]
with open('wordlist3.txt', 'w') as f3:
f3.write("\n".join(f3_list))
with open('wordlist1.txt') as w1,\
open('wordlist2.txt') as w2,\
open('wordlist3.txt', 'w') as w3:
for wordlist1, wordlist2 in zip(w1.readlines(), w2.readlines()):
if wordlist1[-1] != '\n':
wordlist1 += '\n'
if wordlist2[-1] != '\n':
wordlist2 += '\n'
w3.write(wordlist1)
w3.write(wordlist2)
Here you go :)
with open('wordlist1.txt', 'r') as f:
file1 = f.readlines()
with open('wordlist2.txt', 'r') as f:
file2 = f.readlines()
with open('wordlist3.txt', 'w') as f:
for x in range(len(file1)):
if not file1[x].endswith('\n'):
file1[x] += '\n'
f.write(file1[x])
if not file2[x].endswith('\n'):
file2[x] += '\n'
f.write(file2[x])
Open wordlist 1 and 2 and make a line paring, separate each pair by a newline character then join all the pairs together and separated again by a newline.
# paths
wordlist1 = #
wordlist2 = #
wordlist3 = #
with open(wordlist1, 'r') as fd1, open(wordlist2, 'r') as fd2:
out = '\n'.join(f'{l1}\n{l2}' for l1, l2 in zip(fd1.read().split(), fd2.read().split()))
with open(wordlist3, 'w') as fd:
fd.write(out)

How to save each line of a file to a new file (every line a new file) and do that for multiple original files

I have 5 files from which i want to take each line (24 lines in total) and save it to a new file. I managed to find a code which will do that but they way it is, every time i have to manually change the number of the appropriate original file and of the file i want to save it to and also the number of each line every time.
The code:
x1= np.loadtxt("x_p2_40.txt")
x2= np.loadtxt("x_p4_40.txt")
x3= np.loadtxt("x_p6_40.txt")
x4= np.loadtxt("x_p8_40.txt")
x5= np.loadtxt("x_p1_40.txt")
with open("x_p1_40.txt", "r") as file:
content = file.read()
first_line = content.split('\n', 1)[0]
with open("1_p_40_x.txt", "a" ) as f :
f.write("\n")
with open("1_p_40_x.txt", "a" ) as fa :
fa.write(first_line)
print(first_line)
I am a beginner at python, and i'm not sure how to make a loop for this, because i assume i need a loop?
Thank you!
Since you have multiple files here, you could define their names in a list, and use a list comprehension to open file handles to them all:
input_files = ["x_p2_40.txt", "x_p4_40.txt", "x_p6_40.txt", "x_p8_40.txt", "x_p1_40.txt"]
file_handles = [open(f, "r") for f in input_files]
Since each of these file handles is an iterator that yields a single line every time you iterate over it, you could simply zip() all these file handles to iterate over them simultaneously. Also throw in an enumerate() to get the line numbers:
for line_num, files_lines in enumerate(zip(*file_handles), 1):
out_file = f"{line_num}_p_40.txt"
# Remove trailing whitespace on all lines, then add a newline
files_lines = [f.rstrip() + "\n" for f in files_lines]
with open(out_file, "w") as of:
of.writelines(files_lines)
With three files:
x_p2_40.txt:
2_1
2_2
2_3
2_4
x_p4_40.txt:
4_1
4_2
4_3
4_4
x_p6_40.txt:
6_1
6_2
6_3
6_4
I get the following output:
1_p_40.txt:
2_1
4_1
6_1
2_p_40.txt:
2_2
4_2
6_2
3_p_40.txt:
2_3
4_3
6_3
4_p_40.txt:
2_4
4_4
6_4
Finally, since we didn't use a context manager to open the original file handles, remember to close them after we're done:
for fh in file_handles:
fh.close()
If you have files with an unequal number of lines and you want to create files for all lines, consider using itertools.zip_longest() instead of zip()
In order to read each of your input files, you can store them in a list and iterate over it with a for loop. Then we add every line to a single list with the function extend() :
inputFiles = ["x_p2_40.txt", "x_p4_40.txt", "x_p6_40.txt", "x_p8_40.txt", "x_p1_40.txt"]
outputFile = "outputfile.txt"
lines = []
for filename in inputFiles:
with open(filename, 'r') as f:
lines.extend(f.readlines())
lines[-1] += '\n'
Finally you can write all the line to your output file :
with open(outputFile, 'w') as f:
f.write(''.join(lines))

Files are not merging : python

I want to merge the contents of two files into one new output file .
I have read other threads about merging file contents and I tried several options but I only get one file in my output. Here's one of the codes that I tried and I can't see anything wrong with it.
I only get one file in my output and even if I switch position of file1 and file2 in the list, i still only get only file1 in my output.
Here is my Code :
filenames = ['file1','file2']
with open('output.data', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
How can i do this ?
My whole code that leads to merging to these two files
source1 = open('A','r')
output = open('file1','w')
output.write(',yes.\n'.join(','.join(line) for line in source1.read().split('\n')))
source1 = open('B', 'r')
output = open('file2','w')
output.write(',no.\n'.join(','.join(line) for line in source2.read().split('\n')))
filenames = ['file1','file2']
with open('output.data', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
After the edit it's clear where your mistake is. You need to close (or flush) the file after writing, before it can be read by the same code.
source1 = open('A','r')
output = open('file1','w')
output.write(',yes.\n'.join(','.join(line) for line in source1.read().split('\n')))
output.close()
source2 = open('B', 'r')
output = open('file2','w')
output.write(',no.\n'.join(','.join(line) for line in source2.read().split('\n')))
output.close()
filenames = ['file1','file2']
with open('output.data', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
The reason why the first file is available is because you remove the reference to the file descriptor of file1 by reassigning the variable output to hold the file descriptor for file2, and it will be closed automatically by Python.
As #gnibbler suggested, it's best to use with statements to avoid this kind of problem in the future. You should enclose the source1, source2, and output in a with statement, as you did for the last part.
The first file is being closed/flushed when you rebind output to a new file. This is the behaviour of CPython, but it's not good to rely on it
Use context managers to make sure that the files are flushed (and closed) properly before you try to read from them
with open('A','r') as source1, open('file1','w') as output:
output.write(',yes.\n'.join(','.join(line) for line in source1.read().split('\n')))
with open('B','r') as source2, open('file2','w') as output:
output.write(',no.\n'.join(','.join(line) for line in source2.read().split('\n')))
filenames = ['file1','file2']
with open('output.data', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
print("Reading from: " + fname)
data = infile.read()
print(len(data))
outfile.write(data)
There is a fair bit of duplication in the first two blocks. Maybe you can use a function there.
You can just combine your read and writes into one with statement (if you don't really need the intermediary files); this will also solve your closing problem:
with open('A') as a, open('B') as b, open('out.txt','w') as out:
for line in a:
out.write(',yes.\n'.join(','.join(line)))
for line in b:
out.write(',no.\n'.join(','.join(line)))

How to write lines from a input file to an output file in reversed order in python 3

What I want to do is take a series of lines from one text document, and put them in reverse in a second. For example text document a contains:
hi
there
people
So therefore I would want to write these same lines to text document b, except like this:
people
there
hi
So far I have:
def write_matching_lines(input_filename, output_filename):
infile = open(input_filename)
lines = infile.readlines()
outfile = open(output_filename, 'w')
for line in reversed(lines):
outfile.write(line.rstrip())
infile.close()
outfile.close()
but this only returns:
peopletherehi
in one line. any help would be appreciated.
One line will do:
open("out", "wb").writelines(reversed(open("in").readlines()))
You just need to + '\n' since .write does not do that for you, alternatively you can use
print >>f, line.rstrip()
equivalently in Python 3:
print(line.rstrip(), file=f)
which will add a new line for you. Or do something like this:
>>> with open('text.txt') as fin, open('out.txt', 'w') as fout:
fout.writelines(reversed([line.rstrip() + '\n' for line in fin]))
This code assumes that you don't know if the last line has a newline or not, if you know it does you can just use
fout.writelines(reversed(fin.readlines()))
Why do you rstrip() your line before writing it? You're stripping off the newline at the end of each line as you write it. And yet you then notice that you don't have any newlines. Simply remove the rstrip() in your write.
Less is more.
Update
If I couldn't prove/verify that the last line has a terminating newline, I'd personally be inclined to mess with the one line where it mattered, up front. E.g.
....
outfile = open(output_filename, 'w')
lines[-1] = lines[-1].rstrip() + '\n' # make sure last line has a newline
for line in reversed(lines):
outfile.write(line)
....
with open(your_filename) as h:
print ''.join(reversed(h.readlines()))
or, if you want to write it to other stream:
with open(your_filename_out, 'w') as h_out:
with open(your_filename_in) as h_in:
h_out.write(''.join(reversed(h_in.readlines()))

Using python, how to read a file starting at the seventh line ?

I have a text file structure as:
date
downland
user
date data1 date2
201102 foo bar 200 50
201101 foo bar 300 35
So first six lines of file are not needed. filename:dnw.txt
f = open('dwn.txt', 'rb')
How do I "split" this file starting at line 7 to EOF?
with open('dwn.txt') as f:
for i in xrange(6):
f, next()
for line in f:
process(line)
Update: use next(f) for python 3.x.
Itertools answer!
from itertools import islice
with open('foo') as f:
for line in islice(f, 6, None):
print line
Python 3:
with open("file.txt","r") as f:
for i in range(6):
f.readline()
for line in f:
# process lines 7-end
with open('test.txt', 'r') as fo:
for i in xrange(6):
fo.next()
for line in fo:
print "%s" % line.strip()
In fact, to answer precisely at the question as it was written
How do I "split" this file starting at line 7 to EOF?
you can do
:
in case the file is not big:
with open('dwn.txt','rb+') as f:
for i in xrange(6):
print f.readline()
content = f.read()
f.seek(0,0)
f.write(content)
f.truncate()
in case the file is very big
with open('dwn.txt','rb+') as ahead, open('dwn.txt','rb+') as back:
for i in xrange(6):
print ahead.readline()
x = 100000
chunk = ahead.read(x)
while chunk:
print repr(chunk)
back.write(chunk)
chunk = ahead.read(x)
back.truncate()
The truncate() function is essential to put the EOF you asked for. Without executing truncate() , the tail of the file, corresponding to the offset of 6 lines, would remain.
.
The file must be opened in binary mode to prevent any problem to happen.
When Python reads '\r\n' , it transforms them in '\n' (that's the Universal Newline Support, enabled by default) , that is to say there are only '\n' in the chains chunk even if there were '\r\n' in the file.
If the file is from Macintosh origin , it contains only CR = '\r' newlines before the treatment but they will be changed to '\n' or '\r\n' (according to the platform) during the rewriting on a non-Macintosh machine.
If it is a file from Linux origin, it contains only LF = '\n' newlines which, on a Windows OS, will be changed to '\r\n' (I don't know for a Linux file processed on a Macintosh ).
The reason is that the OS Windows writes '\r\n' whatever it is ordered to write , '\n' or '\r' or '\r\n'. Consequently, there would be more characters rewritten than having been read, and then the offset between the file's pointers ahead and back would diminish and cause a messy rewriting.
In HTML sources , there are also various newlines.
That's why it's always preferable to open files in binary mode when they are so processed.
Alternative version
You can direct use the command read() if you know the character position pos of the separating (header part from the part of interest) linebreak, e.g. an \n, in the text at which you want to break your input text:
with open('input.txt', 'r') as txt_in:
txt_in.seek(pos)
second_half = txt_in.read()
If you are interested in both halfs, you could also investigate the following method:
with open('input.txt', 'r') as txt_in:
all_contents = txt_in.read()
first_half = all_contents[:pos]
second_half = all_contents[pos:]
You can read the entire file into an array/list and then just start at the index appropriate to the line you wish to start reading at.
f = open('dwn.txt', 'rb')
fileAsList = f.readlines()
fileAsList[0] #first line
fileAsList[1] #second line
#!/usr/bin/python
with open('dnw.txt', 'r') as f:
lines_7_through_end = f.readlines()[6:]
print "Lines 7+:"
i = 7;
for line in lines_7_through_end:
print " Line %s: %s" % (i, line)
i+=1
Prints:
Lines 7+:
Line 7: 201102 foo bar 200 50
Line 8: 201101 foo bar 300 35
Edit:
To rebuild dwn.txt without the first six lines, do this after the above code:
with open('dnw.txt', 'w') as f:
for line in lines_7_through_end:
f.write(line)
I have created a script used to cut an Apache access.log file several times a day.
It's not original topic of question, but I think it can be useful, if you have store the file cursor position after the 6 first lines reading.
So I needed the set a position cursor on last line parsed during last execution.
To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
Just do f.readline() six times. Ignore the returned value.
Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. The user will have to read again the lines (in file or in the produced list) to process what he wants, while it could have been done without having read the intersting lines already a first time. Moreover if the file is big, the memory is weighed by the file's content while a for line in file instruction would have been lighter.
Doing repetition of readline() can be done like that
nb = 6
exec( nb * 'f.readline()\n')
It's short piece of code and nb is programmatically adjustable

Categories

Resources