Sorting a file in ascending order - python

I was given a file with one name per line in random order myInput01.txt and I need to order it in ascending order and output the ordered names one name per line to a file named myOutput01.txt.
myhandle = open('myInput01.txt', 'r')
aLine = myhandle.readlines()
sorted(aLine)
aLine = myOutput01.txt
print myOutput01.txt

For future visitors, the easiest and most concise way of doing this in Python (assuming a sort isn't going to blow your system memory) is:
with open('myInput01.txt') as fin, open('myOutput01.txt', 'w') as fout:
fout.writelines(sorted(fin))

So, this part is ok:
myhandle = open('myInput01.txt', 'r')
aLine = myhandle.readlines()
You open a file (get a file handler in myhandle) and read its lines into aLine.
Now, there's a problem with:
sorted(aLine)
The sorted function doesn't do anything to the aLine argument. It returns a sorted new list. So it's either you use aLine.sort() to sort in-place or assign the output of the sorted function to another variable:
sorted_lines = sorted(aLine)
Take a look to this sorting tutorial.
Also, these two lines are very problematic:
aLine = myOutput01.txt
print myOutput01.txt
You're overwriting your aLine variable with something called myOutput01.txt, which is unknown to the script (what is it? where is it defined?). You need to proceed in a similar way as to read a file. You need to open a handler and write stuff to the file using that handler as a reference.
You need:
mywritehandle = open('myOutputO1.txt', 'w')
mywritehandle.writelines(sorted_lines)
mywritehandle.close()
Or, to avoid having to call close() explicitly:
with open('myOutputO1.txt', 'w') as mywritehandle:
mywritehandle.writelines(sorted_lines)
You should familiarize yourself with file objects and be aware that myOutput01.txt is very different to "myOutput01.txt".

outputFile = open('myOutput01.txt','w')
inputFile = open('myInput01.txt','r')
content = inputFile.readlines()
for name in sorted(content):
outputFile.write(name + '\n')
inputFile.close()
outputFile.close()

Related

Simple parsing and sorting data from file

Sorry if this has already been answered before; the searches I have done have not been helpful.
I have a file that stores data as such:
name,number
(Although perhaps not relevant to the question, I will have to add entries to this file. I know how to do this.)
My question is for the pythonic(?) way of analyzing the data and sorting it in ascending order. So if the file was:
alex,30
bob,20
and I have to add the entry
carol, 25
The file should be rewritten as
bob,20
carol,25
alex,30
My first attempt was to store the entire file as a string (by read()) and then split by lines to get a list of strings, procedurally split those strings by a comma, and then create a new list of scores then sort that, but this doesn't seem right and fails because I don't have a way to go "back" once I have the order of scores.
I am unable to use libraries for this program.
Edit:
My first attempt I did not test because all it manages to do is sort a list of the scores; I don't know of a way to get the "entries" back.
file = open("scores.txt" , "r")
data = file.read()
list_data = data.split()
data.append([name,score])
for i in range(len(list_data)):
list_scores = list_scores.append(list_data[i][1])
list_scores = sorted(list_scores)
As you can see, this gives me an ascending list of scores, but I do not know where to go from here in order to sort the list of name, score entries.
You will just have to write the sorted entries back to some file, using some basic string formatting:
with open('scores.txt') as f_in, open('file_out.txt', 'w') as f_out:
entries = [(x, int(y)) for x, y in (line.strip().split(',') for line in f_in)]
entries.append(('carol', 25))
entries.sort(key=lambda e: e[1])
for x, y in entries:
f_out.write('{},{}\n'.format(x, y))
I'm going to assume you're capable of putting your data into a .csv file in the following format:
Name,Number
John,20
Jane,25
Then you can use csv.DictReader to read this into a dictionary with something like as shown in the listed example:
with(open('name_age.csv', 'w') as csvfile:
reader = csv.DictReader(csvfile)
and write to it using
with(open('name_age.csv') as csvfile:
writer = csv.DictWriter(csvfile)
writer.writerow({'Name':'Carol','Number':25})
You can then sort it using python's built-in operator as shown here
this a function that will take a filename and sort it for you
def sort_file(filename):
f = open(filename, 'r')
text = f.read()
f.close()
lines = [i.split(',') for i in text.splitlines()]
lines.sort(key=lambda x: x[1])
lines = [', '.join(i) for i in lines]
text = '\n'.join(lines)
f = open(filename, 'w')
f.write(text)
f.close()

Python: Issue with Writing over Lines?

So, this is the code I'm using in Python to remove lines, hence the name "cleanse." I have a list of a few thousand words and their parts-of-speech:
NN by
PP at
PP at
... This is the issue. For whatever reason (one I can't figure out and have been trying to for a few hours), the program I'm using to go through the word inputs isn't clearing duplicates, so the next best thing I can do is the former! Y'know, cycle through the file and delete the duplicates on run. However, whenever I do, this code instead takes the last line of the list and duplicates that hundreds of thousands of times.
Thoughts, please? :(
EDIT: The idea is that cleanseArchive() goes through a file called words.txt, takes any duplicate lines and deletes them. Since Python isn't able to delete lines, though, and I haven't had luck with any other methods, I've turned to essentially saving the non-duplicate data in a list (saveList) and then writing each object from that list into a new file (deleting the old). However, as of the moment as I said, it just repeats the final object of the original list thousands upon thousands of times.
EDIT2: This is what I have so far, taking suggestions from the replies:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
but ATM it's giving me this error:
Traceback (most recent call last):
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 154, in <module>
initialize()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 100, in initialize
cleanseArchive()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 29, in cleanseArchive
f.write(saveList)
TypeError: must be str, not set
for i in saveList:
f.write(n+"\n")
You basically print the value of n over and over.
Try this:
for i in saveList:
f.write(i+"\n")
If you just want to delete "duplicated lines", I've modified your reading code:
saveList = []
duplicates = []
with open("words.txt", "r") as ins:
for line in ins:
if line not in duplicates:
duplicates.append(line)
saveList.append(line)
Additionally take the correction above!
def cleanseArchive():
f = open("words.txt", "r+")
f.seek(0)
given_line = f.readlines()
saveList = set()
for x,y in enumerate(given_line):
t=(y)
saveList.add(t)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
for i in saveList: f.write(i)
Finished product! I ended up digging into enumerate and essentially just using that to get the strings. Man, Python has some bumpy roads when you get into sets/lists, holy shit. So much stuff not working for very ambiguous reasons! Whatever the case, fixed it up.
Let's clean up this code you gave us in your update:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
We have bad names that don't respect the Style Guide for Python Code, we have superfluous code parts, we don't use the full power of Python and part of it is not working.
Let us start with dropping unneeded code while at the same time using meaningful names.
def cleanse_archive():
infile = open("words.txt", "r")
given_lines = infile.readlines()
words = set(given_lines)
infile.close()
outfile = open("words.txt", "w")
outfile.write(words)
The seek was not needed, the mode for opening a file to read is now just r, the mode for writing is now w and we dropped the removing of the file because it will be overwritten anyway. Having a look at this now clearer code we see, that we missed to close the file after writing. If we open the file with the with statement Python will take care of that for us.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
outfile.write(words)
Now that we have clear code we'll deal with the error message that occurs when outfile.write is called: TypeError: must be str, not set. This message is clear: You can't write a set directly to the file. Obviously you'll have to loop over the content of the set.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
for word in words:
outfile.write(word)
That's it.

"Move" some parts of the file to another file

Let say I have a file with 48,222 lines. I then give an index value, let say, 21,000.
Is there any way in Python to "move" the contents of the file starting from index 21,000 such that now I have two files: the original one and the new one. But the original one now is having 21,000 lines and the new one 27,222 lines.
I read this post which uses partition and is quite describing what I want:
with open("inputfile") as f:
contents1, sentinel, contents2 = f.read().partition("Sentinel text\n")
with open("outputfile1", "w") as f:
f.write(contents1)
with open("outputfile2", "w") as f:
f.write(contents2)
Except that (1) it uses "Sentinel Text" as separator, (2) it creates two new files and require me to delete the old file. As of now, the way I do it is like this:
for r in result.keys(): #the filenames are in my dictionary, don't bother that
f = open(r)
lines = f.readlines()
f.close()
with open("outputfile1.txt", "w") as fn:
for line in lines[0:21000]:
#write each line
with open("outputfile2.txt", "w") as fn:
for line in lines[21000:]:
#write each line
Which is quite a manual work. Is there a built-in or more efficient way?
You can also use writelines() and dump the sliced list of lines from 0 to 20999 into one file and another sliced list from 21000 to the end into another file.
with open("inputfile") as f:
content = f.readlines()
content1 = content[:21000]
content2 = content[21000:]
with open("outputfile1.txt", "w") as fn1:
fn1.writelines(content1)
with open('outputfile2.txt','w') as fn2:
fn2.writelines(content2)

Python: Concise / elegant way to reformat a set of text files?

I have written a python script to process a set of ASCII files within a given dir. I wonder if there is a more concise and/or "pythonesque" way to do it, without loosing readability?
Python Code
import os
import fileinput
import glob
import string
indir='./'
outdir='./processed/'
for filename in glob.glob(indir+'*.asc'): # get a list of input ASCII files to be processed
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
lines = iter(fileinput.input([indir+filename])) # iterator over all lines in the input file
fout.write(next(lines)) # just copy the first line (the header) to output
for line in lines:
val=iter(string.split(line,' '))
fout.write('{0:6.2f}'.format(float(val.next()))), # first value in the line has it's own format
for x in val: # iterate over the rest of the numbers in the line
fout.write('{0:10.6f}'.format(float(val.next()))), # the rest of the values in the line has a different format
fout.write('\n')
fin.close()
fout.close()
An example:
Input:
;;; This line is the header line
-5.0 1.090074154029272 1.0034662411357929 0.87336062116561186 0.78649408279093869 0.65599958665017222 0.4379879132749317 0.26310799350679176 0.087808018565486673
-4.9900000000000002 1.0890770415316042 1.0025480136545413 0.87256100700428996 0.78577373527626004 0.65539842673645277 0.43758616966566649 0.26286647978335914 0.087727357602906453
-4.9800000000000004 1.0880820021223023 1.0016316956763136 0.87176305623792771 0.78505488659611744 0.65479851808106115 0.43718526271594083 0.26262546925502467 0.087646864773454014
-4.9700000000000006 1.0870890372077564 1.0007172884938402 0.87096676998908273 0.78433753775986659 0.65419986152386733 0.4367851929843618 0.26238496225635727 0.087566540188423345
-4.9600000000000009 1.086098148170821 0.99980479337809591 0.87017214936140763 0.78362168975984026 0.65360245789061966 0.4363859610200459 0.26214495911617541 0.087486383957276398
Processed:
;;; This line is the header line
-5.00 1.003466 0.786494 0.437988 0.087808
-4.99 1.002548 0.785774 0.437586 0.087727
-4.98 1.001632 0.785055 0.437185 0.087647
-4.97 1.000717 0.784338 0.436785 0.087567
-4.96 0.999805 0.783622 0.436386 0.087486
Other than a few minor changes, due to how Python has changed through time, this looks fine.
You're mixing two different styles of next(); the old way was it.next() and the new is next(it). You should use the string method split() instead of going through the string module (that module is there mostly for backwards compatibility to Python 1.x). There's no need to use go through the almost useless "fileinput" module, since open file handle are also iterators (that module comes from a time before Python's file handles were iterators.)
Edit: As #codeape pointed out, glob() returns the full path. Your code would not have worked if indir was something other than "./". I've changed the following to use the correct listdir/os.path.join solution. I'm also more familiar with the "%" string interpolation than string formatting.
Here's how I would write this in more idiomatic modern Python
def reformat(fin, fout):
fout.write(next(fin)) # just copy the first line (the header) to output
for line in fin:
fields = line.split(' ')
# Make a format header specific to the number of fields
fmt = '%6.2f' + ('%10.6f' * (len(fields)-1)) + '\n'
fout.write(fmt % tuple(map(float, fields)))
basenames = os.listdir(indir) # get a list of input ASCII files to be processed
for basename in basenames:
input_filename = os.path.join(indir, basename)
output_filename = os.path.join(outdir, basename)
with open(input_filename, 'r') as fin, open(output_filename, 'w') as fout:
reformat(fin, fout)
The Zen of Python is "There should be one-- and preferably only one --obvious way to do it". It's interesting how you functions which, during the last 10+ years, was "obviously" the right solution, but are no longer. :)
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
#code
fin.close()
fout.close()
can be written as:
with open(indir+filename,'r') as fin, open(outdir+filename,'w') as fout:
#code
In python 2.6, you can use:
with open(indir+filename,'r') as fin:
with open(outdir+filename,'w') as fout:
#code
And the line
lines = iter(fileinput.input([indir+filename]))
is useless. You can just iterate over an open file(fin in your case)
You can also do line.split(' ') instead of string.split(line, ' ')
If you change those things, there is no need to import string and fileinput.
Edit: I didn't know you can use inline code. That's cool
In my build script, I have this code:
inFile = open(sourceFile,'r')
outFile = open(targetFile,'w')
for line in inFile:
line = doKeywordSubstitution(line)
outFile.write(line)
inFile.close()
outFile.close()
I don't know of a way to make this any more concise. Putting the line-changing logic in a different function looks neater to me though.
I may be missing the point of your code, but I don't understand why you have lines = iter(fileinput.input([indir+filename])).
I don't understand why do you use: string.split(line, ' ') instead of just line.split(' ').
Well maybe I would write the string-processing part like this:
values = line.split(' ')
values[0] = '{0:6.2f}'.format(float(values[0]))
values[1:] = ['{0:10.6f}'.format(float(v)) for v in values[1:]]
fout.write(' '.join(values))
At least for me this looks better but this might be subjective :)
Instead of indir I would use os.curdir. Instead of "./processed" I would do: os.path.join(os.curdir, 'processed').

Read a multielement list, look for an element and print it out in python

I am writing a python script in order to write a tex file. But I had to use some information from another file. Such file has names of menus in each line that I need to use. I use split to have a list for each line of my "menu".
For example, I had to write a section with the each second element of my lists but after running, I got anything, what could I do?
This is roughly what I am doing:
texfile = open(outputtex.tex', 'w')
infile = open(txtfile.txt, 'r')
for line in infile.readlines():
linesplit = line.split('^')
for i in range(1,len(infile.readlines())):
texfile.write('\section{}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' %i)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
By the way, in the inclugraphics line, I had to increace the number after pg_ from "0001" to "25050". Any clues??
I really appreciate your help.
I don't quite follow your question. But I see several errors in your code. Most importantly:
for line in infile.readlines():
...
...
for i in range(1,len(infile.readlines())):
Once you read a file, it's gone. (You can get it back, but in this case there's no point.) That means that the second call to readlines is yielding nothing, so len(infile.readlines()) == 0. Assuming what you've written here really is what you want to do (i.e. write file_len * (file_len - 1) + 1 lines?) then perhaps you should save the file to a list. Also, you didn't put quotes around your filenames, and your indentation is strange. Try this:
with open('txtfile.txt', 'r') as infile: # (with automatically closes infile)
in_lines = infile.readlines()
in_len = len(in_lines)
texfile = open('outputtex.tex', 'w')
for line in in_lines:
linesplit = line.split('^')
for i in range(1, in_len):
texfile.write('\section{}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' %i)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
Perhaps you don't actually want nested loops?
infile = open('txtfile.txt', 'r')
texfile = open('outputtex.tex', 'w')
for line_number, line in enumerate(infile):
linesplit = line.split('^')
texfile.write('\section{{{0}}}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' % line_number)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
infile.close()

Categories

Resources