Difference between 2 text files with uknown size [closed] - python

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
In hope to find a solution, I lost couple of days but no success! I have two text files with many lines. One file can contains thousands lines with numbers, for example: 79357795
79357796
68525650.
The second file also contains numbers, but not too much, maybe one hundred of lines (again one number per line). I tried some "algorithms" but no success. Now, my questions is: Can I check first line from first file with all lines from second file, after that, to check second line from first file with all lines from second file and so on up to the end of the file? And as a result, I want to save the difference between this two files in third files. Thank you all for responses and sorry for my baddest english. :)
PS: Oh yes, I need to do this in Python.
More details:
first_file.txt contains:
79790104
79873070
69274656
69180377
60492209
78177852
79023241
69736256
68699620
79577311
78509545
69656007
68188871
60643247
78898817
79924105
79684143
79036022
69445507
60605544
79348181
69748018
69486323
69102802
68651099
second_file.txt contain:
78509545
69656007
68188871
60643247
78898817
79924105
79684143
79036022
69445507
60605544
79348181
69748018
69486323
69102802
68651099
79357794
78953958
69350610
78383111
68629321
78886856
third_file.txt need to contain what number not exist in first_file.txt but exist in second file, in this case:
79357794
78953958
69350610
78383111
68629321
78886856

Something like:
from itertools import ifilterfalse
with open('first') as fst, open('second') as snd, open('not_second', 'w') as fout:
snd_nums = set(int(line) for line in snd)
fst_not_in_snd = ifilterfalse(snd_nums.__contains__, (int(line) for line in fst))
fout.writelines(num + '\n' for num in fst_not_in_snd)

Yes.
Edit: This will give you all numbers that are in both lists (which is what you first asked for.) See other answers for what your data set wants. (I Like 1_CR's answer.)
with open('firstfile.txt') as f:
file1 = f.read().splitlines()
with open('secondfile.txt') as f:
file2 = f.read().splitlines()
for x in file1:
for y in file2:
if x == y:
print "Found matching: " + x
#do what you want here
It could be made more efficient, but the files don't sound that big, and this is the simplest way.

Well, if i were you, I would load those files into two lists, and then iterate through one of them, looking up each value in second one.

If the files are small enough to load into memory, sets are an option
with open('firstfile.txt') as f1, open('second_file.txt') as f2:
print '\n'.join(set(f2.read().splitlines()).difference(f1.read().splitlines()))

Related

Editing of txt files not saving when I concatenate them

I am fairly new to programming, so bear with me!
We have a task at school which we are made to clean up three text files ("Balance1", "Saving", and "Withdrawal") and append them together into a new file. These files are just names and sums of money listed downwards, but some of it is jumbled. This is my code for the first file Balance1:
with open('Balance1.txt', 'r+') as f:
f_contents = f.readlines()
# Then I start cleaning up the lines. Here I edit Anna's savings to an integer.
f_contents[8] = "Anna, 600000"
# Here I delete the blank lines and edit in the 50000 to Philip.
del f_contents[3]
del f_contents[3]
In the original text file Anna's savings is written like this: "Anna, six hundred thousand" and we have to make it look clean, so its rather "NAME, SUM (as integer). When I print this as a list it looks good, but after I have done this with all three files I try to append them together in a file called "Balance.txt" like this:
filenames = ["Balance1.txt", "Saving.txt", "Withdrawal.txt"]
with open("Balance.txt", "a") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
When I check the new text file "Balance" it has appended them together, but just as they were in the beginning and not with my edits. So it is not "cleaned up". Can anyone help me understand why this happens, and what I have to do so it appends the edited and clean versions?
In the first part, where you do the "editing" of Balance.txt` file, this is what happens:
You open the file in read mode
You load the data into memory
You edit the in memory data
And voila.
You never persisted the changes to any file on the disk. So when in the second part you read the content of all the files, you will read the data that was originally there.
So if you want to concatenate the edited data, you have 2 choices:
Pre-process the data by creating 3 final correct files (editing Balance1.txt and persisting it to another file, say Balance1_fixed.txt) and then in the second part, concatenate: ["Balance1_fixed.txt", "Saving.txt", "Withdrawal.txt"]. Total of 4 data file openings, more IO.
Use only the second loop you have, and correct the contents before writing it to the outfile. You can use readlines() like you did first, edit the specific line and then use writelines(). Total of 3 data file openings, less IO than previous option

File conversion between .fasta and .genbank format

I have to create two functions that should allow me to open .genbank files and convert them into a .fasta file and the other way around. What I have for the moment is this:
def Convert(file, file1)
handle_input=open('file', 'rU')
handle_output=open('file1', 'w')
while True:
s=handle_input.readline()
t=handle_output.write(s, '.genbank')
print(t)
Convert('file.fas', 'file.genbank')
It is also probably not correct, but I have no idea what to do.
You can find a lot of documentation about this on the internet. Take a look here: https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
But to get you started:
I assume that the 2 files will not be identical in the future because otherwise you can just copy the file.
I have couple of remarks.
1) Your loop while true will run till the end of time. Change it to something like
for line in handle_input:
2)Close your files when you are done:
handle_input.close()
handle_output.close()
3)t=handle_output.write(s, '.genbank')
Remove the '.genbank' argument
4) No need to do print(t)
Note: I havent tested this code so I could have made some small mistakes

Building on "How to read and write a table / matrix to file with python?"

Back in Feb 8 '13 at 20:20, YamSMit asked a question (see: How to read and write a table / matrix to file with python?) similar to what I am struggling with: starting out with an Excel table (CSV) that has 3 columns and a varying number of rows. The contents of the columns are string, floating point, and string. The first string will vary in length, while the other string can be fixed (eg, 2 characters). The table needs to go into a 2 dimensional array, so that I can do manipulations on the data to produce a final file (which will be a text file). I have experimented with a variety of strategies presented in stackoverflow, but I am always missing something, and I haven't seen an example with all the parts, which is the reason for the struggle to figure this out.
Sample data will be similar to:
Ray Smith, 41645.87778, V1
I have read and explored numpy and astropy since the available documentation says they make this type of code easy. I have tried import csv. Somehow, the code doesn't come together. I should add that I am writing in Python 3.2.3 (which seems to be a mistake since a lot of documentation is for Python 2.x).
I realize the basic nature of this question directs me to read more tutorials. I have been reading many, yet the tutorials always refer to enough that is different, that I fail to assemble the right pieces: read the table file, write into a 2D array, then... do more stuff.
I am grateful to anyone who might provide me with a workable outline of the code, or pointing me to specific documentation I should read to handle the specific nature of the code I am trying to write.
Many thanks in advance. (Sorry for the wordiness - just trying to be complete.)
I am more familiar with 2.x, but from the 3.3 csv documentation found here, it seems to be mostly the same as 2.x. The following function will read a csv file, and return a 2D array of the rows found in the file.
import csv
def read_csv(file_name):
array_2D = []
with open(file_name, 'rb') as csvfile:
read = csv.reader(csvfile, delimiter=';') #Assuming your csv file has been set up with the ';' delimiter - there are other options, for which you should see the first link.
for row in read:
array_2D.append(row)
return array_2D
You would then be able to manipulate the data as follows (assuming your csv file is called 'foo.csv' and the desired text file is 'foo.txt'):
data = read_csv('foo.csv')
with open('foo.txt') as textwrite:
for row in data:
string = '{0} has {1} apples in his Ford {2}.\n'.format(row[0], row[1], row[2])
textwrite.write(string)
#if you know the second column is a float:
manipulate = float(row[1])*3
textwrite.write(manipulate)
string would then be written to 'foo.txt' as:
Ray Smith has 41645.87778 apples in his Ford V1.\n
and maniuplate would be written to 'foo.txt' as:
124937.63334

Python - Import txt in a sequential pattern

In the directory I have say, 30 txt files each containing two columns of numbers with roughly 6000 numbers in each column. What i want to do is to import the first 3 txt files, process the data which gives me the desired output, then i want to move onto the next 3 txt files.
The directory looks like:
file0a
file0b
file0c
file1a
file1b
file1c ... and so on.
I don't want to import all of the txt files simultaneously, I want to import the first 3, process the data, then the next 3 and so forth. I was thinking of making a dictionary - though i have a feeling this might involve writing each file name in the dictionary, which would take far too long.
EDIT:
For those that are interested, I think i have come up with a work around. Any feedback would greatly be appreciated, since i'm not sure if this is the quickest way to do things or the most pythonic.
import glob
def chunks(l,n):
for i in xrange(0,len(l),n):
yield l[i:i+n]
Data = []
txt_files = glob.iglob("./*.txt")
for data in txt_files:
d = np.loadtxt(data, dtype = np.float64)
Data.append(d)
Data_raw_all = list(chunks(Data,3))
Here the list 'Data' is all of the text files from the directory, and 'Data_raw_all' uses the function 'chunks' to group the elements in 'Data' into sets of 3. This way you can selecting one element in Data_raw_all selects the corresponding 3 text files in the directory.
First of all, I have nothing original to include here and I definitely do not want to claim credit for it at all because it all comes from the Python Cookbook 3rd Ed and from this wonderful presentation on generators by David Beazley (one of the co-authors of the aforementioned Cookbook). However, I think you might really benefit from the examples given in the slideshow on generators.
What Beazley does is chain a bunch of generators together in order to do the following:
yields filenames matching a given filename pattern.
yields open file objects from a sequence of filenames.
concatenates a sequence of generators into a single sequence
greps a series of lines for those that match a regex pattern
All of these code examples are located here. The beauty of this method is that the chained generators simply chew up the next pieces of information: they don't load all files into memory in order to process all the data. It's really a nice solution.
Anyway, if you read through the slideshow, I believe it will give you a blueprint for exactly what you want to do: you just have to change it for the information you are seeking.
In short, check out the slideshow linked above and follow along and it should provide a blueprint for solving your problem.
I'm presuming you want to hardcode as few of the file names as possible. Therefore most of this code is for generating the filenames. The files are then opened with a with statement.
Example code:
from itertools import cycle, count
root = "UVF2CNa"
for n in count(1):
for char in cycle("abc"):
first_part = "{}{}{}".format(root, n, char)
try:
with open(first_part + "i") as i,\
open(first_part + "j") as j,\
open(first_part + "k") as k:
# do stuff with files i, j and k here
pass
except FileNotFoundError:
# deal with this however
pass

Python/Excel: How to freeze the top line of a generated sheet (or sheets) [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
This isn't actually a question -- it's more of an FYI. It took me a while to figure this out, since my searching with Google turned up bits and pieces (including some related questions here on StackOverflow), but no article or blurb gave me the whole answer all in one place. I was finally able to figure it out by recording a macro in Excel and looking at its source code and combining what I found there with what I found on the web.
Anyway, the following code snippets show two helper functions, plus a sample main() type of function (which I always use, even in Perl and Python apps, since I have a C/C++/Java background).
The first helper method will auto-fit every used column on the given sheet. Since the auto-fit logic depends on the data in the cells, as well as the font(s) in use, you shouldn't call this method until the entire sheet has been completely populated with your data. Also, I'm pretty new to VBA, so I'm not so sure that this line:
sheet.Cells(1, i).EntireColumn.AutoFit()
is correct, but it does seem to work just fine. If it's not entirely correct, maybe you can post a correction.
def autoFitSheet(sheet):
"""Auto-fits all the used columns on the given sheet. Obviously, this
method should only be called _after_ the sheet has been populated."""
firstCol = sheet.UsedRange.Column
numCols = sheet.UsedRange.Columns.Count
for i in range(firstCol, (firstCol + numCols)):
sheet.Cells(1, i).EntireColumn.AutoFit()
The second helper is a "freeze the top line" method. All it does is freeze the top line of the sheet that you pass it. A side effect of this method is that the sheet becomes the active sheet in the spreadsheet, so you might need/want to activate some other sheet once all your sheets' top lines have been frozen.
def freezeTopLine(window, sheet):
"""Freezes the top line of the given sheet."""
sheet.Activate()
window.SplitColumn = 0
window.SplitRow = 1
window.FreezePanes = True
The last snippet is a sample main() function, which just shows how to call the two helper methods. I'm actually using code very similar to this to populate a spreadsheet with mutliple tabs/sheets, and it works perfectly (on Windows 7 -- can't guarantee how well it will work on your system).
def main(...):
#
# (whatever...)
#
excel = win32.gencache.EnsureDispatch('Excel.Application')
workbook = excel.Workbooks.Add()
sheets = workbook.Sheets
window = workbook.Windows(1)
sheet1 = workbook.Worksheets('Sheet1')
straddles.Name = 'My First Sheet'
straddles.Tab.ColorIndex = 6 # yellow
sheet2 = workbook.Worksheets('Sheet2')
cashSummary.Name = 'Some Other Sheet'
cashSummary.Tab.ColorIndex = 4 # bright green
#
# Populate sheets...
#
for sheet in sheets:
autoFitSheet(sheet)
freezeTopLine(sheet)
#
# (whatever...)
#
Like I said, it took me a while to figure this out, and I was hoping that maybe I could save someone some of the headache that I went through. Good luck.

Categories

Resources