i have 2 lists of strings with the same length but when i write them to a file where each item appear on separate lines in the file, they length of the list and file do not match:
print len(x)
print len(y)
317858
317858
However when i write each item in the list to a text file:
the number of lines in the text file do not match to length of the list.
with open('a.txt', 'wb') as f:
for i in x[:222500]:
print >> f, i
in linux, wc -l a.txt gives 222499 which is right.
with open('b.txt', 'wb') as f:
for i in y[:222500]:
print >> f, i
in linux, wc -l b.txt gives 239610 which is wrong.
when i vi b.txt in the terminal, it did have 239610 lines so i am quite confused as to why this is happening..
How can i debug this?
The only possible way for finding more lines in b.txt than the number of string written is that some of the strings in y actually contain new lines.
Here is a small example
l = [ 'a', 'b\nc']
print len(l)
with open('tst.txt', 'wb') as fd:
for i in l:
print >> fd, i
This little code will print 2 because list l contains 2 elements, but the resulting file will contain 3 lines:
a
b
c
I'm sure others will quickly point out the cause of this difference (it's related to newline characters), but since you asked 'How can I debug this?' I'd like to address that question:
Since the only difference between the passing and the failing run are the lists themselves, I'd concentrate on those. There is some difference between the lists (i.e. at least one differing list element) which triggers this. Hence, you could perform a binary search to locate the first differing list element triggering this.
To do so, just chop the lists in halves, e.g. take the first 317858/2 lines of each list. Do you still observe the same symptom? If so, repeat the exercise with that first half. Otherwise, repeat that exercise with the second half. That way, you'll need at most 19 tries to identify the line which triggers this. And at that point, the issue is simplified to a single string.
Chances are that you can spot the issue by just looking at the strings, but in principle (e.g. if the strings are very long), you could then proceed to do a binary search on those strings to identify the first character triggering this issue.
Related
I am very new to Python and am looking for assistance to where I am going wrong with an assignment. I have attempted different ways to approach the problem but keep getting stuck at the same point(s):
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
Problem 2: When I try and combine the lists I keep receiving "None" for my result or Nonetype errors [which I think means I have added the None's together(?)].
The assignment is:
#8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.You can download the sample data at http://www.py4e.com/code3/romeo.txt
My current code which is giving me a Nonetype error is:
poem = input("enter file:")
play = open(poem)
lst= list()
for line in play:
line=line.rstrip()
word=line.split()
if not word in lst:
lst= lst.append(word)
print(lst.sort())
If someone could just talk me through where I am going wrong that will be greatly appreciated!
your problem was lst= lst.append(word) this returns None
with open(poem) as f:
lines = f.read().split('\n') #you can also you readlines()
lst = []
for line in lines:
words = line.split()
for word in words:
if word:
lst.append(word)
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
You are doing play = open(poem) then for line in play: which is method for processing file line-by-line, if you want to process whole content at once then do:
play = open(poem)
content = play.read()
words = content.split()
Please always remember to close file after you used it i.e. do
play.close()
unless you use context manager way (i.e. like with open(poem) as f:)
Just to help you get into Python a little more:
You can:
1. Read whole file at once (if it is big it is better to grab it into RAM if you have enough of it, if not grab as much as you can for the chunk to be reasonable, then grab another one and so on)
2. Split data you read into words and
3. Use set() or dict() to remove duplicates
Along the way, you shouldn't forget to pay attention to upper and lower cases,
if you need same words, not just different not repeating strings
This will work in Py2 and Py3 as long as you do something about input() function in Py2 or use quotes when entering the path, so:
path = input("Filename: ")
f = open(filename)
c = f.read()
f.close()
words = set(x.lower() for x in c.split()) # This will make a set of lower case words, with no repetitions
# This equals to:
#words = set()
#for x in c.split():
# words.add(x.lower())
# set() is an unordered datatype ignoring duplicate items
# and it mimics mathematical sets and their properties (unions, intersections, ...)
# it is also fast as hell.
# Checking a list() each time for existance of the word is slow as hell
#----
# OK, you need a sorted list, so:
words = sorted(words)
# Or step-by-step:
#words = list(words)
#words.sort()
# Now words is your list
As for your errors, do not worry, they are common at the beginning in almost any objective oriented language.
Other explained them well in their comments. But not to make the answer lacking...:
Always pay attention on functions or methods which operate on the datatype (in place sort - list.sort(), list.append(), list.insert(), set.add()...) and which ones return a new version of the datatype (sorted(), str.lower()...).
If you ran into a similar situation again, use help() in interactive shell to see what exactly a function you used does.
>>> help(list.append)
>>> help(list.sort)
>>> help(str.lower)
>>> # Or any short documentation you need
Python, especially Python 3.x is sensitive to trying operations between types, but some might have a different connotation and can actually work while doing unexpected stuff.
E.g. you can do:
print(40*"x")
It will print out 40 'x' characters, because it will create a string of 40 characters.
But:
print([1, 2, 3]+None)
will, logically not work, which is what is happening somewhere in the rest of your code.
In some languages like javascript (terrible stuff) this will work perfectly well:
v = "abc "+123+" def";
Inserting the 123 seamlessly into the string. Which is usefull, but a programming nightmare and nonsense from another viewing angle.
Also, in Py3 a reasonable assumption from Py2 that you can mix unicode and byte strings and that automatic cast will be performed is not holding.
I.e. this is a TypeError:
print(b"abc"+"def")
because b"abc" is bytes() and "def" (or u"def") is str() in Py3 - what is unicode() in Py2)
Enjoy Python, it is the best!
I have a text file with a list of repeated names (some of which have accented alphabets like é, à, î etc.)
e.g. List: Précilia, Maggie, Précilia
I need to write a code that will give an output of the unique names.
But, my text file seems to have different character-encoding for the two accented é's in the two occurrences of Précilia (I am guess perhaps ASCII for one and UTF-8 for another). Thus my code gives both occurrences of Précilia as different unique elements. You can find my code below:
seen = set()
with open('./Desktop/input1.txt') as infile:
with open('./Desktop/output.txt', 'w') as outfile:
for line in infile:
if line not in seen:
outfile.write(line)
seen.add(line)
Expected output: Prècilia, Maggie
Actual and incorrect output: Prècilia, Maggie, Prècilia
Update: The original file is a very large file. I need a way to consider both these occurrences as a single one.
So my boss suggested we use Unicode Normalization which replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
More details can be found on https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html and https://github.com/aws/aws-cli/issues/1639
As of now we got positive results on our test cases and hopefully our main data set will work with this too.
first of all, I'm new to python, so maybe my code is a little weird or bordering to be wrong, but it works, so there is that.
I've been googleing for this problem, but can't find anyone who writes about, I got this huge list written like this
1 2 3 4 5
2 2 2 2 2
3 3 3 3 3
etc, note that it is spaces and not tab, and this I can't change, since I'm working with a print out from ls-dyna
So I am using this script to remove the whitespaces before the numbers, since they have been giving me troubles when trying to format the numbers into a matrix and then i remove the empty lines afterwards
for line in input:
print >> output, line.lstrip(' ')
but for some reason, I have 4442 lines (and here I mean writen lines, which is easy to track since they are enumerated) but the output only has 4411, so it removes 31 lines, with numbers I need
Why is this?
The lstrip() won't remove lines because it is used inside the print statement which will always append a newline character (the way you use it). But the for line in input might step through the list of lines in an unexpected way, i. e. it could skip lines or combine them in a manner you didn't expect.
Maybe newline and carriage return characters result in this strange problem.
I propose to let the .lstrip(' ') away for testing and compare the output with the input to find the places where something gets changed. Probably you should use output.write(line) to circumvent all the automatics of the print statement (especially appending newline characters).
Then you should use a special separator when outputting (output.write('###' + line) or similar) to find out how the iteration through the input takes place.
I am having an issue with writing a list to a file. I am annotating certain files to change them into a certain format, so I read sequence alignment files, store them in lists, do necessary formatting, and then write them to a new file. The problem is that while my list, containing sequence alignments is structured correctly, the output produced when it writes them to new files is incorrect (it does not replicate my list structure). I include only a section of my output and what it should look like because the list itself if far too long to post.
OUTPUT WRITTEN TO FILE:
>
TRFE_CHICK
From XALIGN
MKLILCTVLSLGIAAVCFAAP (seq spans multiple lines) ...
ADYIKAVSNLRKCS--TSRLLEAC*> (end of sequence, * should be on a newline, followed by > on a newline as well)
OUTPUT IS SUPPOSED TO BE WRITTEN AS:
>
TRFE_CHICK
From XALIGN
MKLILCTVLSLGIAAVCFAAP (seq spans many lines) ...
ADYIKAVSNLRKCS--TSRLLEAC
*
>
It does this misformatting multiple times over. I have tried pickling and unpickling the list but that misformats it further.
My code for producing the list and writing to file:
new = []
for line in alignment1:
if line.endswith('*\n'):
new.append(line.strip('*\n'))
new.append('*')
else:
new.append(line)
new1 = []
for line in new:
if line.startswith('>'):
twolines = line[0] + '\n' + line[1:]
new1.append(twolines)
continue
else:
new1.append(line)
for line in new1:
alignfile_annot.write(line)
Basically, I have coded it so that it reads the alignment file, inserts a line between the end of the sequence and the * character and also so that > followed by the ID code are always on new lines. This is the way my list is built but not the way it is written to file. Anyone know why the misformatting?
Apologies for the long text, I tried to keep it as short as possible to make my issue clear
I'm running Python 2.6.5
new.append(line.strip('*\n'))
new.append('*')
You have a list of lines (with newline terminators each), so you need to include \n for these two lines, too:
new.append(line[:-2] + "\n") # slice as you just checked line.endswith("*\n")
new.append("*\n")
Remember the strip (or slice, as I've changed it to) will remove the newline, so splitting a single item in the list with a value of "...*\n" into two items of "..." and "*" actually removes a newline from what you had originally.
I'm using python 2.6 on linux.
I have two text files
first.txt has a single string of text on each line. So it looks like
lorem
ipus
asfd
The second file doesn't quite have the same format.
it would look more like this
1231 lorem
1311 assss 31 1
etc
I want to take each line of text from first.txt and determine if there's a match in the second text. If there isn't a match then I would like to save the missing text to a third file. I would like to ignore case but not completely necessary. This is why I was looking at regex but didn't have much luck.
So I'm opening the files, using readlines() to create a list.
Iterating through the lists and printing out the matches.
Here's my code
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. Anyone have suggestions how to fix this or a better solution?
My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. Then convert this list into a set, for better performance with large files.
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file.
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
with statements are a convenient way to ensure that a file is closed, even if an exception occurs. They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements at the top of your file.
The in operator does what it sounds like: tell you if a value can be found in a collection. When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. not in does the opposite. (Possible point of confusion: in is also used when defining a for loop (for x in [1, 2, 3]), but this is unrelated.)
Assuming that you're looking for the entire line in the second file:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()