IndexError: string index out of range, line is empty? - python

I'm trying to make a program that takes a letter the user inputs, reads a text file and then prints the words that start with that letter.
item = "file_name"
letter = raw_input("Words starting with: ")
letter = letter.lower()
found = 0
with open(item) as f:
filelength = sum(1 for line in f)
for i in range (filelength):
word = f.readline()
print word
if word[0] == letter:
print word
found += 1
print "words found:", found
I keep receiving the error
"if word[0] == letter: IndexError: string index out of range"
with no lines being printed. I think this is what happens if there's nothing there, but there are 50 lines of random words in the file, so I'm not sure why it is being read this way.

You have two problems:
You are trying to read the whole file twice (once to determine the filelength, then again to get the lines themselves), which won't work; and
You aren't dealing with empty lines, so if any are introduced (e.g. if the last line is blank) your code will break anyway.
The easiest way to do this is:
found = 0
with open(item) as f:
for line in f: # iterate over lines directly
if line and line[0] == letter: # skip blank lines and any that don't match
found += 1
print "words found:", found
if line skips blanks because empty sequences are false-y, and the "lazy evaluation" of and means that line[0] will only be tried where the line isn't empty. You could alternatively use line.startswith(letter).

When you use sum(1 for line in f) you are already consuming all the lines in your file, so now your handle points to the end of the file. Try using f.seek(0) to return the read cursor to the start of the file.

Related

How get only 5mers from a sequence

I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.
My file looks like this:
CGATGCATAGGAA
GCAGGAGTGATCC
my code is:
with open('test.txt','r') as file:
for line in file:
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass
with this code, I should not get 4 mers but I do even I have an if statement for the length of 5mers. Could anyone help me with this? Thanks
my out put is:
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC
but the ideal output should be only the one with length equal to 5 (for each line separately):
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
When iterating through a file, every character is represented somewhere. In particular, the last character for each of those lines is a newline \n, which you're printing.
with open('test.txt') as f: data = list(f)
# data[0] == 'CGATGCATAGGAA\n'
# data[1] == 'GCAGGAGTGATCC\n'
So the very last substring you're trying to print from the first line is 'GGAA\n', which has a length of 5, but it's giving you the extra whitespace and the appearance of 4mers. One of the comments proposed a satisfactory solution, but when you know the root of the problem you have lots of options:
with open('test.txt', 'r') as file:
for line_no, line in enumerate(file):
if line_no: print() # for the space between chunks which you seem to want in your final output -- omit if not desired
line = line.strip() # remove surrounding whitespace, including the pesky newlines
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass

How to get first word from text file removing \n - python

If the text file is /n/n Hello world!/n I like python./n
How do I get the first word from that text?
I tried to code:
def word_file(file):
files = open(file, 'r')
l = files.readlines()
for i in range(len(l)):
a = l[i].rstrip("\n")
line = l[0]
word = line.strip().split(" ")[0]
return word
There is space in front Hello.
The result I get is NONE. How should I correct it?
Can anybody help?
Assuming there is a word in the file:
def word_file(f):
with open(f) as file:
return file.read().split()[0]
file.read reads the entire file as a string. Do a split with no parameters on that string (i.e. sep=None). Then according to the Python manual "runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace." So the splitting will be done on consecutive white space and there will be no empty strings returned as a result of the split. Therefore the first element of the returned list will be the first word in the file.
If there is a possibility that the file is empty or contains nothing but white space, then you would need to check the return value from file.read().split() to ensure it is not an empty list.
If you need to avoid having to read the entire file into memory at once, then the following, less terse code can be used:
def word_file(f):
with open(f) as file:
for line in file:
words = line.split()
if words:
return words[0]
return None # No words found
Edit: #Booboo answer is far better than my answer
This should work:
def word_file(file):
with open(file, 'r') as f:
for line in f:
for index, character in enumerate(line):
if not character.isspace():
line = line[index:]
for ind, ch in enumerate(line):
if ch.isspace():
return line[:ind]
return line # could not find whitespace character at end
return None # no words found
output:
Hello

Reading from a file with f.readline() reads more lines than the real file has

I'm trying to make a sorting function to order words by length, so the first step is finding the longest word.
In order to do this I've produced the following code, it works as intended, but there's a bug that caught my attention and I'm trying to figure out what causes it.
After replacing a word, the buffer will always read the next line as if it was empty, but will eventually reach the next word, but will testify the reading was done in the wrong line.
def sort():
f = open("source.txt","r")
w = open("result.txt","a+")
word = ''
og_lines = sum(1 for line in open('source.txt'))
print("Sorting",og_lines,"words...")
new_lines = 1
lenght = 0
word = f.readline(new_lines)
new_lines+=1
buffer=f.readline(new_lines)
while (buffer != ''):
if (len(buffer)>len(word)):
word=buffer
print("change")
new_lines+=1
print(new_lines, "lines read")
buffer=f.readline(new_lines)
buffer.rstrip()
lenght = len(word.rstrip())
print("Longest word is",word.rstrip(),lenght)
Expected to read 25 lines, but since it found a longer word 4 times along the way, it ended up reading the nonexistent line 29 and yet returning the word from the real line 25.

reading strings from large file faster

I have a large text file (parsed.txt) which includes almost 1.500.000 lines. Each line is in this format:
foobar foo[Noun]+lAr[A3pl]+[Pnon]+[Nom]
loremipsum lorem[A1sg]+lAr[A3pl]+[Pl]+[Nom]
I'm giving the second field after space and get the first field before space with this function:
def find_postag(word,postag):
with open('parsed.txt',"r") as zemberek:
for line in zemberek:
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword
However, it works too slow. I'm not sure how can I make the process faster. My idea is: The parsed.txt file is alphabetic ordered. If given word variable starts with "z" letter, it reads almost 900.000 lines unnecessarily. Maybe it will be faster if it will check from line 900.000 if the given word starts with "z" letter. Is there any better ideas and how can I implement?
Since your input file is alphabetical, what you could do is create a dictionary that contains the line number where each letter starts, like this:
with open('parsed.txt', 'r') as f:
data = [line.strip() for line in f if line.strip()]
index = dict()
for i in range(len(data)):
line = data[i]
first_letter = line[0].lower()
if first_letter not in index:
index[first_letter] = i
You would want to add that code at the beginning so it only runs once before you start doing the searches. This way when you search for a word, you can have it start searching where its first letter starts, like this:
def find_postag(word, postag):
start = index[word[0].lower()]
for line in data[start:]:
# your code here
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword

Python: List not correct after appending lines

I'm trying to append lines to an empty list reading from a file, and I've already stripped the lines of returns and newlines, but what should be one line is being entered as two separate items into the list.
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.rstrip()
line = line.lstrip()
if line.startswith('>')==True:
DNAID.append(line)
if line.startswith('>')==False:
DNASEQ.append(line)
print DNAID
print DNASEQ
And here's the output
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGA', 'TCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCA', 'ATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAG', 'TGGGAACCTGCGGGCAGTAGGTGGAAT']
I want it to look like this:
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGATCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCAATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAGTGGGAACCTGCGGGCAGTAGGTGGAAT']
Here is the source material, just remove the ''s:
['>Rosalind_6404'
CCTGCGGAAGATCGGCACTAGA
TCCCACTAATAATTCTGAGG
'>Rosalind_5959'
CCATCGGTAGCGCATCCTTAGTCCA
ATATCCATTTGTCAGCAGACACGC
'>Rosalind_0808'
CCACCCTCGTGGTATGGCTAGGCATTCAG
TGGGAACCTGCGGGCAGTAGGTGGAAT]
You can combine the .lstrip() and .rstrip() into a single .strip() call.
Then, you were thinking that .append() both added lines to a list and joined lines into a single line. Here, we start DNASEQ with an empty string and use += to join the lines into a long string:
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.strip()
if line.startswith('>'):
DNAID.append(line)
DNASEQ.append('')
else:
DNASEQ[-1] += line
print DNAID
print DNASEQ
Within each iteration of the loop, you're only looking at a certain line from the file. This means that, although you certainly are appending lines that don't contain a linefeed at the end, you're still appending one of the file's lines at a time. You'll have to let the interpreter know that you want to combine certain lines, by doing something like setting a flag when you first start to read in a DNASEQ and clearing it when the next DNAID starts.
for line in DNA:
line = line.strip() # gets both sides
if line.startswith('>'):
starting = True
DNAID.append(line)
elif starting:
starting = False
DNASEQ.append(line)
else:
DNASEQ[-1] += line

Categories

Resources