reading strings from large file faster

reading strings from large file faster - python

I have a large text file (parsed.txt) which includes almost 1.500.000 lines. Each line is in this format:
foobar foo[Noun]+lAr[A3pl]+[Pnon]+[Nom]
loremipsum lorem[A1sg]+lAr[A3pl]+[Pl]+[Nom]
I'm giving the second field after space and get the first field before space with this function:
def find_postag(word,postag):
with open('parsed.txt',"r") as zemberek:
for line in zemberek:
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword
However, it works too slow. I'm not sure how can I make the process faster. My idea is: The parsed.txt file is alphabetic ordered. If given word variable starts with "z" letter, it reads almost 900.000 lines unnecessarily. Maybe it will be faster if it will check from line 900.000 if the given word starts with "z" letter. Is there any better ideas and how can I implement?

Since your input file is alphabetical, what you could do is create a dictionary that contains the line number where each letter starts, like this:
with open('parsed.txt', 'r') as f:
data = [line.strip() for line in f if line.strip()]
index = dict()
for i in range(len(data)):
line = data[i]
first_letter = line[0].lower()
if first_letter not in index:
index[first_letter] = i
You would want to add that code at the beginning so it only runs once before you start doing the searches. This way when you search for a word, you can have it start searching where its first letter starts, like this:
def find_postag(word, postag):
start = index[word[0].lower()]
for line in data[start:]:
# your code here
if all(i in line for i in (word,postag)):
if line.split(" ")[0].startswith(word) and line.split(" ")[1] == word+postag:
selectedword = line.split(" ")[0]
break
return selectedword

Related

How to get first word from text file removing \n - python

If the text file is /n/n Hello world!/n I like python./n
How do I get the first word from that text?
I tried to code:
def word_file(file):
files = open(file, 'r')
l = files.readlines()
for i in range(len(l)):
a = l[i].rstrip("\n")
line = l[0]
word = line.strip().split(" ")[0]
return word
There is space in front Hello.
The result I get is NONE. How should I correct it?
Can anybody help?

Assuming there is a word in the file:
def word_file(f):
with open(f) as file:
return file.read().split()[0]
file.read reads the entire file as a string. Do a split with no parameters on that string (i.e. sep=None). Then according to the Python manual "runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace." So the splitting will be done on consecutive white space and there will be no empty strings returned as a result of the split. Therefore the first element of the returned list will be the first word in the file.
If there is a possibility that the file is empty or contains nothing but white space, then you would need to check the return value from file.read().split() to ensure it is not an empty list.
If you need to avoid having to read the entire file into memory at once, then the following, less terse code can be used:
def word_file(f):
with open(f) as file:
for line in file:
words = line.split()
if words:
return words[0]
return None # No words found

Edit: #Booboo answer is far better than my answer
This should work:
def word_file(file):
with open(file, 'r') as f:
for line in f:
for index, character in enumerate(line):
if not character.isspace():
line = line[index:]
for ind, ch in enumerate(line):
if ch.isspace():
return line[:ind]
return line # could not find whitespace character at end
return None # no words found
output:
Hello

IndexError: string index out of range, line is empty?

I'm trying to make a program that takes a letter the user inputs, reads a text file and then prints the words that start with that letter.
item = "file_name"
letter = raw_input("Words starting with: ")
letter = letter.lower()
found = 0
with open(item) as f:
filelength = sum(1 for line in f)
for i in range (filelength):
word = f.readline()
print word
if word[0] == letter:
print word
found += 1
print "words found:", found
I keep receiving the error
"if word[0] == letter: IndexError: string index out of range"
with no lines being printed. I think this is what happens if there's nothing there, but there are 50 lines of random words in the file, so I'm not sure why it is being read this way.

You have two problems:
You are trying to read the whole file twice (once to determine the filelength, then again to get the lines themselves), which won't work; and
You aren't dealing with empty lines, so if any are introduced (e.g. if the last line is blank) your code will break anyway.
The easiest way to do this is:
found = 0
with open(item) as f:
for line in f: # iterate over lines directly
if line and line[0] == letter: # skip blank lines and any that don't match
found += 1
print "words found:", found
if line skips blanks because empty sequences are false-y, and the "lazy evaluation" of and means that line[0] will only be tried where the line isn't empty. You could alternatively use line.startswith(letter).

When you use sum(1 for line in f) you are already consuming all the lines in your file, so now your handle points to the end of the file. Try using f.seek(0) to return the read cursor to the start of the file.

Python: List not correct after appending lines

I'm trying to append lines to an empty list reading from a file, and I've already stripped the lines of returns and newlines, but what should be one line is being entered as two separate items into the list.
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.rstrip()
line = line.lstrip()
if line.startswith('>')==True:
DNAID.append(line)
if line.startswith('>')==False:
DNASEQ.append(line)
print DNAID
print DNASEQ
And here's the output
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGA', 'TCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCA', 'ATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAG', 'TGGGAACCTGCGGGCAGTAGGTGGAAT']
I want it to look like this:
['>Rosalind_6404', '>Rosalind_5959', '>Rosalind_0808']
['CCTGCGGAAGATCGGCACTAGATCCCACTAATAATTCTGAGG', 'CCATCGGTAGCGCATCCTTAGTCCAATATCCATTTGTCAGCAGACACGC', 'CCACCCTCGTGGTATGGCTAGGCATTCAGTGGGAACCTGCGGGCAGTAGGTGGAAT']
Here is the source material, just remove the ''s:
['>Rosalind_6404'
CCTGCGGAAGATCGGCACTAGA
TCCCACTAATAATTCTGAGG
'>Rosalind_5959'
CCATCGGTAGCGCATCCTTAGTCCA
ATATCCATTTGTCAGCAGACACGC
'>Rosalind_0808'
CCACCCTCGTGGTATGGCTAGGCATTCAG
TGGGAACCTGCGGGCAGTAGGTGGAAT]

You can combine the .lstrip() and .rstrip() into a single .strip() call.
Then, you were thinking that .append() both added lines to a list and joined lines into a single line. Here, we start DNASEQ with an empty string and use += to join the lines into a long string:
DNA = open('DNAGCex.txt')
DNAID = []
DNASEQ = []
for line in DNA:
line = line.strip()
if line.startswith('>'):
DNAID.append(line)
DNASEQ.append('')
else:
DNASEQ[-1] += line
print DNAID
print DNASEQ

Within each iteration of the loop, you're only looking at a certain line from the file. This means that, although you certainly are appending lines that don't contain a linefeed at the end, you're still appending one of the file's lines at a time. You'll have to let the interpreter know that you want to combine certain lines, by doing something like setting a flag when you first start to read in a DNASEQ and clearing it when the next DNAID starts.
for line in DNA:
line = line.strip() # gets both sides
if line.startswith('>'):
starting = True
DNAID.append(line)
elif starting:
starting = False
DNASEQ.append(line)
else:
DNASEQ[-1] += line

how to loop around every word in a line and then every line in a file?

I have a dictonary like
list1={'ab':10,'ba':20,'def':30}.
Now my input file contains :
ab def
ba ab
I have coded:
filename=raw_input("enter file:")
f=open(filename,'r')
ff=open(filename+'_value','w')
for word in f.read().split():
s=0
if word in list1:
ff.write(word+'\t'+list1[word]+'\n');
s+=int(list1[word])
else:
ff.write(word+'\n')
ff.write("\n"+"total:%d"%(s)+"\n")
Now I want my output file to contain:
ab 10
def 30
total: 40
ba 20
ab 10
total: 30
Am not able to loop it for each line. How should I do it? I tried a few variations using f.readlines(), f.read(), and tried looping once, then twice with them. But I cannot get it right.

Instead of giving the answer right away, Let me give you a gist of what you ask:
To read the whole file:
f = open('myfile','r')
data = f.read()
To loop through each line in the file:
for line in data:
To loop through each word in the line:
for word in line.split():
Use it wisely to get what you want.

You need to make 2 loops and not only one:
filename = raw_input("enter file:")
with open(filename, 'r') as f, open(filename + '_value','w') as ff:
# Read each line sequentially
for line in f.read():
# In each line, read each word
total = 0
for word in line.split():
if word in list1:
ff.write("%s\t%s\n" % (word, list1[word]))
total += int(list1[word])
else:
ff.write(word+'\n')
ff.write("\ntotal: %s\n" % total)
I have also cleaned a little bit your code to be more readable. Also see What is the python "with" statement designed for? if you want to understand the with block

with open("in.txt","r") as f:
with open("out.txt","w") as f1:
for line in f:
words = line.split() # split into list of two words
f1.write("{} {}\n".format((words[0]),list1[words[0]])) # write first word plus value
f1.write("{} {}\n".format((words[1]),list1[words[1]])) # second word plus value
f1.write("Total: {}\n".format((int(list1[words[0]]) + int(list1[words[1]])))) # finally add first and second and get total

Store words of file in dictionary

I want to store the words of a text file in a dictionary.
My code is
word=0
char=0
i=0
a=0
d={}
with open("m.txt","r") as f:
for line in f:
w=line.split()
d[i]=w[a]
i=i+1
a=a+1
word=word+len(w)
char=char+len(line)
print(word,char)
print(d)
my text file is
jdfjdnv dj g gjv,kjvbm
but the problem is that the dictionary is storing only the first word of the text file .how to store the rest of the words.please help

How many lines does your text file have? If it only has one line your loop executes only once, splits whole line into separate words, then saves one word in Python dict. If you want to save all words from this text file with one line you need to add another loop. Like this:
for word in line.split():
d[i] = word
i += 1

You only store the first word because you only have one line in the file, and your only for loop is over the lines.
Generally, if you are going to key the dictionary by index, you can just use the list you are already making:
w = []
char = 0
with open("m.txt", "r") as f:
for line in f:
char += len(line)
w.extend(line.split())
word = sum(map(len, w))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading strings from large file faster - python

Related

How to get first word from text file removing \n - python

IndexError: string index out of range, line is empty?

Python: List not correct after appending lines

how to loop around every word in a line and then every line in a file?

Store words of file in dictionary

Categories

Resources