how to perform XOR of all words in a file - python

I want to convert all words in a standard dictionary (for example : /usr/share/dict/words of a unix machine) integer and find XOR between every two words in the dictionary( ofcourse after converting them to integer) and probably store it in a new file.
Since I am new to python and because of large file sizes, the program is getting hung every now and then.
import os
dictionary = open("/usr/share/dict/words","r")
'''a = os.path.getsize("/usr/share/dict/words")
c = fo.read(a)'''
words = dictionary.readlines()
foo = open("word_integer.txt", "a")
for word in words:
foo.write(word)
foo.write("\t")
int_word = int(word.encode('hex'), 16)
'''print int_word'''
foo.write(str(int_word))
foo.write("\n")
foo.close()

First we need a method to convert your string to an int, I'll make one up (since what you're doing isn't working for me at all, maybe you mean to encode as unicode?):
def word_to_int(word):
return sum(ord(i) for i in word.strip())
Next, we need to process the files. The following works in Python 2.7 onward, (in 2.6, just nest two separate with blocks, or use contextlib.nested:
with open("/usr/share/dict/words","rU") as dictionary:
with open("word_integer.txt", "a") as foo:
while dictionary:
try:
w1, w2 = next(dictionary), next(dictionary)
foo.write(str(word_to_int(w1) ^ word_to_int(w2)))
except StopIteration:
print("We've run out of words!")
break

This code seems to work for me. You're likely running into efficiency issues because you are calling readlines() on the entire file which pulls it all into memory at once.
This solution loops through the file line by line for each line and computes the xor.
f = open('/usr/share/dict/words', 'r')
pairwise_xors = {}
def str_to_int(w):
return int(w.encode('hex'), 16)
while True:
line1 = f.readline().strip()
g = open('/usr/share/dict/words', 'r')
line2 = g.readline().strip()
if line1 and line2:
pairwise_xors[(line1, line2)] = (str_to_int(line1) ^ str_to_int(line2))
else:
g.close()
break
f.close()

Related

Python Question - How to extract text between {textblock}{/textblock} of a .txt file?

I want to extract the text between {textblock_content} and {/textblock_content}.
With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.
f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
How to solve this problem?
Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.
First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:
def get_all_tags_content(
text: str,
tag_begin: str = "{textblock_content}",
tag_end: str = "{/textblock_content}"
) -> list[str]:
useful_text = text
ans = []
# Heavy cicle, needs some optimizations
# Works in O(len(text) ** 2), we can better
while tag_begin in useful_text:
useful_text = useful_text.split(tag_begin, 1)[1]
if tag_end not in useful_text:
break
block_content, useful_text = useful_text.split(tag_end, 1)
ans.append(block_content)
return ans
with open("introtext.txt", "r") as f:
with open("textcontent.txt", "w+") as r:
r.write(str(get_all_tags_content(f.read())))
To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself
When you call file.readlines() the file pointer will reach the end of the file. For further calls of the same, the return value will be an empty list so if you change your code to sth like one of the below code snippets it should work properly:
f = open("introtext.txt")
r = open("textcontent.txt", "w")
f_lines = f.readlines()
for l in f_lines:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)
f.close()
r.close()
Also, you can implement it through with context manager like the below code snippet:
with open("textcontent.txt", "w") as r:
with open("introtext.txt") as f:
for line in f:
if "{textblock_content}" in l:
pos_text_begin = l.find("{textblock_content}") + 19
pos_text_end = l.find("{/textblock_content}")
text = l[pos_text_begin:pos_text_end]
r.write(text)

Removing extra space from text file

I am currently keeping high scores into a text file called "score.txt". The prgoram works fine, updating the file with the new high scores as normal. Except that every time the program updates the file, there is always one blank line before the first high score, creating an error when I try to save the scores the next time. The code:
scores_list = []
score = 10
def take_score():
# Save old scores into list
f = open("score.txt", "r")
lines = f.readlines()
for line in lines:
scores_list.append(line)
print scores_list
f.close()
take_score()
def save_score():
# Clear file
f = open("score.txt", "w")
print >> f, ""
f.close()
# Rewrite scores into text files
w = open("score.txt", "a")
for i in range(0, len(scores_list)):
new_string = scores_list[i].replace("\n", "")
scores_list[i] = int(new_string)
if score > scores_list[i]:
scores_list[i] = score
for p in range(0, len(scores_list)):
print >> w, str(scores_list[p])
print repr(str(scores_list[p]))
save_score()
The problem mentioned happens in the save_score() function. I have tried this related question: Removing spaces and empty lines from a file Using Python, but it requires I open the file in "r" mode. Is there a way to accomplish the same thing except when the file is opened in "a" mode (append)?
You are specifically printing an empty line as soon as you create the file.
print >> f, ""
You then append to it, keeping the empty line.
If you just want to clear the contents every time you run this, get rid of this:
# Clear file
f = open("score.txt", "w")
print >> f, ""
f.close()
And modify the opening to this:
w = open("score.txt", "w")
The 'w' mode truncates already, as you were already using. There's no need to truncate, write an empty line, close, then append lines. Just truncate and write what you want to write.
That said, you should use the with construct and file methods for working with files:
with open("score.txt", "w") as output: # here's the with construct
for i in xrange(len(scores_list)):
# int() can handle leading/trailing whitespace
scores_list[i] = int(scores_list[i])
if score > scores_list[i]:
scores_list[i] = score
for p in xrange(len(scores_list)):
output.write(str(scores_list[p]) + '\n') # writing to the file
print repr(str(scores_list[p]))
You will then not need to explicitly close() the file handle, as with takes care of that automatically and more reliably. Also note that you can simply send a single argument to range and it will iterate from 0, inclusive, until that argument, exclusive, so I've removed the redundant starting argument, 0. I've also changed range to the more efficient xrange, as range would only be reasonably useful here if you wanted compatibility with Python 3, and you're using Python 2-style print statements anyway, so there isn't much point.
print appends a newline to what you print. In the line
print >> f, ""
You're writing a newline to the file. This newline still exists when you reopen in append mode.
As #Zizouz212 mentions, you don't need to do all this. Just open in write mode, which'll truncate the file, then write what you need.
Your opening a file, clearing it, but then you open the same file again unnecessarily. When you open the file, you print a newline, even if you don't think so. Here is the offending line:
print >> f, ""
In Python 2, it really does this.
print "" + "\n"
This is because Python adds a newline at the end of the string to each print statement. To stop this, you could add a comma to the end of the statement:
print "",
Or just write directly:
f.write("my data")
However, if you're trying to save a Python data type, and it does not have to be human-readable, you may have luck using pickle. It's really simple to use:
def save_score():
with open('scores.txt', 'w') as f:
pickle.dump(score_data, f):
It is not really answer for question.
It is my version of your code (not tested). And don't avoid rewriting everything ;)
# --- functions ---
def take_score():
'''read values and convert to int'''
scores = []
with open("score.txt", "r") as f
for line in f:
value = int(line.strip())
scores.append(value)
return scores
def save_score(scores):
'''save values'''
with open("score.txt", "w") as f
for value in scores:
write(value)
write("\n")
def check_scores(scores, min_value):
results = []
for value in scores:
if value < min_value:
value = min_value
results.append(value)
return resulst
# --- main ---
score = 10
scores_list = take_score()
scores_list = check_scores(scores_list, score)
save_score(scores_list)

python reading file infinite loop

pronunciation_file = open('dictionary.txt')
pronunciation = {}
line = pronunciation_file.readline()
while line != '':
n_line = line.strip().split(' ' , 1)
pronunciation[n_line[0]] = n_line[1].strip()
line = pronunciation_file.readline()
print(pronunciation)
the code is to turn a file of words and its pronunciation into a dictionary (keys are words and value is pronunciation) for example 'A AH0\n...' into {'A':'AH0'...}
the problem is if I put the print inside the loop, it prints normal(but it prints all the unfinished dictionaries) however if i put the print outside the loop like the one above, the shell returns nothing and when i close it ,it prompts the program is still running(where is probably a infinite loop)
Help please
I also tried cutting out first few hundred words and run the program, it works for very short files but it starts returning nothing at a certain length:|
That is not how to read from a file:
# with will also close your file
with open(your_file) as f:
# iterate over file object
for line in f:
# unpack key/value for your dict and use rstrip
k, v = line.rstrip().split(' ' , 1)
pronunciation[k] = v
You simply open the file and iterate over the file object. Use .rstrip() if you want to remove from the end of string, there is also no need to call strip twice on the same line.
You can also simplify your code to just using dict and a generator expression
with open("dictionary.txt") as f:
pronunciation = dict(line.rstrip().split(" ",1) for line in f)
Not tested, but if you want to use a while loop, the idiom is more like this:
pronunciation={}
with open(fn) as f:
while True:
line=f.readline()
if not line:
break
l, r=line.split(' ', 1)
pronunciation[l]=r.strip()
But the more modern Python idiom for reading a file line-by-line is to use a for loop as Padraic Cunningham's answer uses. A while loop is more commonly used to read a binary file fixed chunk by fixed chunk in Python.

Is there a way to read a file in a loop in python using a separator other than newline

I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?
The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.
Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.

The pythonic way of printing a value

This probably measures how pythonic you are. I'm playing around trying to learn python so Im not close to being pythonic enough. The infile is a dummy patriline and I want a list of father son.
infile:
haffi jolli dkkdk lkskkk lkslll sdkjl kljdsfl klsdlj sdklja asldjkl
code:
def main():
infile = open('C:\Users\Notandi\Desktop\patriline.txt', 'r')
line = infile.readline()
tmpstr = line.split('\t')
for i in tmpstr[::2]:
print i, '\t', i + 1
infile.close()
main()
The issue is i + 1; I want to print out two strings in every line. Is this clear?
You are getting confused between the words in the split string and their indices. For example, the first word is "haffi" but the first index is 0.
To iterate over both the indices and their corresponding words, use enumerate:
for i, word in enumerate(tmpstr):
print word, tmpstr[i+1]
Of course, this looks messy. A better way is to just iterate over pairs of strings. There are many ways to do this; here's one.
def pairs(it):
it = iter(it)
for element in it:
yield element, next(it)
for word1, word2 in pairs(tmpstr):
print word1, word2
I'd use the with statement here, which if you're using an older version of python you need to import:
from __future__ import with_statement
for the actual code, if you can afford to load the whole file into memory twice (ie, it's pretty small) I would do this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.read().split('\t')
for father, son in zip(string, string[1:]):
print "%s \t %s" % (father, son)
main()
That way you skip the last line with out having too much overhead to not include the childless leaf at the end, which is think is what you were asking for(?)
As a bit of a tangent: if the file is really big, you may not want to load the whole thing into memory, in which case you may need a generator. You probably don't need to do this if you're actually printing everything out, but in case this is some simplified version of the problem, this is how I would approach making a generator to split the file:
class reader_and_split():
def __init__(self, fname, delim='\t'):
self.fname = fname
self.delim = delim
def __enter__(self):
self.file = open(self.fname, 'r')
return self.word_generator()
def __exit__(self, type, value, traceback):
self.file.close()
def word_generator(self):
current = []
while True:
char = self.file.read(1)
if char == self.delim:
yield ''.join(current)
current = []
elif not char:
break
else:
current.append(char)
The value of a generator is that you don't load the entire contents of the file into memory, before running the split on it, which can be expensive for very, very large files. This implementation only allows single character delimiter for simplicity. Which means all you need to do to parse out everything is to use the generator, a quick dirty way to do this is:
with reader_and_split(fileloc) as f:
previous = f.next()
for word in f:
print "%s \t %s" % (previous, word)
previous = word
You can be more pythonic in both your file reading and printing. Try this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.readline().split('\t')
for i, word in enumerate(strings):
print "{} \t {}".format(word, strings[i+1:i+2])
main()
Using strings[i+1:i+2] ensures an IndexError isn't thrown (instead, returning a []) when trying to reach the i+1th index at the end of the list.
Here's one clean way to do it. It has the benefit of not crashing when fed an odd number of items, but of course you may prefer an exception for that case.
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as infile:
line = infile.readline()
previous = None
for i in line.split('\t'):
if previous is None:
previous = i
else:
print previous, '\t', i
previous = None
I won't make any claims that this is pythonic though.

Categories

Resources