Read from a file and remove \n and spaces - python

I'm trying to have python read some lines of text from a file and then convert them to an md5 hash to compare to the one the user entered.
I'm using f = open(file, 'r') to open and read the files and everything is working fine but when it hashes the word its not the right hash for that word.
So I need to know how to remove the spaces or the \n at the end that is causing it to hash incorrectly.
If that makes sense. I didnt really know how to word it.
The code: http://pastebin.com/Rdticrbs

I have just rewritten your pastebin code, because it's not good. Why did you write it recursively? (The line sys.setrecursionlimit(10000000) should probably be a clue that you're doing something wrong!)
import md5
hashed = raw_input("Hash:")
with open(raw_input("Wordlist Path: ")) as f:
for line in f:
if md5.new(line.strip()).hexdigest() == hashed:
print(line.strip())
break
else:
print("The hash was not found. Please try a new wordlist.")
raw_input("Press ENTER to close.")
This will obviously be slow, because it hashes every word in the wordlist every time. If you are going to look up more than one word in the wordlist, you should compute (once) a mapping of hashes to words (a reverse lookup table) and use that. You may need a large-scale key-value storage library if your wordlists are large.

You can just open the file like this:
with open('file', 'r') as f:
for line in f:
do_somthing_with(line.strip())
From the official documentation strip() will return a copy of the string with the leading and trailing characters removed.
Edit: I correct my mistake thanks to the comment of katrielalex (I don't know why I believed what I posted before). My apology.

def readStripped(path):
with open('file') as f:
for line in f:
yield f.strip()
dict((line, yourHash(line)) for line in readStripped(path))

str.strip([chars])
Return a copy of the string with the leading and trailing characters
removed.The chars argument is a string specifying the set of
characters to be removed. If omitted or None, the chars argument
defaults to removing whitespace. The chars argument is not a prefix or
suffix; rather, all combinations of its values are stripped:
>>> s = " Hello \n".strip()
>>> print(s)
... Hello
In your code, add this.
words = lines[num].strip()

Related

Re-formatting a text file

I am fairly new to Python. I have a text file, full of common misspellings. The correct spelling of the word is prefixed with a $ character, and all misspelled versions of the word preceding it; one on each line.
mispelling.txt:
$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa
I want to create a new text file, based on mispelling.txt, where the format appears as this:
new_mispelling.txt:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The correct spelling of the word is on the right-hand side of its misspelling, separated by ->; on the same line.
Question:
How do I read in the file, read $ as a new word and thus a new line in my output file, propagate an output file and save to disk?
The purpose of this is to have my collected data be of the same format as this open-source Wikipedia entry dataset of "all" commonly misspelled words, that doesn't contain my own entries of words and misspellings.
As you process the file line-by-line, if you find a word that starts with $, set that as the "currently active correct spelling". Then each subsequent line is a misspelling for that word, so format that into a string and write it to the output file.
current_word = ""
with open("mispelling.txt") as f_in, open("new_mispelling.txt", "w") as f_out:
for line in f_in:
line = line.strip() # Remove whitespace at start and end
if line.startswith("$"):
# If the line starts with $
# Slice the current line from the second character to the end
# And save it as current_word
current_word = line[1:]
else:
# If it doesn't start with $, create the string we want
# And write it.
f_out.write(f"{line}->{current_word}\n")
With your input file, this gives:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The f"{line}->{current_word}\n" construct is called an f-string and is used for string interpolation in python 3.6+.
A regex solution:
You can use the pattern '\$(\w+)(.*?)(?=\$|$)' and join each value starting with $ with the subsequent values by ->, then join all these by \n for the groups captured, then finally join all such values by \n. Make sure to use re.DOTALL flag since its a multi-line string:
import re
txt='''$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa'''
print('\n'.join('\n'.join('->'.join((v, m.group(1)))
for v in m.group(2).strip('\n').split('\n')) for m in
re.finditer('\$(\w+)(.*?)(?=\$|$)', txt, re.DOTALL)))
OUTPUT:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
I'm leaving file read/write for you assuming that's not the problem you are asking for

How can I successfully capture all possible cases to create a python list from a text file

This public gist creates a simple scenario where you can turn a text file into a python list line by line.
with open('test.txt', 'r') as listFile:
lines = listFile.read().split("\n")
out = []
for item in lines:
if '"' in item:
out.append('("""' + item + '"""),')
else:
out.append('("' + item + '"),')
with open('out.py', 'a') as outFile:
outFile.write("out = [\n")
for item in out:
outFile.write("\t" + item + "\n")
outFile.write("]")
In text.txt the sixth and seventh lines
'"""'
""
are the ones that produce invalid output. Perhaps you can think of some other examples that would fail to work.
EDIT:
Valid output would look something like this:
out = [
"line1",
"line2",
""" line 3 has """ and "" and " in it """, # but it is a valid string
"last line",
]
The ( and ) characters were an oversight by me they are not needed or wanted...
EDIT: Oh god I'm getting overwhelmed. I'm going to take 5 minutes and post the question again in a better form.
Using a newline character besides \n would also cause the program to fail. In Windows its common to use \r or \r\n.
#abarnert's comment shows a better way to read lines.
A text file is already an iterable of lines.
As with any other iterable, you can convert it to a list by just passing it to the list constructor:
with open('text.txt') as f:
lines = list(f)
Or, if you don't want the newlines on the end of each line:
with open('text.txt') as f:
lines = [line.rstrip('\n') for line in f]
If you want to handle classic Mac and Windows line endings as well as Unix, open the file in universal-newlines mode:
with open('text.txt', 'rU') as f:
… or use the Python 3-style io classes (but note that this will give you unicode strings, not byte strings, which will repr with u prefixes—they're still valid Python literals that way, but they won't look as pretty):
import io
with io.open('text.txt') as f:
Now, it's hard to tell from code that doesn't work and no explanation of what's wrong with it, but it looks like you're trying to figure out how to write that list out as a Python-source-format list display, wrapping it in brackets, adding quotes, escaping any internal quotes, etc. But there's a much easier way to do that too:
with open('out.py', 'a') as f:
f.write(repr(lines))
If you're trying to pretty-print it, there's a pprint module in the stdlib for exactly that purpose, and various bigger/better alternatives on PyPI. Here's an example of the output of pprint.pprint(lines, width=60) with (what I think is) the same input you used for your desired output:
['line1',
'line2',
' line 3 has """ and "" and " in it ',
'last line']
Not exactly the same as your desired output—but, unlike your output, it's a valid Python list display that evaluates to the original input, and it looks pretty readable to me.

Don't write final new line character to a file

I have looked around StackOverflow and couldn't find an answer to my specific question so forgive me if I have missed something.
import re
target = open('output.txt', 'w')
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
match_text = match.group()
target.write(match_text + '\n')
else:
continue
target.close()
The file I am parsing is huge so need to process it line by line.
This (of course) leaves an additional newline at the end of the file.
How should I best change this code so that on the final iteration of the 'if match' loop it doesn't put the extra newline character at the end of the file. Should it look through the file again at the end and remove the last line (seems a bit inefficient though)?
The existing StackOverflow questions I have found cover removing all new lines from a file.
If there is a more pythonic / efficient way to write this code I would welcome suggestions for my own learning also.
Thanks for the help!
Another thing you can do, is to truncate the file. .tell() gives us the current byte number in the file. We then subtract one, and truncate it there to remove the trailing newline.
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell()-1)
On Linux and MacOS, the -1 is correct, but on Windows it needs to be -2. A more Pythonic method of determining which is to check os.linesep.
import os
remove_chars = len(os.linesep)
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell() - remove_chars)
kindal's answer is also valid, with the exception that you said it's a large file. This method will let you handle a terabyte sized file on a gigabyte of RAM.
Write the newline of each line at the beginning of the next line. To avoid writing a newline at the beginning of the first line, use a variable that is initialized to an empty string and then set to a newline in the loop.
import re
with open('input.txt') as source, open('output.txt', 'w') as target:
newline = ''
for line in source:
match = re.search(r'Stuff', line)
if match:
target.write(newline + match.group())
newline = '\n'
I also restructured your code a bit (the else: continue is not needed, because what else is the loop going to do?) and changed it to use the with statement so the files are automatically closed.
The shortest path from what you have to what you want is probably to store the results in a list, then join the list with newlines and write that to the file.
import re
target = open('output.txt', 'w')
results = []
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
results.append(match.group())
target.write("\n".join(results))
target.close()
Voilà, no extra newline at the beginning or end. Might not scale very well of the resulting list is huge. (And like kindall I left out the else)
Since you're performing the same regex over and over, you'd probably want to compile it beforehand.
import re
prog = re.compile(r'Stuff')
I tend to input from and output to stdin and stdout for simplicity. But that's a matter of taste (and specs).
from sys import stdin, stdout
Ignoring the specific requirement about removing the final EOL[1], and just addressing the bit about your own learning, the whole thing could be written like this:
from itertools import imap
stdout.writelines(match.group() for match in imap(prog.match, stdin) if match)
[1] As others have commented, this is a Bad Thing, and it's extremely annoying when someone does this.

How to iterate over space-separated ASCII file in Python

Strange question here.
I have a .txt file that I want to iterate over. I can get all the words into an array from the file, which is good, but what I want to know how to do is, how do I iterate over the whole file, but not the individual letters, but the words themselves.
I want to be able to go through the array which houses all the text from the file, and basically count all the instances in which a word appears in it.
Only problem is I don't know how to write the code for it.
I tried using a for loop, but that just iterates over every single letter, when I want the whole words.
This code reads the space separated file.txt
f = open("file.txt", "r")
words = f.read().split()
for w in words:
print w
file = open("test")
for line in file:
for word in line.split(" "):
print word
Untested:
def produce_words(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('in.txt', 'r') as file_:
for word in produce_words(file_):
print word
If you want to loop over an entire file, then the sensible thing to do is to iterate over the it, taking the lines and splitting them into words. Working line-by-line is best as it means we don't read the entire file into memory first (which, for large files, could take a lot of time or cause us to run out of memory):
with open('in.txt') as input:
for line in input:
for word in line.split():
...
Note that you could use line.split(" ") if you want to preserve more whitespace, as line.split() will remove all excess whitespace.
Also note my use of the with statement to open the file, as it's more readable and handles closing the file, even on exceptions.
While this is a good solution, if you are not doing anything within the first loop, it's also a little inefficient. To reduce this to one loop, we can use itertools.chain.from_iterable and a generator expression:
import itertools
with open('in.txt') as input:
for word in itertools.chain.from_iterable(line.split() for line in input):
...

Try Except in Python

I want to take a path for a file, open the file and read the data within it. Upon doing so, I would like to count the number of occurrences of each letter in the alphabet.
Of what I have read and heard, using try/except would be best here. I've tried my best in this, but I only managed to count the occurrences of what letters were in a string within the program, and not within the file.
I haven't a clue how to do this now, and my brain is starting to hurt....this is what I have so far:
import sys
print "Enter the file path:"
thefile = raw_input()
f = open(thefile, "r")
chars = {}
for c in f:
try:
chars[c]+=1
except:
chars[c]=1
print chars
Any help will be highly appreciated. Thank you.
EDIT: I forgot to say that the result I get at the minute says that the whole file is one character. The file consists of "abcdefghijklmnopqrstuvwxyz" and the resulting output is: {'"abcdefghijklmnopqrstuvwxyz"\n': 1} which it shouldn't be.
A slightly more elegant approach is this:
from __future__ import with_statement
from collections import defaultdict
print "Enter the file path:"
thefile = raw_input()
with open(thefile, "r") as f:
chars = defaultdict(int)
for line in f:
for c in line:
chars[c] += 1
print dict(chars)
This uses a defaultdict to simplify the counting process, uses two loops to make sure we read each character separately without needing to read the entire file into memory, and uses a with block to ensure that the file is closed properly.
Edit:
To compute a histogram of the letters, you can use this version:
from __future__ import with_statement
from string import ascii_letters
print "Enter the file path:"
thefile = raw_input()
chars = dict(zip(ascii_letters, [0] * len(ascii_letters)))
with open(thefile, "r") as f:
for line in f:
for c in line:
if c in ascii_letters:
chars[c] += 1
for c in ascii_letters:
print "%s: %d" % (c, chars[c])
This uses the handy string.ascii_letters constant, and shows a neat way to build the empty dictionary using zip() as well.
The for c in f: statement is processing your file line by line (that's what the for operation on a file object is designed to do). Since you want to process it character by character, try changing that to:
data = f.read()
for c in data:
The .read() method reads the entire contents of the file into one string, assigns it to data, then the for loop considers each individual character of that string.
You're almost there, actually; the most important thing you're missing is that your c is not a character, instead it's a line: iterating through a Python file gives you a line at a time. You can solve the problem by adding another loop:
print "Enter the file path:"
thefile = raw_input()
f = open(thefile, "r")
chars = {}
for line in f:
for c in line:
try:
chars[c]+=1
except:
chars[c]=1
print chars
(Reading the entire file into a string also works, as another answer mentions, if your file is small enough to fit in memory.)
While it does work in this case, it's not a terribly good idea to use a raw except: unless you're actually trying to catch all possible errors. Instead, use except KeyError:.
What you're trying to do is pretty common, so there's a Python dictionary method and data type that can remove the try/except from your code entirely. Take a look at the setdefault method and the defaultdict type. With either, you can essentially specify that missing values start at 0.
Let's put a more pythonic way for PEP8's sake:
import collections
with open(raw_input(), 'rb') as f:
count = collections.Counter(f.read())
print count
Batteries included! :)

Categories

Resources