Reading input files with ASCII 215 as delimiter - python

I am trying to read from a file which has word pairs delimited by ASCII value 215. When I run the following code:
f = open('file.i', 'r')
for line in f.read().split('×'):
print line
I get a string that looks like garbage. Here is a sample of my input:
abashedness×N
abashment×N
abash×t
abasia×N
abasic×A
abasing×t
Abas×N
abatable×A
abatage×N
abated×V
abatement×N
abater×N
Abate×N
abate×Vti
abating×V
abatis×N
abatjours×p
abatjour×N
abator×N
abattage×N
abattoir×N
abaxial×A
and here is my output after the code above is run:
z?Nlner?N?NANus?A?hion?hk?hhn?he?hanoconiosis?N
My goal is to eventually read this into either a list of tuples or something of that nature, but I'm having trouble just getting the data to print.
Thanks for all help.

Well, two things:
Your source could be Unicode! Use an escape and be safe.
Read in binary mode.
with open("file.i", "rb") as f:
for line in f.read().split(b"\xd7"):
print(line)

The character is delimiting the word and the part of speech, but each word is still on its own line:
with open('file.i', 'rb') as handle:
for line in handle:
word, pos = line.strip().split('×')
print word, pos
Your code was splitting the whole file, so you were ending up with words like N\nabatable, N\nAbate, Vti\nabating.

To interpret bytes from a file as text, you need to know its character encoding. There Ain't No Such Thing As Plain Text. You could use codecs module to read the text:
import codecs
with codecs.open('file.i', 'r', encoding='utf-8') as file:
for line in file:
word, sep, suffix = line.partition(u'\u00d7')
if sep:
print word
Put the actual character encoding of the file instead of utf-8 placeholder e.g., cp1252.
Non-ascii characters in string literals would require source character encoding declaration at the top the script so I've used the unicode escape: u'\u00d7'.

Thanks to both your help, I was able to hack together this bit of code that returns a list of lists holding what I'm looking for.
with open("mobyposi.i", "rb") as f:
content = f.readlines()
f.close()
content = content[0].split()
for item in content:
item.split("\xd7")
It was indeed in unicode! However, the implementation described above discarded the text after the unicode value and before the newline.
EDIT: Able to reduce to:
with open("mobyposi.i", "rb") as f:
for item in f.read().split():
item.split("\xd7")

Related

Python: convert strings containing unicode code point back into normal characters

I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:
r = requests.get(url)
with open("file.txt","w") as filename:
filename.write(r.text)
With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.
\u9001\u5206200000
When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "\u9001\u5206200000" when I try to print it out. For example:
with open("file.txt", "r") as filename:
mystring = filename.readline()
print(mystring)
Output:
"\u9001\u5206200000"
Is there a way for me to convert this string and others like it back to their original strings with unicode characters?
convert this string and others like it back to their original strings with unicode characters?
Yes, let file.txt content be
\u9001\u5206200000
then
with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)
output
送分200000
If you want to know more read Text Encodings in codecs built-in module docs
It's better to use the io module for that. Try and adapt the following code for your problem.
import io
with io.open(filename,'r',encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
f.write(text)
Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python
I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:
with open("file.txt","w", encoding="UTF-8") as filename:
filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
mystring = filename.readline()
print(mystring)
That works as you expect regardless of platform.

Keep the new line symbols in the string when writing in a text file Python

I have a list of strings, and some of the strings contain '\n's. I want to write this list of strings into a text file and then later on read it back and store it to a list by using readlines(). I have to keep the original text; meaning not removing the new lines from the text.
If I don't remove all these new lines then of course readlines() will return a larger number of strings than the original list.
How can I achieve this? Or there's really no way and I should write in other formats instead. Thanks.
The following:
from __future__ import print_function
strings = ["asd", "sdf\n", "dfg"]
with open("output.txt", "w") as out_file:
for string in strings:
print(repr(string), file=out_file)
with open("output.txt") as in_file:
for line in in_file:
print(line.strip())
prints
'asd'
'sdf\n'
'dfg'
To print it normally (without the quotes), you can use ast.literal_eval: print(ast.literal_eval(line.strip()))

How can I successfully capture all possible cases to create a python list from a text file

This public gist creates a simple scenario where you can turn a text file into a python list line by line.
with open('test.txt', 'r') as listFile:
lines = listFile.read().split("\n")
out = []
for item in lines:
if '"' in item:
out.append('("""' + item + '"""),')
else:
out.append('("' + item + '"),')
with open('out.py', 'a') as outFile:
outFile.write("out = [\n")
for item in out:
outFile.write("\t" + item + "\n")
outFile.write("]")
In text.txt the sixth and seventh lines
'"""'
""
are the ones that produce invalid output. Perhaps you can think of some other examples that would fail to work.
EDIT:
Valid output would look something like this:
out = [
"line1",
"line2",
""" line 3 has """ and "" and " in it """, # but it is a valid string
"last line",
]
The ( and ) characters were an oversight by me they are not needed or wanted...
EDIT: Oh god I'm getting overwhelmed. I'm going to take 5 minutes and post the question again in a better form.
Using a newline character besides \n would also cause the program to fail. In Windows its common to use \r or \r\n.
#abarnert's comment shows a better way to read lines.
A text file is already an iterable of lines.
As with any other iterable, you can convert it to a list by just passing it to the list constructor:
with open('text.txt') as f:
lines = list(f)
Or, if you don't want the newlines on the end of each line:
with open('text.txt') as f:
lines = [line.rstrip('\n') for line in f]
If you want to handle classic Mac and Windows line endings as well as Unix, open the file in universal-newlines mode:
with open('text.txt', 'rU') as f:
… or use the Python 3-style io classes (but note that this will give you unicode strings, not byte strings, which will repr with u prefixes—they're still valid Python literals that way, but they won't look as pretty):
import io
with io.open('text.txt') as f:
Now, it's hard to tell from code that doesn't work and no explanation of what's wrong with it, but it looks like you're trying to figure out how to write that list out as a Python-source-format list display, wrapping it in brackets, adding quotes, escaping any internal quotes, etc. But there's a much easier way to do that too:
with open('out.py', 'a') as f:
f.write(repr(lines))
If you're trying to pretty-print it, there's a pprint module in the stdlib for exactly that purpose, and various bigger/better alternatives on PyPI. Here's an example of the output of pprint.pprint(lines, width=60) with (what I think is) the same input you used for your desired output:
['line1',
'line2',
' line 3 has """ and "" and " in it ',
'last line']
Not exactly the same as your desired output—but, unlike your output, it's a valid Python list display that evaluates to the original input, and it looks pretty readable to me.

using txt file as input for python

I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.

Read from a file and remove \n and spaces

I'm trying to have python read some lines of text from a file and then convert them to an md5 hash to compare to the one the user entered.
I'm using f = open(file, 'r') to open and read the files and everything is working fine but when it hashes the word its not the right hash for that word.
So I need to know how to remove the spaces or the \n at the end that is causing it to hash incorrectly.
If that makes sense. I didnt really know how to word it.
The code: http://pastebin.com/Rdticrbs
I have just rewritten your pastebin code, because it's not good. Why did you write it recursively? (The line sys.setrecursionlimit(10000000) should probably be a clue that you're doing something wrong!)
import md5
hashed = raw_input("Hash:")
with open(raw_input("Wordlist Path: ")) as f:
for line in f:
if md5.new(line.strip()).hexdigest() == hashed:
print(line.strip())
break
else:
print("The hash was not found. Please try a new wordlist.")
raw_input("Press ENTER to close.")
This will obviously be slow, because it hashes every word in the wordlist every time. If you are going to look up more than one word in the wordlist, you should compute (once) a mapping of hashes to words (a reverse lookup table) and use that. You may need a large-scale key-value storage library if your wordlists are large.
You can just open the file like this:
with open('file', 'r') as f:
for line in f:
do_somthing_with(line.strip())
From the official documentation strip() will return a copy of the string with the leading and trailing characters removed.
Edit: I correct my mistake thanks to the comment of katrielalex (I don't know why I believed what I posted before). My apology.
def readStripped(path):
with open('file') as f:
for line in f:
yield f.strip()
dict((line, yourHash(line)) for line in readStripped(path))
str.strip([chars])
Return a copy of the string with the leading and trailing characters
removed.The chars argument is a string specifying the set of
characters to be removed. If omitted or None, the chars argument
defaults to removing whitespace. The chars argument is not a prefix or
suffix; rather, all combinations of its values are stripped:
>>> s = " Hello \n".strip()
>>> print(s)
... Hello
In your code, add this.
words = lines[num].strip()

Categories

Resources