Python writing to a file - English converted to Chinese - python

I open a file in python.
I read it in and process it (separate it into single words).
I write it to an output file.
The below picture shows my code, the shell (where I'm printing out each word before appending it to the file), and the output.
Why does it become Chinese characters? The encoding of the file is ANSI.
Edit: I should add that the output file seems to be encoded with UCS-2 LE BOM.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
with open('ALLSentences.txt', 'r') as myfile:
text = myfile.read()
tokenized = word_tokenize(text)
file = open("output.txt", "a")
for word in tokenized:
print(word)
file.write(word + "\n")
file.close()

Related

words stemming and save output to text file

i have this code its stemmed words from text file and save the output to another text file.
the problem is its just do stemming to the first word and ignore the others.
any help ?
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
with open(r'file path', 'r') as fp:
tokens = fp.readlines()
for t in tokens:
s = stemmer.stem(t.strip())
print(s, file=open("output.txt", "a"))

pyinstaller executable for simple script is 1.4gb

I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.
I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.
The script:
import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
filename = easygui.fileopenbox()
pdfFileObj = fitz.open(filename)
text =""
for page in pdfFileObj:
text+= page.getText()
re.sub(r'\W+', '', text)
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
write = csv.writer(f)
for i in keywords:
write.writerow([i])
Some context:
Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?

Tagging a .txt file from Inaugural Address Corpus

I'm having a hard time trying to figure this out. New to coding. I'm trying to read a .txt file, tokenize it, pos tag the words in it.
Here's what I've got so far:
import nltk
from nltk import word_tokenize
import re
file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)
My problem is, it keeps giving me the TypeError: expected string or bytes-like object error.
word_tokenize is expecting a string but file.readlines() gives you a list.
Just convert the list to a string will solve the issue.
import nltk
from nltk import word_tokenize
import re
file = open('test.txt', 'r').readlines()
text =''
for line in file:
text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)
I suggest you do the following:
import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re
with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
lines = f.readlines() # first get all the lines from file, store it
for i in range(0, len(lines)): # for each line, do the following
token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
print (token_text) # for debug purposes
pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
print (pos_tagged_token)
For a text file containing:
user is here
pass is there
The output was:
['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]
It worked for me, I'm on Python 3.6, if that should matter. Hope this helps!
EDIT 1:
So your issue was you were passing a list of strings to pos_tag(), whereas doc says
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
Hence you needed to pass it line by line i. e. string by string. That is why you were getting a TypeError: expected string or bytes-like object error.
Most probably the 1865-Lincoln.txt refers to the inaugural speech of president Lincoln. It's available in NLTK from https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip
The original source of the document comes from the Inaugural Address Corpus
If we check how NLTK is reading the file using LazyCorpusReader, we see that the files are Latin-1 encoded.
inaugural = LazyCorpusLoader(
'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')
If you have the default encoding set to utf8, most probably that's where the TypeError: expected string or bytes-like object is occurring
You should open the file with an explicit encoding and decode the string properly, i.e.
import nltk
from nltk import word_tokenize, pos_tag
tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
for line in fin:
tagged_lines.append(pos_tag(word_tokenize(line)))
But technically, you can access the inagural corpus directly as a corpus object in NLTK, i.e.
>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

how to check if a verb stem is in a file in python?

I have a file containing the stems of verbs. I want to give the code a verb and it checks the file and returns the stem of that. for example my verb is "going" and in the file I have the stem "go". I want the code returns "go".
here's my code, but it doesn't work. How should I change it?
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
file = "c:/python342/rootLexicon.txt"
>>> stemmer ("خوردن", file)
'خورد'
Try
print(stemmer("going", file))

How to tokenize natural English text in an input file in python?

I want to tokenize input file in python please suggest me i am new user of python .
I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.
Try something like this:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html
Using NLTK
If your file is small:
Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()
[code]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
If your file is larger:
Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)
[code]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
Using SpaCy
from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)

Categories

Resources