How to tokenize natural English text in an input file in python? - python

I want to tokenize input file in python please suggest me i am new user of python .
I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

Try something like this:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html

Using NLTK
If your file is small:
Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()
[code]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
If your file is larger:
Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)
[code]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
Using SpaCy
from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)

Related

NER using spaCy & Transformers - different result when running inside and outside of a loop

I am using NER (spacy & Transformer) for finding and anonymizing personal information. I noticed that the output I get when giving an input line directly is different than when the input line is read from a file (see screenshot below). Does anyone have suggestions on how to fix this?
Here is my code:
import pandas as pd
import csv
import spacy
from spacy import displacy
from transformers import pipeline
import re
!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')
sent = nlp('Yesterday I went out with Andrew, johanna and Jonathan Sparow.')
displacy.render(sent, style = 'ent')
with open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
sent = nlp(str(row))
displacy.render(sent, style = 'ent')
You are using the csv module to read your file and then trying to convert each row (aka line) of the file to a string with str(row).
If your file just has one sentence per line, then you do not need the csv module at all. You could just do
with open('Synth_dataset_raw.txt', 'r') as fd:
for line in fd:
# Remove the trailing newline
line = line.rstrip()
sent = nlp(line)
displacy.render(sent, style = 'ent')
If you in fact have a csv (with presumably multiple columns and a header) you could do
open('Synth_dataset_raw.txt', 'r') as fd:
reader = csv.reader(fd)
header = next(reader)
text_column_index = 0
for row in reader:
sent = nlp(row[text_column_index])
displacy.render(sent, style = 'ent')

words stemming and save output to text file

i have this code its stemmed words from text file and save the output to another text file.
the problem is its just do stemming to the first word and ignore the others.
any help ?
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
with open(r'file path', 'r') as fp:
tokens = fp.readlines()
for t in tokens:
s = stemmer.stem(t.strip())
print(s, file=open("output.txt", "a"))

pyinstaller executable for simple script is 1.4gb

I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.
I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.
The script:
import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
filename = easygui.fileopenbox()
pdfFileObj = fitz.open(filename)
text =""
for page in pdfFileObj:
text+= page.getText()
re.sub(r'\W+', '', text)
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
write = csv.writer(f)
for i in keywords:
write.writerow([i])
Some context:
Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?

Python writing to a file - English converted to Chinese

I open a file in python.
I read it in and process it (separate it into single words).
I write it to an output file.
The below picture shows my code, the shell (where I'm printing out each word before appending it to the file), and the output.
Why does it become Chinese characters? The encoding of the file is ANSI.
Edit: I should add that the output file seems to be encoded with UCS-2 LE BOM.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
with open('ALLSentences.txt', 'r') as myfile:
text = myfile.read()
tokenized = word_tokenize(text)
file = open("output.txt", "a")
for word in tokenized:
print(word)
file.write(word + "\n")
file.close()

Tagging a .txt file from Inaugural Address Corpus

I'm having a hard time trying to figure this out. New to coding. I'm trying to read a .txt file, tokenize it, pos tag the words in it.
Here's what I've got so far:
import nltk
from nltk import word_tokenize
import re
file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)
My problem is, it keeps giving me the TypeError: expected string or bytes-like object error.
word_tokenize is expecting a string but file.readlines() gives you a list.
Just convert the list to a string will solve the issue.
import nltk
from nltk import word_tokenize
import re
file = open('test.txt', 'r').readlines()
text =''
for line in file:
text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)
I suggest you do the following:
import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re
with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
lines = f.readlines() # first get all the lines from file, store it
for i in range(0, len(lines)): # for each line, do the following
token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
print (token_text) # for debug purposes
pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
print (pos_tagged_token)
For a text file containing:
user is here
pass is there
The output was:
['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]
It worked for me, I'm on Python 3.6, if that should matter. Hope this helps!
EDIT 1:
So your issue was you were passing a list of strings to pos_tag(), whereas doc says
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
Hence you needed to pass it line by line i. e. string by string. That is why you were getting a TypeError: expected string or bytes-like object error.
Most probably the 1865-Lincoln.txt refers to the inaugural speech of president Lincoln. It's available in NLTK from https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip
The original source of the document comes from the Inaugural Address Corpus
If we check how NLTK is reading the file using LazyCorpusReader, we see that the files are Latin-1 encoded.
inaugural = LazyCorpusLoader(
'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')
If you have the default encoding set to utf8, most probably that's where the TypeError: expected string or bytes-like object is occurring
You should open the file with an explicit encoding and decode the string properly, i.e.
import nltk
from nltk import word_tokenize, pos_tag
tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
for line in fin:
tagged_lines.append(pos_tag(word_tokenize(line)))
But technically, you can access the inagural corpus directly as a corpus object in NLTK, i.e.
>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

Categories

Resources