I'm playing around with a Stopword Filter
I feed the script a path to the file that contains articles.
However I get the error:
Traceback (most recent call last):
File "stop2.py", line 17, in <module>
print preprocess(sentence)
File "stop2.py", line 10, in preprocess
sentence = sentence.lower()
AttributeError: 'file' object has no attribute 'lower'
my code is attached below as well
any ideas as to how to pass a file as an argument
# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
def preprocess(sentence):
sentence = sentence.lower()
tokenizer = RegexpTokenizer(r'\w')
tokens = tokenizer.tokenize(sentence)
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
return " ".join(filtered_words)
sentence = open('pathtofile')
print preprocess(sentence)
sentence = open(...) means that sentence is an instance of file (returned from the open() method);
whereas it seems you want to have the entire contents of the file: sentence = open(...).read()
Related
i have this code its stemmed words from text file and save the output to another text file.
the problem is its just do stemming to the first word and ignore the others.
any help ?
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
with open(r'file path', 'r') as fp:
tokens = fp.readlines()
for t in tokens:
s = stemmer.stem(t.strip())
print(s, file=open("output.txt", "a"))
I open a file in python.
I read it in and process it (separate it into single words).
I write it to an output file.
The below picture shows my code, the shell (where I'm printing out each word before appending it to the file), and the output.
Why does it become Chinese characters? The encoding of the file is ANSI.
Edit: I should add that the output file seems to be encoded with UCS-2 LE BOM.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
with open('ALLSentences.txt', 'r') as myfile:
text = myfile.read()
tokenized = word_tokenize(text)
file = open("output.txt", "a")
for word in tokenized:
print(word)
file.write(word + "\n")
file.close()
I'm having a hard time trying to figure this out. New to coding. I'm trying to read a .txt file, tokenize it, pos tag the words in it.
Here's what I've got so far:
import nltk
from nltk import word_tokenize
import re
file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)
My problem is, it keeps giving me the TypeError: expected string or bytes-like object error.
word_tokenize is expecting a string but file.readlines() gives you a list.
Just convert the list to a string will solve the issue.
import nltk
from nltk import word_tokenize
import re
file = open('test.txt', 'r').readlines()
text =''
for line in file:
text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)
I suggest you do the following:
import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re
with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
lines = f.readlines() # first get all the lines from file, store it
for i in range(0, len(lines)): # for each line, do the following
token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
print (token_text) # for debug purposes
pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
print (pos_tagged_token)
For a text file containing:
user is here
pass is there
The output was:
['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]
It worked for me, I'm on Python 3.6, if that should matter. Hope this helps!
EDIT 1:
So your issue was you were passing a list of strings to pos_tag(), whereas doc says
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
Hence you needed to pass it line by line i. e. string by string. That is why you were getting a TypeError: expected string or bytes-like object error.
Most probably the 1865-Lincoln.txt refers to the inaugural speech of president Lincoln. It's available in NLTK from https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip
The original source of the document comes from the Inaugural Address Corpus
If we check how NLTK is reading the file using LazyCorpusReader, we see that the files are Latin-1 encoded.
inaugural = LazyCorpusLoader(
'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')
If you have the default encoding set to utf8, most probably that's where the TypeError: expected string or bytes-like object is occurring
You should open the file with an explicit encoding and decode the string properly, i.e.
import nltk
from nltk import word_tokenize, pos_tag
tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
for line in fin:
tagged_lines.append(pos_tag(word_tokenize(line)))
But technically, you can access the inagural corpus directly as a corpus object in NLTK, i.e.
>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]
I write a code to find the POS for Arabic words in my python shell 2.7 and the output was not correct , i find this solution on stackoverflow :
Unknown symbol in nltk pos tagging for Arabic
and i download all the files needed (stanford-postagger-full-2018-02-27) this file used in the code in the problem above .
this code from above problem and i write it in my shell:
# -*- coding: cp1256 -*-
from nltk.tag import pos_tag
from nltk.tag.stanford import POSTagger
from nltk.data import load
from nltk.tokenize import word_tokenize
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
tagger = load(_POS_TAGGER)
return tagger.tag(tokens)
path_to_model= 'D:\StanfordParser\stanford-postagger-full-2018-02-
27\models/arabic.tagger'
path_to_jar = 'D:\StanfordParser\stanford-postagger-full-2018-02-
27/stanford-postagger-3.9.1.jar'
artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
and the output was :
Traceback (most recent call last):
File "C:/Python27/Lib/mo.py", line 4, in <module>
from nltk.tag.stanford import POSTagger
ImportError: cannot import name POSTagger
How can I solve this error ?
This script works without errors on my PC, but the tagger results do not look very good!!!
import nltk
from nltk import *
from nltk.tag.stanford import StanfordTagger
import os
java_path = "Put your local path in here/Java/javapath/java.exe"
os.environ['JAVAHOME'] = java_path
path_to_model= ('Put your local path in here/stanford-postagger-full-2017-06-09/models/arabic.tagger')
path_to_jar = ('Put your local path in here/stanford-postagger-full-2017-06-09/stanford-postagger.jar')
artagger = StanfordPOSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = "/"
tagged_sent = artagger.tag("أنا أتسلق شجرة".split())
print(tagged_sent)
The results:
[('أنا', 'VBD'), ('أتسلق', 'NN'), ('شجرة', 'NN')]
Give it a try and see :-)
I want to tokenize input file in python please suggest me i am new user of python .
I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.
Try something like this:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html
Using NLTK
If your file is small:
Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()
[code]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
If your file is larger:
Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)
[code]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
Using SpaCy
from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)