NLTK pos tagging for Arabic - python

I write a code to find the POS for Arabic words in my python shell 2.7 and the output was not correct , i find this solution on stackoverflow :
Unknown symbol in nltk pos tagging for Arabic
and i download all the files needed (stanford-postagger-full-2018-02-27) this file used in the code in the problem above .
this code from above problem and i write it in my shell:
# -*- coding: cp1256 -*-
from nltk.tag import pos_tag
from nltk.tag.stanford import POSTagger
from nltk.data import load
from nltk.tokenize import word_tokenize
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
tagger = load(_POS_TAGGER)
return tagger.tag(tokens)
path_to_model= 'D:\StanfordParser\stanford-postagger-full-2018-02-
27\models/arabic.tagger'
path_to_jar = 'D:\StanfordParser\stanford-postagger-full-2018-02-
27/stanford-postagger-3.9.1.jar'
artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
and the output was :
Traceback (most recent call last):
File "C:/Python27/Lib/mo.py", line 4, in <module>
from nltk.tag.stanford import POSTagger
ImportError: cannot import name POSTagger
How can I solve this error ?

This script works without errors on my PC, but the tagger results do not look very good!!!
import nltk
from nltk import *
from nltk.tag.stanford import StanfordTagger
import os
java_path = "Put your local path in here/Java/javapath/java.exe"
os.environ['JAVAHOME'] = java_path
path_to_model= ('Put your local path in here/stanford-postagger-full-2017-06-09/models/arabic.tagger')
path_to_jar = ('Put your local path in here/stanford-postagger-full-2017-06-09/stanford-postagger.jar')
artagger = StanfordPOSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = "/"
tagged_sent = artagger.tag("أنا أتسلق شجرة".split())
print(tagged_sent)
The results:
[('أنا', 'VBD'), ('أتسلق', 'NN'), ('شجرة', 'NN')]
Give it a try and see :-)

Related

File not found in python (WORDCLOUD)

When I try to run this it says file not found. Is there any misatkes I've made?
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import sys
import os
import matplotlib.pyplot as plt
os.chdir(sys.path[0])
text = open('pokemon.txt', mode='r', encoding='utf-8').read()
stop = STOPWORDS
print(stop)
Since your file is in the same folder as the Python program, use ./ before your pokemon.txt like this:
text = open('./pokemon.txt', mode='r', encoding='utf-8').read()

pyinstaller executable for simple script is 1.4gb

I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.
I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.
The script:
import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
filename = easygui.fileopenbox()
pdfFileObj = fitz.open(filename)
text =""
for page in pdfFileObj:
text+= page.getText()
re.sub(r'\W+', '', text)
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
write = csv.writer(f)
for i in keywords:
write.writerow([i])
Some context:
Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?

I can't import a file.txt in py3

I'm trying to write a program on py3. I have saved 2 raw texts in the same directory as "programm.py" but the program can't find the texts.
I'm using emacs, and I wrote:
from __future__ import division
import nltk, sys, matplotlib, numpy, re, pprint, codecs
from os import path
text1 = "/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt"
text2 = "/home/giovanni/Scrivania/Giovanni/programmi/Esame/Sksp.txt"
from nltk import ngrams
s_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
w_tokenizer = nltk.word_tokenize("text")
print(text1)
but when I run it in py3 it doesn't print text1 (I used it to see if it works)
>>> import programma1
>>>
Instead, if I ask to print in py3 it can't find the file
>>> import programma1
>>> text1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'text1' is not defined
What can I do?
There's a few independent issues going on here. As #Yash Kanojia correctly pointed out, to get the contents of the files you need to read them, rather than just have their address.
The reason you can't access text1 is that it isn't a global variable. To access it, you need to use programma1.text1.
I've also moved all the import statements to the top of programma1.py as it's seen as good practice :)
Full code:
programma1.py:
from __future__ import division
import nltk, sys, matplotlib, numpy, re, pprint, codecs
from nltk import ngrams
from os import path
with open("/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt") as file1:
text1 = file1.read()
with open("/home/giovanni/Scrivania/Giovanni/programmi/Esame/Sksp.txt") as file2:
text2 = file2.read()
s_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
w_tokenizer = nltk.word_tokenize("text")
#print(text1)
main.py:
import programma1
print(programma1.text1)
EDIT:
I presume you wanted to load the contents of the files into the tokenizer. If you do, replace this line:
w_tokenizer = nltk.word_tokenize("text")
with this line
w_tokenizer = nltk.word_tokenize(text1 + "\n" + text2)
Hope this helps.
with open('/home/giovanni/Scrivania/Giovanni/programmi/Esame/Milton.txt') as f:
data = f.read()
print(data)

Tagging a .txt file from Inaugural Address Corpus

I'm having a hard time trying to figure this out. New to coding. I'm trying to read a .txt file, tokenize it, pos tag the words in it.
Here's what I've got so far:
import nltk
from nltk import word_tokenize
import re
file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)
My problem is, it keeps giving me the TypeError: expected string or bytes-like object error.
word_tokenize is expecting a string but file.readlines() gives you a list.
Just convert the list to a string will solve the issue.
import nltk
from nltk import word_tokenize
import re
file = open('test.txt', 'r').readlines()
text =''
for line in file:
text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)
I suggest you do the following:
import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re
with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
lines = f.readlines() # first get all the lines from file, store it
for i in range(0, len(lines)): # for each line, do the following
token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
print (token_text) # for debug purposes
pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
print (pos_tagged_token)
For a text file containing:
user is here
pass is there
The output was:
['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]
It worked for me, I'm on Python 3.6, if that should matter. Hope this helps!
EDIT 1:
So your issue was you were passing a list of strings to pos_tag(), whereas doc says
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
Hence you needed to pass it line by line i. e. string by string. That is why you were getting a TypeError: expected string or bytes-like object error.
Most probably the 1865-Lincoln.txt refers to the inaugural speech of president Lincoln. It's available in NLTK from https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip
The original source of the document comes from the Inaugural Address Corpus
If we check how NLTK is reading the file using LazyCorpusReader, we see that the files are Latin-1 encoded.
inaugural = LazyCorpusLoader(
'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')
If you have the default encoding set to utf8, most probably that's where the TypeError: expected string or bytes-like object is occurring
You should open the file with an explicit encoding and decode the string properly, i.e.
import nltk
from nltk import word_tokenize, pos_tag
tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
for line in fin:
tagged_lines.append(pos_tag(word_tokenize(line)))
But technically, you can access the inagural corpus directly as a corpus object in NLTK, i.e.
>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

Error : 'file' object has no attribute 'lower'

I'm playing around with a Stopword Filter
I feed the script a path to the file that contains articles.
However I get the error:
Traceback (most recent call last):
File "stop2.py", line 17, in <module>
print preprocess(sentence)
File "stop2.py", line 10, in preprocess
sentence = sentence.lower()
AttributeError: 'file' object has no attribute 'lower'
my code is attached below as well
any ideas as to how to pass a file as an argument
# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
def preprocess(sentence):
sentence = sentence.lower()
tokenizer = RegexpTokenizer(r'\w')
tokens = tokenizer.tokenize(sentence)
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
return " ".join(filtered_words)
sentence = open('pathtofile')
print preprocess(sentence)
sentence = open(...) means that sentence is an instance of file (returned from the open() method);
whereas it seems you want to have the entire contents of the file: sentence = open(...).read()

Categories

Resources