loading files to categorized plain text corpus

loading files to categorized plain text corpus - python

I am using ubuntu and as part of my assignment,im doing text sentiment analysis. I am making a training set to classify text using NaiveBayes classifier, i have many files containing sentences and saved as sent1.txt,sent2.txt. . . and a file called label.txt
label.txt contains
sent1.txt:pos
sent2.txt:pos
...
sent 15:txt:neg
sent16.txt:neg
all the sent files and lable.txt files are stored in \home\abha.I tried this
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader = CategorizedPlaintextCorpusReader('.', r'.*\.txt', cat_file='cats/cats.txt')
Please tell me What should my third argument be?
Im having such silly issues with where to store the label.txt file and the sent files.

Related

How can I read english.pickle file from nltk module?

I am trying to figure out why I cant read the contents of the english.pickle file downloaded from nltk module.
I first downloaded the nltk file using this code:
import nltk
nltk.download('punkt')
I then looked for inside the punkt file that I have on my home directory and found english.pickle file. I used the following code to read the file in python:
import pickle
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
It all seemed fine, however, when I am running the variable x (which should be storing the pickled data) i am unable to retrieve the data from as I would from any other pickled file.
Instead I am only getting the object name and the id:
<nltk.tokenize.punkt.PunktParameters at 0x7f86cf6c0cd0>
The problem is I need to access the content of the file and I cant iterate through as it is not iterable.
Has anyone encountered the same problem?

You have downloaded the punkt tokenizer, for which the documentation says:
This tokenizer divides a text into a list of sentences by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on a
large collection of plaintext in the target language before it can be
used.
After this:
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
You should have a nltk.tokenize.punkt.PunktSentenceTokenizer object. You can call methods on that object to perform tokenization. E.g.:
>>> x.tokenize('This is a test. I like apples. The cow is blue.')
['This is a test.', 'I like apples.', 'The cow is blue.']

How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

The Watson Language Translator documentation says:
"A TMX file with parallel sentences for source and target language. You can upload multiple parallel_corpus files in one request. All uploaded parallel_corpus files combined, your parallel corpus must contain at least 5,000 parallel sentences to train successfully."
I have a number of corpora files which I would use to train my translation model. I've looked up possible ways to do so programmatically with no success.
The only way I found to do so is by merging them manually into one single file.
Is there any way to send several files as parallel corpus via the API?
Can you provide examples in Python or Curl?
Thanks.
The only thing which worked just yest is aggregating the .TMX files manually and sending just one file. I have not found any way of sending several files as parallel_corpora
with open("./training/my_corpus_SPA.TMX", "rb") as parallel:
custom_model = language_translation.create_model(
base_model_id = 'en-es',
name = 'en-es-base1xx',
#forced_glossary = glossary,
parallel_corpus = parallel).get_result()
print(json.dumps(custom_model, indent=2))

I think I found a solution in here
I tried this and it seems to work:
with open(corpus_fname1, 'rb') as parallel1, open(corpus_fname2, 'rb') as parallel2:
custom_model = language_translation.create_model(
base_model_id = base_model_es_en,
name = model01_name,
parallel_corpus = parallel1,
parallel_corpus_filename2 = parallel2,
forced_glossary=None).get_result()

How to handle big textual data to create WordCloud?

I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours.
The data is firstly stored in MongoDB. Due to Cursor issues while reading the data into a Python list, I have exported the whole data to a plain text file - simply a txt file which is 304 MB.
So the question that I am looking for the answer is how can I handle this huge textual data? The word_cloud library needs a String parameter that contains the whole data separated with ' ' in order to create the Word Cloud.
p.s. Python version: 3.7.1
p.s. word_cloud is an open source Word Cloud generator for Python which is available on GitHub: https://github.com/amueller/word_cloud

You don't need to load all the file in memory.
from wordcloud import WordCloud
from collections import Counter
wc = WordCloud()
counts_all = Counter()
with open('path/to/file.txt', 'r') as f:
for line in f: # Here you can also use the Cursor
counts_line = wc.process_text(line)
counts_all.update(counts_line)
wc.generate_from_frequencies(counts_all)
wc.to_file('/tmp/wc.png')

How to save data in multiple files on python

I'm trying to make a bot in python which copy some texts from a webpage. In every run it grab 10k+ texts. so i want to save those texts in different files. every file will keep 100+ texts.
How can i do this in python?
Thanks.

Assuming you don't care what the file names are, you could write each batch of messages to a new temp file, thus:
import tempfile
texts = some_function_grabbing_text()
while texts:
with tempfile.TemporaryFile() as fp:
fp.write(texts)
texts = some_function_grabbing_text()

How to build a IMS open source corpus workbench and NLTK readable corpus?

Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.
Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all converted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?

It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:
from nltk.corpus import brown
out = open('corpus.vrt','w')
for sentence in nltk.brown.sents():
print >>out,'<s>'
for word in sentence:
print >>out,word
print >>out,'</s>'
out.close()
From there, you can follow the instructions on the CWB website.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

loading files to categorized plain text corpus - python

Related

How can I read english.pickle file from nltk module?

How to use multiple corpora files to use as parallel corpora in Watson Language Translator in Python

How to handle big textual data to create WordCloud?

How to save data in multiple files on python

How to build a IMS open source corpus workbench and NLTK readable corpus?

Categories

Resources