I am working with a medium sized text dataset - about a 1GB of a single text column that I have loaded as a pandas series (of type object). It is called textData.
I want to create docs for each text row, and then tokenize. But I want to use my custom tokenizer.
from joblib import Parallel, delayed
from spacy.en import English
nlp = English()
docs = nlp.pipe([text for text in textData], batch_size=batchSize, n_threads=n_threads)
# This runs without any errors, but results is empty
results1 = Parallel(n_jobs=-1,)(delayed(clean_tokens)(doc) for doc in docs)
# This runs, and returns expected result
results2 = [clean_tokens(doc) for doc in docs]
def clean_tokens(doc):
# clean tokens and POS tags
exclusions = [token.i for token in doc if token.dep in [punct, det, agent, prep, aux, auxpass, cc, expl, quantmod]]
tokens = [token.lemma_ for token in doc if token.i not in exclusions]
return tokens
I am running the above functions inside main() with a call to main() using a script.
Any reason this should not work? If there is a pickling problem - it doesn't get raised.
Is there any way to make this work?
Related
I am currently working on filtering certain details from a large batch of around ~200 log files in Jupyter Notebooks. I am reading and processing these log files as a pandas dataframe. However, when I run my preprocessing function on the column, I get the following error: MemoryError: Unable to allocate 7.07 GiB for an array with shape (6590623, 288) and data type float32
The function that I am using tokenizes the filtered data and returns only the tokens that I extracted using spacy. Here is some context for the function:
def preprocess(text):
doc = nlp(text, disable=['ner', 'parser'])
lemmas = [token.lemma_ for token in doc]
commands = get_commands("command-words.txt")
#tokens_clean = [token.lower() for token in lemmas if token.isalnum() and token.lower() not in months]
tokens_clean = []
for token in lemmas:
if token.isalpha() and token.lower() not in months and token.lower() not in commands:
tokens_clean.append(token.lower())
return ' '.join(tokens_clean)
Anyone happen to know how to fix this issue? When I work with a much smaller batch around ~50 log files it's okay but not when I make it a larger batch.
I am trying to figure out why I cant read the contents of the english.pickle file downloaded from nltk module.
I first downloaded the nltk file using this code:
import nltk
nltk.download('punkt')
I then looked for inside the punkt file that I have on my home directory and found english.pickle file. I used the following code to read the file in python:
import pickle
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
It all seemed fine, however, when I am running the variable x (which should be storing the pickled data) i am unable to retrieve the data from as I would from any other pickled file.
Instead I am only getting the object name and the id:
<nltk.tokenize.punkt.PunktParameters at 0x7f86cf6c0cd0>
The problem is I need to access the content of the file and I cant iterate through as it is not iterable.
Has anyone encountered the same problem?
You have downloaded the punkt tokenizer, for which the documentation says:
This tokenizer divides a text into a list of sentences by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on a
large collection of plaintext in the target language before it can be
used.
After this:
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
You should have a nltk.tokenize.punkt.PunktSentenceTokenizer object. You can call methods on that object to perform tokenization. E.g.:
>>> x.tokenize('This is a test. I like apples. The cow is blue.')
['This is a test.', 'I like apples.', 'The cow is blue.']
I found this post looking for a way to identify and clean abbreviations within my dataframe. The code works well for my use case.
However, I'm dealing with a large data set and was wondering if there was a better or proficient way to apply this without dealing with memory issues.
In order for me to run the code snipet, I sampled 10% of the original dataset and it runs perfectly. If I run the full dataset, my laptop locks.
Below is updated version of the original code:
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 43793966
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = [nlp(text, disable = ['ner', 'parser','tagger']) for text in train.text]
text = ' '.join([str(elem) for elem in text])
doc = nlp(text)
#Print the Abbreviation and it's definition
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
I'm trying to create my own corpus out of a set of text files. However, I want to do some preprocessing on the text files before they get corpus-ized and I can't figure out how to do that, short of creating a script to run through every single text file first, do the text preprocessing, save a new text file, and then make the corpus on the new, post-processed files. (This seems inefficient now, because I have ~200 mb of files that I would need to read through twice, and is not really scalable if I had a much larger corpus.)
The preprocessing that I want to do is very basic text manipulation:
Make every word as listed in the corpus lower case
Remove any items entirely enclosed in brackets, e.g., [coughing]
Remove digits at the start of each line (they're line numbers from the original transcriptions) which are the first four characters of each line
Critically, I want to do this preprocessing BEFORE the words enter the corpus - I don't want, e.g., "[coughing]" or "0001" as an entry in my corpus, and instead of "TREE" I want "tree."
I've got the basic corpus reader code, but the problem is that I can't figure out how to modify pattern matching as it reads in the files and builds the corpus. Is there a good way to do this?
corpusdir = "C:/corpus/"
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
corpus_words = newcorpus.words() # get words in the corpus
fdist = nltk.FreqDist(corpus_words) # make frequency distribution of the words in the corpus
This answer seems sort of on the right track, but the relevant words are already in the corpus and the poster wants to ignore/strip punctuation before tokenizing the corpus. I want to affect which types of words are even entered (i.e., counted) in the corpus at all.
Thanks in advance!
I disagree with your inefficiency comment because once the corpus has been processed, you can analyze the processed corpus multiple times without having to run a cleaning function each time. That being said, if you are going to be running this multiple times, maybe you would want to find a quicker option.
As far as I can understand, PlaintextCorpusReader needs files as an input. I used code from Alvas' answer on another question to build this response. See Alvas' fantastic answer on using PlaintextCorpusReader here.
Here's my workflow:
from glob import glob
import re
import os
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import FreqDist as FreqDist
mycorpusdir = glob('path/to/your/corpus/*')
# captures bracket-ed text
re_brackets = r'(\[.*?\])'
# exactly 4 numbers
re_numbers = r'(\d{4})'
Lowercase everything, remove numbers:
corpus = []
for file in mycorpusdir:
f = open(file).read()
# lowercase everything
all_lower = f.lower()
# remove brackets
no_brackets = re.sub(re_brackets, '', all_lower)
# remove #### numbers
just_words = re.sub(re_numbers, '', no_brackets)
corpus.append(just_words)
Make new directory for the processed corpus:
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
with open(corpusdir + str(filename) + '.txt' , 'w+') as fout:
print(text, file=fout)
filename += 1
Call PlaintextCorpusReader:
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
corpus_words = newcorpus.words()
fdist = FreqDist(corpus_words)
print(fdist)
I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.
my input function follows the basic structure below:
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])
# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space
Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?
Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.
For example:
def add_engineered(features):
text = features['text']
features['words'] = tf.string_split(text)
return features
Then, in your input_fn, wrap the features you return with a call to add_engineered:
def input_fn():
features = ...
label = ...
return add_engineered(features), label
and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:
def serving_input_fn():
feature_placeholders = ...
features = ...
return tflearn.utils.input_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders
)
Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.
Here's a complete working example:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107