I have a list of articles and want to apply stanza nlp to them, but running the code (in Google Colab), stanza never finishes. Online (https://github.com/stanfordnlp/stanza) I found that separating the documents (in my case, articles in a list) with double linebreaks helps with speeding up the process, but my code for that doesn't seem to work.
Code before trying to add linebreaks (without all the import lines):
file = open("texts.csv", mode="r", encoding='utf-8-sig')
data = list(csv.reader(file, delimiter=','))
file.close
pickle.dump(data, open('List.p', 'wb'))
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,POS', use_gpu=True)
data_list = pickle.load(open('List.p', 'rb'))
new_list = []
for article in data_list:
a = nlp(str(article))
new_list.append(a) ### here the code runs forever and doesn't finish
pickle.dump(new_list, open('Annotated.p', 'wb'))
This code is followed by a code for topic modeling. I tried the code above and the topic modeling code with a smaller dataset (327 KB) and had no trouble whatsoever, but the size of the csv file (3.37 MB) seems to be a problem...
So I tried the following lines of code:
data_split = '\n\n'.join(data)
This gives me the error "TypeError: sequence item 0: expected str instance, list found"
data_split = 'n\n\'.join(map(str, data))
Printing the first item of the list (data_split[0]) gives me "[" and nothing else.
I also played around with looping through the articles of the list 'data', creating a new list and appending it, but that also dind't work.
Maybe there are also other ways of speeding up stanza when using large datasets?
I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
Knowing that my computer can handle performing the action on the dataset, I simply did:
reddit_subset = reddit_data[:50]
reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))
This produces the following result:
This in fact works with word2vec and produces model one can work with. Great so far.
Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.
I wrote the following function to achieve that:
def reddit_data_cleaning(filepath, batchsize=20000):
if batchsize:
df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
print("Beginning the data cleaning process!")
start_time = time.time()
flag = 1
chunk_num = 1
for chunk in df:
chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
chunk_num += 1
if flag == 1:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Beginning writing a new file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
flag = 0
else:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Adding a chunk into an already existing file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
end_time = time.time()
print("Processing has been completed in: ", (end_time - start_time), " seconds.")
Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.
I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:
The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.
Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?
I appreciate you if you read through the whole issue and stuck through!
First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.
My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.
What I also realized was that I had a major flaw in my previous code.
I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.
I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.
Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).
def generate_corpus_txt(csv_filepath, output_filepath):
import csv
import time
start_time = time.time()
with open(csv_filepath, encoding = 'utf-8') as csvfile:
datareader = csv.reader(csvfile)
count = 0
header = next(csvfile)
print(time.asctime(time.localtime()), " ---- Beginning Processing")
with open(output_filepath, 'w+') as output:
# Check file as empty
if header != None:
for row in datareader:
# Iterate over each row after the header in the csv
# row variable is a list that represents a row in csv
processed_row = str(' '.join(row)) + '\n'
output.write(processed_row)
count += 1
if count == 1000000:
print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
count = 0
print('Processing took:', int((time.time()-start_time)/60), ' minutes')
output.close()
csvfile.close()
Hello and thank you for reading. To put it simply, I want to perform Batch Transform on my XGBoost model that I made using SageMaker Experiments. I trained my model on csv data stored in S3, deployed an endpoint for my model, successfully hit said endpoint with single csv lines and got back expected inferences.
(I followed this tutorial to the letter before starting to work on Batch Transformation)
Now I am attempting to run Batch Transformation using the model created from the above tutorial and I'm running into an error (skip to the bottom to see my error logs). Before I get straight to the error, I want to show my batch transform code.
(imports are done from SageMaker SDK v2.24.4)
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
region = boto3.Session().region_name
role = get_execution_role()
image = sagemaker.image_uris.retrieve('xgboost', region, '1.2-1')
model_location = '{mys3info}/output/model.tar.gz'
model = Model(image_uri=image,
model_data=model_location,
role=role,
)
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path='myOutputPath',
accept='text/csv',
max_concurrent_transforms=1,
max_payload=20)
transformer.transform(data='s3://test-s3-prefix/short_test_data.csv',
content_type='text/csv',
split_type='Line',
join_source='Input'
)
transformer.wait()
short_test_data.csv
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown
57,blue-collar,married,primary,no,52,yes,no,unknown,5,may,38,1,-1,0,unknown
32,blue-collar,single,primary,no,23,yes,yes,unknown,5,may,160,1,-1,0,unknown
53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,unknown
29,management,single,tertiary,no,0,yes,no,unknown,5,may,363,1,-1,0,unknown
32,management,married,tertiary,no,0,yes,no,unknown,5,may,179,1,-1,0,unknown
38,management,single,tertiary,no,424,yes,no,unknown,5,may,104,1,-1,0,unknown
I made the above csv test data using my original dataset in my command line by running:
head original_training_data.csv > short_test_data.csv
and then I uploaded it to my S3 bucket manually.
Logs
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=20, BatchStrategy=MULTI_RECORD
[sagemaker logs]: */short_test_data.csv: ClientError: 415
[sagemaker logs]: */short_test_data.csv:
[sagemaker logs]: */short_test_data.csv: Message:
[sagemaker logs]: */short_test_data.csv: Loading csv data failed with Exception, please ensure data is in csv format:
[sagemaker logs]: */short_test_data.csv: <class 'ValueError'>
[sagemaker logs]: */short_test_data.csv: could not convert string to float: 'entrepreneur'
I understand the concept of one-hot encoding and other methods for converting strings to numbers for usage by an algorithm like XGBoost. My problem here is that I was easily able to input the exact same format of data into a deployed endpoint and get results back without doing that level of encoding. I am clearly missing something though, so any help is greatly appreciated!
Your Batch Transform code good and does not throw out any alarms, but looking at error message, it is clearly an input format error. As silly as it may sound. I'd advice you use pandas to save off the test_data from validation set to ensure the formatting is appropriate.
You could do something like this -
data = pd.read_csv("file")
#specify columns to save from ectracted df
data.columns["choose columns"]
# save the data to csv
data.to_csv("data.csv", sep=',', index=False)
My data has more than 1 million rows and while training gensim similarity model, it is making multiple .sav files (model.sav, model.sav.0, model.sav.1 and so on..). Problem is while loading, it is loading only one sub-part, instead of all the sub-parts, hence performing horribly in prediction. Parameters/options are not working as per gensim documentation.
As per the gensim documentation - https://radimrehurek.com/gensim/similarities/docsim.html
Saving as file handle and giving the following params should have worked - :
model.save(fname_or_handle, separately = None)
model.load(filepath, mmap = 'r')
Even tried to -
pickle the .sav files ( this pickles the 1st shard only i.e. model.sav)
compressing all sub-parts as .gz file ( this compresses one shard only , not all the sub-parts) and also gives some sort of pickle error.
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('./models/model.sav',tf_idf[corpus],
num_features=len(dictionary))
sims.save('./models/model.sav')
sims1 = gensim.similarities.Similarity.load(./models/model.sav)
Expected results should give all matching documents from corpus, but this gives only from model.sav (the file mentioned while loading). It does NOT even execute the other shards. I checked result from each shard.
Question: How do I use all the sub-files of gensim model to predict similarity of my test document, WITHOUT looping through every sub-file individually and then presenting union of those results.
It's my understanding that 'model.sav' serves as a sort of directory to access all the actual similarity shards.
What's your output from len(sims1)? Running the above code on a corpus of 65,536 entries (creates exactly two shards), I can save and load a corpus and check that it has the 65,536 documents. I can also add documents and further save/load.
I am working with a medium sized text dataset - about a 1GB of a single text column that I have loaded as a pandas series (of type object). It is called textData.
I want to create docs for each text row, and then tokenize. But I want to use my custom tokenizer.
from joblib import Parallel, delayed
from spacy.en import English
nlp = English()
docs = nlp.pipe([text for text in textData], batch_size=batchSize, n_threads=n_threads)
# This runs without any errors, but results is empty
results1 = Parallel(n_jobs=-1,)(delayed(clean_tokens)(doc) for doc in docs)
# This runs, and returns expected result
results2 = [clean_tokens(doc) for doc in docs]
def clean_tokens(doc):
# clean tokens and POS tags
exclusions = [token.i for token in doc if token.dep in [punct, det, agent, prep, aux, auxpass, cc, expl, quantmod]]
tokens = [token.lemma_ for token in doc if token.i not in exclusions]
return tokens
I am running the above functions inside main() with a call to main() using a script.
Any reason this should not work? If there is a pickling problem - it doesn't get raised.
Is there any way to make this work?