I found this post looking for a way to identify and clean abbreviations within my dataframe. The code works well for my use case.
However, I'm dealing with a large data set and was wondering if there was a better or proficient way to apply this without dealing with memory issues.
In order for me to run the code snipet, I sampled 10% of the original dataset and it runs perfectly. If I run the full dataset, my laptop locks.
Below is updated version of the original code:
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 43793966
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = [nlp(text, disable = ['ner', 'parser','tagger']) for text in train.text]
text = ' '.join([str(elem) for elem in text])
doc = nlp(text)
#Print the Abbreviation and it's definition
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
Related
I have a list of articles and want to apply stanza nlp to them, but running the code (in Google Colab), stanza never finishes. Online (https://github.com/stanfordnlp/stanza) I found that separating the documents (in my case, articles in a list) with double linebreaks helps with speeding up the process, but my code for that doesn't seem to work.
Code before trying to add linebreaks (without all the import lines):
file = open("texts.csv", mode="r", encoding='utf-8-sig')
data = list(csv.reader(file, delimiter=','))
file.close
pickle.dump(data, open('List.p', 'wb'))
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,POS', use_gpu=True)
data_list = pickle.load(open('List.p', 'rb'))
new_list = []
for article in data_list:
a = nlp(str(article))
new_list.append(a) ### here the code runs forever and doesn't finish
pickle.dump(new_list, open('Annotated.p', 'wb'))
This code is followed by a code for topic modeling. I tried the code above and the topic modeling code with a smaller dataset (327 KB) and had no trouble whatsoever, but the size of the csv file (3.37 MB) seems to be a problem...
So I tried the following lines of code:
data_split = '\n\n'.join(data)
This gives me the error "TypeError: sequence item 0: expected str instance, list found"
data_split = 'n\n\'.join(map(str, data))
Printing the first item of the list (data_split[0]) gives me "[" and nothing else.
I also played around with looping through the articles of the list 'data', creating a new list and appending it, but that also dind't work.
Maybe there are also other ways of speeding up stanza when using large datasets?
I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
Knowing that my computer can handle performing the action on the dataset, I simply did:
reddit_subset = reddit_data[:50]
reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))
This produces the following result:
This in fact works with word2vec and produces model one can work with. Great so far.
Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.
I wrote the following function to achieve that:
def reddit_data_cleaning(filepath, batchsize=20000):
if batchsize:
df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
print("Beginning the data cleaning process!")
start_time = time.time()
flag = 1
chunk_num = 1
for chunk in df:
chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
chunk_num += 1
if flag == 1:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Beginning writing a new file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
flag = 0
else:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Adding a chunk into an already existing file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
end_time = time.time()
print("Processing has been completed in: ", (end_time - start_time), " seconds.")
Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.
I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:
The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.
Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?
I appreciate you if you read through the whole issue and stuck through!
First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.
My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.
What I also realized was that I had a major flaw in my previous code.
I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.
I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.
Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).
def generate_corpus_txt(csv_filepath, output_filepath):
import csv
import time
start_time = time.time()
with open(csv_filepath, encoding = 'utf-8') as csvfile:
datareader = csv.reader(csvfile)
count = 0
header = next(csvfile)
print(time.asctime(time.localtime()), " ---- Beginning Processing")
with open(output_filepath, 'w+') as output:
# Check file as empty
if header != None:
for row in datareader:
# Iterate over each row after the header in the csv
# row variable is a list that represents a row in csv
processed_row = str(' '.join(row)) + '\n'
output.write(processed_row)
count += 1
if count == 1000000:
print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
count = 0
print('Processing took:', int((time.time()-start_time)/60), ' minutes')
output.close()
csvfile.close()
I'm trying to create my own corpus out of a set of text files. However, I want to do some preprocessing on the text files before they get corpus-ized and I can't figure out how to do that, short of creating a script to run through every single text file first, do the text preprocessing, save a new text file, and then make the corpus on the new, post-processed files. (This seems inefficient now, because I have ~200 mb of files that I would need to read through twice, and is not really scalable if I had a much larger corpus.)
The preprocessing that I want to do is very basic text manipulation:
Make every word as listed in the corpus lower case
Remove any items entirely enclosed in brackets, e.g., [coughing]
Remove digits at the start of each line (they're line numbers from the original transcriptions) which are the first four characters of each line
Critically, I want to do this preprocessing BEFORE the words enter the corpus - I don't want, e.g., "[coughing]" or "0001" as an entry in my corpus, and instead of "TREE" I want "tree."
I've got the basic corpus reader code, but the problem is that I can't figure out how to modify pattern matching as it reads in the files and builds the corpus. Is there a good way to do this?
corpusdir = "C:/corpus/"
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
corpus_words = newcorpus.words() # get words in the corpus
fdist = nltk.FreqDist(corpus_words) # make frequency distribution of the words in the corpus
This answer seems sort of on the right track, but the relevant words are already in the corpus and the poster wants to ignore/strip punctuation before tokenizing the corpus. I want to affect which types of words are even entered (i.e., counted) in the corpus at all.
Thanks in advance!
I disagree with your inefficiency comment because once the corpus has been processed, you can analyze the processed corpus multiple times without having to run a cleaning function each time. That being said, if you are going to be running this multiple times, maybe you would want to find a quicker option.
As far as I can understand, PlaintextCorpusReader needs files as an input. I used code from Alvas' answer on another question to build this response. See Alvas' fantastic answer on using PlaintextCorpusReader here.
Here's my workflow:
from glob import glob
import re
import os
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import FreqDist as FreqDist
mycorpusdir = glob('path/to/your/corpus/*')
# captures bracket-ed text
re_brackets = r'(\[.*?\])'
# exactly 4 numbers
re_numbers = r'(\d{4})'
Lowercase everything, remove numbers:
corpus = []
for file in mycorpusdir:
f = open(file).read()
# lowercase everything
all_lower = f.lower()
# remove brackets
no_brackets = re.sub(re_brackets, '', all_lower)
# remove #### numbers
just_words = re.sub(re_numbers, '', no_brackets)
corpus.append(just_words)
Make new directory for the processed corpus:
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
with open(corpusdir + str(filename) + '.txt' , 'w+') as fout:
print(text, file=fout)
filename += 1
Call PlaintextCorpusReader:
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
corpus_words = newcorpus.words()
fdist = FreqDist(corpus_words)
print(fdist)
I'm starting some text analysis on some csv documents. However my csv document has several sentences with few words which do not interest me, so I wanted to create a python code that analyzed this csv document and left only the sentences that contain more than 5 words for my analysis, however I do not I know where to start making my code and would like some help.
example:
Input document
enter image description here
Output document
enter image description here
This should work (with Python 3.5):
lines = []
finalLines = []
toRemove = ['a', 'in', 'the']
with open('export.csv') as f:
lines.append(f.readlines())
for line in lines:
temp = list(csv.reader(line))
sentence = ''
for word in temp[0][0].split():
if (word not in toRemove):
sentence = sentence + ' ' + word
finalLines.append(sentence.strip())
print(finalLines)
You can get your work done efficiently and with ease if you use pandas (python library widely used for data manipulation). Here is the link for official pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/
Note: Pandas has built-in functions for reading csv files. You can use 'skiprow' parameter to skiprows you dont want or apply regex to filter text.
I am working with a medium sized text dataset - about a 1GB of a single text column that I have loaded as a pandas series (of type object). It is called textData.
I want to create docs for each text row, and then tokenize. But I want to use my custom tokenizer.
from joblib import Parallel, delayed
from spacy.en import English
nlp = English()
docs = nlp.pipe([text for text in textData], batch_size=batchSize, n_threads=n_threads)
# This runs without any errors, but results is empty
results1 = Parallel(n_jobs=-1,)(delayed(clean_tokens)(doc) for doc in docs)
# This runs, and returns expected result
results2 = [clean_tokens(doc) for doc in docs]
def clean_tokens(doc):
# clean tokens and POS tags
exclusions = [token.i for token in doc if token.dep in [punct, det, agent, prep, aux, auxpass, cc, expl, quantmod]]
tokens = [token.lemma_ for token in doc if token.i not in exclusions]
return tokens
I am running the above functions inside main() with a call to main() using a script.
Any reason this should not work? If there is a pickling problem - it doesn't get raised.
Is there any way to make this work?