Speed up Spacy processing - python

I want to pre-process text data using spacy (or something else).
My code below works but is really slow. I only have a 20 MB zipped text file as a demo and it takes more than 10 minutes to process with my code.
The problem is: I'll need to process text files of about 20 GB of zipped text files and want to speed up my algorithms before.
Also, how will I be able to deal with a 20 GB zipped text file? It'll blow my main memory of 16GB if I run the code below. Can I read it line-by-line and still get a good speed?
Any help would be appreciated.
import zipfile
nlp = spacy.load("en_core_web_sm" , n_process=4)
with zipfile.ZipFile(filename, 'r') as thezip:
text=thezip.open(thezip.filelist[0],mode='r').read()
text=text.decode('utf-8').splitlines()
for doc in nlp.pipe(text, disable=["tok2vec", "parser", "attribute_ruler"], batch_size=2000):
# Do something with the doc here
# First remove punctuation
tokens=[t for t in doc if t.text not in string.punctuation]
# then remove stop words, weird unicode characters, words with digits in them
# and empty characters.
tokens = [ t for t in tokens if not t.is_stop and t.is_ascii and not t.is_digit and len(t) > 1 and not any(char.isdigit() for char in t.text)]
# remove empty lines, make it lower case and put them in sentence form
if len(tokens):
sentence= " ".join(token.text.lower() for token in tokens)
# do something useful with sentence here

It looks like you just want to use the spaCy tokenizer? In that case use nlp = spacy.blank("en") instead of spacy.load, and then you can leave out the disable part in nlp.pipe.
Also to be clear, you're using spaCy v2?
Here's a function that makes your code faster and also cleaner:
def is_ok(tok):
# this is much faster than `not in string.punctuation`
if tok.is_punct: return False
if tok.is_stop: return False
if not tok.is_ascii: return False
if tok.is_digit: return False
if len(tok.text) < 2: return False
# this gets rid of anything with a number in it
if 'd' in tok.shape_: return False
return True
# replace your stuff with this:
toks = [tok for tok in doc if is_ok(tok)]
Reading your zip file one line at a time should be totally fine since you're just using the tokenizer.

Related

How to Tokenize block of text as one token in python?

Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions on genome data set, it is not able to tokenize the genomes correctly. The text below shows some part of the genome data set.
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
aatgttttatataaattgcagtatgtgtcacccaaaatagcaaaccccat
aaccaaccagattattatgatacataatgcttatatgaaactaagacatt
tcgcaacatttattttaggtatataaatacatttattgaaggaattgata
tatgccagtaaaatggtgtatttttaatttctttcaataaaaacataatt
gacattatataaaaatgaattataaaactctaagcggtggatcactcggc
tcatgggtcgatgaagaacgcagcaaactgtgcgtcatcgtgtgaactgc
aggacacatgaacatcgacattttgaacgcatatcgcagtccatgctgtt
atgtactttaattaattttatagtgctgcttggactacatatggttgagg
gttgtaagactatgctaattaagttgcttataaatttttataagcatatg
gtatattattggataaatataataatttttattcataatattaaaaaata
aatgaaaaacattatctcacatttgaatgt
>NR_004047 1
atattcaggttcatcgggcttaacctctaagcagtttcacgtactgttta
actctctattcagagttcttttcaactttccctcacggtacttgtttact
atcggtctcatggttatatttagtgtttagatggagtttaccacccactt
agtgctgcactatcaagcaacactgactctttggaaacatcatctagtaa
tcattaacgttatacgggcctggcaccctctatgggtaaatggcctcatt
taagaaggacttaaatcgctaatttctcatactagaatattgacgctcca
tacactgcatctcacatttgccatatagacaaagtgacttagtgctgaac
tgtcttctttacggtcgccgctactaagaaaatccttggtagttactttt
cctcccctaattaatatgcttaaattcagggggtagtcccatatgagttg
>NR_004052 1
When the tokenizer of ntlk is applied on this dataset, each line of text (for example tattattatacacaatcccggggcgttctatatagttatgtataatgtat ) becomes one token which is not correct. and a block of sequences should be considered as one token. For example in this case contents between >NR_004049 1 and >NR_004048 1 should be consider as one token:
>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
So each block starting with special words such as >NR_004049 1 until the next special character should be considered as one token. The problem here is tokenizing this kind of data set and i dont have any idea how can i correctly tokenize them.
I really appreciate answers which helps me to solve this.
Update:
One way to solve this problem is to append al lines within each block, and then using the nltk tokenizer. for example This means that to append all lines between >NR_004049 1 and >NR_004048 1 to make one string from several lines, so the nltk tokenizer will consider it as one token. Can any one help me how can i append lines within each block?
You just need to concatenate the lines between two ids apparently. There should be no need for nltk or any tokenizer, just a bit of programming ;)
patterns = {}
with open('data', "r") as f:
id = None
current = ""
for line0 in f:
line= line0.rstrip()
if line[0] == '>' : # new pattern
if len(current)>0:
# print("adding "+id+" "+current)
patterns[id] = current
current = ""
# to find the next id:
tokens = line.split(" ")
id = tokens[0][1:]
else: # continuing pattern
current = current + line
if len(current)>0:
patterns[id] = current
# print("adding "+id+" "+current)
# do whatever with the patterns:
for id, pattern in patterns.items():
print(f"{id}\t{pattern}")

MemoryError: Unable to allocate 1.83 MiB for an array with shape (5004, 96) and data type int32

When I want to process a huge csv file I'm getting a MemoryError MemoryError: Unable to allocate 1.83 MiB for an array with shape (5004, 96) and data type int32. The error happens at:
processed_texts = [text for text in nlp.pipe(str(tokenized_texts),
disable=["ner",
"parser"])]
Will this be fixed when I'm using multiple threads? If so, has anybody some examples in Python, because I'm coming from Java..
Whole script:
df = pd.read_csv('posts_result.csv')
df_sample = df.sample(frac=0.1, replace=False, random_state=1)
""" DATA EXPLORATION """
text_test = df_sample.post.tolist()
# Start the tokenization
def tokenize_hashtag(text):
punctuations = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~'
for punctuation in punctuations:
text = str(text).replace(punctuation, '')
text = text.lower()
text = text.split()
return text
tokenized_texts = [tokenize_hashtag(text) for text in text_test]
nlp = spacy.load("en_core_web_sm")
processed_texts = [text for text in nlp.pipe(str(tokenized_texts),
disable=["ner",
"parser"])]
df_sample['processed'] = tokenized_texts
tokenized_texts = [[word.text for word in text if (word.pos_ == 'NOUN' or word.pos_ == 'VERB' or word.pos_ == 'PROPN') and not len(word.text) >12 and not word.is_punct and not word.is_stop and not word.text=='X'
and not word.text == '#Name']
for text in processed_texts]
You haven't really provided enough information here, but it looks like you can't hold all the spaCy docs in memory.
A very simple workaround for this would be to split your CSV file up and process it one chunk at a time.
Another thing you can do, since it looks like you're just saving some words, is to avoid saving the docs by changing your for loop a bit.
nlp = spacy.load("en_core_web_sm")
def keep_word(word):
if word.pos_ not in ("NOUN", "VERB", "PROPN"):
return False
if word.text == "#Name":
return False
return True
out = []
for doc in nlp.pipe(str(tokenized_texts),disable=["ner", "parser"]):
out.append([ww.text for ww in doc if keep_word(ww)])
This way you'll just keep the strings you want and not the docs, so it should reduce memory usage.
A couple of other comments about your code...
Whatever you're trying to do with the hashtag function it's not working. If you call str(text.split()) the output is really weird - it'll turn I like cheese into ['I', 'like', 'cheese'] - and it will cause spaCy to give you nonsense output. I recommend just not using that function, spaCy expects to deal with puncutation.
You seem to be using spaCy to remove words based on part of speech (for the most part), but that's generally not a good idea - modern text processing doesn't need that kind of prefiltering. It was still common practice like 15 years ago but you should just be able to give whole sentences to any reasonable model, and they'll be better than overly filtered text.

Calculate a measure between keywords and each word of a textfile

I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')

Code removes stopwords but Word2vec still creates wordvector for stopword?

I have a code that loads a file, strips each sentence and then removes some stopwords and returns the tokens.
So far so good.. If I include a print() statement or do a simple example, I see that stopwords are removed BUT..
when I run the sentences in my word2vec model, the model still creates a wordvector for stopwords like 'the' .. is there an error in my code??
class Raw_Sentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for file in file_loads: # list with the according file names e.g. 'Users/file1.txt'
with open(file,'r', buffering=20000000, encoding='utf-8') as t:
for sentence in tokenizer.tokenize(t.read().replace('\n', ' ').lower()):
sent = remove_stopwords(sentence)
print(sent)
yield gensim.utils.simple_preprocess(sent, deacc=True)
Then I run:
sentences = Raw_Sentences(directory)
num_features = 200
min_word_count = 2
num_workers = cpu_count()
context_size = 4
downsampling = 1e-5
seed = 2
model = gensim.models.Word2Vec(sentences,
sg=1, #skip-gram
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling)
model.most_similar('the')
and it returns similar words.. But the word 'the' should have been removed...
crying out loud
remove_stopwords is a gensim function from gensim.parsing.preprocessing import remove_stopwords which takes a set of stopwords stoplist = set(stop_words) and removes them def remove_stopwords(s): ## del
s = utils.to_unicode(s)
return " ".join(w for w in s.split() if w not in stoplist)
Are you sure your corpus doesn't contain any instances of 'thé'? (If it did, that might not be removed by remove_stopwords(), but then when passed through simple_preprocess(..., deacc=True) the accent-removal would convert it to plain 'the'.)
Note also that lots of published Word2Vec work doesn't bother to remove stop words. The sample downsampling will already thin out the occurrences of any very-common words, without needing any fixed list of stop-words.
So even if your code is debugged, that entire stop-word-removal step may be an unnecessary source of complication & fragility in your code.

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Categories

Resources