I have some text data in a pandas column. Basically each document is part of the column value. Each document is multi sentence long.
I wanted to split each document into sentence and then for each sentence I want to get a list of words. So if a document is 5 sentence long, I will have a list of list of words with length 5.
I used a mapper function to do some operations on that and got a list of words for each sentence of a text. Here is a mapper code:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
Then I did this:
%%time
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
It got done in 70 micro seconds and I got a mapper iterator.
Now if I want to add each list of list of words of each document as a new value of a new column in the same pandas data frame.
I can simply do this:
txt_to_words=[*map(text_to_words,pandas_df.log_text_cleaned)]
Which will expand the map iterator and store it in txt_to_words as list of list of words.
But this process is very slow.
I even tried looping over the map object :
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
txt_to_words_list=[]
for sent in txt_to_words:
txt_to_words_list.append(sent)
But this is similar slow.
extracting the output from a mapper object is very slow. And I just have 67K documents in that pandas data frame column.
Is there a way this can be sped up?
Thanks
The direct answer to your question is that the fastest way to convert an iterator to a list is probably by calling list on it, although that may depend on the size of your lists.
However, this is not going to matter, except to an unnoticeable, barely-measurable degree.
The difference between list(m), [*m], or even an explicit for statement is a matter of microseconds at most, but your code is taking seconds. In fact, you could even eliminate almost all the work done by list by using collections.deque(m, maxlen=0) (which just throws away all of the values without allocating anything or storing them), and you still won't see a difference.
Your real problem is that the work done for each element is slow.
Calling map doesn't actually do that work. All it does is construct a lazy iterator that sets up the work to be done later. When is later? When you convert the iterator to a list (or consume it in some other way).
So, it's that text_to_words function that you need to speed up.
And there's at least one obvious candidate for how to do that:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
You're loading in an entire English tokenizer/dictionary/etc. for each sentence? Sure, you'll get some benefit from caching after the first time, but I'll bet it's still way too slow to do for every sentence.
If you were trying to speed things up by making it a local variable rather than a global (which probably won't matter, but it might), that's not the way to do it; this is:
nlp=spacy.load('en')
def text_to_words(x, *. _nlp=nlp):
""" This function converts sentences in a text to a list of words
"""
txt_to_words= [str(doc).replace(".","").split(" ") for doc in _nlp(x).sents]
return txt_to_words
Related
Let's say I have a pretty large text document.
I need to imitate copy and paste operation of a text editor.
More concretely,
I want to write two functions copy(i,j) and paste(i), where i and j represent indices of characters in the text document.
Now, I understand that normal string slicing creates a new string object every time and doing something like
copy_text = str[i:j]
self.str = str[:i] + copy_text + str[j:]
will end up creating a lot of overhead of new string objects, given how many times we do a copy-paste function in a text editor.
How do I do it? Is it even possible?
I've looked into memoryview, where in they make use of a buffer, which may end up taking less time to execute given they create a zero-copy view of the original object. However, I want to do it algorithmacally and not play around with how strings are stored.
I've thinking on the lines of an array to store the string and use a B+ tree to store pointers to that string. I haven't been able to really materialize anything.
Look forward to your comments. Thanks.
I am trying to implement a function in python which takes in input an iterable and loops through it to perform some operation. I was confused about how to handle different iterables (example: lists and dictionaries cannot be looped in the same general way), so I looked in the statistics library in python and found that they are handling this situation like this: -
def variance(data, xbar=None):
if iter(data) is data: #<-----1
data = list(data)
...
then, they are handling data as list everywhere.
So, my question is : -
What is the meaning of (1); and
Is this the right method as it is everytime making a new list out of data. Can't they simply use the iterator to loop through the data?
iter(something) returns an iterator object that returns the elements of something. If something is already an iterator, it simply returns it unchanged. So
if iter(data) is data:
is a way of telling whether data is an iterator object. If it is, it converts it to a list of all the elements.
It's doing this because the code after that needs a real list of the elements. There are things you can do with a list that you can't do with an iterator, such as access specific elements, insert/delete elements, and loop over it multiple times. Iterators can only be processed sequentially.
I have a .NET structure that has arrays with arrays in it. I want to crete a list of members of items from a specific array in a specific array using list comprehension in IronPython, if possible.
Here is what I am doing now:
tag_results = [item_result for item_result in results.ItemResults if item_result.ItemId == tag_id][0]
tag_vqts = [vqt for vqt in tag_results.VQTs]
tag_timestamps = [vqt.TimeStamp for vqt in tag_vqts]
So, get the single item result from the results array which matches my condition, then get the vqts arrays from those item results, THEN get all the timestamp members for each VQT in the vqts array.
Is wanting to do this in a single statement overkill? Later on, the timestamps are used in this manner:
vqts_to_write = [vqt for vqt in resampled_vqts if not vqt.TimeStamp in tag_timestamps]
I am not sure if a generator would be appropriate, since I am not really looping through them, I just want a list of all the timestamps for all the item results for this item/tag so that I can test membership in the list.
I have to do this multiple times for different contexts in my script, so I was just wondering if I am doing this in an efficient and pythonic manner. I am refactoring this into a method, which got me thinking about making it easier.
FYI, this is IronPython 2.6, embedded in a fixed environment that does not allow the use of numpy, pandas, etc. It is safe to assume I need a python 2.6 only solution.
My main question is:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Two! My two main questions are:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Is a generator appropriate for testing membership in a list?
Three! My three questions are... Amongst my questions are such diverse queries as...I'll come in again...
(it IS python...)
tag_results = [...][0] builds a whole new list just to get one item. This is what next() on a generator expression is for:
next(item_result for item_result in results.ItemResults if item_result.ItemId == tag_id)
which only iterates just enough to get a first item.
You can inline that, but I'd keep that as a separate expression for readability.
The remainder is easily put into one expression:
tag_results = next(item_result for item_result in results.ItemResults
if item_result.ItemId == tag_id)
tag_timestamps = [vqt.TimeStamp for vqt in tag_results.VQTs]
I'd make that a set if you only need to do membership testing:
tag_timestamps = set(vqt.TimeStamp for vqt in tag_results.VQTs)
Sets allow for constant time membership tests; testing against a list takes linear time as the whole list could end up being scanned for each such test.
I have boiler plate code that performs my sql queries and returns results (this is based off of working code written years ago by someone else). Generally, that code will return a list of tuples, which is totally fine for what I need.
However, if there's only one result, the code returns a single tuple instead, and breaks code that expects to loop through a list of tuples.
I need any easy way to convert the tuple into the first item of a list, so I can use it in my code expecting to loop through lists.
What's the most straightforward way to do this in a single line of code?
I figured there must be a straightforward way to do this, and there is. If my result set is called rows:
if not isinstance(rows,list):
rows = [rows]
I don't know if this is the most Pythonic construction, or if there's a way of combining the isintance and rows = [rows] lines into a single statement.
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.