i'm new to graphlab and python trying to work on an assignment,the question is do a sentiment analysis on selected words from that i'm supposed to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "wordCount_select"
import graphlab
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
function
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
but im getting this error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-494af1bfc2ab> in <module>()
7
8 for word in selected_words:
----> 9 products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
C:\Users\elginelijahsoft\Anaconda2\envs\dato-env\lib\site-packages\graphlab\data_structures\sarray.pyc in apply(self, fn, dtype, skip_undefined, seed)
1699
1700 with cython_context():
-> 1701 return SArray(_proxy=self.__proxy__.transform(fn, dtype, skip_undefined, seed))
1702
1703
C:\Users\elginelijahsoft\Anaconda2\envs\dato-env\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Cannot evaluate lambda. Lambda workers cannot not start.
any ideas what im doing wrong and why the lambda workers cannot start
I also had the same issue during the Machine learning course. I have the latest Graphlab
I suspect this is a memory issue (not enough memory to load that data set and apply the lambda function). After closing most of the other stuff (browsers ...) the command ran fine.
Related
I am trying to generate the summary of a large text file using Gensim Summarizer.
I am getting memory error. Have been facing this issue since sometime, any help
would be really appreciated. feel free to ask for more details.
from gensim.summarization.summarizer import summarize
file_read =open("xxxxx.txt",'r')
Content= file_read.read()
def Summary_gen(content):
print(len(Content))
summary_r=summarize(Content,ratio=0.02)
print(summary_r)
Summary_gen(Content)
The length of the document is:
365042
Error messsage:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-a91bd71076d1> in <module>()
10
11
---> 12 Summary_gen(Content)
<ipython-input-6-a91bd71076d1> in Summary_gen(content)
6 def Summary_gen(content):
7 print(len(Content))
----> 8 summary_r=summarize(Content,ratio=0.02)
9 print(summary_r)
10
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
428 corpus = _build_corpus(sentences)
429
--> 430 most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
431
432 # If couldn't get important docs, the algorithm ends.
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize_corpus(corpus, ratio)
367 return []
368
--> 369 pagerank_scores = _pagerank(graph)
370
371 hashable_corpus.sort(key=lambda doc: pagerank_scores.get(doc, 0), reverse=True)
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in pagerank_weighted(graph, damping)
57
58 """
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
60 probability_matrix = build_probability_matrix(graph)
61
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in build_adjacency_matrix(graph)
92 neighbors_sum = sum(graph.edge_weight((current_node, neighbor)) for neighbor in graph.neighbors(current_node))
93 for j in xrange(length):
---> 94 edge_weight = float(graph.edge_weight((current_node, nodes[j])))
95 if i != j and edge_weight != 0.0:
96 row.append(i)
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in edge_weight(self, edge)
255
256 """
--> 257 return self.get_edge_properties(edge).setdefault(self.WEIGHT_ATTRIBUTE_NAME, self.DEFAULT_WEIGHT)
258
259 def neighbors(self, node):
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in get_edge_properties(self, edge)
404
405 """
--> 406 return self.edge_properties.setdefault(edge, {})
407
408 def add_edge_attributes(self, edge, attrs):
MemoryError:
I have tried looking up for this error on the internet, but, couldn't find a workable solution to this.
From the logs, it looks like the code builds an adjacency matrix
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
This probably tries to create a huge adjacency matrix with your 365042 documents which cannot fit in your memory(i.e., RAM).
You could try:
Reducing the document size to fewer files (maybe start with 10000)
and check if it works
Try running it on a system with more RAM
Did you try to use word_count argument instead of ratio?
If the above still doesn't solve the problem, then that's because of gensim's implementation limitations. The only way to use gensim if you still OOM errors is to split documents. That also will speed up your solution (and if the document is really big, it shouldn't be a problem anyway).
What's the problem with summarize:
gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
BTW is the document length measured in words, or characters?
I'm new to python and I'm performing a basic EDA analysis on two similar SFrames. I have a dictionary as two of my columns and I'm trying to find out if the max values of each dictionary are the same or not. In the end I want to sum up the Value_Match column so that I can know how many values match but I'm getting a nasty error and I haven't been able to find the source. The weird thing is I have used the same methodology for both the SFrames and only one of them is giving me this error but not the other one.
I have tried calculating max_func in different ways as given here but the same error has persisted : getting-key-with-maximum-value-in-dictionary
I have checked for any possible NaN values in the column but didn't find any of them.
I have been stuck on this for a while and any help will be much appreciated. Thanks!
Code:
def max_func(d):
v=list(d.values())
k=list(d.keys())
return k[v.index(max(v))]
sf['Max_Dic_1'] = sf['Dic1'].apply(max_func)
sf['Max_Dic_2'] = sf['Dic2'].apply(max_func)
sf['Value_Match'] = sf['Max_Dic_1'] == sf['Max_Dic_2']
sf['Value_Match'].sum()
Error :
RuntimeError Traceback (most recent call last)
<ipython-input-70-f406eb8286b3> in <module>()
----> 1 x = sf['Value_Match'].sum()
2 y = sf.num_rows()
3
4 print x
5 print y
C:\Users\rakesh\Anaconda2\lib\site-
packages\graphlab\data_structures\sarray.pyc in sum(self)
2216 """
2217 with cython_context():
-> 2218 return self.__proxy__.sum()
2219
2220 def mean(self):
C:\Users\rakesh\Anaconda2\lib\site-packages\graphlab\cython\context.pyc in
__exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let
exception propagate
RuntimeError: Runtime Exception. Exception in python callback function
evaluation:
ValueError('max() arg is an empty sequence',):
Traceback (most recent call last):
File "graphlab\cython\cy_pylambda_workers.pyx", line 426, in
graphlab.cython.cy_pylambda_workers._eval_lambda
File "graphlab\cython\cy_pylambda_workers.pyx", line 169, in
graphlab.cython.cy_pylambda_workers.lambda_evaluator.eval_simple
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence
In order to debug this problem, you have to look at the stack trace. On the last line we see:
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence
Python thus says that you aim to calculate the maximum of a list with no elements. This is the case if the dictionary is empty. So in one of your dataframes there is probably an empty dictionary {}.
The question is what to do in case the dictionary is empty. You might decide to return a None into that case.
Nevertheless the code you write is too complicated. A simpler and more efficient algorithm would be:
def max_func(d):
if d:
return max(d,key=d.get)
else:
# or return something if there is no element in the dictionary
return None
I am trying to use a simple apply on s frame full of data. This is for a simple data transform on one of the columns applying a function that takes a text input and splits it into a list. Here is the function and its call/output:
In [1]: def count_words(txt):
count = Counter()
for word in txt.split():
count[word]+=1
return count
In [2]: products.apply(lambda x: count_words(x['review']))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-85338326302c> in <module>()
----> 1 products.apply(lambda x: count_words(x['review']))
C:\Anaconda3\envs\dato-env\lib\site-packages\graphlab\data_structures\sframe.pyc in apply(self, fn, dtype, seed)
2607
2608 with cython_context():
-> 2609 return SArray(_proxy=self.__proxy__.transform(fn, dtype, seed))
2610
2611 def flat_map(self, column_names, fn, column_types='auto', seed=None):
C:\Anaconda3\envs\dato-env\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Unable to evaluate lambdas. Lambda workers did not start.
When I run my code I get that error. The s frame (df) is only 10 by 2 so there should be no overload coming from there. I don't know how to fix this issue.
If you're using GraphLab Create, there is actually a built-in tool for doing this, in the "text analytics" toolkit. Let's say I have data like:
import graphlab
products = graphlab.SFrame({'review': ['a portrait of the artist as a young man',
'the sound and the fury']})
The easiest way to count the words in each entry is
products['counts'] = graphlab.text_analytics.count_words(products['review'])
If you're using the sframe package by itself, or if you want to do a custom function like the one you described, I think the key missing piece in your code is that the Counter needs to be converted into a dictionary in order for the SFrame to handle the output.
from collections import Counter
def count_words(txt):
count = Counter()
for word in txt.split():
count[word] += 1
return dict(count)
products['counts'] = products.apply(lambda x: count_words(x['review']))
For anyone who has come across this issue while using graphlab here is the the discussion thread on the issue on dato support:
http://forum.dato.com/discussion/1499/graphlab-create-using-anaconda-ipython-notebook-lambda-workers-did-not-start
Here is the code that can be run to provide a case by case basis for this issue.
After starting ipython or ipython notebook in the Dato/Graphlab environment, but before importing graphlab, copy and run the following code
import ctypes, inspect, os, graphlab
from ctypes import wintypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
kernel32.SetDllDirectoryW.argtypes = (wintypes.LPCWSTR,)
src_dir = os.path.split(inspect.getfile(graphlab))[0]
kernel32.SetDllDirectoryW(src_dir)
# Should work
graphlab.SArray(range(1000)).apply(lambda x: x)
If this is run, the the apply function should work fine with sframe.
I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.
I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.
Here is what I have so far:
from ua_parser import user_agent_parser
http_str = df['user_agents'].tolist()
def uaparse(http_str):
for i, item in enumerate(http_str):
return user_agent_parser.Parse(http_str[i])
pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))
Right now I'm seeing the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-701fbf58d263> in <module>()
7
8 pool = mp.Pool(processes=10)
----> 9 results = pool.map(uaparse, range(0,len(http_str)))
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
249 '''
250 assert self._state == RUN
--> 251 return self.map_async(func, iterable, chunksize).get()
252
253 def imap(self, func, iterable, chunksize=1):
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
TypeError: 'int' object is not iterable
Thanks in advance for any assistance/direction you can provide.
It seems like all you need is:
http_str = df['user_agents'].tolist()
pool = mp.Pool(processes=10)
parsed = pool.map(user_agent_parser.Parse, http_str)
Using 64-bit Python 3.3.1 and 32GB RAM and this function to generate target expression 1+1/(2+1/(2+1/...)):
def sqrt2Expansion(limit):
term = "1+1/2"
for _ in range(limit):
i = term.rfind('2')
term = term[:i] + '(2+1/2)' + term[i+1:]
return term
I'm getting MemoryError when calling:
simplify(sqrt2Expansion(100))
Shorter expressions work fine, e.g:
simplify(sqrt2Expansion(50))
Is there a way to configure SymPy to complete this calculation? Below is the error message:
MemoryError Traceback (most recent call last)
<ipython-input-90-07c1e2de29d1> in <module>()
----> 1 simplify(sqrt2Expansion(100))
C:\Python33\lib\site-packages\sympy\simplify\simplify.py in simplify(expr, ratio, measure)
2878 from sympy.functions.special.bessel import BesselBase
2879
-> 2880 original_expr = expr = sympify(expr)
2881
2882 expr = signsimp(expr)
C:\Python33\lib\site-packages\sympy\core\sympify.py in sympify(a, locals, convert_xor, strict, rational)
176 try:
177 a = a.replace('\n', '')
--> 178 expr = parse_expr(a, locals or {}, rational, convert_xor)
179 except (TokenError, SyntaxError):
180 raise SympifyError('could not parse %r' % a)
C:\Python33\lib\site-packages\sympy\parsing\sympy_parser.py in parse_expr(s, local_dict, rationalize, convert_xor)
161
162 code = _transform(s.strip(), local_dict, global_dict, rationalize, convert_xor)
--> 163 expr = eval(code, global_dict, local_dict) # take local objects in preference
164
165 if not hit:
MemoryError:
EDIT:
I wrote a version using sympy expressions instead of strings:
def sqrt2Expansion(limit):
x = Symbol('x')
term = 1+1/x
for _ in range(limit):
term = term.subs({x: (2+1/x)})
return term.subs({x: 2})
It runs better: sqrt2Expansion(100) returns valid result, but sqrt2Expansion(200) produces RuntimeError with many pages of traceback and hangs up IPython interpreter with plenty of system memory left unused. I created new question Long expression crashes SymPy with this issue.
SymPy is using eval along the path to turn your string into a SymPy object, and eval uses the built-in Python parser, which has a maximum limit. This isn't really a SymPy issue.
For example, for me:
>>> eval("("*100+'3'+")"*100)
s_push: parser stack overflow
Traceback (most recent call last):
File "<ipython-input-46-1ce3bf24ce9d>", line 1, in <module>
eval("("*100+'3'+")"*100)
MemoryError
Short of modifying MAXSTACK in Parser.h and recompiling Python with a different limit, probably the best way to get where you're headed is to avoid using strings in the first place. [I should mention that the PyPy interpreter can make it up to ~1100 for me.]