I am trying to use a simple apply on s frame full of data. This is for a simple data transform on one of the columns applying a function that takes a text input and splits it into a list. Here is the function and its call/output:
In [1]: def count_words(txt):
count = Counter()
for word in txt.split():
count[word]+=1
return count
In [2]: products.apply(lambda x: count_words(x['review']))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-85338326302c> in <module>()
----> 1 products.apply(lambda x: count_words(x['review']))
C:\Anaconda3\envs\dato-env\lib\site-packages\graphlab\data_structures\sframe.pyc in apply(self, fn, dtype, seed)
2607
2608 with cython_context():
-> 2609 return SArray(_proxy=self.__proxy__.transform(fn, dtype, seed))
2610
2611 def flat_map(self, column_names, fn, column_types='auto', seed=None):
C:\Anaconda3\envs\dato-env\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Unable to evaluate lambdas. Lambda workers did not start.
When I run my code I get that error. The s frame (df) is only 10 by 2 so there should be no overload coming from there. I don't know how to fix this issue.
If you're using GraphLab Create, there is actually a built-in tool for doing this, in the "text analytics" toolkit. Let's say I have data like:
import graphlab
products = graphlab.SFrame({'review': ['a portrait of the artist as a young man',
'the sound and the fury']})
The easiest way to count the words in each entry is
products['counts'] = graphlab.text_analytics.count_words(products['review'])
If you're using the sframe package by itself, or if you want to do a custom function like the one you described, I think the key missing piece in your code is that the Counter needs to be converted into a dictionary in order for the SFrame to handle the output.
from collections import Counter
def count_words(txt):
count = Counter()
for word in txt.split():
count[word] += 1
return dict(count)
products['counts'] = products.apply(lambda x: count_words(x['review']))
For anyone who has come across this issue while using graphlab here is the the discussion thread on the issue on dato support:
http://forum.dato.com/discussion/1499/graphlab-create-using-anaconda-ipython-notebook-lambda-workers-did-not-start
Here is the code that can be run to provide a case by case basis for this issue.
After starting ipython or ipython notebook in the Dato/Graphlab environment, but before importing graphlab, copy and run the following code
import ctypes, inspect, os, graphlab
from ctypes import wintypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
kernel32.SetDllDirectoryW.argtypes = (wintypes.LPCWSTR,)
src_dir = os.path.split(inspect.getfile(graphlab))[0]
kernel32.SetDllDirectoryW(src_dir)
# Should work
graphlab.SArray(range(1000)).apply(lambda x: x)
If this is run, the the apply function should work fine with sframe.
Related
After creating the clr_default:
clr_default = setup(df_rain_definitivo_one_drop_catboost_norm_fs_dropna,fold_shuffle=True, target='RainTomorrow', session_id=123)
I've tried to use the compare_models() function in Pycaret, using the following call:
best_model = compare_models()
from pycaret.classification import *
However I get the following error message:
ValueError Traceback (most recent call last)
<ipython-input-228-e1d76b68915a> in <module>()
----> 1 best_model = compare_models(n_select = 5, sort='Accuracy')
1 frames
/usr/local/lib/python3.7/dist-packages/pycaret/internal/tabular.py in compare_models(include, exclude, fold, round, cross_validation, sort, n_select, budget_time, turbo, errors, fit_kwargs, groups, verbose, display)
1954 if sort is None:
1955 raise ValueError(
-> 1956 f"Sort method not supported. See docstring for list of available parameters."
1957 )
1958
ValueError: Sort method not supported. See docstring for list of available parameters.
I've tried to call compare_models() with the sort parameter = 'Accuracy' but it didn't do any good.
Also, I'm on Google Colab
I dont get what is n_select = 5? do you want to get the top-5 models? Otherwise;
Im using your code examples:
First import pycaret
from pycaret.classification import *
Then setup,
clr_default = setup(df_rain_definitivo_one_drop_catboost_norm_fs_dropna,fold_shuffle=True, target='RainTomorrow', session_id=123)
Last use compare model method
best_model = compare_models(sort='Accuracy')
After that you can create your models then tune it.
working through a tutorial that is supposed to help students do the assignment, but I'm encountering a problem. I'm using python on a notebook project in IBM. Right now the section is simply data exploration. However this error is occurring and I'm not sure how to fix it, no one else seemed to have this problem in this class and the teacher is rather slow to help so I came here!
I tried just defining the variable before its called, but no dice either way.
All the code prior to this is just importing libraries and then parsing the data
# Infer the data type of each column and convert the data to the inferred data type
from ingest import *
eu = ExtensionUtils(sqlContext)
df_data_1 = eu.convertTypes(df_data_1)
df_data_1.printSchema()
the error I'm getting is
TypeError Traceback (most recent call last)
<ipython-input-14-33250ae79106> in <module>()
2 from ingest import *
3 eu = ExtensionUtils(sqlContext)
----> 4 df_data_1 = eu.convertTypes(df_data_1)
5 df_data_1.printSchema()
/opt/ibm/third-party/libs/python3/ingest/extension_utils.py in convertTypes(self, input_obj, dictVal)
304 """
305
--> 306 checkEnrichType_or_DataFrame("input_obj",input_obj)
307 self.logger = self._jLogger.getLogger(__name__)
308 methodname = str(inspect.stack()[0][3])
/opt/ibm/third-party/libs/python3/ingest/extension_utils.py in checkEnrichType_or_DataFrame(param, paramval)
81 if not isinstance(paramval,(EnrichType ,DataFrame)):
82 raise TypeError("%s should be a EnrichType class object or DataFrame, got type %s"
---> 83 % (str(param), type(paramval)))
84
85
TypeError: input_obj should be a EnrichType class object or DataFrame, got type <class 'NoneType'>
The solution was not with the code itself but rather with the notebook. A code snippet from a built in function needed to be inserted first before this.
I'm new to python and I'm performing a basic EDA analysis on two similar SFrames. I have a dictionary as two of my columns and I'm trying to find out if the max values of each dictionary are the same or not. In the end I want to sum up the Value_Match column so that I can know how many values match but I'm getting a nasty error and I haven't been able to find the source. The weird thing is I have used the same methodology for both the SFrames and only one of them is giving me this error but not the other one.
I have tried calculating max_func in different ways as given here but the same error has persisted : getting-key-with-maximum-value-in-dictionary
I have checked for any possible NaN values in the column but didn't find any of them.
I have been stuck on this for a while and any help will be much appreciated. Thanks!
Code:
def max_func(d):
v=list(d.values())
k=list(d.keys())
return k[v.index(max(v))]
sf['Max_Dic_1'] = sf['Dic1'].apply(max_func)
sf['Max_Dic_2'] = sf['Dic2'].apply(max_func)
sf['Value_Match'] = sf['Max_Dic_1'] == sf['Max_Dic_2']
sf['Value_Match'].sum()
Error :
RuntimeError Traceback (most recent call last)
<ipython-input-70-f406eb8286b3> in <module>()
----> 1 x = sf['Value_Match'].sum()
2 y = sf.num_rows()
3
4 print x
5 print y
C:\Users\rakesh\Anaconda2\lib\site-
packages\graphlab\data_structures\sarray.pyc in sum(self)
2216 """
2217 with cython_context():
-> 2218 return self.__proxy__.sum()
2219
2220 def mean(self):
C:\Users\rakesh\Anaconda2\lib\site-packages\graphlab\cython\context.pyc in
__exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let
exception propagate
RuntimeError: Runtime Exception. Exception in python callback function
evaluation:
ValueError('max() arg is an empty sequence',):
Traceback (most recent call last):
File "graphlab\cython\cy_pylambda_workers.pyx", line 426, in
graphlab.cython.cy_pylambda_workers._eval_lambda
File "graphlab\cython\cy_pylambda_workers.pyx", line 169, in
graphlab.cython.cy_pylambda_workers.lambda_evaluator.eval_simple
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence
In order to debug this problem, you have to look at the stack trace. On the last line we see:
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence
Python thus says that you aim to calculate the maximum of a list with no elements. This is the case if the dictionary is empty. So in one of your dataframes there is probably an empty dictionary {}.
The question is what to do in case the dictionary is empty. You might decide to return a None into that case.
Nevertheless the code you write is too complicated. A simpler and more efficient algorithm would be:
def max_func(d):
if d:
return max(d,key=d.get)
else:
# or return something if there is no element in the dictionary
return None
i'm new to graphlab and python trying to work on an assignment,the question is do a sentiment analysis on selected words from that i'm supposed to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "wordCount_select"
import graphlab
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
function
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
but im getting this error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-494af1bfc2ab> in <module>()
7
8 for word in selected_words:
----> 9 products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))
C:\Users\elginelijahsoft\Anaconda2\envs\dato-env\lib\site-packages\graphlab\data_structures\sarray.pyc in apply(self, fn, dtype, skip_undefined, seed)
1699
1700 with cython_context():
-> 1701 return SArray(_proxy=self.__proxy__.transform(fn, dtype, skip_undefined, seed))
1702
1703
C:\Users\elginelijahsoft\Anaconda2\envs\dato-env\lib\site-packages\graphlab\cython\context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Cannot evaluate lambda. Lambda workers cannot not start.
any ideas what im doing wrong and why the lambda workers cannot start
I also had the same issue during the Machine learning course. I have the latest Graphlab
I suspect this is a memory issue (not enough memory to load that data set and apply the lambda function). After closing most of the other stuff (browsers ...) the command ran fine.
I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.
After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:
import pandas as pd
from IPython.parallel import Client
def calculemus(corpus):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
return vectorizer.fit_transform(corpus)
review = pd.read_csv('review.csv')['text']
review = review.fillna('')
client = Client()
r = client[-1].apply(calculemus, review).get()
BUT I got this error instead:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
154 sa.data = m.bytes
155
--> 156 args = uncanSequence(map(unserialize, sargs), g)
157 kwargs = {}
158 for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
175
176 def unserialize(serialized):
--> 177 return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
159 buf = self.serialized.getData()
160 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161 result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
162 else:
163 raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer
I'm not sure what the problem is, could someone enlighten me on this?
UPDATE
Apparently the error says exactly what it says. If I do this:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
it kinda works.
So the next question is, is this a feature or a limitation of IPython?
This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.
PS:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
is equivalent to
r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))