Spark RDD: Using collect() at range() object - python

I want to convert the numbers 0 to 99 into a RDD.
rd1 = range(1, 100)
test = sc.parallelize(rd1)
When I use the collect() function...
print(test.collect())
...I receive the following error message:
PicklingError: Could not pickle object as excessively deep recursion required.
According to this documentary, it's supposed to work. Can you tell me what I'm doing wrong?
Thank you very much.

If someone else has the same problem. I could solve it by selecting only the lines I want to execute.
I think that other scripts that ran parallel led to the error.

Related

PySpark: TypeError: 'str' object is not callable in dataframe operations

I am reading files from a folder in a loop and creating dataframes from these.
However, I am getting this weird error TypeError: 'str' object is not callable.
Please find the code here:
for yr in range (2014,2018):
cat_bank_yr = sqlCtx.read.csv(cat_bank_path+str(yr)+'_'+h1+'bank.csv000',sep='|',schema=schema)
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(col("cat_ledger"))))
cat_bank_yr=cat_bank_yr.withColumn("category",trim(lower(col("category"))))
The code runs for one iteration and then stops at the line
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(col("cat_ledger"))))
with the above error.
Can anyone help out?
Your code looks fine - if the error indeed happens in the line you say it happens, you probably accidentally overwrote one of the PySpark function with a string.
To check this, put the following line directly above your for loop and see whether the code runs without an error now:
from pyspark.sql.functions import col, trim, lower
Alternatively, double-check whether the code really stops in the line you said, or check whether col, trim, lower are what you expect them to be by calling them like this:
col
should return
function pyspark.sql.functions._create_function.._(col)
In the import section use:
from pyspark.sql import functions as F
Then in the code wherever using col, use F.col so your code would be:
# on top/header part of code
from pyspark.sql import functions as F
for yr in range (2014,2018):
cat_bank_yr = sqlCtx.read.csv(cat_bank_path+str(yr)+'_'+h1+'bank.csv000',sep='|',schema=schema)
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(F.col("cat_ledger"))))
cat_bank_yr=cat_bank_yr.withColumn("category",trim(lower(F.col("category"))))
Hope this will work. Good luck.

How to call python function from scala?

Lets say I have an entity X with some fields.
I have a python function that expects panda dataframes of entity X. It does some calculations, updates one of the field of X with machine learning algorithm [ sklearn ] and finally returns the updated dataframes of X.
On the other side, I have a scala code that has List of entity X. Now, I also have to calculate the value of that field of X in the same way as the python counterpart does.
Now it would be extremely difficult to simulate the similar machine learning stuff again in scala.
The easiest way I see would be to re-use/call that python function somehow and get updated List of X.
Is there some way to achieve this ?
Any insights or suggestions on how to solve this would be extremely helpful.
The following code will help you call the python script from scala itself.
import sys.process._
def callPython(){
val result = "python /fullpath/mypythonprogram.py" !
ProcessLogger(stdout append _, stderr append _)
println(result)
println("stdout: " + stdout)
println("stderr: " + stderr)
}
try this code
val result = "python /fullpath/mypythonprogram.py" !! ProcessLogger(stdout append _, stderr append _)
Because "!" will just return 0 or 1 (0 stands for function is executed without error and 1 stands for function through error)
Hope it works for you.
thanks.

DataFrame.set_index returns 'str' object is not callable

I'm not looking for a solution here as I found a workaround; mostly I'd just like to understand why my original approach didn't work given that the work around did.
I have a dataframe of 2803 rows with the default numeric key. I want to replace that with the values in column 0, namely TKR.
So I use f.set_index('TKR') and get
f.set_index('TKR')
Traceback (most recent call last):
File "<ipython-input-4-39232ca70c3d>", line 1, in <module>
f.set_index('TKR')
TypeError: 'str' object is not callable
So I think maybe there's some noise in my TKR column and rather than scrolling through 2803 rows I try f.head().set_index('TKR')
When that works I try f.head(100).set_index('TKR') which also works. I continue with parameters of 500, 1000, and 1500 all of which work. So do 2800, 2801, 2802 and 2803. Finally I settle on
f.head(len(f)).set_index('TKR')
which works and will handle a different size dataframe next month. I would just like to understand why this works and the original, simpler, and (I thought) by the book method doesn't.
I'm using Python 3.6 (64 bit) and Pandas 0.18.1 on a Windows 10 machine
You might have accidentally assigned the pd.DataFrame.set_index() to a value.
example of this mistake: f.set_index = 'intended_col_name'
As a result for the rest of your code .set_index was changed into a str, which is not callable, resulting in this error.
Try restarting your notebook, remove the wrong code and replace it with f.set_index('TKR')
I know it's been a long while, but I think some people may need the answer in the future.
What you do with f.set_index('TKR') is totally right as long as 'TKR' is a column of DataFrame f.
That is to say, this is a bug you are not supposed to have. It is always because that you redefine some build-in function methods or functions of python in your former steps(Possibly 'set_index'). So, the way to fix is to review your code to find out which part is wrong.
If you are using Jupiter notebook, restart it and run this block only can fix this problem.
I believe I have a solution for you.
I ran into the same problem and I was constructing my dataframes from a dictionary, like this:
df_beta = df['Beta']
df_returns = df['Returns']
then, trying to do df_beta.set_index(Date) would fail. My workaround was
df_beta = df['Beta'].copy()
df_returns = df['Returns'].copy()
So apparently, if you build your dataframes as a "view" of another existing dataframe, you can't set index and it will raise 'Series not callable' error. If instead you create an explicit new object copying the original dataframes, then you can call reset_index, which is what you kind of do when you compute the head.
Hope this helps, 2 years later :)
I have the same problem here.
import tushare as ts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts.set_token('*************************************')
tspro = ts.pro_api()
gjyx = tspro.daily(ts_code='000516.SZ', start_date='20190101')
# this doesn't work
# out:'str' object is not callable
gjyx = gjyx.set_index('trade_date')
# this works
gjyx = gjyx.head(len(gjyx)).set_index('trade_date')
jupyter notebook 6.1.6, python 3.9.1, miniconda3, win10
But when I upload this ipynb to ubuntu on AWS, it works.
I once had this same issue.
This simple line of code keep throwing TypeError: 'series' object is not callable error again and again.
df = df.set_index('Date')
I had to shutdown my kernel and restart the jupyter notebook to fix it.

Optimising python function with numba

I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.
Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.

Why does this Python code give Runtime Error(NZEC)?

I already read the other questions and answers but couldn't implement any of the solutions to my code. I'm still clueless about the reason why this code gives a runtime error.
I'm trying to submit the code on CodeChef, yet it gives the Runtime Error(NZEC), although the code runs flawlessly on my console for some inputs. Here's my code:
def GetSquares(base):
if not base or base < 4:
return 0
else:
x = (base - 4) - (base % 2) + 1
return x + GetSquares(base - 4)
num_test = int(input())
for test in range(num_test):
base = int(input())
print (int(GetSquares(base)))
Codechef's explanation for NZEC:
NZEC stands for Non Zero Exit Code. For C users, this will be
generated if your main method does not have a return 0; statement.
Other languages like Java/C++ could generate this error if they throw
an exception.
The problem I'm trying to solve:
https://www.codechef.com/problems/TRISQ
The problem description says that the input is constrained to be < 10^4. That's 10,000! Your code will need to make 10,000/4 = 2500 recursive calls to GetSquares, that's a lot! In fact, it's so much that it's going to give you, fittingly, this error:
RuntimeError: maximum recursion depth exceeded
You're going to have to think of a better way to solve the problem that doesn't involve so much recursion! Because you're doing this coding challenge, I'm not going to give a solution in this answer as that would sort of defeat the purpose, but if you'd like some prodding towards an answer, feel free to ask.
The question puts a constraint on the value of 'B' which is 10000 at max, which means there are a lot of recursive calls and giving a runtime error. Try solving using iteration.

Categories

Resources