Unpickling and encoding a string using rdd.map in PySpark - python

I need to port code from PySpark 1.3 to 2.3 (also on Python 2.7 only) and I have a following map transformation on the rdd:
import cPickle as pickle
import base64
path = "my_filename"
my_rdd = "rdd with data" # pyspark.rdd.PipelinedRDD()
# saving RDD to a file but first encoding everything
my_rdd.map(lambda line: base64.b64encode(pickle.dumps(line))).saveAsTextFile(path)
# another my_rdd.map doing the opposite of the above, fails with the same error
my_rdd = sc.textFile(path).map(lambda line: pickle.loads(base64.b64decode(line)))
When this part is run, I get the following error:
raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
Looks like actions like this are not allowed anymore in the map function. Any suggestion how to potentially rewrite this part?
UPDATE:
weirdly enough, just doing:
my_rdd.saveAsTextFile(path)
also fails with the same error.

Bottom line, the problem was somewhere deep in the functions doing the transformations. Easier to rewrite than debug in this case.

Related

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

How to load .mat file into workspace using Matlab Engine API for Python?

I have a .mat workspace file containing 4 character variables. These variables contain paths to various folders I need to be able to cd to and from relatively quickly. Usually, when using only Matlab I can load this workspace as follows (provided the .mat file is in the current directory).
load paths.mat
Currently I am experimenting with the Matlab Engine API for Python. The Matlab help docs recommend using the following Python formula to send variables to the current workspace in the desktop app:
import matlab.engine
eng = matlab.engine.start_matlab()
x = 4.0
eng.workspace['y'] = x
a = eng.eval('sqrt(y)')
print(a)
Which works well. However the whole point of the .mat file is that it can quickly load entire sets of variables the user is comfortable with. So the above is not efficient when trying to load the workspace.
I have also tried two different variations in Python:
eng.load("paths.mat")
eng.eval("load paths.mat")
The first variation successfully loads a dict variable in Python containing all four keys and values but this does not propagate to the workspace in Matlab. The second variation throws an error:
File "", line unknown SyntaxError: Error: Unexpected MATLAB
expression.
How do I load up a workspace through the engine without having to manually do it in Matlab? This is an important part of my workflow....
You didn't specify the number of output arguments from the MATLAB engine, which is a possible reason for the error.
I would expect the error from eng.load("paths.mat") to read something like
TypeError: unsupported data type returned from MATLAB
The difference in error messages may arise from different versions of MATLAB, engine API...
In any case, try specifying the number of output arguments like so,
eng.load("paths.mat", nargout=0)
This was giving me fits for a while. A few things to try. I was able to get this working on Matlab 2019a with Python 3.7. I had the most trouble trying to create a string and using the string as an argument for load and eval/evalin, so there might be some trickiness with the single or double quotes, or needing to have an additional set of quotes in the string.
Make sure the MAT file is on the Matlab Path. You can use addpath and rmpath really easily with pathlib objects:
from pathlib import Path
mat_file = Path('local/path/from/cwd/example.mat').resolve # get absolute path
eng.addpath(str(mat_file.parent))
# Execute other commands
eng.rmpath(str(mat_file.parent))
Per dML's answer, make sure to specify the nargout=0 when there are no outputs from the function, and always when calling a script. If there are 1 or more outputs you don't have to have an output in Python, and there is more than one it will be output as a tuple.
You can also turn your script into a function (just won't have access to base workspace without using evalin/assignin):
function load_example_matfile()
evalin('base','load example.mat')
end
eng.feval('load_example_matfile')
And, it does seem to work on the plain vanilla eval and load as well, but if you leave off the nargout=0 it either errors out or gives you the output of the file in python directly.
Both of these work.
eng.eval('load example.mat', nargout=0)
eng.load('example.mat', nargout=0)

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

cannot send pyspark output to a file in the local file system

I'm running a pyspark job on spark (single node, stand-alone) and trying to save the output in a text file in the local file system.
input = sc.textFile(inputfilepath)
words = input.flatMap(lambda x: x.split())
wordCount = words.countByValue()
wordCount.saveAsTextFile("file:///home/username/output.txt")
I get an error saying
AttributeError: 'collections.defaultdict' object has no attribute 'saveAsTextFile'
Basically whatever I add to 'wordCount' object, for example collect() or map() it returns the same error. The code works with no problem when output goes to the terminal (with a for loop) but I can't figure what is missing to send the output to a file.
The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it.
You can use your favorite method to save the dictionary locally.

PyFITS: hdulist.writeto()

I'm extracting extensions from a multi-extension FITS file, manipulate the data, and save the data (with the extension's header information) to a new FITS file.
To my knowledge pyfits.writeto() does the task. However, when I give it a data parameter in the form of an array, it gives me the error:
'AttributeError: 'numpy.ndarray' object has no attribute 'lower''
Here is a sample of my code:
'file = 'hst_11166_54_wfc3_ir_f110w_drz.fits'
hdulist = pyfits.open(dir + file)'
sci = hdulist[1].data # science image data
exp = hdulist[5].data # exposure time data
sci = sci*exp # converts electrons/second to electrons
file = 'test_counts.fits'
hdulist.writeto(file,sci,clobber=True)
hdulist.close()
I appreciate any help with this. Thanks in advance.
You're confusing the HDUList.writeto method, and the writeto function.
What you're calling is a method on the HDUList object that is returned when you call pyfits.open. You can think of this object as something like a file handle to your original drizzled FITS file. You can manipulate this object in place and either write it out to a new file or save updates in place (if you open the file in mode='update').
The writeto function on the other hand is not tied to any existing file. It's just a high-level function for writing an array out to a file. In your example you could write your array of electron counts out like:
pyfits.writeto(filename, data)
This will create a single-HDU FITS file with the array data in the PRIMARY HDU.
Do be aware of the admonishment at the top of this section of the docs: http://docs.astropy.org/en/v1.0.3/io/fits/index.html#convenience-functions
The functions like pyfits.writeto are there for convenience in interactive work, but are not recommendable for use in code that will be run repeatedly, as in a script. Instead have a look at these instructions to start.
It is probably because you should use hdulist.writeto(file, clobber=True). There is only one required argument:
https://pythonhosted.org/pyfits/api_docs/api_hdulists.html#pyfits.HDUList.writeto
If you give a second argument, it is used for output_verify which should be a string, not a numpy array. This probably explains your AttributeError ....

Categories

Resources