cannot send pyspark output to a file in the local file system

cannot send pyspark output to a file in the local file system - python

I'm running a pyspark job on spark (single node, stand-alone) and trying to save the output in a text file in the local file system.
input = sc.textFile(inputfilepath)
words = input.flatMap(lambda x: x.split())
wordCount = words.countByValue()
wordCount.saveAsTextFile("file:///home/username/output.txt")
I get an error saying
AttributeError: 'collections.defaultdict' object has no attribute 'saveAsTextFile'
Basically whatever I add to 'wordCount' object, for example collect() or map() it returns the same error. The code works with no problem when output goes to the terminal (with a for loop) but I can't figure what is missing to send the output to a file.

The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it.
You can use your favorite method to save the dictionary locally.

Related

Read files from hdfs - pyspark

I am new to Pyspark, when I execute the below code, I am getting attribute error.
I am using apache spark 2.4.3
t=spark.read.format("hdfs:\\test\a.txt")
t.take(1)
I expect the output to be 1, but it throws error.
AttributeError: dataframereader object has no attribute take

You're not using the API properly:
format is used to specify the input data source format you want
Here, you're reading text file so all you have to do is:
t = spark.read.text("hdfs://test/a.txt")
t.collect()
See related doc

Unpickling and encoding a string using rdd.map in PySpark

I need to port code from PySpark 1.3 to 2.3 (also on Python 2.7 only) and I have a following map transformation on the rdd:
import cPickle as pickle
import base64
path = "my_filename"
my_rdd = "rdd with data" # pyspark.rdd.PipelinedRDD()
# saving RDD to a file but first encoding everything
my_rdd.map(lambda line: base64.b64encode(pickle.dumps(line))).saveAsTextFile(path)
# another my_rdd.map doing the opposite of the above, fails with the same error
my_rdd = sc.textFile(path).map(lambda line: pickle.loads(base64.b64decode(line)))
When this part is run, I get the following error:
raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
Looks like actions like this are not allowed anymore in the map function. Any suggestion how to potentially rewrite this part?
UPDATE:
weirdly enough, just doing:
my_rdd.saveAsTextFile(path)
also fails with the same error.

Bottom line, the problem was somewhere deep in the functions doing the transformations. Easier to rewrite than debug in this case.

How do you take a function from an excel spreadsheet and run it with python?

Hello whoever is reading this.
I am having trouble taking a python function from an excel spreadsheet and then running it in python.
I'd like to stress on the fact that I actually defined my function in python, and it's not an excel function.
Please note that this is my second time questioning and have only ever done coding as a hobby.
Here is an example of what I want to take:
heal(2,c)
I am using xlrd to analyze data on the spreadsheet.
Here is a chunk from my code.
e = worksheet.cell(rowidx,colidx+1)
f = str(e).replace("'","")
g = f.replace("text:","")
This chunk focuses on converting the 'cell object' to a 'string' and making it look like the function required.
The end result is this:
g = heal(2,c)
My problem is that I cannot seem to activate this function.
I have also tried doing "g()" but it came up with the error message:
File "C:\Users\Alexander\Dropbox\Documents\Python\Hearthstone\Hearthstone.py", line 18, in play
g()
TypeError: 'str' object is not callable
I do not mind if you tell me a way to activate "g" or just directly run it from the spreadsheet.
Please let me know if you need any more information.
Thank you for you time.

You can use eval for same. Here, I am considering that funciton is defined in current file only.
Example:
eval('heal(2,c)')
If function is defined in other file i.e. "file",import it and call using:
Example:
import file
eval('file.heal(2,c)')

Python JSON dictionary key error

I'm trying to collect data from a JSON file using python. I was able to access several chunks of text but when I get to the 3rd object in the JSON file I'm getting a key error. The first three lines work fine but the last line gives me a key error.
response = urllib.urlopen("http://asn.desire2learn.com/resources/D2740436.json")
data = json.loads(response.read())
title = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/elements/1.1/title"][0]["value"]
description = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/terms/description"][0]["value"]
topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
topicDesc = data["http://asn.desire2learn.com/resources/S2743916"]
Here is the JSON file I'm using. http://s3.amazonaws.com/asnstaticd2l/data/rdf/D2742493.json I went through all the braces and can't figure out why I'm getting this error. Anyone know why I might be getting this?

topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
I don't see this key "http://asn.desire2learn.com/resources/D2740436" anywhere in your source file. You didn't include your stack, but my first thought would be typo resulting in a bad key and you getting an error like:
KeyError: "http://asn.desire2learn.com/resources/D2740436"
Which means that value does not exist in the data you are referencing

The link in your code and your AWS link go to very different files. Open up the link in your code in a web browser, and you will find that it's much shorter than the file on AWS. It doesn't actually contain the key you're looking for.

You say that you are using the linked file, in which the key "http://asn.desire2learn.com/resources/S2743916" turns up once.
However, your code is downloading a different file - one in which the key does not appear.
Try using the file you linked in your code, and you should see the key will work.

Accessing Running Python program from another Python program

I have the following program running
collector.py
data=0
while True:
#collects data
data=data+1
I have another program cool.py which wants to access the current data value. How can I do this?
Ultimately, something like:
cool.py
getData()
*An Idea would be to use a global variable for data?

You can use memory mapping.
http://docs.python.org/2/library/mmap.html
For example you open a file in tmp directore, next u mapping this file to memory in both program and write u data to this file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

cannot send pyspark output to a file in the local file system - python

The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it. You can use your favorite method to save the dictionary locally.

Related

Read files from hdfs - pyspark

Unpickling and encoding a string using rdd.map in PySpark

How do you take a function from an excel spreadsheet and run it with python?

Python JSON dictionary key error

Accessing Running Python program from another Python program

Categories

Resources