Read files from hdfs - pyspark

Read files from hdfs - pyspark - python

I am new to Pyspark, when I execute the below code, I am getting attribute error.
I am using apache spark 2.4.3
t=spark.read.format("hdfs:\\test\a.txt")
t.take(1)
I expect the output to be 1, but it throws error.
AttributeError: dataframereader object has no attribute take

You're not using the API properly:
format is used to specify the input data source format you want
Here, you're reading text file so all you have to do is:
t = spark.read.text("hdfs://test/a.txt")
t.collect()
See related doc

Related

Trouble reading npy file

I am new to python and am having trouble reading a *.npy file that somebody else saved. If I use the following commands:
import numpy as np
np.load('lat.npy')
I get the following error:
ValueError: Cannot load file containing pickled data when allow_pickle=False
So, I set allow_pickle=True:
np.load('lat.npy',allow_pickle=True)
Then, I get a different error:
OSError: Failed to interpret file 'lat.npy' as a pickle
Maybe it is relevant that I am on a PC, and the other file was written on a Mac.
Am I doing something wrong? (I am sorry if this question has been asked already.) Thank you!

I learned that my colleague's data file was written in python 2, while I am using python 3. Using the np.load command with the following options will work:
np.load('lat.npy',allow_pickle=True,fix_imports=True,encoding='latin1')
It seems I need to set all of those options, but the 'encoding' argument seems especially important. The doc for numpy.load says about the encoding argument, "Only useful when loading Python 2 generated pickled files in Python 3, which includes npy/npz files containing object arrays."

Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError

I am trying to use the BigQuery package to interact with Pandas DataFrames. In my scenario, I query a base table in BigQuery, use .to_dataframe(), then pass that to load_table_from_dataframe() to load it into a new table in BigQuery.
My original problem was that str(uuid.uuid4()) (for random ID's) was automatically being converted to bytes instead of string, so I am forcing a schema instead of allowing it to auto-detect what to make.
Now, though, I passed a job_config with a job_config dict that contained the schema, and now I get this error:
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 903, in load_table_from_dataframe
job_config.source_format = job.SourceFormat.PARQUET
AttributeError: 'dict' object has no attribute 'source_format'
I already had PyArrow installed, and tried also installing FastParquet, but it didnt help, and this didn't happen before I tried to force a schema.
Any ideas?
https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#using-bigquery-with-pandas
https://google-cloud-python.readthedocs.io/en/latest/_modules/google/cloud/bigquery/client.html#Client.load_table_from_dataframe
Looking in to the actual package it seems that it forces Parquet format, but like I said, I had no issue before, just now that I'm trying to give a table schema.
EDIT: This only happens when I try to write to BigQuery.

Figured it out. After weeding through Google's documentation I forgot to put:
load_config = bigquery.LoadJobConfig()
load_config.schema = SCHEMA
Oops. Never loaded the config dict from the BigQuery package.

Tableau SDK TableException (40200)

Issue: Error being thrown: tableausdk.Exceptions.TableauException: TableauException (40200): The system cannot find the path specified.
- OS::mkdir(CreateDirectory path="C:\PATH\Tableau-SDK\tdetmp2A0E0E5E")
I am attempting to to create a tableau extract from oracle data using python and the tableauSDK.
The code seems to run correctly if the extract already exists. (although the produced tde is unreadable)
According to the Tableau community I should be able to create an extract from any source data without the extract already existing...
Any idea on why this is occuring?
tde_path = r'C:\PATH\test.tde'
tde_file = Extract(path=tde_path) ## ERROR Thrown here

The reason now seems obvious...
The error had the answer :
OS::mkdir(CreateDirectory path="C:\PATH\Tableau-SDK\tdetmp2A0E0E5E")
To solve the issue :
The Directory C:\PATH\Tableau-SDK\ did not exist.
Created the Directory and the code ran without error.

cannot send pyspark output to a file in the local file system

I'm running a pyspark job on spark (single node, stand-alone) and trying to save the output in a text file in the local file system.
input = sc.textFile(inputfilepath)
words = input.flatMap(lambda x: x.split())
wordCount = words.countByValue()
wordCount.saveAsTextFile("file:///home/username/output.txt")
I get an error saying
AttributeError: 'collections.defaultdict' object has no attribute 'saveAsTextFile'
Basically whatever I add to 'wordCount' object, for example collect() or map() it returns the same error. The code works with no problem when output goes to the terminal (with a for loop) but I can't figure what is missing to send the output to a file.

The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it.
You can use your favorite method to save the dictionary locally.

Error: Line magic function

I'm trying to read a file using python and I keep getting this error
ERROR: Line magic function `%user_vars` not found.
My code is very basic just
names = read_csv('Combined data.csv')
names.head()
I get this for anytime I try to read or open a file. I tried using this thread for help.
ERROR: Line magic function `%matplotlib` not found
I'm using enthought canopy and I have IPython version 2.4.1. I made sure to update using the IPython installation page for help. I'm not sure what's wrong because it should be very simple to open/read files. I even get this error for opening text files.
EDIT:
I imported traceback and used
print(traceback.format_exc())
But all I get is none printed. I'm not sure what that means.

Looks like you are using Pandas. Try the following (assuming your csv file is in the same path as the your script lib) and insert it one line at a time if you are using the IPython Shell:
import pandas as pd
names = pd.read_csv('Combined data.csv')
names.head()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read files from hdfs - pyspark - python

I am new to Pyspark, when I execute the below code, I am getting attribute error. I am using apache spark 2.4.3 t=spark.read.format("hdfs:\\test\a.txt") t.take(1) I expect the output to be 1, but it throws error. AttributeError: dataframereader object has no attribute take

You're not using the API properly: format is used to specify the input data source format you want Here, you're reading text file so all you have to do is: t = spark.read.text("hdfs://test/a.txt") t.collect() See related doc

Related

Trouble reading npy file

Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError

Tableau SDK TableException (40200)

cannot send pyspark output to a file in the local file system

Error: Line magic function

Categories

Resources