input file is not getting read from pd.read_csv

input file is not getting read from pd.read_csv - python

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?

A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Related

Apache beam dataframe write csv to GCS without shard name template

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket. This is my code:
with beam.Pipeline(options=pipeline_options) as p:
df = p | read_csv(known_args.input)
df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
df.to_csv(known_args.output, index=False, encoding='utf-8')
However, while I pass a gcs path to known_args.output, the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001. For my project, I need the file name to be without the shard. I've read the documentation but there seems to be no options to remove the shard. I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.

It looks like you're right; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations. Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')
I filed BEAM-22923. When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), e.g.
df.to_csv(
output_dir,
num_shards=1,
file_naming=fileio.single_file_naming('out.csv'))
.

Write data from broadcast variable (databricks) to azure blob

I have a url from where I download the data (which is in JSON format) using Databricks:
url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/amzf277698d77514b44"
testfile = urllib.request.URLopener()
testfile.retrieve(url, "file.gz")
with gzip.GzipFile("file.gz", 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8')
data = json.loads(json_str)
Now I want to save this data in Azure container as a blob .json file.
I have tried saving data in a dataframe and write df to mounted location but data is huge in GBs and I get spark.rpc.message.maxSize (268435456 bytes) error.
I have tried saving data in a broadcast variable (it saves successfully) but I am not sure how to write data from broadcast variable to mounted location.
Here is how I save data in broadcast variable
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
broadcastStates = spark.sparkContext.broadcast(data)
print(broadcastStates.value)
M question is
Is there any way I can write data from broadcast variable to azure mounted location
if not then please guide me what is right/best way to get this job done.

It is not possible to write the broadcast variable into mounted azure blob storage. However, there is a way you can write the value of broadcast variable into a file.
pyspark.broadcast.Broadcast provides 2 methods, dump() and load_from_path(), using which you can write and read the value of a broadcast variable. Since you have created a broadcast variable using:
broadcastStates = spark.sparkContext.broadcast(data)
Use the following syntax to write the value of broadcast variable to a file.
<broadcast_variable>.dump(<broadcast_variable>.value, filename)
Note: the file must have write attribute and the write() argument must be string.
To read this data from the file, you can use load_from_path() as shown below:
<broadcast_variable>.load_from_path(filename)
Note: the file must have read and readline attributes.
There might also be a way to avoid “spark.rpc.message.maxSize (268435456 bytes) error”. The default value of spark.rpc.message.maxSize is 128. Refer to the following document to know more about this error.
https://spark.apache.org/docs/latest/configuration.html#networking
While creating a cluster in Databricks, we can configure and increase the value to avoid this error. The steps to configure a cluster are:
⦁ While creating cluster, choose advanced options (present at the bottom).
⦁ Under spark tab, choose the configuration and its value as shown below.
Click create cluster.
This might help in writing the dataframe directly to mounted blob storage without using broadcast variables.
You can also try increasing the number of partitions to save the dataframe into multiple smaller files to avoid maxSize error. Refer to the following document about configuring spark and partitioning.
https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html

Azure blob storage to JSON in azure function using SDK

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.

figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.

This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

Importing JSON file for Python analysis

I am trying to import a JSON file for use in a Python editor so that I can perform analysis on the data. I am quite new to Python so not sure how I am meant to achieve this. My JSON file is full of tweet data, example shown here:
{"id":441999105775382528,"score":0.0,"text":"blablabla","user_id":1441694053,"created":"Fri Mar 07 18:09:33 GMT 2014","retweet_id":0,"source":"twitterfeed","geo_long":null,"geo_lat":null,"location":"","screen_name":"SevenPS4","name":"Playstation News","lang":"en","timezone":"Amsterdam","user_created":"2013-05-19","followers":463,"hashtags":"","mentions":"","following":1062,"urls":"http://bit.ly/1lcbBW6","media_urls":"","favourites_count":4514,"reply_status_id":0,"reply_user_id":0,"is_truncated":false,"is_retweet":false,"original_text":null,"status_count":4514,"description":"Tweeting the latest Playstation news!","url":null,"utc_offset":3600}
My questions:
How do I import the JSON file so that I can perform analysis on it in a Python editor?
How do I perform analysis on only a set number of the data (IE 100/200 of them instead of all of them)?
Is there a way to get rid of some of the fields such as score, user_id, created, etc without having to go through all of my data manually to do so?
Some of the tweets have invalid/unusable symbols within them, is there anyway to get rid of those without having to go through manually?

I'd use Pandas for this job, as you are will not only load the json, but perform some data analysis tasks on it. Depending on the size of your json-file, this one should do it:
import pandas as pd
import json
# read a sample json-file (replace the link with your file location
j = json.loads("yourfilename")
# you might select the relevant keys before constructing the data-frame
df = pd.DataFrame.from_dict([{k:v} for k,v in j.iteritems() if k in ["id","retweet_count"]])
# select a subset (the first five rows)
df.iloc[:5]
# do some analysis
df.retweet_count.sum()
>>> 200

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.