I have a huge gzipped csv file (55GB compressed, 660GB expanded) that I am trying to to process and load into hive in a more useable format.
The file has 2.5B records of device events, with each row giving a device_id, event_id, timestamp, and event_name for identification and then about 160 more columns, only a few of which are non-null for a given event_name.
Ultimately, I'd like to format the data into the few identification columns and a JSON column that only stores the relevant fields for a given event_name and get this in a hive table partitioned by date, hour, and event_name (and partitions based on metadata like timezone, but hopefully that is a minor point) so we can query the data easily for further analysis. However, the sheer size of the file is giving me trouble.
I've tried several approaches, without much success:
Loading the file directly into hive and then doing 'INSERT OVERWRITE ... SELECT TRANSFORM(*) USING 'python transform.py' FROM ...' had obvious problems
I split the file into multiple gzip files of 2 million rows each with bash commands: gzip -cd file.csv.gz | split -l 2000000 -d -a 5 --filter='gzip > split/$FILE.gz' and just loading one of those into hive to do the transform, but I'm running into memory issues still, though we've tried to increase memory limits (I'd have to check to see what parameters we've changed).
A. Tried a transform script that uses pandas (because it makes it easy to group by event_name and then remove unneeded columns per event name) and limited pandas to reading in 100k rows at a time, but still needed to limit hive selection to not have memory issues (1000 rows worked fine, 50k rows did not).
B. I also tried making a secondary temp table (stored as ORC), with the same columns as in the csv file, but partitioned by event_name and then just selecting the 2 million rows into that temp table, but also had memory issues.
I've considered trying to start with splitting the data up by event_name. I can do this using awk, but I'm not sure that would be much better than just doing the 2 million row files.
I'm hoping somebody has some suggestions for how to handle the file. I'm open to pretty much any combination of bash, python/pandas, hadoop/hive (I can consider others as well) to get the job done, so long as it can be made mostly automatic (I'll have several other files like this to process too).
This is the hive transform query I'm using:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=10240;
ADD FILE transform.py;
INSERT OVERWRITE TABLE final_table
PARTITION (date,hour,zone,archetype,event_name)
SELECT TRANSFORM (*)
USING "python transform.py"
AS device_id,timestamp,json_blob,date,hour,zone,archetype,event_name
FROM (
SELECT *
FROM event_file_table eft
JOIN meta_table m ON eft.device_id=m.device_id
WHERE m.name = "device"
) t
;
And this is the python transform script:
import sys
import json
import pandas as pd
INSEP='\t'
HEADER=None
NULL='\\N'
OUTSEP='\t'
cols = [# 160+ columns here]
metaCols =['device_id','meta_blob','zone','archetype','name']
df = pd.read_csv(sys.stdin, sep=INSEP, header=HEADER,
names=cols+metaCols, na_values=NULL, chunksize=100000,
parse_dates=['timestamp'])
for chunk in df:
chunk = chunk.drop(['name','meta_blob'], axis=1)
chunk['date'] = chunk['timestamp'].dt.date
chunk['hour'] = chunk['timestamp'].dt.hour
for name, grp in chunk.groupby('event_name'):
grp = grp.dropna(axis=1, how='all') \
.set_index(['device_id','timestamp','date','hour',
'zone','archetype','event_name']) \
grp = pd.Series(grp.to_dict(orient='records'),grp.index) \
.to_frame().reset_index() \
.rename(columns={0:'json_blob'})
grp = grp[['device_id',timestamp','blob',
'date','hour','zone','archetype','event_name']]
grp.to_csv(sys.stdout, sep=OUTSEP, index=False, header=False)
Depending on the yarn configurations, and what hits the limits first, I've received error messages:
Error: Java heap space
GC overhead limit exceeded
over physical memory limit
Update, in case anybody has similar issues.
I did eventually manage to get this done. In our case, what seemed to work best was to use awk to split the file by date and hour. This allowed us to reduce memory overhead for partitions because we could load in a few hours at a time rather than potentially having hundreds of hourly partitions (multiplied by all the other partitions we wanted) to try to keep in memory and load at once. Running the file through awk took several hours, but could be done in parallel with loading another already split file into hive.
Here's the bash/awk I used for splitting the file:
gzip -cd file.csv.gz |
tail -n +2 |
awk -F, -v MINDATE="$MINDATE" -v MAXDATE="$MAXDATE" \
'{
if ( ($5>=MINDATE) && ($5<MAXDATE) ) {
split($5, ts, "[: ]");
print | "gzip > file_"ts[1]"_"ts[2]".csv.gz"
}
}'
where, obviously, the 5th column of the file was the timestamp, and MINDATE and MAXDATE were used to filter out dates we did not care about. This split the timestamp by spaces and colons so the first part of the timestamp was the date and the second the hour (third and fourth would be minutes and seconds) and used that to direct the line to the appropriate output file.
Once the the file was split by hour, I loaded the hours several at a time into a temporary hive table and proceeded with basically the same transform mentioned above.
I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?
If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.
I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).
I'm trying to spin up a Spark cluster with Datbricks' CSV package so that I can create parquet files and also do some stuff with Spark obviously.
This is being done within AWS EMR, so I don't think I'm putting these options in the correct place.
This is the command I want to send to the cluster as it spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. I've tried putting this on a Spark step - is this correct?
If the cluster spun up without that being properly installed, how do I start up PySpark with that package? Is this correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? I can't tell if it was installed properly or not. Not sure what functions to test
And in regards to actually using the package, is this correct for creating a parquet file:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
#is it this option1
df.write.parquet("s3n://bucketname/nation_parquet.parquet")
#or this option2
df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
I'm not able to find any recent (mid 2015 and later) documentation regarding writing a Parquet file.
Edit:
Okay, now I'm not sure if I'm creating my dataframe correctly. If I try to run some select queries on it and show the resultset, I don't get anything and instead some error. Here's what I tried running:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
df.registerTempTable("region2")
tcp_interactions = sqlContext.sql(""" SELECT nation_id, name, comment FROM region2 WHERE nation_id > 1 """)
tcp_interactions.show()
#get some weird Java error:
#Caused by: java.lang.NumberFormatException: For input string: "0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|"
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb: