How to have a single csv file after applying partitionBy in Pysark - python

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseries data that is needed for inference and it cant be spread across multiple files.
i tried: datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it creates mutilpe smaller files inside customer_group/ path with formats csv.gz0000_part_00.gz , csv.gz0000_part_01.gz ....
i tried to use :datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').coalesce(1).option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it throws the following error:
AttributeError: 'DataFrameWriter' object has no attribute 'coalesce'
Is there a solution?
I cannot use repartition(1) or coalesce(1) directly without the partition by as it creates only 1 file and only one worker node works at a time(serially)and is computationally super expensive

The repartition function also accepts column names as arguments, not only the number of partitions.
Repartitioning by the write partition column will make spark save one file per folder.
Please note that if one of your partitions are skewed and one customer group has a majority of the data you might get into performance issues.
spark_df1 \
.repartition("customer_group") \
.write \
.partitionBy("customer_group") \
...

Related

Processing/loading huge gzip file into hive

I have a huge gzipped csv file (55GB compressed, 660GB expanded) that I am trying to to process and load into hive in a more useable format.
The file has 2.5B records of device events, with each row giving a device_id, event_id, timestamp, and event_name for identification and then about 160 more columns, only a few of which are non-null for a given event_name.
Ultimately, I'd like to format the data into the few identification columns and a JSON column that only stores the relevant fields for a given event_name and get this in a hive table partitioned by date, hour, and event_name (and partitions based on metadata like timezone, but hopefully that is a minor point) so we can query the data easily for further analysis. However, the sheer size of the file is giving me trouble.
I've tried several approaches, without much success:
Loading the file directly into hive and then doing 'INSERT OVERWRITE ... SELECT TRANSFORM(*) USING 'python transform.py' FROM ...' had obvious problems
I split the file into multiple gzip files of 2 million rows each with bash commands: gzip -cd file.csv.gz | split -l 2000000 -d -a 5 --filter='gzip > split/$FILE.gz' and just loading one of those into hive to do the transform, but I'm running into memory issues still, though we've tried to increase memory limits (I'd have to check to see what parameters we've changed).
A. Tried a transform script that uses pandas (because it makes it easy to group by event_name and then remove unneeded columns per event name) and limited pandas to reading in 100k rows at a time, but still needed to limit hive selection to not have memory issues (1000 rows worked fine, 50k rows did not).
B. I also tried making a secondary temp table (stored as ORC), with the same columns as in the csv file, but partitioned by event_name and then just selecting the 2 million rows into that temp table, but also had memory issues.
I've considered trying to start with splitting the data up by event_name. I can do this using awk, but I'm not sure that would be much better than just doing the 2 million row files.
I'm hoping somebody has some suggestions for how to handle the file. I'm open to pretty much any combination of bash, python/pandas, hadoop/hive (I can consider others as well) to get the job done, so long as it can be made mostly automatic (I'll have several other files like this to process too).
This is the hive transform query I'm using:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=10240;
ADD FILE transform.py;
INSERT OVERWRITE TABLE final_table
PARTITION (date,hour,zone,archetype,event_name)
SELECT TRANSFORM (*)
USING "python transform.py"
AS device_id,timestamp,json_blob,date,hour,zone,archetype,event_name
FROM (
SELECT *
FROM event_file_table eft
JOIN meta_table m ON eft.device_id=m.device_id
WHERE m.name = "device"
) t
;
And this is the python transform script:
import sys
import json
import pandas as pd
INSEP='\t'
HEADER=None
NULL='\\N'
OUTSEP='\t'
cols = [# 160+ columns here]
metaCols =['device_id','meta_blob','zone','archetype','name']
df = pd.read_csv(sys.stdin, sep=INSEP, header=HEADER,
names=cols+metaCols, na_values=NULL, chunksize=100000,
parse_dates=['timestamp'])
for chunk in df:
chunk = chunk.drop(['name','meta_blob'], axis=1)
chunk['date'] = chunk['timestamp'].dt.date
chunk['hour'] = chunk['timestamp'].dt.hour
for name, grp in chunk.groupby('event_name'):
grp = grp.dropna(axis=1, how='all') \
.set_index(['device_id','timestamp','date','hour',
'zone','archetype','event_name']) \
grp = pd.Series(grp.to_dict(orient='records'),grp.index) \
.to_frame().reset_index() \
.rename(columns={0:'json_blob'})
grp = grp[['device_id',timestamp','blob',
'date','hour','zone','archetype','event_name']]
grp.to_csv(sys.stdout, sep=OUTSEP, index=False, header=False)
Depending on the yarn configurations, and what hits the limits first, I've received error messages:
Error: Java heap space
GC overhead limit exceeded
over physical memory limit
Update, in case anybody has similar issues.
I did eventually manage to get this done. In our case, what seemed to work best was to use awk to split the file by date and hour. This allowed us to reduce memory overhead for partitions because we could load in a few hours at a time rather than potentially having hundreds of hourly partitions (multiplied by all the other partitions we wanted) to try to keep in memory and load at once. Running the file through awk took several hours, but could be done in parallel with loading another already split file into hive.
Here's the bash/awk I used for splitting the file:
gzip -cd file.csv.gz |
tail -n +2 |
awk -F, -v MINDATE="$MINDATE" -v MAXDATE="$MAXDATE" \
'{
if ( ($5>=MINDATE) && ($5<MAXDATE) ) {
split($5, ts, "[: ]");
print | "gzip > file_"ts[1]"_"ts[2]".csv.gz"
}
}'
where, obviously, the 5th column of the file was the timestamp, and MINDATE and MAXDATE were used to filter out dates we did not care about. This split the timestamp by spaces and colons so the first part of the timestamp was the date and the second the hour (third and fourth would be minutes and seconds) and used that to direct the line to the appropriate output file.
Once the the file was split by hour, I loaded the hours several at a time into a temporary hive table and proceeded with basically the same transform mentioned above.

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Join/merge multiple NetCDF files using xarray

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?
If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

Adding Relationships to Existing Nodes

I have a very large (~24 million lines) edge list that I'm trying to import into a Neo4j graph that is populated with nodes. The CSV file has three columns: from, to, and the period (relationship property). I've tried this using the REST API using the following (Python) code:
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':1,'body':{'key':'name','value':row[0]}})
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':2,'body':{'key':'name','value':row[1]}})
batch_queue.append({"method":"POST","to":'{1}/relationships','body':{'to':"{2}","type":"FP%s" % row[2]}})
Where the third line failed, and then also using the Cypher statement:
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)
Which worked in small scale but gave me an "Unknown Error" when using the whole list (also with smaller periodic commit values).
Any guidance as to what I'm doing incorrectly would be appreciated.
You might want to look into my batch-importer for that: http://github.com/jexp/batch-import
Otherwise for LOAD CSV, see my blog post here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
use the neo4j-shell for LOAD CSV
Depending on your memory available, you might have to split the data a bit. By moving a window over the file (e.g. 1M rows at once below). Do you have indexes / constraints created for :Person(name) ?
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
WITH line
SKIP 2000000 LIMIT 1000000
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)

Categories

Resources