I have a huge gzipped csv file (55GB compressed, 660GB expanded) that I am trying to to process and load into hive in a more useable format.
The file has 2.5B records of device events, with each row giving a device_id, event_id, timestamp, and event_name for identification and then about 160 more columns, only a few of which are non-null for a given event_name.
Ultimately, I'd like to format the data into the few identification columns and a JSON column that only stores the relevant fields for a given event_name and get this in a hive table partitioned by date, hour, and event_name (and partitions based on metadata like timezone, but hopefully that is a minor point) so we can query the data easily for further analysis. However, the sheer size of the file is giving me trouble.
I've tried several approaches, without much success:
Loading the file directly into hive and then doing 'INSERT OVERWRITE ... SELECT TRANSFORM(*) USING 'python transform.py' FROM ...' had obvious problems
I split the file into multiple gzip files of 2 million rows each with bash commands: gzip -cd file.csv.gz | split -l 2000000 -d -a 5 --filter='gzip > split/$FILE.gz' and just loading one of those into hive to do the transform, but I'm running into memory issues still, though we've tried to increase memory limits (I'd have to check to see what parameters we've changed).
A. Tried a transform script that uses pandas (because it makes it easy to group by event_name and then remove unneeded columns per event name) and limited pandas to reading in 100k rows at a time, but still needed to limit hive selection to not have memory issues (1000 rows worked fine, 50k rows did not).
B. I also tried making a secondary temp table (stored as ORC), with the same columns as in the csv file, but partitioned by event_name and then just selecting the 2 million rows into that temp table, but also had memory issues.
I've considered trying to start with splitting the data up by event_name. I can do this using awk, but I'm not sure that would be much better than just doing the 2 million row files.
I'm hoping somebody has some suggestions for how to handle the file. I'm open to pretty much any combination of bash, python/pandas, hadoop/hive (I can consider others as well) to get the job done, so long as it can be made mostly automatic (I'll have several other files like this to process too).
This is the hive transform query I'm using:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=10240;
ADD FILE transform.py;
INSERT OVERWRITE TABLE final_table
PARTITION (date,hour,zone,archetype,event_name)
SELECT TRANSFORM (*)
USING "python transform.py"
AS device_id,timestamp,json_blob,date,hour,zone,archetype,event_name
FROM (
SELECT *
FROM event_file_table eft
JOIN meta_table m ON eft.device_id=m.device_id
WHERE m.name = "device"
) t
;
And this is the python transform script:
import sys
import json
import pandas as pd
INSEP='\t'
HEADER=None
NULL='\\N'
OUTSEP='\t'
cols = [# 160+ columns here]
metaCols =['device_id','meta_blob','zone','archetype','name']
df = pd.read_csv(sys.stdin, sep=INSEP, header=HEADER,
names=cols+metaCols, na_values=NULL, chunksize=100000,
parse_dates=['timestamp'])
for chunk in df:
chunk = chunk.drop(['name','meta_blob'], axis=1)
chunk['date'] = chunk['timestamp'].dt.date
chunk['hour'] = chunk['timestamp'].dt.hour
for name, grp in chunk.groupby('event_name'):
grp = grp.dropna(axis=1, how='all') \
.set_index(['device_id','timestamp','date','hour',
'zone','archetype','event_name']) \
grp = pd.Series(grp.to_dict(orient='records'),grp.index) \
.to_frame().reset_index() \
.rename(columns={0:'json_blob'})
grp = grp[['device_id',timestamp','blob',
'date','hour','zone','archetype','event_name']]
grp.to_csv(sys.stdout, sep=OUTSEP, index=False, header=False)
Depending on the yarn configurations, and what hits the limits first, I've received error messages:
Error: Java heap space
GC overhead limit exceeded
over physical memory limit
Update, in case anybody has similar issues.
I did eventually manage to get this done. In our case, what seemed to work best was to use awk to split the file by date and hour. This allowed us to reduce memory overhead for partitions because we could load in a few hours at a time rather than potentially having hundreds of hourly partitions (multiplied by all the other partitions we wanted) to try to keep in memory and load at once. Running the file through awk took several hours, but could be done in parallel with loading another already split file into hive.
Here's the bash/awk I used for splitting the file:
gzip -cd file.csv.gz |
tail -n +2 |
awk -F, -v MINDATE="$MINDATE" -v MAXDATE="$MAXDATE" \
'{
if ( ($5>=MINDATE) && ($5<MAXDATE) ) {
split($5, ts, "[: ]");
print | "gzip > file_"ts[1]"_"ts[2]".csv.gz"
}
}'
where, obviously, the 5th column of the file was the timestamp, and MINDATE and MAXDATE were used to filter out dates we did not care about. This split the timestamp by spaces and colons so the first part of the timestamp was the date and the second the hour (third and fourth would be minutes and seconds) and used that to direct the line to the appropriate output file.
Once the the file was split by hour, I loaded the hours several at a time into a temporary hive table and proceeded with basically the same transform mentioned above.
Related
I am trying to read a large database table with polars. Unfortunately, the data is too large to fit into memory and the code below eventually fails.
Is there a way in polars how to define a chunksize, and also write these chunks to parquet, or use the lazy dataframe interface to keep the memory footprint low?
import polars as pl
df = pl.read_sql("SELECT * from TABLENAME", connection_string)
df.write_parquet("output.parquet")
Yes and no.
There's not a predefined method to do it but you can certainly do it yourself. You'd do something like:
rows_at_a_time=1000
curindx=0
while True:
df = pl.read_sql(f"SELECT * from TABLENAME limit {curindx},{rows_at_a_time}", connection_string)
if df.shape[0]==0:
break
df.write_parquet(f"output{curindx}.parquet")
curindx+=rows_at_a_time
ldf=pl.concat([pl.scan_parquet(x) for x in os.listdir(".") if "output" in x and "parquet" in x])
This borrows limit syntax from this answer assuming you're using mysql or a db that has the same syntax which isn't trivial assumption. You may need to do something like this if not using mysql.
Otherwise you just read your table in chunks, saving each chunk to a local file. When the chunk you get back from your query has 0 rows then it stops looping and loads all the files to a lazy df.
You can almost certainly (and should) increase the rows_at_a_time to something greater than 1000 but that's dependent on your data and computer memory.
I'm working with PySpark DataFrame API with Spark version 3.1.1 on a local setup. After reading in data, performing some transformations etc. I save the DataFrame to disk. Output directories get created, along with part-0000* file and there is _SUCCESS file present in the output directory as well. However, my part-0000* is always empty i.e. zero bytes.
I've tried writing it in both parquet as well as csv formats with the same result. Just before writing, I called df.show() to make sure there is data in the DataFrame.
### code.py ###
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import configs
spark = SparkSession.builder.appName('My Spark App').getOrCreate()
data = spark.read.csv(configs.dataset_path, sep=configs.data_delim)
rdd = data.rdd.map(...)
data = spark.createDataFrame(rdd)
data = data.withColumn('col1', F.lit(1))
data.show() # Shows top 20 rows with data
data.write.parquet(save_path + '/dataset_parquet/', mode='overwrite') # Zero Bytes
data.write.csv(save_path + '/dataset_csv/', mode='overwrite') # Zero Bytes
I'm running this code as follows
export PYSPARK_PYTHON=python3
$SPARK_HOME/bin/spark-submit \
--master local[*] \
code.py
So I ran into a similar issue with pyspark and one thing I also noticed is that when I tried to set the mode to overwrite it was also failing. The issue with the overwrite was that it was failing to write while it was in the middle of the write, so it would create the file, fail, retry and the retry would fail with the 'file already exists' because it was past the point in its process of handling the overwrite.
So I added cache to force the evaluation because like your .show() above I was doing a data.cache().count(). The count showed records but any further evaluation using show or write showed the DF as empty.
So try adding .cache() to the first reference of that dataframe and see it it fixes your issue. It did for me.
df_bad = df_cln.filter(F.col('isInvalid')).select(F.concat(F.col('line')\
,F.lit(">> LINE:"),F.col('monotonically_increasing_id'))\
.alias("line"),F.col('monotonically_increasing_id'))
removed_file_cnt = df_bad.cache().count()
print(f"The count of the records still containing udf chars in the file: {removed_file_cnt}")
if removed_file_cnt > 0:
df_bad.coalesce(1)\
.orderBy('monotonically_increasing_id')\
.drop('monotonically_increasing_id')\
.write.option("ignoreTrailingWhiteSpace","false").option("encoding", "UTF-8")\
.format('text').save(s3_error_bucket_path, mode='overwrite')
Alternatively, consider using a .localCheckpoint() on the data column. It is fast and convenient. Since we can always restart the job there essentially no critical need for a checkpoint.
I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseries data that is needed for inference and it cant be spread across multiple files.
i tried: datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it creates mutilpe smaller files inside customer_group/ path with formats csv.gz0000_part_00.gz , csv.gz0000_part_01.gz ....
i tried to use :datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').coalesce(1).option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it throws the following error:
AttributeError: 'DataFrameWriter' object has no attribute 'coalesce'
Is there a solution?
I cannot use repartition(1) or coalesce(1) directly without the partition by as it creates only 1 file and only one worker node works at a time(serially)and is computationally super expensive
The repartition function also accepts column names as arguments, not only the number of partitions.
Repartitioning by the write partition column will make spark save one file per folder.
Please note that if one of your partitions are skewed and one customer group has a majority of the data you might get into performance issues.
spark_df1 \
.repartition("customer_group") \
.write \
.partitionBy("customer_group") \
...
Using psycopg2 to export Postgres data to CSV files (not all at once, 100 000 rows at a time). Currently using LIMIT OFFSET but obviously this is slow on a 100M row db. Any faster way to keep track of the offset each iteration?
for i in (0, 100000000, 100000):
"COPY
(SELECT * from users LIMIT %s OFFSET %s)
TO STDOUT DELIMITER ',' CSV HEADER;"
% (100000, i)
Is the code run in a loop, incrementing i
Let me suggest you a different approach.
Copy the whole table and split it afterward. Something like:
COPY users TO STDOUT DELIMITER ',' CSV HEADER
And finally, from bash execute the split command (btw, you could call it inside your python script):
split -l 100000 --numeric-suffixes users.csv users_chunks_
It'll generate a couple of files called users_chunks_1, users_chunks_2, etc.
I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()
Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.
A few suggestions:
Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime
To bring it directly to Pandas in python3 I wrote this little snippet
import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
subprocess.call(["mdb-schema", database_path, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
sys.stdout.flush()
# Dump each table as a stringio using "mdb-export",
out_tables = {}
for rtable in tables:
table = rtable.decode()
if VERBOSE: print('running table:',table)
if table != '':
if VERBOSE: print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", database_path, table],
stdout=subprocess.PIPE).communicate()[0]
temp_io = StringIO(contents.decode())
print(table, temp_io)
out_tables[table] = pd.read_csv(temp_io)
return out_tables
There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.
It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.