Converting a very very large csv to parquet - python

I am trying to convert a csv file to parquet (I don't really care if it is done in python or command line, or...) In any case, this question addresses is, but the answers seem to require one to read the csv in first, and since in my case the csv is 17GB, this is not really feasible, so I would like some "offline" or streaming approach.

I successfully converted a 7GB+ (2.7 millions lines) CSV file into a parquet file, using csv2parquet.
The process is simple:
First I had to clean my CSV with csvclean from csvkit (but you might not need this)
Generate a JSON schema with csv2parquet
Edit the schema by hand, as it might not suit you
Generate the parquet file thanks to csv2parquet
Bonus: use DuckDB to test simple SQL queries directly on the parquet file
You can probably reproduce the process if you download our CSV export at https://world.openfoodfacts.org/data
# Not needed for you, just in case you want to reproduce
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
csvclean -t en.openfoodfacts.org.products.csv
# Generate the schema
./csv2parquet --header true -p -n en.openfoodfacts.org.products_out.csv products_zstd.pqt > parquet.shema
# It has to be modified because column detection is sometimes wrong.
# From Open Food Facts CSV, for example, the code column is detected as a an Int64, but it's in fact a "Utf8".
nano parquet.shema
# Generate parquet file.
# Using -c for compression is optional.
# -c zstd appears to be the best option regarding speed/compression.
./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt
# Try a query thanks to DuckDB. It's as fast as a database!
time ./duckdb test-duck.db "select * FROM (select count(data_quality_errors_tags) as products_with_issues from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''), (select count(data_quality_errors_tags) as products_with_issues_but_without_images from $db where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 29333 │ 4897 │
└──────────────────────┴─────────────────────────────────────────┘
real 0m0,211s
user 0m0,645s
sys 0m0,053s

Related

PySpark DataFrame writing empty (zero bytes) files

I'm working with PySpark DataFrame API with Spark version 3.1.1 on a local setup. After reading in data, performing some transformations etc. I save the DataFrame to disk. Output directories get created, along with part-0000* file and there is _SUCCESS file present in the output directory as well. However, my part-0000* is always empty i.e. zero bytes.
I've tried writing it in both parquet as well as csv formats with the same result. Just before writing, I called df.show() to make sure there is data in the DataFrame.
### code.py ###
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import configs
spark = SparkSession.builder.appName('My Spark App').getOrCreate()
data = spark.read.csv(configs.dataset_path, sep=configs.data_delim)
rdd = data.rdd.map(...)
data = spark.createDataFrame(rdd)
data = data.withColumn('col1', F.lit(1))
data.show() # Shows top 20 rows with data
data.write.parquet(save_path + '/dataset_parquet/', mode='overwrite') # Zero Bytes
data.write.csv(save_path + '/dataset_csv/', mode='overwrite') # Zero Bytes
I'm running this code as follows
export PYSPARK_PYTHON=python3
$SPARK_HOME/bin/spark-submit \
--master local[*] \
code.py
So I ran into a similar issue with pyspark and one thing I also noticed is that when I tried to set the mode to overwrite it was also failing. The issue with the overwrite was that it was failing to write while it was in the middle of the write, so it would create the file, fail, retry and the retry would fail with the 'file already exists' because it was past the point in its process of handling the overwrite.
So I added cache to force the evaluation because like your .show() above I was doing a data.cache().count(). The count showed records but any further evaluation using show or write showed the DF as empty.
So try adding .cache() to the first reference of that dataframe and see it it fixes your issue. It did for me.
df_bad = df_cln.filter(F.col('isInvalid')).select(F.concat(F.col('line')\
,F.lit(">> LINE:"),F.col('monotonically_increasing_id'))\
.alias("line"),F.col('monotonically_increasing_id'))
removed_file_cnt = df_bad.cache().count()
print(f"The count of the records still containing udf chars in the file: {removed_file_cnt}")
if removed_file_cnt > 0:
df_bad.coalesce(1)\
.orderBy('monotonically_increasing_id')\
.drop('monotonically_increasing_id')\
.write.option("ignoreTrailingWhiteSpace","false").option("encoding", "UTF-8")\
.format('text').save(s3_error_bucket_path, mode='overwrite')
Alternatively, consider using a .localCheckpoint() on the data column. It is fast and convenient. Since we can always restart the job there essentially no critical need for a checkpoint.

filter files by pyspark date

I am trying to lift some files with pyspark from a databricks datalake. To do this, I use the "sqlContext" statement to create the data frame, I do this without problems. Each file is named by the creation date, for example "20211001.cv". These arrive on a daily basis and I was using "* .csv" to get them all up. But now I need to lift the files from a certain date forward and I can't find a way to do it, that's why I turn to you please.
The statement style I am using is the following:
df_example= (sqlContext
.read
.format("com.databricks.spark.csv")
.option("delimiter", ";")
.option("header","true")
.option("inferSchema","true")
.option("encoding","windows-1252")
.load("/mnt/path/202110*.csv"))
I need to be able to detect files from a certain date forward in ".load" sentence, is it possible to do it with pyspark? Example "NameFile.csv >= 202110" Do you have an example please?
From already thank you very much!
I believe you can't do this. At least how you intend it.
If the data would've been written partitioned by date, said date would be part of the path and then Spark would add it as another column which you could then use to filter using the DataFrame API as you do with any other column.
So if the files were, let's say:
your_main_df_path
├── date_at=20211001
│ └── file.csv
├── date_at=20211002
│ └── file.csv
├── date_at=20211003
│ └── file.csv
└── ...
You could then do:
df = spark_session.read.format("csv").load("your_main_df_path") # with all your options
df.filter("date_at>=20211002") # or any other date you need
Spark would use the date in the path to do the partition pruning and only read the dates you need. If you can modify how the data is written, this is probably the best option.
If you can't control that or is hard to change for all the data you already have there. Maybe you can try to write a little python function takes a start date (and maybe an optional end_date) and returns a list of files that fall in that range. That list of files could then be passed to the DataFrameWriter.
get the all dates >[particular-date] and make that as list and pass those values as parameterized value in iteration mode as like this ,
%python
table_name ='my_table_name'
survey_curated_delta_path = f"abfss://container#datalake.dfs.core.windows.net/path1/path2/stage/validation/results/{table_name}.csv"
survey_sdf = spark.read.format("csv").load(survey_curated_delta_path)
display(survey_sdf)
easiest will be to import also filename
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
than remove non numeric characters from filename columns
df['filename'] = df['filename'].str.replace(r'\D+', '')
df['filename'] = df['filename'].astype(int)
now we can filter
df.filter("filename >= 20211002")

Processing/loading huge gzip file into hive

I have a huge gzipped csv file (55GB compressed, 660GB expanded) that I am trying to to process and load into hive in a more useable format.
The file has 2.5B records of device events, with each row giving a device_id, event_id, timestamp, and event_name for identification and then about 160 more columns, only a few of which are non-null for a given event_name.
Ultimately, I'd like to format the data into the few identification columns and a JSON column that only stores the relevant fields for a given event_name and get this in a hive table partitioned by date, hour, and event_name (and partitions based on metadata like timezone, but hopefully that is a minor point) so we can query the data easily for further analysis. However, the sheer size of the file is giving me trouble.
I've tried several approaches, without much success:
Loading the file directly into hive and then doing 'INSERT OVERWRITE ... SELECT TRANSFORM(*) USING 'python transform.py' FROM ...' had obvious problems
I split the file into multiple gzip files of 2 million rows each with bash commands: gzip -cd file.csv.gz | split -l 2000000 -d -a 5 --filter='gzip > split/$FILE.gz' and just loading one of those into hive to do the transform, but I'm running into memory issues still, though we've tried to increase memory limits (I'd have to check to see what parameters we've changed).
A. Tried a transform script that uses pandas (because it makes it easy to group by event_name and then remove unneeded columns per event name) and limited pandas to reading in 100k rows at a time, but still needed to limit hive selection to not have memory issues (1000 rows worked fine, 50k rows did not).
B. I also tried making a secondary temp table (stored as ORC), with the same columns as in the csv file, but partitioned by event_name and then just selecting the 2 million rows into that temp table, but also had memory issues.
I've considered trying to start with splitting the data up by event_name. I can do this using awk, but I'm not sure that would be much better than just doing the 2 million row files.
I'm hoping somebody has some suggestions for how to handle the file. I'm open to pretty much any combination of bash, python/pandas, hadoop/hive (I can consider others as well) to get the job done, so long as it can be made mostly automatic (I'll have several other files like this to process too).
This is the hive transform query I'm using:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=10240;
ADD FILE transform.py;
INSERT OVERWRITE TABLE final_table
PARTITION (date,hour,zone,archetype,event_name)
SELECT TRANSFORM (*)
USING "python transform.py"
AS device_id,timestamp,json_blob,date,hour,zone,archetype,event_name
FROM (
SELECT *
FROM event_file_table eft
JOIN meta_table m ON eft.device_id=m.device_id
WHERE m.name = "device"
) t
;
And this is the python transform script:
import sys
import json
import pandas as pd
INSEP='\t'
HEADER=None
NULL='\\N'
OUTSEP='\t'
cols = [# 160+ columns here]
metaCols =['device_id','meta_blob','zone','archetype','name']
df = pd.read_csv(sys.stdin, sep=INSEP, header=HEADER,
names=cols+metaCols, na_values=NULL, chunksize=100000,
parse_dates=['timestamp'])
for chunk in df:
chunk = chunk.drop(['name','meta_blob'], axis=1)
chunk['date'] = chunk['timestamp'].dt.date
chunk['hour'] = chunk['timestamp'].dt.hour
for name, grp in chunk.groupby('event_name'):
grp = grp.dropna(axis=1, how='all') \
.set_index(['device_id','timestamp','date','hour',
'zone','archetype','event_name']) \
grp = pd.Series(grp.to_dict(orient='records'),grp.index) \
.to_frame().reset_index() \
.rename(columns={0:'json_blob'})
grp = grp[['device_id',timestamp','blob',
'date','hour','zone','archetype','event_name']]
grp.to_csv(sys.stdout, sep=OUTSEP, index=False, header=False)
Depending on the yarn configurations, and what hits the limits first, I've received error messages:
Error: Java heap space
GC overhead limit exceeded
over physical memory limit
Update, in case anybody has similar issues.
I did eventually manage to get this done. In our case, what seemed to work best was to use awk to split the file by date and hour. This allowed us to reduce memory overhead for partitions because we could load in a few hours at a time rather than potentially having hundreds of hourly partitions (multiplied by all the other partitions we wanted) to try to keep in memory and load at once. Running the file through awk took several hours, but could be done in parallel with loading another already split file into hive.
Here's the bash/awk I used for splitting the file:
gzip -cd file.csv.gz |
tail -n +2 |
awk -F, -v MINDATE="$MINDATE" -v MAXDATE="$MAXDATE" \
'{
if ( ($5>=MINDATE) && ($5<MAXDATE) ) {
split($5, ts, "[: ]");
print | "gzip > file_"ts[1]"_"ts[2]".csv.gz"
}
}'
where, obviously, the 5th column of the file was the timestamp, and MINDATE and MAXDATE were used to filter out dates we did not care about. This split the timestamp by spaces and colons so the first part of the timestamp was the date and the second the hour (third and fourth would be minutes and seconds) and used that to direct the line to the appropriate output file.
Once the the file was split by hour, I loaded the hours several at a time into a temporary hive table and proceeded with basically the same transform mentioned above.

Mongoimport json data and then big data

I'm stuck in doing a very simple import operation in MongoDB. I have a file, 200MB in size, JSON format. Its a feeds dump, format as: {"some-headers":"", "dump":[{"item-id":"item-1"},{"item-id":"item-2"},...]}
This json feed contains words in languages other than english too, like Chinese, Japanese, Characters, etc.
I tried to do a mongoimport as mongoimport --db testdb --collection testcollection --file dump.json but possibly, because the data is a bit complex, its treating dump as a column, resulting in error, due to 4MB column value limit.
I further tried and a python script:
import simplejson
import pymongo
conn = pymongo.Connection("localhost",27017)
db = conn.testdb
c = db.testcollection
o = open("dump.json")
s = simplejson.load(o)
for x in s['dump']:
c.insert(x)
o.close()
Python is killed while running this thing, possibly due to the very limited resources I'm trying to work with.
I reduced the filesize, by getting a new json dump at 50MB, now due to ASCII issues, python is troubling me again.
I am looking for options both way using mongoimport and with above python script. Any further solutions shall also be greatly appreciated.
Also, I might some day reach the json dump ~GBs, so if there will be some other solution I should consider then, pl do highlight.

Reading DBF files with pyodbc

In a project, I need to extract data from a Visual FoxPro database, which is stored in dbf files, y have a data directory with 539 files I need to take into account, each file represents a database table, so I've been doing some testing and my code goes like this:
import pyodbc
connection = pyodbc.connect("Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=P:\\Data;Exclusive=No;Collate=Machine;NULL=No;DELETED=Yes")
tables = connection.cursor().tables()
for _ in tables:
print _
this prints only 15 tables, with no obvious pattern, always the same 15 tables, I thought this was because the rest of the tables were empty but I checked and it some of the tables (dbf files) on the list are empty too, then, I thought it was a permission issue, but all the files have the same permission structure, so, I don't know what's happening here.
Any light??
EDIT:
It is not truccating the output, the tables it list are not the 15 first or anything like that
I DID IT!!!!
There where several problems with what I was doing so, here I come with what I did to solve it (after implementing it the first time with Ethan Furman's solution)
The first thing was a driver problem, it turns out that the Windows' DBF drivers are 32 bits programs and runs on a 64 bits operating system, so, I had installed Python-amd64 and that was the first problem, so I installed a 32bit Python.
The second issue was a library/file issue, according to this, dbf files in VFP > 7 are diferent, so my pyodbc library won't read them correctly, so I tried some OLE-DB libraries with no success and I decided to to it from scratch.
Googling for a while took me to this post which finally gave me a light on this
Basically, what I did was the following:
import win32com.client
conn = win32com.client.Dispatch('ADODB.Connection')
db = 'C:\\Profit\\profit_a\\ARMM'
dsn = 'Provider=VFPOLEDB.1;Data Source=%s' % db
conn.Open(dsn)
cmd = win32com.client.Dispatch('ADODB.Command')
cmd.ActiveConnection = conn
cmd.CommandText = "Select * from factura, reng_fac where factura.fact_num = reng_fac.fact_num AND factura.fact_num = 6099;"
rs, total = cmd.Execute() # This returns a tuple: (<RecordSet>, number_of_records)
while total:
for x in xrange(rs.Fields.Count):
print '%s --> %s' % (rs.Fields.item(x).Name, rs.Fields.item(x).Value)
rs.MoveNext() #<- Extra indent
total = total - 1
And it gave me 20 records which I checked with DBFCommander and were OK
First, you need to install pywin32 extensions (32bits) and the Visual FoxPro OLE-DB Provider (only available for 32bits), in my case for VFP 9.0
Also, it's good to read de ADO Documentation at the w3c website
This worked for me. Thank you very much to those who replied
I would use my own dbf package and the code would go something like this:
import dbf
from glob import glob
for dbf_file in glob(r'p:\data\*.dbf'):
with dbf.Table(dbf_file) as table:
for record in table:
do_something_with(record)
A table is list-like, and iteration through it returns records. A record is list-, dict-, and obj-like, and iteration returns the values; besides iteration through the record, individual fields can be accessed either by offset (record[0] for the first field), by field-name using dict-like access (record['some_field']), or by field-name using obj.attr-like access (record.some_field).
If you just wanted to dump the contents of each dbf file into a csv file you could do:
for dbf_file in glob(r'p:\data\*.dbf'):
with dbf.Table(dbf_file) as table:
dbf.export(table, dbf_file)
I know this doesn't directly answer your question, but might still help. I've had lots of issues using ODBC with VFP databases and I've found it's often much easier treating the VFP tables as free tables when possible.
Using Yusdi Santoso's dbf.py and glob, here's some code to open each table in a directory and run through each record.
import glob
import os
import dbf
os.chdir("P:\\data")
for file in glob.glob("*.dbf"):
table = dbf.readDbf(file)
for row in table:
#do stuff

Categories

Resources