Loading big parquet files on server

Loading big parquet files on server - python

I am working on a VPN server for a company and they gave us their data through lots of parquet files. We are however having problems with loading all the data. Loading a few files is working for us but we have a 103 parquet files we have to load to get the full data. When we try this, the server is having a Kernel error.
We are working with python.
Has anyone had this problem before and found a solution?
We have tried to concat a few parquet files and convert that to csv so it's easier to load but this also gives us a Kernel error. The files do have a lot of string types so this does have a lot of memory in it.
We also tried loads of different commands like low_memory and loading only certain column, this however will give us the same error.

Related

Load Large Excel Files in Databricks using PySpark from an ADLS mount

We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.
We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success
PySpark.Pandas
import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")
We are experiencing some conversion error as below
ArrowTypeError: Expected bytes, got a 'int' object
spark-excel
df=spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","false") \
.load('dbfs:/mnt/aadata/ds/data/test.xlsx')
We are able to load a smaller file, but a larger file gives the following error
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.
Is there any other way to load excel files in databricks with pyspark?

In your Excel file, there's probably some kind of weird format, or some kind of special character, that is preventing it from working. Save the Excel file as a CSV file, and re-try. You should easily be able to load a CSV file, because it has no weird things of any kind, whereas Excel has all kinds of weird things embedded in it.

Writing dataframe to Excel takes extremely long

I have got an excel file from work which I amended using pandas. It has 735719 rows × 31 columns, I made the changes necessary and allocated them to a new dataframe. Now I need to have this dataframe in an Excel format. I have checked to see that in jupyter notebooks the ont_dub works and it shows a dataframe. So I use the following code ont_dub.to_excel("ont_dub 2019.xlsx") which I always use.
However normally this would only take a few seconds, but now it has been 40 minutes and it is still calculating. Sidenote I am working in a onedrive folder from work, but that hasn't caused issues before. Hopefully someone can see the problem.

Usually, if you want to save such high amount of datas in a local folder. You don't utilize excel. If I am not mistaken excel has a know limit of displayable cells and it wasnt built to display and query such massive amounts of data (you can use pandas for that). You can either utilize feather files (a known quick save alternative). Or csv files, which are built for this sole purpose.

Importing TSV Data with python to an arango Database

I'm currently working on a project, with which I have big TSV files and need to import them in a database. I need a noSQL database so I chose Arango. With arango, we can't import TSV files in python. Only JSON files. But, we can import TSV files with PowerShell.
The files are around 1Gb and once they are imported, I will need to do daily updates to the database, with the same TSV files, but that could be modified.
What are the best options?
Convert tsv files to json with panda in my python program, then bulk import (I think this cause memory issues)
Just use with open() and insert documents line by line (Again, memory issue, but you guys may have a solution)
Use the powershell way to import the data. The only problem with that is that I use docker, therefore I can't simply run a powershell script.
Use another database
Also, what would be my best bet for the daily updates, taking into consideration the memory issues?

Use Pandas(Python) to read really big csv files from Google Drive

Good afternoon!
While using pandas to read csv data files > 500MB from my drive, instead of getting the csv file I receive the "can't scan large file for viruses" HTML page. I've tried a lot but can't find any workarounds. Can anyone tell me if it's possible to bypass that?
Sample file:- https://drive.google.com/file/d/1EQbD11iRnbXVJMZNTVExfrRP5WYIcAjk/view
Error Image
PS can someone also suggest a better (preferably free) service to upload multiple big csv files so that I can use pandas to get the data from it... i have >40gb of data to work with
Thanks :)

I found this and it's working for me as of 14/10/2020 though it was taken off of the documentation: http://web.archive.org/web/20190621105530/https://developers.google.com/drive/api/v3/manage-downloads

How to load big datasets like million song dataset into BigData HDFS or Hbase or Hive?

I have downloaded a subset of million song data set which is about 2GB. However, the data is broken down into folders and sub folders. In the sub-folder they are all in several 'H5 file' format. I understand it can be read using Python. But I do not know how to extract and load then into HDFS so I can run some data analysis in Pig.
Do I extract them as CSV and load to Hbase or Hive ? It would help if someone can point me to right resource.

If it's already in the CSV or any format on the linux file system, that PIG can understand, just do a hadoop fs -copyFromLocal to
If you want to read/process the raw H5 File format using Python on HDFS, look at hadoop-streaming (map/reduce)
Python can handle 2GB on a decent linux system- not sure if you need hadoop for it.

Don't load such amount of small files into HDFS. Hadoop doesn't handle well lots of small files. Each small file will incur in overhead because the block size (usually 64MB) is much bigger.
I want to do it myself, so I'm thinking of solutions. The million song dataset files don't have more than 1MB. My approach will be to aggregate data somehow before importing into HDFS.
The blog post "The Small Files Problem" from Cloudera may shed some light.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading big parquet files on server - python

Related

Load Large Excel Files in Databricks using PySpark from an ADLS mount

Writing dataframe to Excel takes extremely long

Importing TSV Data with python to an arango Database

Use Pandas(Python) to read really big csv files from Google Drive

How to load big datasets like million song dataset into BigData HDFS or Hbase or Hive?

Categories

Resources