I am working on a VPN server for a company and they gave us their data through lots of parquet files. We are however having problems with loading all the data. Loading a few files is working for us but we have a 103 parquet files we have to load to get the full data. When we try this, the server is having a Kernel error.
We are working with python.
Has anyone had this problem before and found a solution?
We have tried to concat a few parquet files and convert that to csv so it's easier to load but this also gives us a Kernel error. The files do have a lot of string types so this does have a lot of memory in it.
We also tried loads of different commands like low_memory and loading only certain column, this however will give us the same error.
Related
We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.
We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success
PySpark.Pandas
import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")
We are experiencing some conversion error as below
ArrowTypeError: Expected bytes, got a 'int' object
spark-excel
df=spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","false") \
.load('dbfs:/mnt/aadata/ds/data/test.xlsx')
We are able to load a smaller file, but a larger file gives the following error
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.
Is there any other way to load excel files in databricks with pyspark?
In your Excel file, there's probably some kind of weird format, or some kind of special character, that is preventing it from working. Save the Excel file as a CSV file, and re-try. You should easily be able to load a CSV file, because it has no weird things of any kind, whereas Excel has all kinds of weird things embedded in it.
I have got an excel file from work which I amended using pandas. It has 735719 rows × 31 columns, I made the changes necessary and allocated them to a new dataframe. Now I need to have this dataframe in an Excel format. I have checked to see that in jupyter notebooks the ont_dub works and it shows a dataframe. So I use the following code ont_dub.to_excel("ont_dub 2019.xlsx") which I always use.
However normally this would only take a few seconds, but now it has been 40 minutes and it is still calculating. Sidenote I am working in a onedrive folder from work, but that hasn't caused issues before. Hopefully someone can see the problem.
Usually, if you want to save such high amount of datas in a local folder. You don't utilize excel. If I am not mistaken excel has a know limit of displayable cells and it wasnt built to display and query such massive amounts of data (you can use pandas for that). You can either utilize feather files (a known quick save alternative). Or csv files, which are built for this sole purpose.
I'm currently working on a project, with which I have big TSV files and need to import them in a database. I need a noSQL database so I chose Arango. With arango, we can't import TSV files in python. Only JSON files. But, we can import TSV files with PowerShell.
The files are around 1Gb and once they are imported, I will need to do daily updates to the database, with the same TSV files, but that could be modified.
What are the best options?
Convert tsv files to json with panda in my python program, then bulk import (I think this cause memory issues)
Just use with open() and insert documents line by line (Again, memory issue, but you guys may have a solution)
Use the powershell way to import the data. The only problem with that is that I use docker, therefore I can't simply run a powershell script.
Use another database
Also, what would be my best bet for the daily updates, taking into consideration the memory issues?
Good afternoon!
While using pandas to read csv data files > 500MB from my drive, instead of getting the csv file I receive the "can't scan large file for viruses" HTML page. I've tried a lot but can't find any workarounds. Can anyone tell me if it's possible to bypass that?
Sample file:- https://drive.google.com/file/d/1EQbD11iRnbXVJMZNTVExfrRP5WYIcAjk/view
Error Image
PS can someone also suggest a better (preferably free) service to upload multiple big csv files so that I can use pandas to get the data from it... i have >40gb of data to work with
Thanks :)
I found this and it's working for me as of 14/10/2020 though it was taken off of the documentation: http://web.archive.org/web/20190621105530/https://developers.google.com/drive/api/v3/manage-downloads
I have downloaded a subset of million song data set which is about 2GB. However, the data is broken down into folders and sub folders. In the sub-folder they are all in several 'H5 file' format. I understand it can be read using Python. But I do not know how to extract and load then into HDFS so I can run some data analysis in Pig.
Do I extract them as CSV and load to Hbase or Hive ? It would help if someone can point me to right resource.
If it's already in the CSV or any format on the linux file system, that PIG can understand, just do a hadoop fs -copyFromLocal to
If you want to read/process the raw H5 File format using Python on HDFS, look at hadoop-streaming (map/reduce)
Python can handle 2GB on a decent linux system- not sure if you need hadoop for it.
Don't load such amount of small files into HDFS. Hadoop doesn't handle well lots of small files. Each small file will incur in overhead because the block size (usually 64MB) is much bigger.
I want to do it myself, so I'm thinking of solutions. The million song dataset files don't have more than 1MB. My approach will be to aggregate data somehow before importing into HDFS.
The blog post "The Small Files Problem" from Cloudera may shed some light.