How to convert CSV to parquet file without RLE_DICTIONARY encoding? - python

I've already test three ways of converting a csv file to a parquet file. You can find them below. All the three created the parquet file. I've tried to view the contents of the parquet file using "APACHE PARQUET VIEWER" on Windows and I always got the following error message:
"encoding RLE_DICTIONARY is not supported"
Is there any way to avoid this?
Maybe a way to use another type of encoding?...
Below the code:
1º Using pandas:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet")
2º Using pyarrow:
from pyarrow import csv, parquet
table = csv.read_csv("filename.csv")
parquet.write_table(table, "filename.parquet")
3º Using dask:
from dask.dataframe import read_csv
dask_df = read_csv("filename.csv", dtype={'column_xpto': 'float64'})
dask_df.to_parquet("filename.parquet")

You should set use_dictionary to False:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet", use_dictionary=False)

Related

I cant read parquet file by pandas read_parquet function

when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet
Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.

Python: Reading tdms files using Python npTDMS and creating a Pandas dataframe

I'm able to read a labview .tdms file using Python npTDMS package, could read metadata and sample data for groups and channels as well.
However, the file has timestamp values with year '9999'. Hence getting the following error while converting to a pandas dataframe:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp:.
I went through the documentation in:
https://nptdms.readthedocs.io/en/stable/apireference.html#nptdms.TdmsFile.as_dataframe; however, couldn't find an option to deal with this data situation.
Tried passing errors='coerce' while calling as.dataframe() didn't work either. Any pointers or directions to read the .tdms file to a pandas dataframe, with this data situation, would be very helpful.
Changing the data at the source is not an option.
Code snippet to read tdms file:
import numpy as np
import pandas as pd
from nptdms import TdmsFile as td
tdms_file = td.read(<tdms file name>)
tdms_file_df = tdms_file.as_dataframe()
Error while creating a pandas dataframe

pandas incorrectly parsing csv

I've created a CSV file from a dataframe with a shape of (1081, 165233). I did this using this command: df2.to_csv("better_matched.csv")
However, whenever I try to load this csv as a pandas dataframe later on, the shape of the dataframe becomes (660, 165234). This is the code that I'm using to load the csv:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/Authorship/better_matched.csv")
df.shape
I think I need to change some parameters with .to_csv() or .read_csv() functions. What can be done?

Reading in a stata .dta file as a python pandas data frame using pd.read_stata()

I want to read in an .dta file as a pandas data frame.
I've tried using code from https://www.fragilefamilieschallenge.org/using-dta-files-in-python/ but it gives me an error.
Thanks for any help!
import pandas as pd
df_path = "https://zenodo.org/record/3635384/files/B-PROACT1V%20Year%204%20%26%206%20child%20BP%2C%20BMI%20and%20PA%20dataset.dta?download=1"
df = None
with open(df_path, "r") as f:
df = pd.read_stata(f)
print df.head()
open can be used when you have a file saved locally on your machine. With pd.read_stata this is not necessary however, as you can specify the file path directly as a parameter.
In this case you want to read in a .dta file from a url so this does not apply. The solution is simple though, as pd.read_stata can read in files from urls directly.
import pandas as pd
url = 'https://zenodo.org/record/3635384/files/B-PROACT1V%20Year%204%20%26%206%20child%20BP%2C%20BMI%20and%20PA%20dataset.dta?download=1'
df = pd.read_stata(url)

How to read bz2 files into dataframes using pyspark?

I can read a json file into a dataframe in Pyspark using
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.json("path to json file")
However, when i try to read a bz2(compressed csv) into a dataframe it gives me an error. I am using:
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.load("path to bz2 file")
Could you please help correct me?
The method spark.read.load() has an optional parameter format which by default is 'parquet'.
So, for your code to work it should look like this:
df = spark.read.load("data.json.bz2", format="json")
Also, spark.read.json will perfectly work for compressed JSON files, e.g.:
df = spark.read.json("data.json.bz2")

Categories

Resources