I'm able to read a labview .tdms file using Python npTDMS package, could read metadata and sample data for groups and channels as well.
However, the file has timestamp values with year '9999'. Hence getting the following error while converting to a pandas dataframe:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp:.
I went through the documentation in:
https://nptdms.readthedocs.io/en/stable/apireference.html#nptdms.TdmsFile.as_dataframe; however, couldn't find an option to deal with this data situation.
Tried passing errors='coerce' while calling as.dataframe() didn't work either. Any pointers or directions to read the .tdms file to a pandas dataframe, with this data situation, would be very helpful.
Changing the data at the source is not an option.
Code snippet to read tdms file:
import numpy as np
import pandas as pd
from nptdms import TdmsFile as td
tdms_file = td.read(<tdms file name>)
tdms_file_df = tdms_file.as_dataframe()
Error while creating a pandas dataframe
Related
when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet
Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.
I've already test three ways of converting a csv file to a parquet file. You can find them below. All the three created the parquet file. I've tried to view the contents of the parquet file using "APACHE PARQUET VIEWER" on Windows and I always got the following error message:
"encoding RLE_DICTIONARY is not supported"
Is there any way to avoid this?
Maybe a way to use another type of encoding?...
Below the code:
1º Using pandas:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet")
2º Using pyarrow:
from pyarrow import csv, parquet
table = csv.read_csv("filename.csv")
parquet.write_table(table, "filename.parquet")
3º Using dask:
from dask.dataframe import read_csv
dask_df = read_csv("filename.csv", dtype={'column_xpto': 'float64'})
dask_df.to_parquet("filename.parquet")
You should set use_dictionary to False:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet", use_dictionary=False)
I've created a CSV file from a dataframe with a shape of (1081, 165233). I did this using this command: df2.to_csv("better_matched.csv")
However, whenever I try to load this csv as a pandas dataframe later on, the shape of the dataframe becomes (660, 165234). This is the code that I'm using to load the csv:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/Authorship/better_matched.csv")
df.shape
I think I need to change some parameters with .to_csv() or .read_csv() functions. What can be done?
I am trying to import a txt file which has around 56 columns and has different data types.
Few columns have values with prefix 000, which I cannot see once the data has been imported.
I am also getting the error message "specify dtype option on reading or set low_memory=false".
Values in certain columns have changed to "NaN" & "4.40578e+01", which is not correct...
I want the data to be imported and displayed correctly.
This is code that I am using
from os import os path
import numpy as np
import pandas as pd
df=pd.read_csv(r"C:\Users\abc\desktop\file.txt",sep=",")
df.head()
I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`
The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")