Auto convert strings and float columns using genfromtxt from numpy/python - python

I have several different data files that I need to import using genfromtxt. Each data file has different content. For example, file 1 may have all floats, file 2 may have all strings, and file 3 may have a combination of floats and strings etc. Also the number of columns vary from file to file, and since there are hundreds of files, I don't know which columns are floats and strings in each file. However, all the entries in each column are the same data type.
Is there a way to set up a converter for genfromtxt that will detect the type of data in each column and convert it to the right data type?
Thanks!

If you're able to use the Pandas library, pandas.read_csv is much more generally useful than np.genfromtxt, and will automatically handle the kind of type inference mentioned in your question. The result will be a dataframe, but you can get out a numpy array in one of several ways. e.g.
import pandas as pd
data = pd.read_csv(filename)
# get a numpy array; this will be an object array if data has mixed/incompatible types
arr = data.values
# get a record array; this is how numpy handles mixed types in a single array
arr = data.to_records()
pd.read_csv has dozens of options for various forms of text inputs; see more in the pandas.read_csv documentation.

Related

Pandas DataFrame saved as HDF5 Files are extremely large when containing a column with a string values

When I create a pandas DataFrame with a column that contains a string and save it as HDF5 the file size seems extremely large. The following code processes a file with a size of 1’063’224 Bytes.
from pathlib import Path
import pandas as pd
data_frame = pd.DataFrame({'foo': ['bar']})
data_frame.to_hdf(Path.home() / 'Desktop' / 'file.hdf5', key='baz', mode='w')
When I replace the 'bar' with a 1 the file size shrinks down to 7’032 Bytes, which seems (more) reasonable.
Does anyone know where that megabyte of data comes from?
The problem is that the dataframe is of dtype object, since strings have variable length that is not permitted in HDF5.
The column can be converted to string with a length parameter:
data_frame['foo'] = data_frame['foo'].astype('|S80') #where the max length is set at 80 bytes
Using this conversion, the file is even smaller than the integer example with 7024 Bytes ?!

Read quoted fields in CSV as strings (pandas)

Context
I have a pandas dataframe which I need to save to disk and re-load later. Because the file saved on disk needs to be human-readable, I'm currently saving the dataframe as a CSV. The data includes values that are integers, booleans, null/None, timestamps, and strings.
Problem
Some of the string values are phone numbers, formatted as "+12025550140", but these are being converted to integers by the round trip (dataframe -> CSV -> dataframe). I need them to stay as strings.
I've changed the CSV writing portion to use quoting=csv.QUOTE_NONNUMERIC, which preserves the format of the phone numbers into the CSV, but when they are read back into a dataframe they are converted to integers. If I tell the CSV reading portion to also use quoting=csv.QUOTE_NONNUMERIC, then they are converted to floats.
How do I enforce that quoted fields are loaded as strings? Or, is there any other way to enforce that the full process is type-safe?
Constraints and non-constraints
The file saved to disk must be easy to manually edit, preferably with a plain text editor. I have full control and ownership over the code which generates the CSV file. A different file format can be used if it is easy to apply manual edits.
Code
Writing to disk:
import csv
df = get_df() # real function has been replaced
df.to_csv(query_file_path, index=False, quoting=csv.QUOTE_NONNUMERIC)
Reading from disk:
import pandas as pd
CSV_NA_VALS = pd._libs.parsers.STR_NA_VALUES
CSV_NA_VALS.remove("")
df = pd.read_csv(query_file_path, na_values=CSV_NA_VALS)
df = df.replace([""], [None])
Versions
Python 3.9.5
pandas==1.4.0

How to Translate CSV Data into TFRecord Files

Currently I am working on a system that can take data from a CSV file and import it into a TFRecord file, However I have a few questions.
For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
I have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Thankyou for reading!
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Some Example Code Im using
import pandas as pd
from ast import literal_eval
import numpy as np
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
def Start():
db = pd.read_csv("I:\Github\ClubKeno\Keno Project\Database\..\LotteryDatabase.csv")
pd.DataFrame = db
print(db['Winning_Numbers'])
print(db.dtypes)
training_dataset = (
tf.data.Dataset.from_tensor_slices(
(
tf.cast(db['Draw_Number'].values, tf.int64),
tf.cast(db['Winning_Numbers'].values, tf.int64),
tf.cast(db['Extra_Numbers'].values, tf.int64),
tf.cast(db['Kicker'].values, tf.int64)
)
)
)
for features_tensor, target_tensor in training_dataset:
print(f'features:{features_tensor} target:{target_tensor}')
Error Message:
CSV Data
Update:
Got Two Columns of dating working using the following function...
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Kicker'],
column_defaults=[tf.int64, tf.int64],
)
However when trying to include my two other column object types
(What data looks like in both those columns)
"3,9,11,16,25,26,28,29,36,40,41,46,63,66,67,69,72,73,78,80"
I get an error, here is the function I tried for that
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Winning_Numbers', 'Extra_Numbers', 'Kicker'],
column_defaults=[tf.int64, tf.compat.as_bytes, tf.compat.as_bytes, tf.int64],
header=True,
batch_size=100,
field_delim=',',
na_value='NA'
)
This Error Appears:
TypeError: Failed to convert object of type <class 'function'> to Tensor. Contents: <function as_bytes at 0x000000EA530908C8>. Consider casting elements to a supported type.
Should I try to Cast those two types outside the function and try combining it later into the TFRecord file alongside the tf.data from the make_csv_dataset function?
For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
TFRecord accepts following datatypes-
string, byte, float32, float 64, bool, enum, int32, int64, uint32, uint64
Talked here.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
Here is an example from TF, it is a bit complicated to digest it at once but if you read it carefully it is easy.
have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
For string type data, you require tf.train.BytesList which returns a bytes_list from a string.
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Instead of reading csv file using Pandas, I would recommend you to use tf.data.experimental.make_csv_dataset defined here. This will make this conversion process very faster than Pandas and will give you less compatibility issues to work with TF classes. If you use this function, then you will not need to read the csv file row by row but all at once using map() which uses eager execution. This is a good tutorial to get started.
Accidentally edited wrong section of the post

Numpy CSV fromfile()

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

How to speed up importing dataframes into pandas

I understand that one of the reasons why pandas can be relatively slow importing csv files is that it needs to scan the entire content of a column before guessing the type (see the discussions around the mostly deprecated low_memory option for pandas.read_csv). Is my understanding correct?
If it is, what would be a good format in which to store a dataframe, and which explicitly specifies data types, so pandas doesn't have to guess (SQL is not an option for now)?
Any option in particular from those listed here?
My dataframes have floats, integers, dates, strings and Y/N, so formats supporting numeric values only won't do.
One option is to use numpy.genfromtxt
with delimiter=',', names=True, then to initialize the pandas dataframe with the numpy array. The numpy array will be structured and the pandas constructor should automatically set the field names.
In my experience this performs well.
You can improve the efficiency of importing from a CSV file by specifying column names and their datatypes to your call to pandas.read_csv. If you have existing column headers in the file, you probably don't have to specify the names and can just use those, but I like to skip the header and specify names for completeness:
import pandas as pd
import numpy as np
col_names = ['a', 'b', 'whatever', 'your', 'names', 'are']
col_types = {k: np.int32 for k in col_names} # create the type dict
col_types['a'] = 'object' # can change whichever ones you like
df = pd.read_csv(fname,
header = None, # since we are specifying our own names
skiprows=[0], # if you *do* have a header row, skip it
names=col_names,
dtype=col_types)
On a large sample dataset comprising mostly integer columns, this was about 20% faster than specifying dtype='object' in the call to pd.read_csv for me.
I would consider either HDF5 format or Feather Format. Both of them are pretty fast (Feather might be faster, but HDF5 is more feature rich - for example reading from disk by index) and both of them store the type of columns, so they don't have to guess dtypes and they don't have to convert data types (for example strings to numerical or strings to datetimes) when loading data.
Here are some speed comparisons:
which is faster for load: pickle or hdf5 in python
What is the fastest way to upload a big csv file in notebook to work with python pandas?

Categories

Resources