How to Translate CSV Data into TFRecord Files - python

Currently I am working on a system that can take data from a CSV file and import it into a TFRecord file, However I have a few questions.
For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
I have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Thankyou for reading!
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Some Example Code Im using
import pandas as pd
from ast import literal_eval
import numpy as np
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
def Start():
db = pd.read_csv("I:\Github\ClubKeno\Keno Project\Database\..\LotteryDatabase.csv")
pd.DataFrame = db
print(db['Winning_Numbers'])
print(db.dtypes)
training_dataset = (
tf.data.Dataset.from_tensor_slices(
(
tf.cast(db['Draw_Number'].values, tf.int64),
tf.cast(db['Winning_Numbers'].values, tf.int64),
tf.cast(db['Extra_Numbers'].values, tf.int64),
tf.cast(db['Kicker'].values, tf.int64)
)
)
)
for features_tensor, target_tensor in training_dataset:
print(f'features:{features_tensor} target:{target_tensor}')
Error Message:
CSV Data
Update:
Got Two Columns of dating working using the following function...
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Kicker'],
column_defaults=[tf.int64, tf.int64],
)
However when trying to include my two other column object types
(What data looks like in both those columns)
"3,9,11,16,25,26,28,29,36,40,41,46,63,66,67,69,72,73,78,80"
I get an error, here is the function I tried for that
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Winning_Numbers', 'Extra_Numbers', 'Kicker'],
column_defaults=[tf.int64, tf.compat.as_bytes, tf.compat.as_bytes, tf.int64],
header=True,
batch_size=100,
field_delim=',',
na_value='NA'
)
This Error Appears:
TypeError: Failed to convert object of type <class 'function'> to Tensor. Contents: <function as_bytes at 0x000000EA530908C8>. Consider casting elements to a supported type.
Should I try to Cast those two types outside the function and try combining it later into the TFRecord file alongside the tf.data from the make_csv_dataset function?

For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
TFRecord accepts following datatypes-
string, byte, float32, float 64, bool, enum, int32, int64, uint32, uint64
Talked here.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
Here is an example from TF, it is a bit complicated to digest it at once but if you read it carefully it is easy.
have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
For string type data, you require tf.train.BytesList which returns a bytes_list from a string.
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Instead of reading csv file using Pandas, I would recommend you to use tf.data.experimental.make_csv_dataset defined here. This will make this conversion process very faster than Pandas and will give you less compatibility issues to work with TF classes. If you use this function, then you will not need to read the csv file row by row but all at once using map() which uses eager execution. This is a good tutorial to get started.
Accidentally edited wrong section of the post

Related

how to convert predicted CSV to xml in python(jupiter notebook)

I am predicting the output data using ARIMA.
output is saved in CSV.
i need the output to be stored in XML format.
import numpy as np
prediction = pd.DataFrame(predictions,columns=['sl.no' 'predicted_freq']).to_csv('prediction.csv')
Before calling the to_csv() method, you have a pandas dataframe. To convert one of these to xml, there are solutions (though not out of the box). See e.g. [here].(https://stackabuse.com/reading-and-writing-xml-files-in-python-with-pandas/#writingxmlfileswithlxml)
You may have to ask yourself though, how exactly your xml needs to be structured.

Numpy CSV fromfile()

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

exporting dataframe into dataframe format to pass as argument into next program

I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')

Auto convert strings and float columns using genfromtxt from numpy/python

I have several different data files that I need to import using genfromtxt. Each data file has different content. For example, file 1 may have all floats, file 2 may have all strings, and file 3 may have a combination of floats and strings etc. Also the number of columns vary from file to file, and since there are hundreds of files, I don't know which columns are floats and strings in each file. However, all the entries in each column are the same data type.
Is there a way to set up a converter for genfromtxt that will detect the type of data in each column and convert it to the right data type?
Thanks!
If you're able to use the Pandas library, pandas.read_csv is much more generally useful than np.genfromtxt, and will automatically handle the kind of type inference mentioned in your question. The result will be a dataframe, but you can get out a numpy array in one of several ways. e.g.
import pandas as pd
data = pd.read_csv(filename)
# get a numpy array; this will be an object array if data has mixed/incompatible types
arr = data.values
# get a record array; this is how numpy handles mixed types in a single array
arr = data.to_records()
pd.read_csv has dozens of options for various forms of text inputs; see more in the pandas.read_csv documentation.

How to force a python function to return a particular type of object?

I am using pandas to read a csv file and convert it into a numpy array. Earlier I was loading the whole file and was getting memory error. So I went through this link and tried to read the file in chunks.
But now I am getting a different error which say:
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"
This is the code I am using:
>>> X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
>>> X = pd.concat(X_chunks, ignore_index=True)
API reference for read_csv tells that it returns either a DataFrame or a TextParser. The problem is that concat function will work fine if X_chunks is DataFrame, but its type is TextParser here.
is there any way in which I can force the return type for read_csv or any work around to load the whole file as a numpy array?
Since iterator=False is the default, and chunksize forces a TextFileReader object, may I suggest:
X_chunks = pd.read_csv('train_v2.csv')
But you don't want to materialize the list?
Final suggestion:
X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
for chunk in x_chunks:
analyze(chunk)
Where analyze is whatever process you've broken up to analyze the chunks piece by piece, since you apparently can't load the entire dataset into memory.
You can't use concat in the way you're trying to, the reason is that it demands the data be fully materialized, which makes sense, you can't concatenate something that isn't there yet.

Categories

Resources