I am predicting the output data using ARIMA.
output is saved in CSV.
i need the output to be stored in XML format.
import numpy as np
prediction = pd.DataFrame(predictions,columns=['sl.no' 'predicted_freq']).to_csv('prediction.csv')
Before calling the to_csv() method, you have a pandas dataframe. To convert one of these to xml, there are solutions (though not out of the box). See e.g. [here].(https://stackabuse.com/reading-and-writing-xml-files-in-python-with-pandas/#writingxmlfileswithlxml)
You may have to ask yourself though, how exactly your xml needs to be structured.
Related
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
Currently I am working on a system that can take data from a CSV file and import it into a TFRecord file, However I have a few questions.
For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
I have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Thankyou for reading!
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Some Example Code Im using
import pandas as pd
from ast import literal_eval
import numpy as np
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
def Start():
db = pd.read_csv("I:\Github\ClubKeno\Keno Project\Database\..\LotteryDatabase.csv")
pd.DataFrame = db
print(db['Winning_Numbers'])
print(db.dtypes)
training_dataset = (
tf.data.Dataset.from_tensor_slices(
(
tf.cast(db['Draw_Number'].values, tf.int64),
tf.cast(db['Winning_Numbers'].values, tf.int64),
tf.cast(db['Extra_Numbers'].values, tf.int64),
tf.cast(db['Kicker'].values, tf.int64)
)
)
)
for features_tensor, target_tensor in training_dataset:
print(f'features:{features_tensor} target:{target_tensor}')
Error Message:
CSV Data
Update:
Got Two Columns of dating working using the following function...
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Kicker'],
column_defaults=[tf.int64, tf.int64],
)
However when trying to include my two other column object types
(What data looks like in both those columns)
"3,9,11,16,25,26,28,29,36,40,41,46,63,66,67,69,72,73,78,80"
I get an error, here is the function I tried for that
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=databasefile,
column_names=['Draw_Number', 'Winning_Numbers', 'Extra_Numbers', 'Kicker'],
column_defaults=[tf.int64, tf.compat.as_bytes, tf.compat.as_bytes, tf.int64],
header=True,
batch_size=100,
field_delim=',',
na_value='NA'
)
This Error Appears:
TypeError: Failed to convert object of type <class 'function'> to Tensor. Contents: <function as_bytes at 0x000000EA530908C8>. Consider casting elements to a supported type.
Should I try to Cast those two types outside the function and try combining it later into the TFRecord file alongside the tf.data from the make_csv_dataset function?
For starters, I need to know what type a TFRecord file can take, when using CSV types are removed.
TFRecord accepts following datatypes-
string, byte, float32, float 64, bool, enum, int32, int64, uint32, uint64
Talked here.
Secondly, How can I convert data type:object into a type that a TFRecord can take?
Here is an example from TF, it is a bit complicated to digest it at once but if you read it carefully it is easy.
have two columns (will post example below) of two objects types that are strings, How can I convert that data to the correct type for TFRecords?
For string type data, you require tf.train.BytesList which returns a bytes_list from a string.
When importing Im hoping to append data from each row at a time into the TFRecord file, any advice or documentation would be great, I have been looking for some time at this problem and it seems there can only be ints,floats inputted into a TFRecord but what about a list/array of Integers?
Quick Note, I am using PANDAS to create a dataframe of the CSV file
Instead of reading csv file using Pandas, I would recommend you to use tf.data.experimental.make_csv_dataset defined here. This will make this conversion process very faster than Pandas and will give you less compatibility issues to work with TF classes. If you use this function, then you will not need to read the csv file row by row but all at once using map() which uses eager execution. This is a good tutorial to get started.
Accidentally edited wrong section of the post
I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).
I have two dataset in csv and arff format which I have been using in classification models in weka. I was wondering if this formats can be used in scikit to try others classification methods in python.
This is how my dataset looks like:
ASSAY_CHEMBLID...MDEN.23...MA,TARGET_TYPE...No...MA,TARGET_TYPE...apol...MA,TARGET_TYPE...ATSm5...MA,TARGET_TYPE...SCH.6...MA,TARGET_TYPE...SPC.6...MA,TARGET_TYPE...SP.3...MA,TARGET_TYPE...MDEN.12...MA,TARGET_TYPE...MDEN.22...MA,TARGET_TYPE...MLogP...MA,TARGET_TYPE...R...MA,TARGET_TYPE...G...MA,TARGET_TYPE...I...MA,ORGANISM...No...MA,ORGANISM...C2SP1...MA,ORGANISM...VC.6...MA,ORGANISM...ECCEN...MA,ORGANISM...khs.aasC...MA,ORGANISM...MDEC.12...MA,ORGANISM...MDEC.13...MA,ORGANISM...MDEC.23...MA,ORGANISM...MDEC.33...MA,ORGANISM...MDEO.11...MA,ORGANISM...MDEN.22...MA,ORGANISM...topoShape...MA,ORGANISM...WPATH...MA,ORGANISM...P...MA,Lij
0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089,0,0.209406,0
where Lij is my class identificator (0 or 1). I was wondering if a previous transformation with numpy is needed.
To read ARFF files, you'll need to install liac-arff. see the link for details.
once you have that installed, then use the following code to read the ARFF file
import arff
import numpy as np
# read arff data
with open("file.arff") as f:
# load reads the arff db as a dictionary with
# the data as a list of lists at key "data"
dataDictionary = arff.load(f)
f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])
There are several ways in which csv data can be read, I found that the easiest is using the function read_csv from the Python's module Pandas. See the link for details regarding installation.
The code for reading a csv data file is below
# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values
In either cases, you'll have a numpy array with your data. since the last column represents the (classes/target /ground truth/labels). you'll need to separate the data to a features array X and target vector y. e.g.
X = arffData[:, :-1]
y = arffData[:, -1]
where X contains all the data in arffData except for the last column and y contains the last column in arffData
Now you can use any supervised learning binary classifier from scikit-learn.
I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')