How to load training data in PyBrain?

How to load training data in PyBrain? - python

I am trying to use PyBrain for some simple NN training. What I don't know how to do is to load the training data from a file. It is not explained in their website anywhere. I don't care about the format because I can build it now, but I need to do it in a file instead of adding row by row manually, because I will have several hundreds of rows.

Here is how I did it:
ds = SupervisedDataSet(6,3)
tf = open('mycsvfile.csv','r')
for line in tf.readlines():
data = [float(x) for x in line.strip().split(',') if x != '']
indata = tuple(data[:6])
outdata = tuple(data[6:])
ds.addSample(indata,outdata)
n = buildNetwork(ds.indim,8,8,ds.outdim,recurrent=True)
t = BackpropTrainer(n,learningrate=0.01,momentum=0.5,verbose=True)
t.trainOnDataset(ds,1000)
t.testOnData(verbose=True)
In this case the neural network has 6 inputs and 3 outputs. The csv file has 9 values on each line separated by a comma. The first 6 values are input values and the last three are outputs.

You just use a pandas DataFrame this way
import pandas as pd
dataset = SupervisedDataSet(6,3)
df = pd.read_csv('mycsvfile.csv')
dataset.setField('input', df.values[:,:6]) # this sets the features
y=[[x] for x in df.values[:,:6])] # Do this to avoid IndexError: tuple index out of range
# as the target field should be a list of lists,
# even if its shape is 1
dataset.setField('target', y) # this set the target[s] field[s]
del df,y
and you are good to go.

Related

Compute mean and standard deviation for HDF5 data

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).
Main Routine
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).
Storing of Simulation Data
To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.
After running the simulations, I receive the following Keys structure:
Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>
Computing Statistics for Files
Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).
The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:
episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]
I currently use the following code to extract the data storing it into an empty list:
def ExtractSimData(name, simulation_runs, length):
# Create empty list
result = []
# Call the simulation run file
filename = f"runs/{length}/{name}_simulation_runs2.h5"
with h5py.File(filename, "r") as hf:
# List all groups
print("Keys: %s" % hf.keys())
for i in range(simulation_runs):
a_group_key = list(hf.keys())[i]
data = list(hf[a_group_key])
for element in data:
result.append(element)
The data structure of result looks something like this:
[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]
First Attempt to Compute Means
I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):
mean_result = [np.mean(k) for k in zip(*list(result))]
However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).
Is there any way to efficiently accomplish this?

This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)
Outline of steps in the procedure summarized below (after opening the file):
Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.
A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules
Code to read and calculate statistics below.
import h5py
import numpy as np
def ExtractSimData(name, simulation_runs, length):
# Call the simulation run file
filename = f"runs/{length}/{name}simulation_runs2.h5"
with h5py.File(filename, "a") as hf:
# List all dataset names
ds_names = list(hf.keys())
print(f'Dataset names (keys): {ds_names}')
# Create empty arrays for player1 and player2 episode values
sim_cnt = len(ds_names)
print(f'# of simulation runs (dataset count) = {sim_cnt}')
ep_cnt = hf[ ds_names[0] ].shape[0]
print(f'# of episodes (rows) in each dataset = {ep_cnt}')
p1_arr = np.empty((ep_cnt,sim_cnt))
p2_arr = np.empty((ep_cnt,sim_cnt))
for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation
p1_arr[:,i] = hf[ds][:,0]
p2_arr[:,i] = hf[ds][:,1]
ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4),
compression='gzip', chunks=True)
ds1[:,0] = np.mean(p1_arr, axis=1)
ds1[:,1] = np.std(p1_arr, axis=1)
ds1[:,2] = np.mean(p2_arr, axis=1)
ds1[:,3] = np.std(p2_arr, axis=1)
dt = np.dtype([ ('average_player1',float), ('average_player2',float),
('std_player1',float), ('std_player2',float) ] )
ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt,
compression='gzip', chunks=True)
ds2['average_player1'] = np.mean(p1_arr, axis=1)
ds2['std_player1'] = np.std(p1_arr, axis=1)
ds2['average_player2'] = np.mean(p2_arr, axis=1)
ds2['std_player2'] = np.std(p2_arr, axis=1)
### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)
Code to create pseudo-data HDF5 file below.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
players = 2
periods = 1000
# Define the simulation with some random data
val_hist = np.random.random(periods*players).reshape(periods,players)
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (unique datasets)
with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation N times
simulations = 10
for i in range(simulations):
print(f'--- Iteration {i} ---')
test_simulation(i)

Set up a column based on another column and outside list in a Pandas Dataframe

I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())

How to iterate on datatype to get associated values?

I am currently discovering HDf5 library n Python and I have some problem. I have a dataset with this layout:
GROUP "GROUP1" {
DATASET "DATASET1" {
DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "DATATYPE1";
H5T_STD_I64LE "DATATYPE2";
H5T_STD_I64LE "DATATYPE3";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {
1,
2,
3
I am trying to iterate in dataset to get the values associated to each datatype and copying them in a text file. (For example, "1" is the associated value to "DATATYPE1".) This following script does work:
new_file = open('newfile.txt', 'a')
for i in range(len(dataset[...])):
new_file.write('Ligne '+ str(i)+" "+":"+" ")
for j in range(len(dataset[i,...])):
new_file.write(str(dataset[i][j]) + "\n")
But it is not this clean... So I tried to get values by calling the datatypes by name. The closest script I found is the following:
for attribute in group.attrs:
print group.attrs[attribute]
Unfortunately, despite my tries it does not work on datatype :
Checking datatypes leads to dataset
for data.dtype in dataset.dtype:
#then print datatypes
print dataset.dtype[data.dtype
The backing error message is "numpy.dtype' object is not iterable".
Do you please have any idea how to process? I hope my question is clear.

Without your data it's hard to offer specific solutions. Here is a very simple example that mimics your data schema using pytables (& numpy). First it creates the HDF5 file, with table named DATASET1 under group GROUP1. DATASET1 has 3 int values in each row named: DATATYPE1, DATATYPE2, and DATATYPE3. The ds1.append() function adds rows of data to the table (1 row at a time).
After the data is created, walk_nodes() is used to traverse the HDF5 file structure and print node names and dtypes for tables.
import tables as tb
import numpy as np
with tb.open_file("SO_56545586.h5", mode = "w") as h5f:
ds1 = h5f.create_table('/GROUP1', 'DATASET1',
description=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]),
createparents=True)
for row in range(5) :
row_vals = [ (row, row+1, row*2), ]
ds1.append(row_vals)
## This section walks the file strcuture (groups and datasets), printing node names and dtype for tables:
for this_node in h5f.walk_nodes('/'):
print (this_node)
if isinstance(this_node, tb.Table) :
print (this_node.dtype)
Note: do not use mode = "w" when you open an existing file. It will create a new file (overwrite the existing file). Use mode = "a" or mode = "r+" if you need to append data, or mode = "r" if you only need to read the data.

To complete solution added by kcw78 I also found this script which also work. Because I can't iterate over dataset, I copied dataset into a new array :
dataset = file['path_to_dataset']
data = np.array(dataset) # Create a new array filled with dataset values as numpy.
print(data)
ls_column = list(data.dtype.names) # Get a list with datatypes associated to each data values.
print(ls_column) # Show layout of datatypes associated to each previous data values.
# Create an array filled with same datatypes rather than same subcases.
for col in ls_column:
k = data[col] # example : k=data['DATATYPE1'], k=data['DATATYPE2']
print(k)

Arnaud, OK, I see you are using h5py.
I don't understand what you mean by "I can't iterate over dataset". You can iterate over rows, or columns/fields.
Here is an example to demonstrate with h5py.
It shows 4 ways to extract data from the dataset, the last one iterates):
Read the entire HDF5 dataset to a np array
Then read 1 column from that array to another array
Read 1 column from the HDF5 dataset as an array
Loop thru HDF5 dataset columns and read 1 at a time as an array
Note that the return from .dtype.names is iterable. You don't need to create a list (unless you need it for other purposes). Also, HDF5 supports mixed types in datasets, so you can get a dtype with int, float, and string values (it will be a record array).
import h5py
import numpy as np
with h5py.File("SO_56545586.h5", "w") as h5f:
# create empty dataset 'DATASET1' in group '/GROUP1'
# dyte argument defines names and types
ds1 = h5f.create_dataset('/GROUP1/DATASET1', (10,),
dtype=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]) )
for row in range(5) : # load some arbitrary data into the dataset
row_vals = [ (row, row+1, row*2), ]
ds1[row] = row_vals
# to read the entire dataset as an array
ds1_arr = h5f['/GROUP1/DATASET1'][:]
print (ds1_arr.dtype)
# to read 1 column from ds1_arr as an array
ds1_col1 = ds1_arr[:]['DATATYPE1']
print ('for DATATYPE1 from ds1_arr, dtype=',ds1_col1.dtype)
# to read 1 HDF5 dataset column as an array
ds1_col1 = h5f['/GROUP1/DATASET1'][:,'DATATYPE1']
print ('for DATATYPE1 from HDF5, dtype=',ds1_col1.dtype)
# to loop thru HDF5 dataset columns and read 1 at a time as an array
for col in h5f['/GROUP1/DATASET1'].dtype.names :
print ('for ', col, ', dtype=',h5f['/GROUP1/DATASET1'][col].dtype)
col_arr = h5f['/GROUP1/DATASET1'][col][:]
print (col_arr.shape)

How to increase multi dimension of array in tensorflow?

I have a txt file which has 8 columns and I am selecting 1 column for my feature extraction which gives me 13 features values, the shape of output array will be [1x13].
Similarly I have 5 txt files in a folder I want to run a loop so that the returned variable will have 5x13 data.
def loadinfofromfile(directory,sd,channel):
# subdir selection and read file names in it for particular crack type.
subdir, filenames = loadfilenamesindirectory(directory,sd)
for i in range(5):
# join the directory sub directory and the filename
loadfile = os.path.join(directory,subdir,filenames[i])
# load the values of that paticular file into tensor
fileinfo = tf.constant(np.loadtxt(loadfile),tf.float32)
# select the particular column data ( choosen from crack type, channel no)
fileinfo_trans = tf.transpose(fileinfo)
fileinfo_back = tf.gather(fileinfo_trans,channel)
# extracting features from selected column data gives [1x13]
pool = features.pooldata(fileinfo_back)
poolfinal = tf.concat_v2([tf.expand_dims(pool,0)],axis=0)
return poolfinal
In the above function I am able to get [1x13] to the variable 'pool' and I am expecting the size of the variable poolfinal as [5x13] but i get it as [1x13].
how to concat in vertical direction ?
What is the mistake i did in the loop ?

each loop creates pool and poolfinal from sctratch. That's why you see only one data in poolfinal.
instead please try following:
pools = []
for ...:
pools.append(...)
poolfinal = tf.concat_v2(pools, axis=0)

Python sklearn.datasets.dump_svmlight_file failed to output the right index of column

I want to execute SVM light and SVM rank,
so I need to process my data into the format of SVM light.
But I had a big problem....
My Python codes are below:
import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file
self.df = pd.DataFrame()
self.df['patent_id'] = patent_id_list
self.df['Target'] = class_list
self.df['backward_citation'] = backward_citation_list
self.df['uspc_originality'] = uspc_originality_list
self.df['science_linkage'] = science_linkage_list
self.df['sim_bc_structure'] = sim_bc_structure_list
self.df['claim_num'] = claim_num_list
self.qid = dataset_list
X = self.df[np.setdiff1d(self.df.columns, ['patent_id','Target'])]
y = self.df.Target
dump_svmlight_file(X,y,'test.dat',zero_based=False, query_id=self.qid,multilabel=False)
The output file "test.dat" is look like this:
But the real data is look like this:
I got a wrong index....
Take first instance for example, the value of column 1 is 7, and the values of column 2~4 are zeros, the value of column 5 is 2....
So my expected result is look like this:
1 qid:1 1:7 5:2
but the column index of output file are totally wrong....
and unfortunately... I cannot figure out where is the problem occur....
I cannot fix this problem for a long time....
Thank you for help!!

I change the data structure, I use np.array to produce array-like input.
Finally, I succeed!

If you're interested in loading into a numpy array, try:
X = clicks_train[:,0:2]
y = clicks_train[:,2]
where 2 is the index of the target column

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load training data in PyBrain? - python

Related

Compute mean and standard deviation for HDF5 data

Set up a column based on another column and outside list in a Pandas Dataframe

How to iterate on datatype to get associated values?

How to increase multi dimension of array in tensorflow?

Python sklearn.datasets.dump_svmlight_file failed to output the right index of column

Categories

Resources