I have a txt file which has 8 columns and I am selecting 1 column for my feature extraction which gives me 13 features values, the shape of output array will be [1x13].
Similarly I have 5 txt files in a folder I want to run a loop so that the returned variable will have 5x13 data.
def loadinfofromfile(directory,sd,channel):
# subdir selection and read file names in it for particular crack type.
subdir, filenames = loadfilenamesindirectory(directory,sd)
for i in range(5):
# join the directory sub directory and the filename
loadfile = os.path.join(directory,subdir,filenames[i])
# load the values of that paticular file into tensor
fileinfo = tf.constant(np.loadtxt(loadfile),tf.float32)
# select the particular column data ( choosen from crack type, channel no)
fileinfo_trans = tf.transpose(fileinfo)
fileinfo_back = tf.gather(fileinfo_trans,channel)
# extracting features from selected column data gives [1x13]
pool = features.pooldata(fileinfo_back)
poolfinal = tf.concat_v2([tf.expand_dims(pool,0)],axis=0)
return poolfinal
In the above function I am able to get [1x13] to the variable 'pool' and I am expecting the size of the variable poolfinal as [5x13] but i get it as [1x13].
how to concat in vertical direction ?
What is the mistake i did in the loop ?
each loop creates pool and poolfinal from sctratch. That's why you see only one data in poolfinal.
instead please try following:
pools = []
for ...:
pools.append(...)
poolfinal = tf.concat_v2(pools, axis=0)
Related
I write a helper function read_xyT to read all .csv files in a directory and output them into a pandas DataFrame. Since there are lots for such directories and I want to save them into individual variables (or maybe some better solutions?)
what I do now is
path = r'./data/T-600s'
df33 = read_xyT(path,33)
df34 = read_xyT(path,34)
df35 = read_xyT(path,35)
df36 = read_xyT(path,36)
...
There is in total 60 folders... I wonder if there is a smarter way to do it more efficiently? i.e.
subdirectoynames = np.arange(34,50) # the helper function take 'path' as input
variablenames = alistforfilenames
for v,f in (variablenames, subdirectorynames):
dfxx = read_xyT(path,yy)
then I'll have the saved individual variables such as df37, df38, df39, ...
Or is there a better way to do it?
Thank you in advance!
You can use a dict comprehension:
df_dict = {f'df{idx}': read_xyT(path, idx) for idx in subdirectorynames}
This creates a dictionary where you can access the dataframes using e.g. df_dict['df33']
I have run the following Python code :
array = ['AEM000', 'AID017']
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'].isin(array)]
I run a regression model and extract the log-likelyhood value on each item of this array by a for loop :
for item in array:
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'] == item]
formula = "WEIGHTED_BASE_MEDIAN_FINAL_MEAN ~ YEAR"
response, predictors = dmatrices(formula, USA_DATA_1D, return_type='dataframe')
mod1 = sm.GLM(response, predictors, family=sm.genmod.families.family.Gaussian()).fit()
LLF_NG = {'model': ['Standard Gaussian'],
'llf_value': mod1.llf
}
df_llf = pd.DataFrame(LLF_NG , columns = ['model', 'llf_value'])
Now I would like to remane the dataframe df_llf by df_llf_(name of the item) i.e. df_llf_AEM000 when running the loop on the first item and df_llf_AID017 when running the loop on the second one.
I need some help to know how to proceed that.
If you want to rename the data frame, you need to use the copy method so that the original data frame does not get altered.
df_llf_AEM000 = df_llf.copy()
If you want to save iteratively several different versions of the original data frame, you can do something like this:
allDataframes = []
for i in range(10):
df = df_original.copy()
allDataframes.append(df)
print(allDataframes[0])
I am currently discovering HDf5 library n Python and I have some problem. I have a dataset with this layout:
GROUP "GROUP1" {
DATASET "DATASET1" {
DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "DATATYPE1";
H5T_STD_I64LE "DATATYPE2";
H5T_STD_I64LE "DATATYPE3";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {
1,
2,
3
I am trying to iterate in dataset to get the values associated to each datatype and copying them in a text file. (For example, "1" is the associated value to "DATATYPE1".) This following script does work:
new_file = open('newfile.txt', 'a')
for i in range(len(dataset[...])):
new_file.write('Ligne '+ str(i)+" "+":"+" ")
for j in range(len(dataset[i,...])):
new_file.write(str(dataset[i][j]) + "\n")
But it is not this clean... So I tried to get values by calling the datatypes by name. The closest script I found is the following:
for attribute in group.attrs:
print group.attrs[attribute]
Unfortunately, despite my tries it does not work on datatype :
Checking datatypes leads to dataset
for data.dtype in dataset.dtype:
#then print datatypes
print dataset.dtype[data.dtype
The backing error message is "numpy.dtype' object is not iterable".
Do you please have any idea how to process? I hope my question is clear.
Without your data it's hard to offer specific solutions. Here is a very simple example that mimics your data schema using pytables (& numpy). First it creates the HDF5 file, with table named DATASET1 under group GROUP1. DATASET1 has 3 int values in each row named: DATATYPE1, DATATYPE2, and DATATYPE3. The ds1.append() function adds rows of data to the table (1 row at a time).
After the data is created, walk_nodes() is used to traverse the HDF5 file structure and print node names and dtypes for tables.
import tables as tb
import numpy as np
with tb.open_file("SO_56545586.h5", mode = "w") as h5f:
ds1 = h5f.create_table('/GROUP1', 'DATASET1',
description=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]),
createparents=True)
for row in range(5) :
row_vals = [ (row, row+1, row*2), ]
ds1.append(row_vals)
## This section walks the file strcuture (groups and datasets), printing node names and dtype for tables:
for this_node in h5f.walk_nodes('/'):
print (this_node)
if isinstance(this_node, tb.Table) :
print (this_node.dtype)
Note: do not use mode = "w" when you open an existing file. It will create a new file (overwrite the existing file). Use mode = "a" or mode = "r+" if you need to append data, or mode = "r" if you only need to read the data.
To complete solution added by kcw78 I also found this script which also work. Because I can't iterate over dataset, I copied dataset into a new array :
dataset = file['path_to_dataset']
data = np.array(dataset) # Create a new array filled with dataset values as numpy.
print(data)
ls_column = list(data.dtype.names) # Get a list with datatypes associated to each data values.
print(ls_column) # Show layout of datatypes associated to each previous data values.
# Create an array filled with same datatypes rather than same subcases.
for col in ls_column:
k = data[col] # example : k=data['DATATYPE1'], k=data['DATATYPE2']
print(k)
Arnaud, OK, I see you are using h5py.
I don't understand what you mean by "I can't iterate over dataset". You can iterate over rows, or columns/fields.
Here is an example to demonstrate with h5py.
It shows 4 ways to extract data from the dataset, the last one iterates):
Read the entire HDF5 dataset to a np array
Then read 1 column from that array to another array
Read 1 column from the HDF5 dataset as an array
Loop thru HDF5 dataset columns and read 1 at a time as an array
Note that the return from .dtype.names is iterable. You don't need to create a list (unless you need it for other purposes). Also, HDF5 supports mixed types in datasets, so you can get a dtype with int, float, and string values (it will be a record array).
import h5py
import numpy as np
with h5py.File("SO_56545586.h5", "w") as h5f:
# create empty dataset 'DATASET1' in group '/GROUP1'
# dyte argument defines names and types
ds1 = h5f.create_dataset('/GROUP1/DATASET1', (10,),
dtype=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]) )
for row in range(5) : # load some arbitrary data into the dataset
row_vals = [ (row, row+1, row*2), ]
ds1[row] = row_vals
# to read the entire dataset as an array
ds1_arr = h5f['/GROUP1/DATASET1'][:]
print (ds1_arr.dtype)
# to read 1 column from ds1_arr as an array
ds1_col1 = ds1_arr[:]['DATATYPE1']
print ('for DATATYPE1 from ds1_arr, dtype=',ds1_col1.dtype)
# to read 1 HDF5 dataset column as an array
ds1_col1 = h5f['/GROUP1/DATASET1'][:,'DATATYPE1']
print ('for DATATYPE1 from HDF5, dtype=',ds1_col1.dtype)
# to loop thru HDF5 dataset columns and read 1 at a time as an array
for col in h5f['/GROUP1/DATASET1'].dtype.names :
print ('for ', col, ', dtype=',h5f['/GROUP1/DATASET1'][col].dtype)
col_arr = h5f['/GROUP1/DATASET1'][col][:]
print (col_arr.shape)
I am working on a script to extract some details from images. I am trying to loop over a dataframe that has my image names. How can I add a new column to the dataframe, that populates the extracted name appropriately against the image name?
for image in df['images']:
concatenated_name = ''.join(name)
df.loc[image, df['images']]['names'] = concatenated_name
Expected:
Index images names
0 img_01 TonyStark
1 img_02 Thanos
2 img_03 Thor
Got:
Index images names
0 img_01 Thor
1 img_02 Thor
2 img_03 Thor
Use apply to apply a function on each row:
def get_name(image):
# Code for getting the name
return name
df['names'] = df['images'].apply(get_name)
Follwing your answer that added some more details, it should be possible to shorten it to:
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
return data
df['data'] = df['filenames'].apply(get_details)
# save df to csv / excel / other
After multiple trials, I think I have a viable solution to this question.
I was using nested function for this exercise, such that function 1 loops over a dataframe of files and calls to function 2 to extract text, perform validation and return a value if the image had the expected field.
First, I created an empty list which would be populated during each run of function 2. At the end, the user can choose to use this list to create a dataframe.
# dataframes to store data
df = pd.DataFrame(os.listdir(), columns=['filenames'])
df = df[df['filenames'].str.contains(".png|.jpg|.jpeg")]
df['filenames'] = '\\' + df['filenames']
df1 = [] #Empty list to record details
# Function 1
def extract_details(df):
for filename in df['filenames']:
get_details(filename)
# Function 2
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
print(filename, data)
df1.append([filename, data])
df_data = pd.DataFrame(df1, columns=['filenames', 'data']) # Container for final output
df_data.to_csv('data_list.csv') # Write output to a csv file
df_data.to_excel('data_list.xlsx') # Write output to an excel file
I am trying to use PyBrain for some simple NN training. What I don't know how to do is to load the training data from a file. It is not explained in their website anywhere. I don't care about the format because I can build it now, but I need to do it in a file instead of adding row by row manually, because I will have several hundreds of rows.
Here is how I did it:
ds = SupervisedDataSet(6,3)
tf = open('mycsvfile.csv','r')
for line in tf.readlines():
data = [float(x) for x in line.strip().split(',') if x != '']
indata = tuple(data[:6])
outdata = tuple(data[6:])
ds.addSample(indata,outdata)
n = buildNetwork(ds.indim,8,8,ds.outdim,recurrent=True)
t = BackpropTrainer(n,learningrate=0.01,momentum=0.5,verbose=True)
t.trainOnDataset(ds,1000)
t.testOnData(verbose=True)
In this case the neural network has 6 inputs and 3 outputs. The csv file has 9 values on each line separated by a comma. The first 6 values are input values and the last three are outputs.
You just use a pandas DataFrame this way
import pandas as pd
dataset = SupervisedDataSet(6,3)
df = pd.read_csv('mycsvfile.csv')
dataset.setField('input', df.values[:,:6]) # this sets the features
y=[[x] for x in df.values[:,:6])] # Do this to avoid IndexError: tuple index out of range
# as the target field should be a list of lists,
# even if its shape is 1
dataset.setField('target', y) # this set the target[s] field[s]
del df,y
and you are good to go.