I want to create a numpy array by parsing a .txt file. The .txt file consists of features of iris flowers seperated by commas. every line is has one flower example with 5 data seperated with 4 commas. first 4 number is features and the last one is the name. I parse the .txt in a loop and want to append (using numpy.append probably) every lines parsed data into a numpy array called feature_table.
heres the code;
import numpy as np
iris_data = open("iris_data.txt", "r")
for line in iris_data:
currentline = line.split(",")
#iris_data_parsed = (currentline[0] + " , " + currentline[3] + " , " + currentline[4])
#sepal_length = numpy.array(currentline[0])
#petal_width = numpy.array(currentline[3])
#iris_names = numpy.array(currentline[4])
feature_table = np.array([currentline[0]],[currentline[3]],[currentline[4]])
print (feature_table)
print(feature_table.shape)
so I want to create a numpy array using only first, fourth and fifth data in every line
but I can't make it work as I want to. tried reading numpy docs but couldn't understand it.
While the people in the comments are right in that you are not persisting your data anywhere, your problem, I assume, is incorrect np.array construction. You should enclose all of the arguments in a list like this:
feature_table = np.array([currentline[0],currentline[3],currentline[4]])
And get rid of redundant [ and ] around the arguments.
See the official documentation for more examples. Basically all of the input data needs to be grouped/separated to be only 1 argument as Python will consider the other arguemnts as different positional arguments.
Related
Here, my code feats value form text file; and create matrices as multidimensional array, but the problem is the code create more then two dimensional array, that I can't manipulate, I need two dimensional array, how I do that?
Explain algorithm of my code:
Moto of code:
My code fetch value from a specific folder, each folder contain 7 'txt' file, that generate from one user, in this way multiple folder contain multiple data of multiple user.
step1: Start a 1st for loop, and control it using how many folder have in specific folder,and in variable 'path' store the first path of first folder.
step2: Open the path and fetch data of 7 txt file using 2nd for loop.after feats, it close 2nd for loop and execute the rest code.
step3: Concat the data of 7 txt file in one 1d array.
step4(Here the problem arise): Store the 1d arry of each folder as 2d array.end first for loop.
Code:
import numpy as np
from array import *
import os
f_path='Result'
array_control_var=0
#for feacth directory path
for (path,dirs,file) in os.walk(f_path):
if(path==f_path):
continue
f_path_1= path +'\page_1.txt'
#Get data from page1 indivisualy beacuse there string type data exiest
pgno_1 = np.array(np.loadtxt(f_path_1, dtype='U', delimiter=','))
#only for page_2.txt
f_path_2= path +'\page_2.txt'
with open(f_path_2) as f:
str_arr = ','.join([l.strip() for l in f])
pgno_2 = np.asarray(str_arr.split(','), dtype=int)
#using loop feach data from those text file.datda type = int
for j in range(3,8):
#store file path using variable
txt_file_path=path+'\page_'+str(j)+'.txt'
if os.path.exists(txt_file_path)==True:
#genarate a variable name that auto incriment with for loop
foo='pgno_'+str(j)
else:
break
#pass the variable name as string and store value
exec(foo + " = np.array(np.loadtxt(txt_file_path, dtype='i', delimiter=','))")
#z=np.array([pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7])
#marge all array from page 2 to rest in single array in one dimensation
f_array=np.concatenate((pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7), axis=0)
#for first time of the loop assing this value
if array_control_var==0:
main_f_array=f_array
else:
#here the problem arise
main_f_array=np.array([main_f_array,f_array])
array_control_var+=1
print(main_f_array)
current my code generate array like this(for 3 folder)
[
array([[0,0,0],[0,0,0]]),
array([0,0,0])
]
Note: I don't know how many dimension it have
But I want
[
array(
[0,0,0]
[0,0,0]
[0,0,0])
]
I tried to write a recursive code that recursively flattens the list of lists into one list. It gives the desired output for your case, but I did not try it for many other inputs(And it is buggy for certain cases such as :list =[0,[[0,0,0],[0,0,0]],[0,0,0]])...
flat = []
def main():
list =[[[0,0,0],[0,0,0]],[0,0,0]]
recFlat(list)
print(flat)
def recFlat(Lists):
if len(Lists) == 0:
return Lists
head, tail = Lists[0], Lists[1:]
if isinstance(head, (list,)):
recFlat(head)
return recFlat(tail)
else:
return flat.append(Lists)
if __name__ == '__main__':
main()
My idea behind the code was to traverse the head of each list, and check whether it is an instance of a list or an element. If the head is an element, this means I have a flat list and I can return the list. Else, I should recursively traverse more.
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
So I am writing some code in Python 2.7 to pull some information from a website, pull the relevant data from that set, then format that data in a way that is more useful. Specifically, I am wanting to take information from a html <pre> tag, put it into a file, turn that information in the file into an array (using numpy), and then do my analysis from that. I am stuck on the "put into a file" part. It seems that when I put it into a file, it is a 1x1 matrix or something and so it won't do what I hope it will. On an attempt previous to the code sample below, the error I got was: IndexError: index 5 is out of bounds for axis 0 with size 0 I had the index for array just to test if it would provide output from what I have so far.
Here is my code so far:
#Pulling data from GFS lamps
from lxml import html
import requests
import numpy as np
ICAO = raw_input("What station would you like GFS lamps data for? ")
page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])
array = np.genfromtxt('ICAO') #puts file into an array
array[5]
You can use KOGD as the ICAO to test this. As is, I get Value Error: Some Errors were detected and it lists Lines 2-23 (Got 26 columns instead of 8). What is the first step that I am doing wrong for what I want to do? Or am I just going about this all wrong?
The problem isn't in the putting data into the file part, its getting it out using genfromtxt. The problem is that genfromtxt is a very rigid function, mostly needs complete data unless you specify lots of options to skip columns and rows. Use this one instead:
arrays = [np.array(map(str, line.split())) for line in open('ICAO')]
The arrays variable will contain array of each line which contains each individual element in that line seperated by a space, for ex if your line has the following data:
a b cdef 124
the array for this line will be:
['a','b','cdef','124']
arrays will contain array of each line like this, which can be processed as you wish further.
So complete code is:
from lxml import html
import requests
import numpy as np
ICAO = raw_input("What station would you like GFS lamps data for? ")
page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])
gfsLamps.close()
array = [np.array(map(str, line.split())) for line in open('ICAO')]
print array
I´m writing this question because I´m not sure with the use of "structure arrays". I did a matrix from keyboard with different inputs (integer, float, etc.) using the "dtype" command. Then, I want find repeated elements in the column "p" and "q", when I have these elements, I want to sum the respective elements from column "z". Thanks. This is my Python code:
from numpy import *
from math import *
from cmath import *
from numpy.linalg import *
number_lines_=raw_input("Lines:")
numero_lines=int(number_lines_)
ceros=zeros((numero_lines,1))
dtype=([('p',int),('q',int),('r',float),('x',float),('b',complex),('z',complex),('y',complex)])
#print dtype
leng=len(dtype)
#print leng
yinfo=array(ceros,dtype)
#print shape(yinfo)
if numero_lines>0:
for i in range(numero_lines):
p_=raw_input("P: ")
p=int(p_)
if p>0:
yinfo['p'][i]=p
#print yinfo
q_=raw_input("Q: ")
q=int(q_)
if q>0 and q!=p:
yinfo['q'][i]=q
r_=raw_input("R: ")
r=float(r_)
yinfo['r'][i]=r
x_=raw_input("X: ")
x=float(x_)
yinfo['x'][i]=x
b_=raw_input("b:")
b=complex(b_)
yinfo['b'][i]=complex(0,b)
yinfo['z'][i]=complex(yinfo['r'][i],yinfo['x'][i])
yinfo['y'][i]=1./(yinfo['z'][i])
# print "\n"
print yinfo
print type(yinfo)
print yinfo.shape
Let me suggest some changes:
import numpy as np # not *
....
numero_lines=int(number_lines_)
...
# don't use numpy function names as variable names
# even the np import
dt=np.dtype([('p',int),('q',int),('r',float),('x',float),('b',complex),('z',complex),('y',complex)])
...
yinfo=np.zeros((numero_lines,),dtype=dt)
# make a zero filled array directly
# also make it 1d, I don't think 2nd dimension helps you
# if numero_lines>0: omit this test; range(0) is empty
# link each of the fields to a variable name
# changing a value of yinfo_p will change a value in yinfo
yinfo_p=yinfo['p']
yinfo_q=yinfo['q']
# etc
for i in range(numero_lines):
p_=raw_input("P: ")
p=int(p_)
if p>0:
yinfo_p[i]=p
#print yinfo
q_=raw_input("Q: ")
q=int(q_)
if q>0 and q!=p:
yinfo_q[i]=q
r_=raw_input("R: ")
r=float(r_)
yinfo_r[i]=r
x_=raw_input("X: ")
x=float(x_)
yinfo_x[i]=x
b_=raw_input("b:")
b=complex(b_)
yinfo_b[i]=complex(0,b)
# dont need to fill in these values row by row
# perform array operations after you are done with the input loop
yinfo_z[:] = yinfo_r + 1j*yinfo_x
yinfo_y[:] = 1./yinfo_z
Alternatively I could have defined
yinfo_p = np.zeros((numero_lines,), dtype=int)
yinfo_q = ...
After filling in all the yinfo_* values I could assemble them into a structured array - if that's what I need for other uses.
yinfo['p'] = yinfo_p
etc.
This isn't a very good use of structured arrays; as I tried to show with the yinfo_b variables, you are using each field as though it was a separate 1d array.
I haven't actually run these changes, so there might be some errors, but hopefully they will give you ideas that can improve your code.
I need help formatting my matrix when i write it to a file. I am using the numpy method called toFile()
it takes 3 args. 1-name of file,2-seperator(must be a string),3-format(Also a string)
I dont know a lot about formatting but i am trying to format the file to there is a new line each 9 charatcers. (not including spaces). The output is a 9x9 soduku game. So I need to it be formatted 9x9.
finished = M.tofile("soduku_solved.txt", " ", "")
Where M is a matrix
My first argument is the name of the file, the second is a space, but I dont know what format argument i need to to make it 9x9
I could be wrong, but I don't think that's possible with the numpy tofile function. I think the format argument just allows you to format how each individual item is formatted, it doesn't consider them in a group.
You could do something like:
M = np.random.randint(1, 9, (9, 9))
each_item_fmt = '{:>3}'
each_row_fmt = ' '.join([each_item_fmt] * 9)
fmt = '\n'.join([each_row_fmt] * 9)
as_string = fmt.format(*M.flatten())
It's not a very nice way to build up the format string and there's bound to be a better way of doing it. You'll see the final result (print(fmt)) is a big block of '{:>3}', which basically says, put a bit of data in here with a fixed width of 3 characters, right aligned.
EDIT Since you're putting it directly into a file you could write it line by line:
M = np.random.randint(1, 9, (9, 9))
fmt = ('{:>3} ' * 9).strip()
with open('soduku_solved.txt', 'w') as f:
for m in M:
f.write(fmt.format(*m) + '\n')