reading file into numpy array python - python

For an assignment i need to read a file and write into a numpy array,
the data consists of a sting and 2 floats:
# naam massa(kg) radius(km)
Venus 4.8685e24 6051.8
Aarde 5.9736e24 6378.1
Mars 6.4185e23 3396.2
Maan 7.349e22 1738.1
Saturnus 5.6846e26 60268
the following was my solution to this problem:
def dataread(filename):
temp = np.empty((1,3), dtype=np.object)
x = 0
f = open(filename,'r')
for line in f:
if line[0] !='#' :
l = line.split('\t')
temp[0,0], temp[0,1] , temp[0,2] = l[0] , float(l[1]) , float(l[2])
if x == 0:
data = temp
if x > 0:
data = np.vstack((data,temp))
x+=1
f.close()
return data
somehow this returns the following array:
[['Aarde' 5.9736e+24 6378.1]
['Aarde' 5.9736e+24 6378.1]
['Mars' 6.4185e+23 3396.2]
['Maan' 7.349e+22 1738.1]
['Saturnus' 5.6846e+26 60268.0]]
The first line is being read but does not end up in the array while the second row is read in twice.
What am I doing wrong ? I’m new to python so any comments on efficiency are also very much appreciated
Thanks in advance

This will read in your three columns into a numpy structured array:
import numpy as np
data = np.genfromtxt(
'data.txt',
dtypes=None, # determine types automatically
names=['name', 'mass', 'radius'],
)
print(data['name'])

Related

Automatically extracting data from csv file into specific matrix position

I have a rather large csv file that I need the program to read, then input the data into the correct position of a zero matrix. Sample of csv block (also attached file):
Sector,Service,Data_Point
Bio,Electricity NonEmitting,0
NEElectricity,Electricity NonEmitting,0.5
RE,Electricity NonEmitting,0
Electricity,Electricity NonEmitting,-1
Bio,Electricity Bio,0.8
NEElectricity,Electricity Bio,0
RE,Electricity Bio,0.04
Electricity,Electricity Bio,-2
Bio,Electricity BECCS,0.84
NEElectricity,Electricity BECCS,0
RE,Electricity BECCS,0.4
Electricity,Electricity BECCS,-1
Bio,Ammonia HB,0
Electricity,Ammonia HB,2.8
RE,Ammonia HB,0.06
Ammonia,Ammonia HB,-1
Bio,Biofuel TBD,0.30
Electricity,Biofuel TBD,0.02
RE,Biofuel TBD,0.012
Electricity,CarUse BEV,0.5
RE,CarUse BEV,0
CarUse,CarUse BEV,-1
Hydrogen,CarUse HFCEV,0.2
RE,CarUse HFCEV,0
CarUse,CarUse HFCEV,-1
Bio,NET DAC,0
NEElectricity,NET DAC,10.5
RE,NET DAC,-1
The problem is that I need it to be able to sort the data based on the Sector and Service columns. I.e. Sector = rows, Service = columns in the matrix. So if the program reads Sector as Bio: row = 1, and Service as Electricity NonEmitting: column 1, it inputs the corresponding number from Data_Point (in this case Data_Point is '0') into row 1 column 1 of the matrix. Or if it reads Sector as NEElectricity: row = 2, but service as Electricity NonEmitting again: column 1, the corresponding Data_Point '0.5' is inputted into row 2 column 1 of the matrix.
Below I have written code that automatically generates a zero matrix based on the number of unique elements in the Sector and Service columns. I just cannot figure out how to sort the values into the correct matrix position, so any help would be greatly appreciated.
import csv
import numpy as np
import pandas as pd
sector = pd.read_csv('Coeff_Sample.csv', usecols=["Sector"])
matrix_column = int(sector.nunique())
service = pd.read_csv('Coeff_Sample.csv', usecols=["Service"])
matrix_row = int(service.nunique())
coeff_matrix = np.zeros((matrix_row, matrix_column))
Best regards
Is that the kind of matrix u wanted to create?
I created this matrix without pandas with the following source code:
import csv
import numpy as np
rows = []
columns = []
all_rows = []
with open('test.csv', 'r') as read_obj:
csv_dict_reader = csv.DictReader(read_obj)
for row in csv_dict_reader:
columns.append(row['Sector'])
rows.append(row['Service'])
all_rows.append(row)
rows_set = set(rows)
columns_set = set(columns)
coeff_matrix = np.full((len(rows_set)+1, len(columns_set)+1), 0).tolist()
row_list = list(rows_set)
columns_list = list(columns_set)
for idx, x in enumerate(columns_list):
coeff_matrix[0][idx+1] = x
for idy, y in enumerate(row_list):
coeff_matrix[idy+1][0] = y
for e in all_rows:
sector = e['Sector']
service = e['Service']
value = e['Data_Point']
for row_idx, row in enumerate(coeff_matrix):
if row[0] == service:
row_index = row_idx
for column_idx, column in enumerate(coeff_matrix[0]):
if column == sector:
column_index = column_idx
coeff_matrix[row_index][column_index] = value
np_coeff_matrix = np.asarray(coeff_matrix)
But it got a lot of loops inside. Maybe there are ways to be faster with that task using pandas or list/np.array functions.

How to solve a formula on Python if one of the variables in the formula is an array of numbers?

I am extracting data from a file and then using this data to calculate some formulas. When I extract the variable "I" (which is all the row values for one column) and I use that variable to solve the formula "T" an error appears:
TypeError: return arrays must be of ArrayType
How can I apply a variable that represents an array to such formula?
import numpy as np
data = np.genfromtxt('data.txt', delimiter=' ')
Lt = data[1:,3]
print(Lt)
v= 1300
cn = (v-137.55)/10.58
print(cn)
round(cn)
V = 10.58*round(cn)+137.55
print(V)
I = data[1:,116]
print(I)
Ifunc = np.vectorize(I)
print(Ifunc)
x = 10**-12
print(x)
y = V**3
print(y)
T = (1.4387*V)/(np.log(1+1.191*x*(y/I), 10))
I expect to solve the formula for all the numbers in the array.
TypeError: return arrays must be of ArrayType
Appears when I run the code.
Use np.log10():
T = (1.4387*V)/(np.log10(1+1.191*x*(y/I)))

My program compute values as string and not as float even when ichange the type

i have a problem with my program and i'm confused, i don't know why it won't change the type of the columns, or maybe it is changing the type of the columns and it just still compute the columns as string. When i change the type into float, if i want it to be multiplied by 8, it will give me, for example with 4, 44444444. Here is my code.
import pandas as pd
import re
import numpy as np
link = "excelfilett.txt"
file = open(link, "r")
frames = []
is_count_frames = False
for line in file:
if "[Frames]" in line:
is_count_frames = True
if is_count_frames == True:
frames.append(line)
if "[EthernetRouting]" in line:
break
number_of_rows = len(frames) - 3
header = re.split(r'\t', frames[1])
number_of_columns = len(header)
frame_array = np.full((number_of_rows, number_of_columns), 0)
df_frame_array = pd.DataFrame(frame_array)
df_frame_array.columns= header
for row in range(number_of_rows):
frame_row = re.split(r'\t',frames[row+2])
for position in range(len(frame_row)):
df_frame_array.iloc[row, position]=frame_row[position]
df_frame_array['[MinDistance (ms)]'].astype(float)
df_frame_array.loc[:,'[MinDistance (ms)]'] *= 8
print(df_frame_array['[MinDistance (ms)]'])
but it gives me 8 times the value like (100100...100100), i also tried with puting them in a list
MinDistList = df_frame_array['[MinDistance (ms)]'].tolist()
product = []
for i in MinDistList:
product.append(i*8)
print(product)
but it still won't work, any ideas?
df_frame_array['[MinDistance (ms)]'].astype(float) doesn't change the column in place, but returns a new one.
You had the right idea, so just store it back:
df_frame_array['[MinDistance (ms)]'] = df_frame_array['[MinDistance (ms)]'].astype(float)

Extract data from string in python

I have a .csv file with data constructed as [datetime, "(data1, data2)"] in rows and I have managed to import the data into python as time and temp, the problem I am facing is how do I seperate the temp string into two new_temp columns in float format to use for plotting later on?
My code so far is:
import csv
import matplotlib.dates as dates
def getColumn(filename, column):
results = csv.reader(open('logfile.csv'), delimiter = ",")
return [result[column] for result in results]
time = getColumn("logfile.csv",0)
temp = getColumn("logfile.csv",1)
new_time = dates.datestr2num(time)
new_temp = [???]
When I print temp I get ['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)',...etc]
If anyone can help me then thanks in advance.
You may use this code:
import re
string = "['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)']"
data = re.findall('[+-]?\d+\.\d+e?[+-]?\d*', string)
data = zip(data[0::2], data[1::2])
print [float(d[0]) for d in data]
print [float(d[1]) for d in data]
Going from the answer in
Parse a tuple from a string?
from ast import literal_eval as make_tuple
temp = ['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)']
tuples = [make_tuple(stringtuple) for stringtuple in temp]
and you have an array of tuples of doubles.
I undeleted the post and made it a full answer, because apparently it wasn't clear enough to reference the other post.

Custom reading CSV files (Keyword accesible / custom structure)

I am trying to do the following:
I downloaded a csv file containing my banking transactions of the last 180 days.
I want to readin this csv file and then do some plots with the data.
For that I setup a program that reads the csv file und makes the data avaible through keywords.
e.g. in the csv file there is a column "Buchungstag".
I replace that with the date keyword etc.
import numpy as np
import matplotlib.pylab as mpl
import csv
class finanz():
def __init__(self):
path = "/home/***/"
self.dataFileName = path + "test.csv"
self.data_read = open(self.dataFileName, 'r')
self._columns = {}
self._columns[0] = ["date", "Buchungstag", "", "S15"]
self._columns[1] = ["value", "Umsatz", "Euro", "f8"]
self._ident = {"Buchungstag":"date", "Umsatz in {0}":"value"}
self.base = 1205.30
self._readData()
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
self.data = np.recarray((2), dtype=dtype)
desiredKeys = map(lambda x:x, self._ident.iterkeys())
for i, x in enumerate(r):
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
self.data[self._ident[k]][i] = v
def getAllData(self):
return self.data.copy()
a = finanz()
b = a.getAllData()
print type(b)
print type(b['value']),type(b['date'])
Sample data
"Buchungstag";"Wertstellung (Valuta)";"Vorgang";"Buchungstext";"Umsatz in {0}";
"02.06.2015";"02.06.2015";"Lastschrift/Belast.";"Auftraggeber: abc";"-3,75";
My question now is why is type(b['date']) a class 'numpy.core.records.recarray' and type(b['value']) a type 'numpy.ndarray' ??
And my second question would be how to "save" the date in a format that I can use with matplotlib?
The Third and final question is how can I check many rows the csv file has (for the creation of the empty self.data array)
Thx!
Repeating your array generation without the extra code:
In [230]: dt=np.dtype([('date', 'S15'), ('value', '<f8')])
In [231]: data=np.recarray((2,),dtype=dt)
In [232]: type(data['date'])
Out[232]: numpy.core.records.recarray
In [233]: type(data['value'])
Out[233]: numpy.ndarray
The fact that one field is returned as ndarray, and the other as recarray isn't significant. It's just how the recarray class is setup.
Now we mostly use 'structured arrays', created for example with
data1=np.empty((2,),dtype=dt)
or filled with '0s':
data1=np.zeros((2,),dtype=dt)
# array([('', 0.0), ('', 0.0)],
dtype=[('date', 'S15'), ('value', '<f8')])
With this, both data1['date'] and ['value'] are ndarray. recarray is the old version, and still compatible, but structured arrays are more consistent in their syntax and behavior. There are lots of SO questions about structured arrays, many produced by np.genfromtxt applied to csv files like yours.
I could combine this idea, plus my comment (about list appends):
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
if self._columns[0][1].endswith('tag'):
self._columns[0][2] = 'datetime64[D]'
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
desiredKeys = map(lambda x:x, self._ident.iterkeys())
data = []
for x in r:
aline = np.zeros((1,), dtype=dtype)
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
v1 = v.split('.')
if len(v1)==3: # convert date to yyyyy-mm-dd format
v = '%s-%s-%s'%(v1[2],v1[1],v1[0])
aline[self._ident[k]] = v
data.append(aline)
self.data = np.concatenate(data)
producing a b like:
array([(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55)],
dtype=[('date', '<M8[D]'), ('value', '<f8')])
I believe genfromtxt collects each row as a tuple, and creates the array at the end. The docs for structured arrays shows that they can be constructed from
np.array([(item1, item2), (item3, item4),...], dtype=dtype)
I chose to construct an array for each line, and concatenate them at the end because that required fewer changes to your code.
I also changed that function so it converts the 'tag' column to np.datetime64 dtype. There are a number of SO questions about using that dtype. I believe it can used in matplotlib, though I don't have experience with that.

Categories

Resources