read a matrix from a text file into numpy

read a matrix from a text file into numpy - python

I am using a software which outputs only the upper triangle of a symmetric matrix in the following format:
2 3 4 5 6 7 8
1: -0.00 0.09 0.03 -0.27 -0.28 0.83 -0.31
2: 0.09 0.03 -0.26 -0.28 0.83 -0.31
3: 0.00 0.11 0.11 0.33 0.10
4: 0.03 0.03 -0.00 0.03
5: -0.02 0.91 -0.04
6: 0.92 -0.03
7: 0.91
I would like to plot this matrix in a heatmap. However, I have a problem in reading this
text file into a data structure. How could I turn this text file into a for example, numpy array which I could use as a matrix for plotting?
Thank you!

If I read in your text file correctly, you can read in the file using pandas with space delimiter:
import pandas as pd
import numpy as np
dat = pd.read_csv("test.txt",index_col=0,delimiter='\s+').to_numpy()
Looks like this:
array([[-0. , 0.09, 0.03, -0.27, -0.28, 0.83, -0.31],
[ 0.09, 0.03, -0.26, -0.28, 0.83, -0.31, nan],
[ 0. , 0.11, 0.11, 0.33, 0.1 , nan, nan],
[ 0.03, 0.03, -0. , 0.03, nan, nan, nan],
[-0.02, 0.91, -0.04, nan, nan, nan, nan],
[ 0.92, -0.03, nan, nan, nan, nan, nan],
[ 0.91, nan, nan, nan, nan, nan, nan]])
So we just need to invert the nan:
idx = np.arange(dat.shape[1])
arr = np.empty(dat.shape)
for i in range(dat.shape[1]):
arr[i] = dat[i][np.concatenate([idx[-i:],idx[:-i]])]
And the end result looks like this:
arr
array([[-0. , 0.09, 0.03, -0.27, -0.28, 0.83, -0.31],
[ nan, 0.09, 0.03, -0.26, -0.28, 0.83, -0.31],
[ nan, nan, 0. , 0.11, 0.11, 0.33, 0.1 ],
[ nan, nan, nan, 0.03, 0.03, -0. , 0.03],
[ nan, nan, nan, nan, -0.02, 0.91, -0.04],
[ nan, nan, nan, nan, nan, 0.92, -0.03],
[ nan, nan, nan, nan, nan, nan, 0.91]])

I could come up with the following solution:
t = open("test_fit")
long_l = []
for line in t:
line = line.rstrip().split()
long_l.append(line[1:])
long_l_new = long_l[1:]
print(long_l_new)
for index, item in enumerate(long_l_new):
print(index, item)
item.insert(0, '0')
long_l_new.append(['0'])
mat = []
for index, item in enumerate(long_l_new):
if index == 0:
to_insert = long_l_new[index][index + 1]
new_l = long_l_new[index + 1]
new_l_to_add = new_l.insert(index, to_insert)
else:
if index < len(long_l_new) - 1:
for i in range(0, index+1):
to_insert = long_l_new[i][index + 1]
new_l = long_l_new[index + 1]
new_l.insert(i, to_insert)
Output:
[['0', '-0.00', '0.09', '0.03', '-0.27', '-0.28', '0.83', '-0.31'],
['-0.00', '0', '0.09', '0.03', '-0.26', '-0.28', '0.83', '-0.31'],
['0.09', '0.09', '0', '0.00', '0.11', '0.11', '0.33', '0.10'],
['0.03', '0.03', '0.00', '0', '0.03', '0.03', '-0.00', '0.03'],
['-0.27', '-0.26', '0.11', '0.03', '0', '-0.02', '0.91', '-0.04'],
['-0.28', '-0.28', '0.11', '0.03', '-0.02', '0', '0.92', '-0.03'],
['0.83', '0.83', '0.33', '-0.00', '0.91', '0.92', '0', '0.91'],
['-0.31', '-0.31', '0.10', '0.03', '-0.04', '-0.03', '0.91', '0']]

Related

Calculating averages dynamically with python

I have a large dataset with thousands of rows though fewer columns, i have ordered them by row values so that each of the 'objects' are grouped together, just like the dataset in Table1 below:
#Table1 :
data = [['ALFA', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['ALFA', 401740.00, 0.43, 0.26, 0.23, 0.16, 0.09],
['ALFA', 892350.00, 0.58, 0.24, 0.05, 0.07, 0.4],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 401740.00, 0.43, 0.26, 0.14, 0.37, 0.06],
['Charlie', 511830.00, 0.52, 0.16, 0.13, 0.22, 0.01],
['Delta', 590030.00, 0.75, 0.2, 0.34, 0.3, 0],
['Delta', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Delta', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['Echo', 892350.00, 0.58, 0.24, 0.23, 0.16, 0.09],
['Echo', 590030.00, 0.75, 0.2, 0.05, 0.07, 0.4],
['Echo', 590030.00, 0.75, 0.2, 0.08, 0.26, 0],
['Echo', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Foxtrot', 401740.00, 0.43, 0.26, 0.27, 0.2, 0.01],
['Foxtrot', 511830.00, 0.52, 0.16, 0.29, 0.11, 0.04],
['Golf', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Golf', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Golf', 351740.00, 0.31, 0.22, 0.13, 0.22, 0.01],
['Hotel', 892350.00, 0.58, 0.24, 0.34, 0.3, 0],
['Hotel', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Hotel', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04]]
df = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df
However i would like to partition the data by these objects and get the averages for all the columns only in a separate table much like Table2 below:
#Table2:
data2 = [['ALFA', 548610.00, 0.44, 0.24, 0.24, 0.14, 0.18],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 545615.00, 0.66, 0.20, 0.21, 0.25, 0.03],
['Delta', 510600.00, 0.60, 0.21, 0.26, 0.26, 0.02],
['Echo', 665610.00, 0.71, 0.21, 0.13, 0.22, 0.14],
['Foxtrot', 456785.00, 0.48, 0.21, 0.28, 0.16, 0.03],
['Golf', 510600.00, 0.60, 0.21, 0.18, 0.26, 0.03],
['Hotel', 690803.33, 0.69, 0.21, 0.21, 0.23, 0.01]]
df2 = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df2
Please note that the number of the objects vary across the dataset so the query would be able to count the number of objects and use that number to get the average of all the columns for each object and then present all these values in a new table.
For instance note that the '548610.00' values in Table2 for ALFA(column1) is merely an addition of Column1 values of ALFA in Table1 (351740.00 + 401740.00 + 401740.00) and divide by the count of ALFA being '3'.

Just use the groupby() function from pandas:
df.groupby('Objects').mean()
Instead of mean() other functions like min() are possible as well.

Can be done with:
df2 = pd.DataFrame(
data,
columns=['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
).groupby(by='Objects').mean().round(decimals=2).reset_index()
print(df2)
Output:
Objects Column1 Column2 Column3 Column4 Column5 Column6
0 ALFA 548610.00 0.44 0.24 0.24 0.14 0.18
1 Bravo 511830.00 0.52 0.16 0.08 0.26 0.00
2 Charlie 545615.00 0.66 0.20 0.21 0.24 0.03
3 Delta 510600.00 0.60 0.21 0.26 0.26 0.02
4 Echo 665610.00 0.71 0.21 0.12 0.22 0.14
5 Foxtrot 456785.00 0.48 0.21 0.28 0.16 0.02
6 Golf 510600.00 0.60 0.21 0.18 0.26 0.03
7 Hotel 690803.33 0.69 0.21 0.21 0.23 0.01

How can I permute only certain entries of numpy 2d-array?

I have a numpy 2d-array of shape (N, N) representing a NxN correlation matrix. This matrix is symmetric. Say N=5, then an example of this 2d-array would be:
x = np.array([[1.00, 0.46, 0.89, 0.76, 0.65],
[0.46, 1.00, 0.83, 0.88, 0.29],
[0.89, 0.83, 1.00, 0.57, 0.84],
[0.76, 0.88, 0.57, 1.00, 0.39],
[0.65, 0.29, 0.84, 0.39, 1.00]])
I would like to obtain P copies of x where the diagonal remains the same but the upper- and lower-triangular halves of the matrix are permuted in unison.
An example of one of these copies could be:
np.array([[1.00, 0.65, 0.89, 0.84, 0.39],
[0.65, 1.00, 0.76, 0.83, 0.88],
[0.89, 0.76, 1.00, 0.29, 0.57],
[0.84, 0.83, 0.29, 1.00, 0.46],
[0.39, 0.88, 0.57, 0.46, 1.00]])
It would be great if the solution doesn't take too long as the matrix I am using is of shape (100, 100) and I would like to obtain 10,000-100,000 copies.
My intuition would be to somehow obtain the lower or upper half of the matrix as a flattened array, do the permutation, and replace values in both upper and lower halves. This, however, would take me a while to figure out and would like to know if there is a more straight-forward approach. Thanks.

You can try this:
import numpy as np
x = np.array([[1.00, 0.46, 0.89, 0.76, 0.65],
[0.46, 1.00, 0.83, 0.88, 0.29],
[0.89, 0.83, 1.00, 0.57, 0.84],
[0.76, 0.88, 0.57, 1.00, 0.39],
[0.65, 0.29, 0.84, 0.39, 1.00]])
j, i = np.meshgrid(np.arange(x.shape[0]), np.arange(x.shape[0]))
i, j = i.flatten(), j.flatten()
up_i, up_j = i[i < j], j[i< j]
elems = x[up_i, up_j]
np.random.shuffle(elems)
x[up_i, up_j] = elems
x[up_j, up_i] = elems
x
array([[1. , 0.57, 0.88, 0.46, 0.76],
[0.57, 1. , 0.39, 0.29, 0.83],
[0.88, 0.39, 1. , 0.65, 0.89],
[0.46, 0.29, 0.65, 1. , 0.84],
[0.76, 0.83, 0.89, 0.84, 1. ]])
In case all your xs are of same shape, you need to call meshgrid and find indices corresponding to upper triangle only once.
This uses numpy fancy indexing to fetch the non-diagonal elements.

How to append dictionaries to a dictionary in a for loop?

I am trying to create a dictionary where the value for each key is two dictionaries.
I have two lists of patient (normal tissue, disease tissue) barcodes that correspond to columns of values in a dataframe. My goal is to match patients that are in both lists and then, for each patient found in both lists, append their normal and disease tissue values to a dictionary. The dictionary key would be the patient barcode and the dictionary value would be another dictionary of the normal tissue: values pulled from the dataframe and disease tissue: values pulled from the dataframe.
So starting with
In [3]: df = pd.DataFrame({'Patient1_Normal':['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan'],
'Patient1_Disease':[0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'Patient2_Disease':['nan', 'nan', 'nan', 1.0, 0.24, 0.67, 0.97, 0.98],
'Patient3_Normal': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9],
'Patient3_Disease':[0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'Patient4_Normal':['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91],
'Patient4_Disease':['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'Patient5_Disease': [0.34, 0.27, 'nan', 0.16, 0.32, 0.27, 0.55, 0.51]})
In [4]: df
Out[4]: Patient1_Normal Patient1_Disease Patient2_Disease Patient3_Normal Patient3_Disease Patient4_Normal Patient4_Disease Patient5_Disease
0 nan 0.12 nan 0.21 0.11 nan nan 0.34
1 0.01 0.06 nan 0.25 0.45 0.35 nan 0.27
2 0.1 0.19 nan 0.63 nan nan 0.56 nan
3 0.16 0.34 1 0.92 0.45 0.22 0.72 0.16
4 0.88 nan 0.24 0.30 0.22 0.45 nan 0.32
5 0.83 nan 0.67 0.56 0.89 0.66 0.97 0.27
6 0.82 0.73 0.97 0.78 0.17 0.21 0.91 0.55
7 nan 0.91 0.98 0.90 0.12 0.91 0.79 0.51
Here is what I have so far:
D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
psi_sets = {}
psi_sets['d'] = []
psi_sets['n'] = []
for patient in N_col:
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets
However, my paired_patients dictionary values are overwriting instead of appending, so the output for paired_patients looks like this:
{'Patient1': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient3': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient4': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
How do I fix the last bit of code to append paired_patient dictionary values correctly for each patient, such that the paired_patient dictionary looks like:
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}

D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
for patient in N_col:
psi_sets = {}
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets

You can use df.melt, pd.concat, series.str.split, df.replace, df.groupby and df.xs and then finally df.to_dict.
Please check out following:
>>> df2 = (pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1)
.replace({'Normal':'n', 'Disease':'d'})
.groupby([0,1]).agg(list))
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if not ({'d', 'n'} ^ v.keys())}
>>> paired_patients
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
EXPLANTION:
>>> df.melt()
variable value
0 Patient1_Normal NaN
1 Patient1_Normal 0.01
2 Patient1_Normal 0.10
.. ... ...
62 Patient5_Disease 0.55
63 Patient5_Disease 0.51
>>> df.melt().variable.str.split('_', expand=True)
0 1
0 Patient1 Normal
1 Patient1 Normal
2 Patient1 Normal
.. ... ...
62 Patient5 Disease
63 Patient5 Disease
[64 rows x 2 columns]
# then concat these two, replace 'Normal' and 'Disease' with 'n' and 'd' and drop
# the 'variable' column
>>> pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1).replace({'Normal':'n', 'Disease':'d'})
0 1 value
0 Patient1 n NaN
1 Patient1 n 0.01
2 Patient1 n 0.10
.. ... .. ...
62 Patient5 d 0.55
63 Patient5 d 0.51
[64 rows x 3 columns]
# then groupby column [0, 1] and aggregate into list:
>>> df2 = _.groupby([0,1]).agg(list)
>>> df2
value
0 1
Patient1 d [0.12, 0.06, 0.19, 0.34, nan, nan, 0.73, 0.91]
n [nan, 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, nan]
Patient2 d [nan, nan, nan, 1.0, 0.24, 0.67, 0.97, 0.98]
Patient3 d [0.11, 0.45, nan, 0.45, 0.22, 0.89, 0.17, 0.12]
n [0.21, 0.25, 0.63, 0.92, 0.3, 0.56, 0.78, 0.9]
Patient4 d [nan, nan, 0.56, 0.72, nan, 0.97, 0.91, 0.79]
n [nan, 0.35, nan, 0.22, 0.45, 0.66, 0.21, 0.91]
Patient5 d [0.34, 0.27, nan, 0.16, 0.32, 0.27, 0.55, 0.51]
# Now groupby level=0, and convert that into dict, and finally check whether
# both 'n' and 'd' are present as keys by using symmetric set difference
# properties of dict_keys objects
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if ('n' in v) and ('d' in v)}

numpy get column indices where all elements are greater than threshold

I want to find the column indices of a numpy array where all the elements of the column are greater than a threshold value.
For example,
X = array([[ 0.16, 0.40, 0.61, 0.48, 0.20],
[ 0.42, 0.79, 0.64, 0.54, 0.52],
[ 0.64, 0.64, 0.24, 0.63, 0.43],
[ 0.33, 0.54, 0.61, 0.43, 0.29],
[ 0.25, 0.56, 0.42, 0.69, 0.62]])
In the above case, if the threshold is 0.4, my result should be 1,3.

You can compare against the min of each column using np.where:
large = np.where(X.min(0) >= 0.4)[0]

x = array([[ 0.16, 0.40, 0.61, 0.48, 0.20],
[ 0.42, 0.79, 0.64, 0.54, 0.52],
[ 0.64, 0.64, 0.24, 0.63, 0.43],
[ 0.33, 0.54, 0.61, 0.43, 0.29],
[ 0.25, 0.56, 0.42, 0.69, 0.62]])
threshold = 0.3
size = numpy.shape(x)[0]
for it in range(size):
y = x[it] > threshold
print(y.all())
Try pls.

a generic solution using list comprehension
threshold = 0.4
rows_nb, col_nb = shape(X)
rows_above_threshold = [col for col in range(col_nb) \
if all([X[row][col] >= threshold for row in range(rows_nb)])]

Numpy, Reading from File with no delimiter but fixed pattern

I tried searching for this question but I could not find answers that did not seem too complicated.
I am reading from a file that only has space delimiters. The columns are not fixed width. The first two columns are what are giving me the issue. It is 15 columns, where the first two are strings and everything else are floating numbers.
I try using numpy's "genfromtxt" and specified the dtype. However, some of the string entries are empty or contain numbers, so so lines are misread as having 15 or 17 entries.
Here is an example of a few lines lines.
NGC 104 47 Tuc 00 24 05.67 -72 04 52.6 305.89 -44.89 4.5 7.4 1.9 -2.6 -3.1
NGC 288 00 52 45.24 -26 34 57.4 152.30 -89.38 8.9 12.0 -0.1 0.0 -8.9
NGC 362 01 03 14.26 -70 50 55.6 301.53 -46.25 8.6 9.4 3.1 -5.1 -6.2
Whiting 1 02 02 57 -03 15 10 161.22 -60.76 30.1 34.5 -13.9 4.7 -26.3
How should I approach this? Should I reformat the text by rereading it and then outputting it as a CSV? Should I read as a regex? Can I fix this command:
data = np.genfromtxt('PositionalData.txt', skiprows=0, missing_values=(' '), dtype=['S6','S6', 'f4', 'f4', 'f4', 'f4', 'f4', 'f4', 'f5','f4','f4', 'f4', 'f4', 'f4', 'f4'])
Thanks, help would be much appreciated.
edit:
Here is some output after using some fixed-width setting:
(' NG', 'C 1', 0.0, 4.0, nan, nan, nan, nan, 4.0, 7.0, nan, nan, nan, nan, nan)
(' NG', 'C 2', 8.0, 8.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' NG', 'C 3', 6.0, 2.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' Wh', 'iti', nan, nan, nan, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' NG', 'C 1', 2.0, 6.0, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' Pa', 'l 1', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
The Command is data = np.genfromtxt('PositionalDataTest.txt', skiprows=0,delimiter=(3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), missing_values=(' '), dtype=['S7','S7', 'f4', 'f4', 'f4', 'f4', 'f4', 'f4', 'f5','f4','f4', 'f4', 'f4', 'f4', 'f4'])
The lines are:
NGC 104 47 Tuc 00 24 05.67 -72 04 52.6 305.89 -44.89 4.5 7.4 1.9 -2.6 -3.1
NGC 288 00 52 45.24 -26 34 57.4 152.30 -89.38 8.9 12.0 -0.1 0.0 -8.9
NGC 362 01 03 14.26 -70 50 55.6 301.53 -46.25 8.6 9.4 3.1 -5.1 -6.2
Whiting 1 02 02 57 -03 15 10 161.22 -60.76 30.1 34.5 -13.9 4.7 -26.3
NGC 1261 03 12 16.21 -55 12 58.4 270.54 -52.12 16.3 18.1 0.1 -10.0 -12.9
Pal 1 03 33 20.04 79 34 51.8 130.06 19.03 11.1 17.2 -6.8 8.1 3.6

Consider this portion of the data file:
-72 04 52.6
-26 34 57.4
-70 50 55.6
-03 15 10
-55 12 58.4
79 34 51.8
It can be parsed like this:
In [75]: np.genfromtxt('data2', delimiter=[3,3,5], dtype=None).tolist()
Out[75]:
[(-72, 4, 52.6),
(-26, 34, 57.4),
(-70, 50, 55.6),
(-3, 15, 10.0),
(-55, 12, 58.4),
(79, 34, 51.8)]
The rest of the file could be parsed similarly, the difficulty is in finding the right column widths to use in delimiter.
That's laborious, and I'd rather not do that because this solution is fragile.
It is quite possible your data truly is not parseable using fixed-width columns.
So instead let's shoot for a robust solution. np.genfromtxt can accept any iterable of strings as its first argument.
So we can bring the full power of Python string manipulation to bear on the problem by simply defining a generator function to pre-process the lines from the file.
The price we pay for all this power is that calling a Python function once per line will be much much slower than the C code np.genfromtxt uses when parsing files with a simple delimiter or fixed-width columns.
import numpy as np
def process(iterable):
for line in iterable:
parts = [line[:11], line[11:24]] + line[24:].split()
yield '#'.join(parts)
with open('data', 'rb') as f:
data = np.genfromtxt(process(f), dtype=None, delimiter='#')
print(repr(data))
yields
array([ ('NGC 104 ', '47 Tuc ', 0, 24, 5.67, -72, 4, 52.6, 305.89, -44.89, 4.5, 7.4, 1.9, -2.6, -3.1),
('NGC 288 ', ' ', 0, 52, 45.24, -26, 34, 57.4, 152.3, -89.38, 8.9, 12.0, -0.1, 0.0, -8.9),
('NGC 362 ', ' ', 1, 3, 14.26, -70, 50, 55.6, 301.53, -46.25, 8.6, 9.4, 3.1, -5.1, -6.2),
('Whiting 1 ', ' ', 2, 2, 57.0, -3, 15, 10.0, 161.22, -60.76, 30.1, 34.5, -13.9, 4.7, -26.3),
('NGC 1261 ', ' ', 3, 12, 16.21, -55, 12, 58.4, 270.54, -52.12, 16.3, 18.1, 0.1, -10.0, -12.9),
('Pal 1 ', ' ', 3, 33, 20.04, 79, 34, 51.8, 130.06, 19.03, 11.1, 17.2, -6.8, 8.1, 3.6)],
dtype=[('f0', 'S11'), ('f1', 'S13'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<f8'), ('f5', '<i8'), ('f6', '<i8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<f8'), ('f13', '<f8'), ('f14', '<f8')])
Note that the process function uses '#' as the delimiter between columns. If the data contains '#' you will have to choose some other character for the delimiter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

read a matrix from a text file into numpy - python

Related

Calculating averages dynamically with python

How can I permute only certain entries of numpy 2d-array?

How to append dictionaries to a dictionary in a for loop?

numpy get column indices where all elements are greater than threshold

Numpy, Reading from File with no delimiter but fixed pattern

Categories

Resources