python, read '.dat' file with differents columns for each lines - python

I need to extract some data from .dat file which I usually do with
import numpy as np
file = np.loadtxt('blablabla.dat')
Here my data are not separated by a specific delimiter but have predefined length (digits) and some lines don't have any values for some columns.
Here an sample to be clear :
3 0 36 0 0 0 0 0 0 0 99.
-2 0 0 0 0 0 0 0 0 0 99.
2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
-5 0 0 0 0 0 0 0 0 0 99.
99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5
My little code above get the error :
# Convert each value according to its column and store
ValueError: Wrong number of columns at line 3
Does someone have an idea about how to collect this kind of data?

numpy.genfromtxt seems to be what you want; it you can specify field widths for each column and treats missing data as NaNs.
For this case:
import numpy as np
data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5])
If you want to keep information in the string part of the file, you could read twice and specify the usecols parameter:
import numpy as np
number_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(0,1,2,3,4,5,6,7,8,9,11))
string_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(10),dtype=str)

What you essentially need is to get list of empty "columns" position that serve as delimiters
That will get you started
In [108]: table = ''' 3 0 36 0 0 0 0 0 0 0 99.
.....: -2 0 0 0 0 0 0 0 0 0 99.
.....: 2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
.....: 5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
.....: -5 0 0 0 0 0 0 0 0 0 99.
.....: 99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5'''.split('\n')
In [110]: max_row_len = max(len(row) for row in table)
In [117]: spaces = reduce(lambda res, row: res.intersection(idx for idx, c in enumerate(row) if c == ' '), table, set(range(max_row_len)))
This code builds set of character positions in the longest row - and reduce leaves only set of positions that have spaces in all rows

Related

Rename the row and columns of a dataframe

I have extracted a NC file using python and after processing the data the final output is an array with the of (199, 314). I convert the array to data frame, but the rows and columns names (index) start from zero to 199 and 314, respectively.
from netCDF4 import Dataset
import numpy as np
import pandas as pd
data = Dataset('GolestanM.nc', 'r')
dims = data.dimensions
ndims = len(dims)
vars = data.variables
nvars = len(vars)
attrs = data.ncattrs
lon = data.variables['lon'][:]
lat = data.variables['lat'][:]
t = data.variables['time'][496]
fire = data.variables['FireMask'][496,:,:]
dataset = pd.DataFrame(fire)
However, I want to rename these index using the following format:
Columns: first name 53.7042 and then +0.0083 until the name reach 56.3208
[0-->53.7042, 1-->53.7.25, ... , 314-->53.3208]
Rows: first name 38.1125 and then -0.0083 until the name reach 36.4625
[0-->38.1125, 1-->38.1042, ... , 199-->36.4625]
to do this I have the code bellow:
dataset = dataset.rename(index={0: "38.1125"})
dataset = dataset.rename(columns={0: "53.7042"})
dataset = dataset.rename(index = lambda x: x + (0.0083),
columns = lambda x: x + (0.0083))
However doin this give me the following error:
TypeError: can only concatenate str (not "float") to str
CAn any one help me with the probem.
Idea is multiple x (columns or index names) in lambda function:
#sample data
dataset = pd.DataFrame(0, index=range(10), columns=range(10))
dataset = dataset.rename(index = lambda x: 38.1125 - 0.0083 * x ,
columns = lambda x: 53.7042 + 0.0083* x)
print (dataset)
53.7042 53.7125 53.7208 53.7291 53.7374 53.7457 53.7540 \
38.1125 0 0 0 0 0 0 0
38.1042 0 0 0 0 0 0 0
38.0959 0 0 0 0 0 0 0
38.0876 0 0 0 0 0 0 0
38.0793 0 0 0 0 0 0 0
38.0710 0 0 0 0 0 0 0
38.0627 0 0 0 0 0 0 0
38.0544 0 0 0 0 0 0 0
38.0461 0 0 0 0 0 0 0
38.0378 0 0 0 0 0 0 0
53.7623 53.7706 53.7789
38.1125 0 0 0
38.1042 0 0 0
38.0959 0 0 0
38.0876 0 0 0
38.0793 0 0 0
38.0710 0 0 0
38.0627 0 0 0
38.0544 0 0 0
38.0461 0 0 0
38.0378 0 0 0

How to split a list using two nested conditions

Basically I have list of 0s and 1s. Each value in the list represents a data sample from an hour. Thus, if there are 24 0s and 1s in the list that means there are 24 hours, or a single day. I want to capture the first time the data cycles from 0s to 1s back to 0s in a span of 24 hours (or vice versa from 1s to 0s back to 1s).
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1]
expected output:
# D
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0]
output = [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
# ^ cycle.1:day.1 |dayline ^cycle.1:day.2
In the output list, when there is 1 that means 1 cycle is completed at that position of the signal list and at rest of the position there are 0. There should only 1 cycle in a days that's why only 1 is there.
I don't how to split this list according to that so can someone please help?
It seams to me like what you are trying to do is split your data first into blocks of 24, and then to find either the first rising edge, or the first falling edge depending on the first hour in that block.
Below I have tried to distill my understanding of what you are trying to accomplish into the following function. It takes in a numpy.array containing zeros and ones, as in your example. It checks to see what the first hour in the day is, and decides what type of edge to look for.
it detects an edge by using np.diff. This gives us an array containing -1's, 0's, and 1's. We then look for the first index of either a -1 falling edge, or 1 rising edge. The function returns that index, or if no edges were found it returns the index of the last element, or nothing.
For more info see the docs for descriptions on numpy features used here np.diff, np.array.nonzero, np.array_split
import numpy as np
def get_cycle_index(day):
'''
returns the first index of a cycle defined by nipun vats
if no cycle is found returns nothing
'''
first_hour = day[0]
if first_hour == 0:
edgetype = -1
else:
edgetype = 1
edges = np.diff(np.r_[day, day[-1]])
if (edges == edgetype).any():
return (edges == edgetype).nonzero()[0][0]
elif (day.sum() == day.size) or day.sum() == 0:
return
else:
return day.size - 1
Below is an example of how you might use this function in your case.
import numpy as np
_data = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
#_data = np.random.randint(0,2,280, dtype='int')
data = np.array(_data, 'int')
#split the data into a set of 'day' blocks
blocks = np.array_split(data, np.arange(24,data.size, 24))
_output = []
for i, day in enumerate(blocks):
print(f'day {i}')
buffer = np.zeros(day.size, dtype='int')
print('\tsignal:', *day, sep = ' ')
cycle_index = get_cycle_index(day)
if cycle_index:
buffer[cycle_index] = 1
print('\toutput:', *buffer, sep=' ')
_output.append(buffer)
output = np.concatenate(_output)
print('\nfinal output:\n', *output, sep=' ')
this yeilds the following output:
day 0
signal: 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0
output: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 1
signal: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
output: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 2
signal: 0 0 0 0 0 0
output: 0 0 0 0 0 0
final output:
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Create a new row with zeroes in wide Pandas DataFrame

I'm trying to fill an empty array with one row of zeroes. This is apparently a lot harder said than done. This is my attempt:
Array = pd.DataFrame(columns=["curTime", "HC", "AC", "HG", "HF1", "HF2", "HF3", "HF4", "HF5", "HF6",
"HF7", "HF8", "HF9", "HF10", "HF11", "HF12", "HD1", "HD2", "HD3", "HD4", "HD5", "HD6",
"AG", "AF1", "AF2", "AF3", "AF4", "AF5", "AF6", "AF7", "AF8", "AF9", "AF10", "AF11", "AF12",
"AD1", "AD2", "AD3", "AD4", "AD5", "AD6"])
appendArray = [[0] * len(Array.columns)]
Array = Array.append(appendArray, ignore_index = True)
This however creates a row that stacks another 41 columns to the right of my existing 41 columns, and fills them with zeroes, while the original 41 columns get a "NaN" value.
How do I most easily do this?
You can using pd.Series within the append
Array.append(pd.Series(appendArray,index=Array.columns), ignore_index = True)
Out[780]:
curTime HC AC HG HF1 HF2 HF3 HF4 HF5 HF6 ... AF9 AF10 AF11 \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
AF12 AD1 AD2 AD3 AD4 AD5 AD6
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
[2 rows x 41 columns]

Converting this operation from matlab to python

I have this line in some matlab script that Im trying to convert to python. So, m=20, and n=20. The dimensions of I_true equals [400,1].
I want to convert following Matlab code:
A=zeros((2*m*n),(2*m*n)+2);
A(1:m*n,(2*m*n)+1)=-I_true(:);
Am I converting it right?
Converted code in Python:
for i in range(0,m*n):
for j in range((2*m*n)+1):
A[i][j] = I_true[i]
Let's look at a small example, with n = 2, m = 2:
In Octave (and presumably Matlab):
octave:50> m = 2; n = 2;
octave:51> I_true = [1;2;3;4];
octave:52> A = zeros((2*m*n),(2*m*n)+2);
octave:53> A(1:m*n,(2*m*n)+1)=-I_true(:)
A =
0 0 0 0 0 0 0 0 -1 0
0 0 0 0 0 0 0 0 -2 0
0 0 0 0 0 0 0 0 -3 0
0 0 0 0 0 0 0 0 -4 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
The equivalent in Python (with n = 20, m = 20) would be
import numpy as np
n, m = 20, 20
I_true = np.arange(1, n*m+1) # just as an example
A = np.zeros((2*m*n, 2*(n*m+1)), dtype=I.dtype)
A[:m*n, 2*m*n] = -I_true
The reason why the last line uses A[:m*n, 2*m*n] and not A[1:m*n, (2*m*n)+1] is
because Python uses 0-based indexing whereas Matlab uses 1-based indexing.
Check this so question as well.
You can define a matrix with 2*m*n rows and 2*m*n+2 columns in python like this:
m = 20
n = 20
a = [[0 for i in range(2*m*n)] for j in range((2*m*n)+2)]
Now you have your matrix you can assign values to its elements using different ways. One example would be using for loops to assign values from another matrix with same size:
for i in range(2*m*n):
for j in range((2*m*n)+2):
a[i][j] = I_true[i][j]
I hope it helps.

finding a value by looping, multiple files python

I am very new to python so please bear with me.
I have a files with atom coordinates. The files look a certain way, but the coordinates are not necessarily on the same line. The file also contains some text, below is a part of the file which is important:
<Gold.Protein.RotatedAtoms>
28.5571 85.1121 3.9003 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
27.3346 84.9085 3.2531 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
28.9141 86.4057 4.2554 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.4701 85.9748 2.9810 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
28.0456 87.4704 3.9845 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.8436 87.2569 3.3417 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.1924 88.0932 3.1196 H 0 0 0 0 0 0 0 0 0 0 0 0
27.0510 83.9062 2.9565 H 0 0 0 0 0 0 0 0 0 0 0 0
what I want to do is the following:
Get the python to recognize if the the number on the 5th row in the 6th column (in our case 3.3417) is more or less than 6. Then, if the value is more than 6 write the FILENAME of the file to a text document. Note that the position of this chunk of information changes in the different files. That is to say, the number 3.3417 is not always on the same row.
Also, all the numbers change all time.
I was thinking that I might loop through the text, scanning for the a line with "Gold.Protein.RotatedAtoms" and then take the 3rd insert on line the line 5 rows down. But how would one do that?
Thanks for your help!
Split all the lines of the text into a list using splitlines().
Find the index of the line with "Gold.Protein.RotatedAtoms" using the enumerate method and a filter in a list comprehension, something like this:
index = [index for index,line in enumerate(all_lines) if "Gold.Protein.RotatedAtoms" in line]
Add 5 to that index to get the line you need from all_lines, use the split() method to split it into tokens, and finally take out the 3rd element with the index operator (3rd element = line.split()[2]).
As Lanaru stated... you could read from the file and split output from the file into an array.
Like so:
#!/usr/bin/env python
def s_coord():
fo = open('Gold.Protein.RotatedAtoms')
count = 1
for i in fo.readlines():
array = i.split()
if array[2] == "3.3417":
print("Element 3.3417 is in the {0} row.".format(count))
count = count + 1
def main():
s_coord()
return 0
if __name__ == '__main__':
main()
It seems to me that the value 3.3417 is in the third column, so I may not understand your question.
I think regular expressions are the cleanest way to do this. I used http://kodos.sourceforge.net/ to create the following regular expression and code.
import re
# common variables
rawstr = r"""^\s*([0-9.]+)\s*([0-9.]+)\s*([0-9.]+)\s*([a-zA-Z.]+)"""
matchstr = """<Gold.Protein.RotatedAtoms>
28.5571 85.1121 3.9003 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
27.3346 84.9085 3.2531 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
28.9141 86.4057 4.2554 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.4701 85.9748 2.9810 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
28.0456 87.4704 3.9845 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.8436 87.2569 3.3417 C.ar 0 0 0 0 0 0 0 0 0 0 0 0
26.1924 88.0932 3.1196 H 0 0 0 0 0 0 0 0 0 0 0 0
27.0510 83.9062 2.9565 H 0 0 0 0 0 0 0 0 0 0 0 0"""
# build a compile object
compile_obj = re.compile(rawstr, re.MULTILINE)
match_obj = compile_obj.search(matchstr)
for values in compile_obj.findall(matchstr):
if values[2] == '3.3417':
print 'found it'
You can modify the conditional in the loop to look for your desired cases and change the print to write a file.

Categories

Resources