Convert string to multidimensional array in Python - python

I'm having a problem managing some data that are saved in a really awful format.
I have data for points that correspond to the edges of a polygon. The data for each polygon is separated by the string >, while the x and y values for the points are separated with non-unified criteria, sometimes with a number of spaces, sometimes with some spaces and a tabulation. I've tried to load such data to an array of arrays with the following code:
f = open('/Path/Data.lb','r')
data = f.read()
splat = data.split('>')
region = []
for number, polygon in enumerate(splat[1:len(splat)], 1):
region.append(float(polygon))
But I keep getting an error trying to run the float() function (I've cut it as it's much longer):
ValueError: could not convert string to float: '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n ...
... -73.324\t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
Is there a way to convert the data to float without modifying the source file? If not, is there a way to easily modify the source file so that all columns are separated the same way? I guess that way the same code should run smoothly.
Thanks!

Try:
with open("PataIce.lb", "r") as file:
polygons = file.read().strip(">").strip().split(">")
region =[]
for polygon in polygons:
sides = polygon.strip().split("\n")
points = [[float(num) for num in side.split()[:2]] for side in sides]
region.append(points)
Some of the points contain more than 2 values and I've restricted the script to only read the first two numbers in these cases.

You can use regex to match decimal numbers.
import re
PATH = <path_to_file>
coords = []
with open(PATH) as f:
for line in f:
nums = re.findall('-?\d+\.\d+', line)
if len(nums) >0:
coords.append(nums)
print(coords)
Note: this solution ignores the trailing 0 at the end of some lines.
Be aware that the results in coords are still strings. You'll need to convert them to float using float().

In [79]: astr = '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n -73.324\
...: t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
In [80]: lines =astr.splitlines()
In [81]: lines
Out[81]:
['',
' -73.311 -48.328',
' -73.311 -48.326',
' -73.318 -48.321',
' -73.324\t -48.353',
' -73.315\t -48.344',
' -73.313\t -48.337']
splitlines deals with the \n separator; split() handles the tab and spaces.
In [82]: [line.split() for line in lines]
Out[82]:
[[],
['-73.311', '-48.328'],
['-73.311', '-48.326'],
['-73.318', '-48.321'],
['-73.324', '-48.353'],
['-73.315', '-48.344'],
['-73.313', '-48.337']]
The initial [] needs to be removed one way or other:
In [84]: np.array(Out[82][1:], dtype=float)
Out[84]:
array([[-73.311, -48.328],
[-73.311, -48.326],
[-73.318, -48.321],
[-73.324, -48.353],
[-73.315, -48.344],
[-73.313, -48.337]])
This works only if each line has the same number of elements, where 2. As long as the lists of strings in Out[82] is clean enough you can let np.array do the conversion from string to float.
Your actually file may require some further handling, but this should give you an idea of the basics.

Related

error in splitting a string using "re.findall"

i need to split a string into three values (x,y,z) the string is something like this (48,25,19)
i used "re.findall" and it works fine but sometimes it produces this error
(plane_X, plane_Y, plane_Z = re.findall("\d+.\d+", planepos)
ValueError: not enough values to unpack (expected 3, got 0))
this is the code:
def read_data():
# reading from file
file = open("D:/Cs/Grad/Tests/airplane test/Reading/Positions/PlanePos.txt", "r")
planepos = file.readline()
file.close()
file = open("D:/Cs/Grad/Tests/airplane test/Reading/Positions/AirportPosition.txt", "r")
airportpos = file.readline()
file.close()
# ==================================================================
# spliting and getting numbers
plane_X, plane_Y, plane_Z = re.findall("\d+\.\d+", planepos)
airport_X, airport_Y, airport_Z = re.findall("\d+\.\d+", airportpos)
return plane_X,plane_Y,plane_Z,airport_X,airport_Y,airport_Z
what i need is to split the string (48,25,19) to x=48,y=25,z=19
so if someone know a better way to do this or how to solve this error will be appreciated.
Your regex only works for numbers with a decimal point and not for integers, hence the error. You can instead strip the string of parentheses and white spaces, then split the string by commas, and map the resulting sequence of strings to the float constructor:
x, y, z = map(float, planepos.strip('() \n').split(','))
You can use ast.literal_eval which safely evaluates your string:
import ast
s = '(48,25,19)'
x, y, z = ast.literal_eval(s)
# x => 48
# y => 25
# z => 19
If your numbers are integers, you can use the regex:
re.findall(r"\d+","(48,25,19)")
['48', '25', '19']
If there are mixed numbers:
re.findall(r"\d+(?:\.\d+)?","(48.2,25,19.1)")
['48.2', '25', '19.1']

Python DNA sequence slice gives \N as wrong content in slice result

I am surprising, I am using python to slice a long DNA Sequence (4699673 character)to a specific length supstring, it's working properly with a problem in result, after 71 good result \n start apear in result for few slices then correct slices again and so on for whole long file
the code:
import sys
filename = open("out_filePU.txt",'w')
sys.stdout = filename
my_file = open("GCF_000005845.2_ASM584v2_genomic_edited.fna")
st = my_file.read()
length = len(st)
print ( 'Sequence Length is, :' ,length)
for i in range(0,len(st[:-9])):
print(st[i:i+9], i)
figure shows the error from the result file
please i need advice on that.
Your sequence file contains multiple lines, and at the end of each line there is a line break \n. You can remove them with st = my_file.read().replace("\n", "").
Try st = re.sub('\\s', '', my_file.read()) to replace any newlines or other whitespace (you'll need to add import re at the top of your script).
Then for i in range(0,len(st[:-9]),9): to step through your data in increments of nine characters. Otherwise you're only advancing by one character each time: that's why you can see the diagonal patterns in your output.

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Extracting Data from Multiple TXT Files and Creating a Summary CSV File in Python

I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!
Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').
i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them
This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))

speed-up python function to process files with data segments separated by a blank space

I need to process files with data segments separated by a blank space, for example:
93.18 15.21 36.69 33.85 16.41 16.81 29.17
21.69 23.71 26.38 63.70 66.69 0.89 39.91
86.55 56.34 57.80 98.38 0.24 17.19 75.46
[...]
1.30 73.02 56.79 39.28 96.39 18.77 55.03
99.95 28.88 90.90 26.70 62.37 86.58 65.05
25.16 32.61 17.47 4.23 34.82 26.63 57.24
36.72 83.30 97.29 73.31 31.79 80.03 25.71
[...]
2.74 75.92 40.19 54.57 87.41 75.59 22.79
.
.
.
for this I am using the following function.
In every call I get the necessary data, but I need to speed-up the code.
Is there a more efficient way?
EDIT: I will be updating the code with the changes that achieve improvements
ORIGINAL:
def get_pos_nextvalues(pos_file, indices):
result = []
for line in pos_file:
line = line.strip()
if not line:
break
values = [float(value) for value in line.split()]
result.append([float(values[i]) for i in indices])
return np.array(result)
NEW:
def get_pos_nextvalues(pos_file, indices):
result = ''
for line in pos_file:
if len(line) > 1:
s = line.split()
result += ' '.join([s [i] for i in indices])
else:
break
else:
return np.array([])
result = np.fromstring(result, dtype=float, sep=' ')
result = result.reshape(result.size/len(indices), len(indices))
return result
.
pos_file = open(filename, 'r', buffering=1024*10)
[...]
while(some_condition):
vs = get_pos_nextvalues(pos_file, (4,5,6))
[...]
speedup = 2.36
not to convert floats to floats would be the first step. I would suggest, however, to first profile your code and then try to optimize the bottleneck parts.
I understand that you've changed your code from the original, but
values = [value for value in line.split()]
is not a good thing either. just write values = line.split() if this is what you mean.
Seeing how you're using NumPy, I'd suggest some methods of file reading that are demonstrated in their docs.
You are only reading every character exactly once, so there isn't any real performance to gain.
You could combine strip and split if the empty lines contain a lot of whitespace.
You could also save some time initializing the numpy array from start, instead of first creating a python array and then converting.
try increasing the read buffer, IO is probably the bottle neck of your code
open('file.txt', 'r', 1024 * 10)
also if the data is fully sequential you can try to skip the line by line code and convert a bunch of lines at once
Instead of :
if len(line) <= 1: # only '\n' in «empty» lines
break
values = line.split()
try this:
values = line.split()
if not values: # line is wholly whitespace, end of segment
break
numpy.fromfile doesn't work for you?
arr = fromfile('tmp.txt', sep=' ', dtype=int)
Here's a variant that might be faster for few indices. It builds a string of only the desired values so that np.fromstring does less work.
def get_pos_nextvalues_fewindices(pos_file, indices):
result = ''
for line in pos_file:
if len(line) > 1:
s = line.split()
for i in indices:
result += s[i] + ' '
else:
return np.array([])
result = np.fromstring(result, dtype=float, sep=' ')
result = result.reshape(result.size/len(indeces), len(indeces))
return result
This trades off the overhead of split() and an added loop for less parsing. Or perhaps there's some clever regex trick you can do to extract the desired substrings directly?
Old Answer
np.mat('1.23 2.34 3.45 6\n1.32 2.43 7 3.54') converts the string to a numpy matrix of floating point values. This might be a faster kernel for you to use. For instance:
import numpy as np
def ReadFileChunk(pos_file):
chunktxt = ""
for line in pos_file:
if len(line) > 1:
chunktxt = chunktxt + line
else:
break
return np.mat(chunktxt).tolist()
# or alternatively
#return np.array(np.mat(s))
Then you can move your indexing stuff to another function. Hopefully having numpy parse the string internally is faster than calling float() repetitively.

Categories

Resources