Using np.genfromtxt to read in data that contains arrays - python

So I am trying to read in some data which looks like this (this is just the first line):
1 14.4132966509 (-1.2936631396696465, 0.0077236319580324952, 0.066687939649724415) (-13.170491147387787, 0.0051387952329040587, 0.0527163312916894)
I'm attempting to read it in with np.genfromtxt using:
skirt_data = np.genfromtxt('skirt_data.dat', names = ['halo', 'IRX', 'beta', 'intercept'], delimiter = ' ', dtype = None)
But it's returning this:
ValueError: size of tuple must match number of fields.
My question is, how exactly do I load in the arrays that are within the data, so that I can pull out the first number in that array? Ultimately, I want to do something like this to look at the first value of the beta column:
skirt_data['beta'][1]
Thanks ahead of time!

If each line is the same, I would go with a custom parser.
You can split the line using str.split(sep, optional max splits)
So something along the lines of
names = [list from above]
output = {}
with open('skirt_data.dat') as sfd:
for i, line in enumerate(sfd.readlines()):
skirt_name = names[i]
first_col, second_col, rest = line.split(' ', 2)
output[skirt_name] = int(first_col)
print output

Related

Remove single quotes around array

I have data that looks like this:
minterms = [['1,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x'], ['x,x,x,x,1,x,x,x,x,x,x,x,x,x,x,x,1,x,x,x,x,x,x']]
and I want to remove the single quotes around each array to get this:
minterms = [[1,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x], [x,x,x,x,1,x,x,x,x,x,x,x,x,x,x,x,1,x,x,x,x,x,x]]
I have tried
mintermNew = minterms.replace("'", "")
and this doesn't work.
What am I doing wrong here?
Edit:
Here is a snippet of my code giving a bit more context.
dontcares = []
mintermAry = []
for mindata in minterms:
for mindataIdx in mindata:
mintermAry.append(mindataIdx.split())
print(SOPform(fullsymlst, mintermAry, dontcares))
return
I am using mindataIdx.split() to put the data into an array. MindataIdx is the data that looks like [['1,x,x,x,x....'].
Using .split("") as mentioned in the commends throws this error:
mintermAry.append(mindataIdx.split(""))
ValueError: empty separator
using .split(" ") yields no changes.
Edit 2:
The data is being read into a dataframe from a file. The first 4 rows I want to discard. I am using this method to do it.
df = df.replace('-', 'x', regex=True)
dfstr =
df.to_string(header=False,index=False,index_names=False).split('\n')
dfArray = np.array(dfstr)
dfArrayDel = np.delete(dfArray,range(4), 0)
dfArrayData = np.char.lstrip(dfArrayDel)
splitData = np.char.split(dfArrayData)
First of all, you're definitly doing somthing very wrong, as, there is no reason for there to be single quotes around the contents of an array. Is this a string you're working with? Please elaborate.
Ill have to assume you want to split the string in the array up into separate elements by the commas, in which case you would want this -
miniterms.map(s => s[0].split(","));
I can't tell if your writing in python or js, regardless your problem is that your 2d array contains only a single String, hence why it's all wrapped in quotes. If the String in your inner arrays were split into individual elements they would look like this:
[[1,'x','x','x','x','x','x','x','x','x','x','x'...], ['x','x','x','x',1,'x'...]]
1 is a Number and therefore not wrapped in quotes while x is a char or String and therefore is wrapped in quotes. These quotes are there only to visualize the variable datatype and are not part of the variable value itself. As the quotes don't exist they can't be removed (eg by using replace)
If your String, before putting it in an array looks like this.
data = '1,x,x,x,x,x,x,x,x,x,x,x'
You can split it into an array like this:
data_array = data.split("")
I needed to split mindataIdx by the comma to create individual items, and then it was able to be recognized by SOPform. Thanks!
dontcares = []
mintermAry = []
for mindata in minterms:
for mindataIdx in mindata:
mintermAry.append(mindataIdx.split(","))
print(SOPform(fullsymlst, mintermAry, dontcares))

Reading an n-dimensional complex array from a text file to numpy

I am trying to read a N-dimensional complex array from a text file to numpy. The text file is formatted as shown below (including the square brackets in the text file, in a single line):
[[[-0.26905+0.956854i -0.96105+0.319635i -0.306649+0.310259i] [0.27701-0.943866i -0.946656-0.292134i -0.334658+0.988528i] [-0.263606-0.340042i -0.958169+0.867559i 0.349991+0.262645i] [0.32736+0.301014i 0.941918-0.953028i -0.306649+0.310259i]] [[-0.9462-0.932573i 0.968764+0.975044i 0.32826-0.925997i] [-0.306461-0.9455i -0.953932+0.892267i -0.929727-0.331934i] [-0.958728+0.31701i -0.972654+0.309404i -0.985806-0.936901i] [-0.312184-0.977438i -0.974281-0.350167i -0.305869+0.926815i]]]
I would like this to be read to a 2x4x3 complex ndarray.
The file can be quite large (say 2x4x10e6) so any efficiency in the reading would really help out.
Here You go:
=^..^=
import numpy as np
import re
# collect raw data
raw_data = []
with open('data.txt', 'r') as data_file:
for item in data_file.readlines():
raw_data.append(item.strip('\n'))
data_array = np.array([])
for item in raw_data:
# remove brackets
split_data = re.split('\]', item)
for string in split_data:
# clean data
clean_data = re.sub('\[+', '', string)
clean_data = re.sub('i', 'j', clean_data)
# split data
split_data = re.split(' ', clean_data)
split_data = list(filter(None, split_data))
# handle empty list
if len(split_data) == 0:
pass
else:
# collect data
data_array = np.hstack((data_array, np.asarray(split_data).astype(np.complex)))
# reshape array
final_array = np.reshape(data_array, (int(data_array.shape[0]/12),4,3))
Output:
[[[-0.26905 +0.956854j -0.96105 +0.319635j -0.306649+0.310259j]
[ 0.27701 -0.943866j -0.946656-0.292134j -0.334658+0.988528j]
[-0.263606-0.340042j -0.958169+0.867559j 0.349991+0.262645j]
[ 0.32736 +0.301014j 0.941918-0.953028j -0.306649+0.310259j]]
[[-0.9462 -0.932573j 0.968764+0.975044j 0.32826 -0.925997j]
[-0.306461-0.9455j -0.953932+0.892267j -0.929727-0.331934j]
[-0.958728+0.31701j -0.972654+0.309404j -0.985806-0.936901j]
[-0.312184-0.977438j -0.974281-0.350167j -0.305869+0.926815j]]
[[-0.26905 +0.956854j -0.96105 +0.319635j -0.306649+0.310259j]
[ 0.27701 -0.943866j -0.946656-0.292134j -0.334658+0.988528j]
[-0.263606-0.340042j -0.958169+0.867559j 0.349991+0.262645j]
[ 0.32736 +0.301014j 0.941918-0.953028j -0.306649+0.310259j]]
[[-0.9462 -0.932573j 0.968764+0.975044j 0.32826 -0.925997j]
[-0.306461-0.9455j -0.953932+0.892267j -0.929727-0.331934j]
[-0.958728+0.31701j -0.972654+0.309404j -0.985806-0.936901j]
[-0.312184-0.977438j -0.974281-0.350167j -0.305869+0.926815j]]]
as it seems your file is not in a "pythonic" list ( no comma between object).
I assume the following:
you can not change your input, you get it from 3rd party source)
the file is not a csv. ( no delimiter between rows)
as a result :
try to convert the strings to python string, after each "[]" add "," --> [[1+2j, 3+4j], [1+2j, 3+4j]]
between each number add "," and change from "i" to "j" [-0.26905+0.956854j, -0.96105+0.319635j, -0.306649+0.310259j]
python complex number is with the letter j 1+2j
then save it as csv.
open with pandas, read scv look at the link : python pandas complex number

Parse data from several equally structured blocks of a text file in python

I've got a text file that has several of these blocks of text in it:
Module Resistor_SMD:R_0402_1005Metric (layer B.Cu) (tedit 5B301BBD) (tstamp 5CC0A687)
(at 120.316179 97.92138 90)
(descr "Resistor SMD 0402 (1005 Metric), square (rectangular) end terminal, IPC_7351 nominal, (Body size source: http://www.tortai-tech.com/upload/download/2011102023233369053.pdf), generated with kicad-footprint-generator")
(tags resistor)
(path /610532D4)
(attr smd)
(fp_text reference R59 (at 0 1.17 90) (layer B.SilkS)
I want to pull out the following:
120.316179, 97.92138 90 and R59
and store it somewhere...
Then, I want to take that collection of line items, and throw some away depending on the value(s) of the first two numbers....They're XY coordinates.
Then, write it to a list.
How can I do that with regular expressions?
I'm loading the file and trying to follow along here, but I'm getting lost in the addition of the pandas library.
IMO you don't need re for this task. You can iterate through the lines of your file and, depending on signal strings like '(at ' and 'fp_text reference', you can fill a list of lists of all your resistor data, e.g.:
with open('textfile.txt') as f:
data = []
row = []
for line in f:
if row:
if '(fp_text ref' in line.strip():
row.append(line.strip().split()[2])
data.append(row)
row = []
else:
if '(at ' in line.strip():
row = line.strip()[:-1].split()[1:4]
print(data)
# [['120.316179', '97.92138', '90', 'R59']]
And if you want a pandas dataframe from this data:
import pandas as pd
df = pd.DataFrame(data, columns=['x', 'y', 'z', 'R'])
print(df)
# x y z R
# 0 120.316179 97.92138 90 R59
This RegEx might help you to capture your three desired strings:
([\d]+\.[\d]{5,}|R[0-9]+)
There are two simple pattern connected using an | (OR):
the one on the left ([\d]+\.[\d]{5,}) checks for your desired float numbers with a 5+ boundary for the float part, and
the one on the right (R[0-9]+) has a left-side R boundary.
You can simply change these boundaries, however you wish, and call the captured output using $1 and do the coding.
You can escape language specific metachars such as . using a \, if necessary.

Write PySpark DF to File of Specialized Format

I'm working with PySpark 2.1 and I need to come up with a way to write my dataframe to a .txt file of a specialized format; so not the typical json or csv, but rather a CTF format (for CNTK).
The file cannot have extra parenthesis or commas etc. It follows the form:
|label val |features val val val ... val
|label val |features val val val ... val
Some code to show this might be as follows:
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))).toDF()
people.show(n=4)
def Convert_to_String(r):
return '|label ' + r.name + ' ' + '|features ' + str(r.age) + '\n'
m_p = people.rdd.map(lambda r: Row(Convert_to_String(r)) ).toDF()
m_p.show(n=3)
In the above example, I would want to simply append each string from each row into a file with out any extra characters.
The real data frame is quite large; It is likely ok for it to be split into multiple files; but would be preferable if the result were a single file.
Any insights is quite helpful.
THANKS!
Converting my comment to an answer.
Instead of converting each record to a Row and calling toDF(), just map each record to a string. Then call saveAsTextFile().
path = 'path/to/output/file'
# depending on your data, you may need to call flatMap
m_p = people.rdd.flatMap(lambda r: Convert_to_String(r))
# now m_p will contain a list of strings that you can write to a file
m_p.saveAsTextFile(path)
Your data will likely be stored in multiple files, but you can concatenate them together from the command line. The command would look something like this:
hadoop fs -cat path/to/output/file/* > combined.txt

How do i format the ouput of a list of list into a textfile properly?

I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!
You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!
String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)

Categories

Resources