Parsing Data by Event in Python

Parsing Data by Event in Python - python

I'm trying to use Python to parse data from a text file which is formatted like this:
<event>
A 0.8
B 0.4 0.3 -0.5 0.3
</event>
<event>
A 0.2
B 0.3 0.2 -0.5 0.8
C 0.1 0.3 -0.3 0.2
C -0.2 0.4 -0.1 0.9
</event>
<event>
A 0.4
B 0.4 0.3 -0.5 0.3
C 0.3 0.7 0.6 0.5
</event>
Variables A & B are always present in each event, but as you can see, the C variable can occur up to two times in one event and sometimes doesn't occur at all. There are about 10,000+ events in total.
I'd like to format all of this so I can call each piece of data individually (i.e. column 2 for variable B from event 3), as well as in groups (i.e. plotting variable A, column 0 for all the events) but the repeating C variable is tripping me up a bit. I would ideally like to have a column of data for C variable #1 and C variable #2, where the data can simply be 0 when there is only one or zero C variables in an event.
My code is far from elegant at the moment and the output format isn't quite what it needs to be, so I'd love suggestions on how to simplify and improve this.
M = 10000 # number of events
file = open('data.txt')
a_lines = open('a.txt','w')
b_lines = open('b.txt','w')
c1_lines = open('c1.txt','w')
c2_lines = open('c2.txt','w')
c1 = []
c2 = []
for i in range(M):
for line in file:
if not line.strip():
continue
if line.startswith("</event>"):
break
elif line.startswith("<event>"):
a = file.next()
print >>a_lines,i,a
for i in range(M):
for line in file:
if line.startswith("B"):
print >>b_lines,i,line.strip()
nextline=file.next().strip()
c1.append(nextline)
nextline2=file.next().strip()
c2.append(nextline2)
break
# Parsing the duplicate C columns...
# I've formatted it so the 0 is aligned with the other data
for i in range(M):
if "C" in c1[i]:
print >>c1_lines, i, c1[i]
else:
print >>c1_lines, i, "C 0"
for i in range(M):
if "C" in c2[i]:
print >>c2_lines, i, c2[i]
else:
print >>c2_lines, i, "C 0"
# Sample variable formatting attempt:
b_event_num,b_0,b_1,b_2,b_3=loadtxt("b.txt",usecols=(0,1,2,3,4),unpack=True)
b_0=array(b_0)
b_1=array(b_1)
b_2=array(b_2)
b_3=array(b_3)
b_0=b_0.reshape((len(b_0)),1)
b_1=b_1.reshape((len(b_1)),1)
b_2=b_2.reshape((len(b_2)),1)
b_3=b_3.reshape((len(b_3)),1)
b_points=np.hstack((b_0,b_1,b_2,b_3))
The extracted data itself looks okay, but when I try to load in the columns, I'm getting the following error, and I don't know why:
vals = [vals[i] for i in usecols]
IndexError: list index out of range
Any help would be appreciated; thanks!

The IndexError is coming from trying to access vals[0] when vals = []. If you expand your code the error might make more sense:
vals = []
for i in usecols:
vals[i] = i
The error happens in the first use of the loop because vals[0] isn't in the list. I would suggest a fix, but I'm not sure what your trying to do. If you just want vals to be the list [0,1,2,3,4] you can just use
vals = range(5)
Edit:
On a side note I don't think that saving it in a separate file is necessary. It would be a lot better to just save it directly into the array, like:
M = 10000 # number of events
file = open('data.txt')
a = []
b = []
c2 = []
c2 = []
def parseLine(line, section):
line = line.split()
line = line[1:] # To take out the letter at the start
section.append(line)
file.next()
for i in range(M):
parseLine(file.next(), a)
parseLine(file.next(), b)
nextLine = file.next()
if nextLine.startswith("C"):
parseLine(nextLine, c1)
nextLine = file.next()
if nextLine.startswith("C"):
parseLine(nextLine, c2)
file.next() # To get to the end of the event
else:
c2.append([0])
else:
c1.append([0])
c2.append([0])
file.next()
Be careful though because to get the element from the 2nd element from the 8th event for b you would do b[7][1], so it's b[event-1][column-1]

Related

Reading a Tuple Assignment (e.g.written as such d1: p, m, h, = 20, 15, 22) from a Text File and Performing Calculations with Each Variable (e.g. p*h)

I'm a reading a text file with several hundred lines of data in python. The text file contains data written as a tuple assignment. For example, the data looks exactly like this in the text file:
d1: p,h,t,m= 74.15 18 6 0.1 ign: 0.0003
d2: p,h,t,m= 54. 378 -0.14 0.1 ign: 0.0009
How can I separate the data as such:
p = 20
t = 15
etc.
Then, how can I perform calculations on the tuple assignment? For example calculate:
p*p = 20*15?
I am not sure if I should convert the tuple assignment to an array. But I was not successful. In addition, I do not know how to get rid of the d1 and d2: which is there to identify which data set I am looking at
I have read the data and picked out the lines that have the data, (ignoring the First Set line and of Data Given as line)
The results that I need would be:
p (from first set of data d1)*p(from first set of data d2) = 20*15 = 300
p (from second set of data d1)*p(from second set of data d2) = 12*5 = 60
I believe I would need to do this over some kind of loop so that I can separate the data in all the lines in the file.
I would appreciate any help on this! I couldn't find anything pertaining to my question. I would only find how to deal with tuples in the simplest manner but nothing on how to extract variables and performing calculations on a tuple assignment contained in a text file.
EDIT:
After looking at the answer given for this question given by #JArunMani, I went back to try to see if I can understand each line of code. I understand that we need to create a dictionary that fills in the respective values for p, q, etc...
When I try to rewrite the code to how I understand it, I have:
with open("d.txt") as fp: # Opens the file
# The database kinda thing here
line = fp.readline() # Read the file's first line
number, _,cont = line.partition(":")#separates m1 from p, m, h, n =..."
print(cont)
data, _,ignore = cont.partition("int") #separates int from p, m, h, n =..."
print(data) #prints tuple assignment needed
keys, _,values = data.partition("=")
print(keys) #prints p, m, h, n
print(values) #prints values (all numbers after =)
thisdict = {} #creating an empty dictionary to fill with keys and values
thisdict[keys] = values
print(thisdict)
if "m" in thisdict:
print("Yes")
print(thisdict) gives me the Output: {' p,m,h,n': ' 76 6818 2.2 1 '}
However, if "m" in thisdict: did not print anything. I do not understand why m is not in the dictionary, yet print(thisdict) shows that thisdict = {} has been filled. Also, is it necessary to add the for loop in the answer given below?
Thank you.
EDIT 2
I am now trying my second attempt to this problem. I combining both answers to write the code since I using what I understand from each code:
def DataExtract(self):
with open("muonsdata.txt") as fp: # Opens the file
line = fp.readline() # Read the file's first line
number, _,cont = line.partition(":")#separates m1 from pt, eta, phi, m =..."
print(cont)
data, _,ignore = cont.partition("dptinv") #separates dptinv from pt, eta, phi, m =..."
print(data) #prints tuple assignment needed
keys, _,values = data.partition("=")
print(keys) #prints pt, eta, phi, m
print(values) #prints values (all numbers after =)
key = [k for k in keys.split(",")]
value = [v for v in values.strip().split(" ")]
print(key)
print(value)
thisdict = {}
data = {}
for k, v in zip(key, value): #creating an empty dictionary to fill with keys and values
thisdict[k] = v
print(thisdict)
if "m" in thisdict:
print("Yes")
x = DataExtract("C:/Users/username/Desktop/data.txt")
mul_p = x['m1']['p'] * x['d2']['p']
print(mul_p)
However, this gives me the error: Traceback (most recent call last):
File "read.py", line 29, in
mul_p = x['d1']['p'] * x['d2']['p']
TypeError: 'NoneType' object is not subscriptable
EDIT 3
I have the code made from a combination of answers 1 and 2, BUT...
the only thing is that I have the code written and working but why doesn't the while loop go on until we reach the end of the file. I only get one answer from the calculating the values from the first two lines, but what about the remaining lines? Also, it seems like it is not reading the d2 data lines (or the line = fp.readline is not doing anything), because when I try to calculate m , I get the error Traceback (most recent call last):
File "read.py", line 37, in
m = math.cosh(float(data[" m2"]["eta"])) * float(data["m1"][" pt"])
KeyError: ' m2'
Here is my code that I have:
import math
with open("d.txt") as fp: # Opens the file
data ={} #final dictionary
line = fp.readline() # Read the file's first line
while line: #continues to end of file
name, _,cont = line.partition(":")#separates d1 from p, m, h, t =..."
#print(cont)
numbers, _,ignore = cont.partition("ign") #separates ign from p, m, h, t =..."
#print(numbers) #prints tuple assignment needed
keys, _,values = numbers.partition("=")
#print(keys) #prints p, m, h, t
#print(values) #prints values (all numbers after =)
key = [k for k in keys.split(",")]
value = [v for v in values.strip().split(" ")]
#print(key) #prints pt, eta, phi, m
#print(value)
thisdict = {}
for k, v in zip(key, value): #creating an empty dictionary to fill with keys and values
#thisdict[k] = v
#print(thisdict)
#data[name]=thisdict
line = fp.readline()#read next lines, not working I think
thisdict[k] = v
data[name]=thisdict
print(thisdict)
#if " m2" in thisdict:
#print("Yes")
#print(data)
#mul_p = float(data["d1"][" p"])*float(data["d1"]["m"])
m = math.cosh(float(data[" d2"]["m"])) * float(data["m1"][" p"])
#m1 = float(data["d1"][" p"]) * float(2)
print(m)
#print(mul_p)
If I replace the d2's with d1 the code runs fine, except it skips the last d1. I do not know what I am doing wrong. Would appreciate any input or guidance.

So the following function returns a dictionary with values of 'p', 'q' and other variables. But I leave it to you to find out how to multiply or perform operations on them ^^
def DataExtract(path): # 'path' is the path to the data file
fp = open(path) # Opens the file
data = {} # The database kinda thing here
line = fp.readline() # Read the file's first line
while line: # This goes on till we reach end of file (EOF)
name, _, cont = line.partition(":") # So this gives, 'd1', ':', 'p, q, ...'
keys, _, values = cont.partition("=") # Now we split the text into RHS and LHS
keys = keys.split(",") # Split the variables by ',' as separator
values = values.split(",") # Split the values
temp_d = {} # Dict for variables
for i in range(len(keys)):
key = keys[i].strip() # Get the item at the index and remove left-right spaces
val = values[i].strip() # Same
temp_d[key] = float(val) # Store it in dictionary but as number
data[name.strip()] = temp_d # Store the temp_d itself in main dict
line = fp.readline() # Now read next line
fp.close() # Close the file
return data # Return the data
I used simple methods, to make it easy for you. Now to access data, you have to do something like this:
x = DataExtract("your_file_path")
mul_p = x['d1']['p'] * x['d2']['p']
print(mul_p) # Tadaaa !
Feel free to comment...

This answer is quite familiar with #JArunMani, but it's shorter a bit and sure that can run successfully.
The idea is return your data to dictionary.
lines = "d1: p,h,t,m= 74.15 18 6 0.1 ign: 0.0003\nd2: p,h,t,m= 54. 378 -0.14 0.1 ign: 0.0009".split("\n") # lines=open("d.txt",'r').read().split("\n")
data = {}
for line in lines:
l = line.split("ign")[0] # remove "ign:.."
name_dict, vals_dict = l.split(":") #['d1',' p,h,t,m= 74.15 18 6 0.1']
keys_str, values_str = vals_dict.split("=") #[' p,h,t,m',' 74.15 18 6 0.1']
keys=[k for k in keys_str.strip().split(',')] #['p','h','t','m']
values=[float(v) for v in values_str.strip().split(' ')] #[74.15, 18, 6, 0.1]
sub_dict = {}
for k,v in zip(keys, values):
sub_dict[k]=v
data[name_dict]=sub_dict
Result:
>>>data
{'d1': {'p': 74.15, 'h': 18.0, 't': 6.0, 'm': 0.1}, 'd2': {'p': 54.0, 'h': 378.0, 't': -0.14, 'm': 0.1}}
>>>data['d1']['p']*data['d2']['p']
4004.1000000000004

Performing Calculations on Tuples Contained in a Text File

I have a text file that contains data. A snippet of the text file looks like this:
d1: p,h,t,m= 74.15 18 6 0.1 ign: 0.0003
d2: p,h,t,m= 54. 378 -0.14 0.1 ign: 0.0009
d1: p,h,t,m= 715 8 16 0.1 ign: 0.0003
d2: p,h,t,m= 50 78 4 0.1 ign: 0.0009
(where there is a space before d2). The text file contains several hundred lines.
What I am trying to do is extract the data from d1 and d2 like:
p = 74.15
t = 18
etc
I have done this by creating a dictionary.
Then, I want to perform a calculation on the data as such, for example,
p (from d1)* p(d2) + t(from d1)
and repeat the calculation throughout the txt file.
Here is the code I have:
import math
with open("d.txt") as fp: # Opens the file
data ={} #final dictionary
line = fp.readline() # Read the file's first line
while line: #continues to end of file
name, _,cont = line.partition(":")#separates m1 from pt, eta, phi, m =..."
#print(cont)
numbers, _,ignore = cont.partition("dptinv") #separates dptinv from pt, eta, phi, m =..."
#print(numbers) #prints tuple assignment needed
keys, _,values = numbers.partition("=")
#print(keys) #prints pt, eta, phi, m
#print(values) #prints values (all numbers after =)
key = [k for k in keys.split(",")]
value = [v for v in values.strip().split(" ")]
#print(key) #prints pt, eta, phi, m
#print(value)
thisdict = {}
for k, v in zip(key, value): #creating an empty dictionary to fill with keys and values
#thisdict[k] = v
#print(thisdict)
#data[name]=thisdict
line = fp.readline()#read next lines
thisdict[k] = v
data[name]=thisdict
print(thisdict)
#if " m2" in thisdict:
#print("Yes")
#print(data)
#mul_p = float(data["m1"][" pt"])*float(data["m1"]["eta"])
m = math.cosh(float(data[" m2"]["eta"])) * float(data["m1"][" pt"])
#m1 = float(data["m1"][" pt"]) * float(2)
print(m)
I had the code made from a combination of answers from my previous question on this, BUT...
One problem is: that the while loop reads through the entire file except the last two lines.
d1:...
d2:...
The second problem is that it seems like it is not reading the d2 data lines (or the line = fp.readline #read next lines is not doing anything), because when I try to calculate m , I get the error
Traceback (most recent call last): File "read.py", line 37, in m = math.cosh(float(data[" m2"]["eta"])) * float(data["m1"][" pt"]) KeyError: ' m2'
I asked about this from another forum and I am still trying to understand what is WRONG with HOW I wrote the code. And what do I need to do to fix it? Any help and guidance is much appreciated! Thank you !

you should try reorganize your reading process
and use more readable data structure
as far as i can see,
data in your text file are grouped in paired lines ,
so my suggested process on this would be
# do your init outside of the loop
# 4 lists should have same length
d1p =[]
d2p= []
d1t= []
d2t= []
with open("muonsdata.txt") as fp: # Opens the file
d1line = fp.readline() # Read one line supposed to have d1
d2line = fp.readline() # Read second line supposed to have d2
# do more split staff
# extract numbers and append to the associate list
for i in range(0..lens(d1p)):
m=d1p[i]*d2p[i]+d1t[i]

Python For loop not incrementing

clean_offset = len(malware)
tuple_clean = []
tuple_malware = []
for i in malware:
tuple_malware.append([malware.index(i), 0])
print(malware.index(i))
print(tuple_malware)
for j in clean:
tuple_clean.append([(clean_offset + clean.index(j)), 1])
print(clean.index(j))
print(tuple_clean)
import pdb; pdb.set_trace()
training_data_size_mal = 0.8 * len(malware)
training_data_size_clean = 0.8 * len(clean)
i increments as normal and produces correct output however j remains at 0 for three loops and then jumps to 3. I don't understand this.

There is a logical error on clean.index(j).
Array.index will return the first matched index in that array.
So if there are some equal variables there will be some error
You can inspect with below code.
malware = [1,2,3,4,5,6,7,8,8,8,8,8,2]
clean = [1,2,3,4,4,4,4,4,4,2,4,4,4,4]
clean_offset = len(malware)
tuple_clean = []
tuple_malware = []
for i in malware:
tuple_malware.append([malware.index(i), 0])
print(malware.index(i))
print(tuple_malware)
for j in clean:
tuple_clean.append([(clean_offset + clean.index(j)), 1])
print(clean.index(j))
print(tuple_clean)
training_data_size_mal = 0.8 * len(malware)
training_data_size_clean = 0.8 * len(clean)

for a in something
a is what is contained in something, not the index
for example:
for n in [1, 10, 9, 3]:
print(n)
gives
1
10
9
3

You either want
for i in range(len(malware))
or
for i, element in enumerate(malware)
at which point the i is the count and the element in the malware.index(i)
The last one is considered best practice when needing both the index and the element at that index in the loop.

op has already figured the question, but in case anyone is wondering or needs a TL;DR of Barkin's comment, its just a small correction,
replace
for i in malware
for j in clean
with
for i in range(len(malware))
for j in range(len(clean))
and at the end remove the .index() function, and place i and j.

How do I EXTRACT all values ending in .000 and print them?

OK so I have a for loop running an equation iterating it a 0.005. I need it to print any "L" value ending in .000 and nothing else. How do I do that?
import numpy as np
import math
for D in np.arange(7, 9, 0.0050):
N = 28
n = 11
A = 7.32
P = 0.25
C = float(D)/float(P) #(P/8)*(2*L-N-n+((2*L-N-n)**(2)-0.810*(N-n)**(2))**(0.5)
L = 2*C+(N+n)/2+A/C
print("L = ", "%.3f"% float(L), '\n')
Problems I had:
I had to use np.arange as it wouldn't allow a float in a loop. If you can show me how to get around that, that'd be great.
When using np.arange, I would get "D" values like
D = 7.0009999999999994
L = 75.76939122982431
D = 7.001499999999999
L = 75.7733725630222
D = 7.001999999999999
L = 75.77735389888602
D = 7.002499999999999
L = 75.78133523741519
this causes errors when I go to use these numbers later in the code
this loop takes forever to compute. If there's a better way, show me. I have to make this quick or it won't get used.

This post explained why float is not working well in python:
numpy arange: how to make "precise" array of floats?
I used below code and it gave me precise decimal 3 numbers for both D & L in your calculation:
for i in range(7000, 9000, 5):
D = i/1000
print(D)
N = 28
n = 11
A = 7.32
P = 0.25
C = float(D)/float(P) #(P/8)*(2*L-N-n+((2*L-N-n)**(2)-0.810*(N-n)**(2))**(0.5)
L = 2*C+(N+n)/2+A/C
print("L = ", "%.3f"% float(L), '\n')

L3 is the variable
"%.3f"% is the 3rd decimal place
% 1 == 0 I'm not sure what this does, but 0 is the number I'm looking for.
if float("%.3f"% L3) % 1 == 0: #L3 is the variable
do_something()

An elegant, readable way to read Butcher tableau from a file

I'm trying to read a specifically formatted file (namely, the Butcher tableau) in python 3.5.
The file looks like this(tab separated):
S
a1 b11 b12 ... b1S
a2 b21 b22 ... b2S
...
aS bS1 bS2 ... bSS
0.0 c1 c2 ... cS
[tolerance]
for example, (tab separated)
2
0.0 0.0 0.0
1.0 0.5 0.5
0.0 0.5 0.5
0.0001
So my code looks like i'm writing in C. Is there a more pythonic approach to parsing this file? Maybe there are numpy methods that could be used here?
#the data from .dat file
S = 0 #method order, first char in .dat file
a = [] #S-dim left column of buther tableau
b = [] #S-dim matrix
c = [] #S-dim lower row
tolerance = 0 # for implicit methods
def parse_method(file_name):
'read the file_name, process lines, produce a Method object'
try:
with open('methods\\' + file_name) as file:
global S
S = int(next(file))
temp = []
for line in file:
temp.append([float(x) for x in line.replace('\n', '').split('\t')])
for i in range(S):
a.append(temp[i].pop(0))
b.append(temp[i])
global c
c = temp[S][1:]
global tolerance
tolerance = temp[-1][0] if len(temp)>S+1 else 0
except OSError as ioerror:
print('File Error: ' + str(ioerror))

My suggestion using Numpy:
import numpy as np
def read_butcher(filename):
with open(filename, 'rb') as fh:
S = int(fh.readline())
array = np.fromfile(fh, float, (S+1)**2, '\t')
rest = fh.read().strip()
array.shape = (S+1, S+1)
a = array[:-1, 0]
b = array[:-1, 1:]
c = array[ -1, 1:]
tolerance = float(rest) if rest else 0.0
return a, b, c, tolerance
Although I'm not entirely sure how consistently numpy.fromfile advances the file pointer... There are no guarantees in the documentation.
Handling of file exceptions should probably be done outside of the parsing method.

Code -
from collections import namedtuple
def parse_file(file_name):
with open('a.txt', 'r') as f:
file_content = f.readlines()
file_content = [line.strip('\n') for line in file_content]
s = int(file_content[0])
a = [float(file_content[i].split()[0]) for i in range(1, s + 1)]
b = [list(map(float, file_content[i].split()[1:]))
for i in range(1, s + 1)]
c = list(map(float, file_content[-2].split()))
tolerance = float(file_content[-1])
ButcherTableau = namedtuple('ButcherTableau', 's a b c tolerance')
bt = ButcherTableau(s, a, b, c, tolerance)
return bt
p = parse_file('a.txt')
print('S :', p.s)
print('a :', p.a)
print('b :', p.b)
print('c :', p.c)
print('tolerance :', p.tolerance)
Output -
S : 2
a : [0.0, 1.0]
b : [[0.0, 0.0], [0.5, 0.5]]
c : [0.0, 0.5, 0.5]
tolerance : 0.0001

Here's a bunch of suggestions you should consider:
from collections import namedtuple
import csv
def parse_method(file_name):
# for conveniency create a namedtuple
bt = namedtuple('ButcherTableau', dict(a=[], b=[], c=[], order=0, tolerance=0))
line = None
# advice ①: do not assume file path in a function, make assumptions as close to your main function as possible (to make it easier to parameterize later on)
# advice ②: do not call your file "file" so you're not shadowing the class "file" that's loaded globally at runtime
with open(file_name, 'r') as f:
# read the first line alone to setup your "method order" value before reading all the tab separated values
bt.order = int(f.readline())
# create a csv reader with cell separator as tabs
# and create an enumerator to have indexes for each line
for idx, line in enumerate(csv.reader(f, delimiter='\t')))
# instead of iterating again, you can just check the index
# and build your a and b values
if idx < bt.order:
bt.a.append(line.pop(0))
bt.b.append(line)
# if line is None (as set before the for), it means we did not iterate, meaning that we need to make it an error
if not line:
raise Exception("File is empty. Could not parse {}".format(file_name))
# finally you can build your c (and tolerance) values with the last line, which conveniently is still available once the for is finished
bt.c = line[1:]
bt.tolerance = line[0] if idx > S+1 else 0
# avoid the globals, return the namedtuple instead and use the results in the caller function
return bt
This code is untested (just rework of your code as I read it), so it might not work as is, but you might want take the good ideas and make them your own.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Data by Event in Python - python

Related

Reading a Tuple Assignment (e.g.written as such d1: p, m, h, = 20, 15, 22) from a Text File and Performing Calculations with Each Variable (e.g. p*h)

Performing Calculations on Tuples Contained in a Text File

Python For loop not incrementing

How do I EXTRACT all values ending in .000 and print them?

An elegant, readable way to read Butcher tableau from a file

Categories

Resources