Python capture file parsing

Python capture file parsing - python

I have a radius capture file, and I need to parse it. How can I grab individual value pairs and aggregate them. Here is a quick snippet of the file:
Acct-Session-Id = "1234adb"
Acct-Session-Time = 141312
Acct-Input-Octets = 1234123
This repeats on and on, continually with same structure but different values.
I need to aggregate the Octets, which is easy since I just do "if "Acct-Input-Octets" in structure.
The problem comes in the total will change IF Session-Time goes to 0 (i.e. they re-connect). Therefore the running total needs to reset, unless it doesn't in which case it is an error (in RADIUS Input-Octets must reset with new Session-ID).

Something like this?
totals = 0
for line in fileObj:
name, value = line.split('=')
if name.strip() == 'Acct-Session-Time' and value.strip() == '0':
totals = 0
elif name.strip() == 'Acct-Input-Octets':
totals += int(value.strip())

Here is a regex approach:
Steps:
1- read times and octets into two lists
2- go over times and store the last index of a '0' element, in the mean time check the same index for octets and make sure if it is also '0', if not throw exception
3- add up the values in octets from the end to the last index of '0'
import re
log = open('log.txt').read()
times = re.findall('Acct-Session-Time\s*=\s*(\d+)\s*', log)
octets = re.findall('Acct-Input-Octets\s*=\s*(\d+)\s*', log)
last_zero_index = 0
for i in range(0, len(times)):
if times[i] == '0':
last_zero_index = i
if octets[i] != '0':
raise Exception('Session time is reset but the usage is not')
totals = 0
for value in octets[-last_zero_index:]:
totals += int(value)
print(totals)

Related

Pandas: Automatically generate incremental ID based on pattern

I want to create a dataframe, to which various users (name, phone number, address...) are continously being added. Now, I need a function, that automatically generates an ID once a new, non-existing user is added to the dataframe.
The first user should get the ID U000001, the second user the ID U000002 and so on.
What's the best way to do this?

If I'm understanding correctly, the main problem is the leading zeros. i.e. you can't just increment the previous ID, because typecasting '0001' just gives 1 instead of 0001. Please correct me if I'm wrong.
Anyways, here's what I came up with. It's far more verbose than you probably need, but I wanted to make sure my logic was clear.
def foo(previous):
"""
Takes in string of format 'U#####...'
Returns incremented value in same format.
Returns None if previous already maxed out (i.e. 'U9999')
"""
value_str = previous[1:] # chop off 'U'
value_int = int(value_str) # get integer value
new_int = value_int + 1 # increment
new_str = str(new_int) # turn back into string
# return None if exceeding character limit on ID
if len(new_str) > len(value_str):
print("Past limit")
return(None)
# add leading zeroes
while(len(new_str) < len(value_str)):
new_str = '0' + new_str
# add 'U' and return
return('U' + new_str)
Please let me know if I can clarify anything! Here's a script you can use to test it:
# test
current_id = 'U0001'
while(True):
current_id = foo(current_id)
print(current_id)
if current_id == None:
break

UnboundLocalError: local variable date referenced before assignment

I have a file here that asks the user for a city/county, reads the file to find any lines with the city or county they specified, and in the end the program should print the date that the number of increase in cases was highest.
def main():
#open the file
myFile = open("Covid Data.txt")
#read the first line
firstLine = myFile.readline()
#set current, previous, and greatest to 0
current = 0
previous = 0
greatest = 0
#ask user for a city/county name
userLocation = input("Please enter a location ").title
#for each line in the file
for dataLine in myFile:
#strip the end of the line
dataLine = dataLine.rstrip("\n")
#split the data line by the commas and place the parts into a list
dataList = dataLine.split(",")
#if dataList[2] is equal to location
if dataList[2] == userLocation:
#subtract previous from current to find the number of cases that the total increased by
cases = current - previous
#if cases is higher than what is currently set as the greatest
if cases > greatest:
#set the new greatest to amount of cases
greatest = cases
#save the date of the current line
date = str(dataList[0])
#At the end print the data for the highest number of cases
print("On",date," ",location," had the highest increase of cases with ",cases," cases.")
#close file
For some reason, every time I run the code, after I type in what city/county I want to view information for, I keep getting an UnboundLocalError for the variable "date". It tells me that it was referenced before assignment, even though I clearly define it. Why am I getting this error?

You will need to initialize a value for the date variable before entering the loop. For example date = None. Same with cases. The problem is that if there is no valid data available, the date in the loop never gets set and thus doesn't exist.
You also are not altering the values of current or previous, which might be the cause for the bug you're seeing where the date variable never gets set (cases will always get value 0 in the loop).
Also there is a typo in the print, where you try to use location instead of the actual variable called userLocation.

My friend, you are having the problem of locals() and globals() attribute.
I am quite sure, If you put:
globals()[date]= str(dataList[0])
you won't have this problem anymore. Check this page, in 5 minutes you will understand:
https://www.geeksforgeeks.org/global-local-variables-python/

Your code has more defects.
The title is a method so you have to use as .title().
You have to define your variables outside of conditions.
The location variable is undefined in your print function.
I have written a working version from your code.
Code:
def main():
# open the file
myFile = open("Covid Data.txt")
# read the first line
firstLine = myFile.readline()
# set current, previous, and greatest to 0
current = 0
previous = 0
greatest = 0
cases = 0
date = None
# ask user for a city/county name
userLocation = input("Please enter a location ").title()
# for each line in the file
for dataLine in myFile:
# strip the end of the line
dataLine = dataLine.rstrip("\n")
# split the data line by the commas and place the parts into a list
dataList = dataLine.split(",")
print(dataList)
# if dataList[2] is equal to location
if dataList[2] == userLocation:
# subtract previous from current to find the number of cases that the total increased by
cases = current - previous
# if cases is higher than what is currently set as the greatest
if cases > greatest:
# set the new greatest to amount of cases
greatest = cases
# save the date of the current line
date = str(dataList[0])
# At the end print the data for the highest number of cases
print("On", date, " ", userLocation, " had the highest increase of cases with ", cases, " cases.")
myFile.close()
main()
Covid Data.txt:
First line
2020.12.04,placeholder,Miami
Test:
>>> python3 test.py
Please enter a location Texas
On None Texas had the highest increase of cases with 0 cases.
>>> python3 test.py
Please enter a location Miami
On None Miami had the highest increase of cases with 0 cases.
NOTE:
As you can see above, your logic doesn't work but the script can run. Some of conditions will be always False. For example because of this the date variable won't get value so it will be always None.

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x

Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

How to read a text file and convert into a list for use with statistics package in Python

The code I am running so far is as follows
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
index = 0
while index < len(values):
values(index) = int(values(index))
index += 1
print(values)
main()
The text file contains 41 rows of numbers each entered on a single line like so:
151868
153982
156393
158956
161884
165069
168088
etc.
My tasks is to create a program which shows average change in population during the time period. The year with the greatest increase in population during the time period. The year with the smallest increase in population (from the previous year) during the time period.
The code will print each of the text files entries on a single line, but upon trying to convert to int for use with the statistics package I am getting the following error:
values(index) = int(values(index))
SyntaxError: can't assign to function call
The values(index) = int(values(index)) line was taken from reading as well as resources on stack overflow.

You can change values = infile.read() to values = list(infile.read())
and it will have it ouput as a list instead of a string.
One of the things that tends to happen whenever reading a file like this is, at the end of every line there is an invisible '\n' that declares a new line within the text file, so an easy way to split it by lines and turn them into integers would be, instead of using values = list(infile.read()) you could use values = values.split('\n') which splits the based off of lines, as long as values was previously declared.
and the while loop that you have can be easily replace with a for loop, where you would use len(values) as the end.
the values(index) = int(values(index)) part is a decent way to do it in a while loop, but whenever in a for loop, you can use values[i] = int(values[i]) to turn them into integers, and then values becomes a list of integers.
How I would personally set it up would be :
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
values = values.split('\n') # Splits based off of lines
for i in range(0, len(values)) : # loops the length of values and turns each part of values into integers
values[i] = int(values[i])
changes = []
# Use a for loop to get the changes between each number.
for i in range(0, len(values)-1) : # you put the -1 because there would be an indexing error if you tried to count i+1 while at len(values)
changes.append(values[i+1] - values[i]) # This will get the difference between the current and the next.
print('The max change :', max(changes), 'The minimal change :', min(changes))
#And since there is a 'change' for each element of values, meaning if you print both changes and values, you would get the same number of items.
print('A change of :', max(changes), 'Happened at', values[changes.index(max(changes))]) # changes.index(max(changes)) gets the position of the highest number in changes, and finds what population has the same index (position) as it.
print('A change of :', min(changes), 'Happened at', values[changes.index(min(changes))]) #pretty much the same as above just with minimum
# If you wanted to print the second number, you would do values[changes.index(min(changes)) + 1]
main()
If you need any clarification on anything I did in the code, just ask.

I personally would use numpy for reading a text file.
in your case I would do it like this:
import numpy as np
def main ():
infile = np.loadtxt('USPopulation.txt')
maxpop = np.argmax(infile)
minpop = np.argmin(infile)
print(f'maximum population = {maxpop} and minimum population = {minpop}')
main()

Python algorithm error when trying to find the next largest value

I've written an algorithm that scans through a file of "ID's" and compares that value with the value of an integer i (I've converted the integer to a string for comparison, and i've trimmed the "\n" prefix from the line). The algorithm compares these values for each line in the file (each ID). If they are equal, the algorithm increases i by 1 and uses reccurtion with the new value of i. If the value doesnt equal, it compares it to the next line in the file. It does this until it has a value for i that isn't in the file, then returns that value for use as the ID of the next record.
My issue is i have a file of ID's that list 1,3,2 as i removed a record with ID 2, then created a new record. This shows the algorithm to be working correctly, as it gave the new record the ID of 2 which was previously removed. However, when i then create a new record, the next ID is 3, resulting in my ID list reading: 1,3,2,3 instead of 1,3,2,4. Bellow is my algorithm, with the results of the print() command. I can see where its going wrong but can't work out why. Any ideas?
Algorithm:
def _getAvailableID(iD):
i = iD
f = open(IDFileName,"r")
lines = f.readlines()
for line in lines:
print("%s,%s,%s"%("i=" + str(i), "ID=" + line[:-1], (str(i) == line[:-1])))
if str(i) == line[:-1]:
i += 1
f.close()
_getAvailableID(i)
return str(i)
Output:
(The output for when the algorithm was run for finding an appropriate ID for the record that should have ID of 4):
i=1,ID=1,True
i=2,ID=1,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False

I think your program is failing because you need to change:
_getAvailableID(i)
to
return _getAvailableID(i)
(At the moment the recursive function finds the correct answer which is discarded.)
However, it would probably be better to simply put all the ids you have seen into a set to make the program more efficient.
e.g. in pseudocode:
S = set()
loop over all items and S.add(int(line.rstrip()))
i = 0
while i in S:
i += 1
return i

In case you are simply looking for the max ID in the file and then want to return the next available value:
def _getAvailableID(IDFileName):
iD = '0'
with open(IDFileName,"r") as f:
for line in f:
print("ID=%s, line=%s" % (iD, line))
if line > iD:
iD = line
return str(int(iD)+1)
print(_getAvailableID("IDs.txt"))
with an input file containing
1
3
2
it outputs
ID=1, line=1
ID=1
, line=3
ID=3
, line=2
4
However, we can solve it in a more pythonic way:
def _getAvailableID(IDFileName):
with open(IDFileName,"r") as f:
mx_id = max(f, key=int)
return int(mx_id)+1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python capture file parsing - python

Something like this? totals = 0 for line in fileObj: name, value = line.split('=') if name.strip() == 'Acct-Session-Time' and value.strip() == '0': totals = 0 elif name.strip() == 'Acct-Input-Octets': totals += int(value.strip())

Related

Pandas: Automatically generate incremental ID based on pattern

UnboundLocalError: local variable date referenced before assignment

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

How to read a text file and convert into a list for use with statistics package in Python

Python algorithm error when trying to find the next largest value

Categories

Resources