Pandas: Automatically generate incremental ID based on pattern

Pandas: Automatically generate incremental ID based on pattern - python

I want to create a dataframe, to which various users (name, phone number, address...) are continously being added. Now, I need a function, that automatically generates an ID once a new, non-existing user is added to the dataframe.
The first user should get the ID U000001, the second user the ID U000002 and so on.
What's the best way to do this?

If I'm understanding correctly, the main problem is the leading zeros. i.e. you can't just increment the previous ID, because typecasting '0001' just gives 1 instead of 0001. Please correct me if I'm wrong.
Anyways, here's what I came up with. It's far more verbose than you probably need, but I wanted to make sure my logic was clear.
def foo(previous):
"""
Takes in string of format 'U#####...'
Returns incremented value in same format.
Returns None if previous already maxed out (i.e. 'U9999')
"""
value_str = previous[1:] # chop off 'U'
value_int = int(value_str) # get integer value
new_int = value_int + 1 # increment
new_str = str(new_int) # turn back into string
# return None if exceeding character limit on ID
if len(new_str) > len(value_str):
print("Past limit")
return(None)
# add leading zeroes
while(len(new_str) < len(value_str)):
new_str = '0' + new_str
# add 'U' and return
return('U' + new_str)
Please let me know if I can clarify anything! Here's a script you can use to test it:
# test
current_id = 'U0001'
while(True):
current_id = foo(current_id)
print(current_id)
if current_id == None:
break

Related

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x

Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Unable to change value of dataframe at specific location

So I'm trying to go through my dataframe in pandas and if the value of two columns is equal to something, then I change a value in that location, here is a simplified version of the loop I've been using (I changed the values of the if/else function because the original used regex and stuff and was quite complicated):
pro_cr = ["IgA", "IgG", "IgE"] # CR's considered productive
rows_changed = 0
prod_to_unk = 0
unk_to_prod = 0
changed_ids = []
for index in df_sample.index:
if num=1 and color="red":
pass
elif num=2 and color="blue":
prod_to_unk += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "unknown"
rows_changed += 1
elif num=3 and color="green":
unk_to_prod += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "productive"
rows_changed += 1
else:
pass
print("Number of productive columns changed to unknown: {}".format(prod_to_unk))
print("Number of unknown columns changed to productive: {}".format(unk_to_prod))
print("Total number of rows changed: {}".format(rows_changed))
So the main problem is the changing code:
df_sample.at[index, "Functionality"] = "unknown" # or productive
If I run this code without these lines of code, it works properly, it finds all the correct locations, tells me how many were changed and what their ID's are, which I can use to validate with the CSV file.
If I use df_sample["Functionality"][index] = "unknown" # or productive the code runs, but checking the rows that have been changed shows that they were not changed at all.
When I use df.at[row, column] = value I get "AttributeError: 'BlockManager' object has no attribute 'T'"
I have no idea why this is showing up. There are no duplicate columns. Hope this was clear (if not let me know and I'll try to clarify it). Thanks!

To be honest, I've never used df.at - but try using df.loc instead:
df_sample.loc[index, "Functionality"] = "unknown"

You can also iat.
Example: df.iat[iTH row, jTH column]

Python capture file parsing

I have a radius capture file, and I need to parse it. How can I grab individual value pairs and aggregate them. Here is a quick snippet of the file:
Acct-Session-Id = "1234adb"
Acct-Session-Time = 141312
Acct-Input-Octets = 1234123
This repeats on and on, continually with same structure but different values.
I need to aggregate the Octets, which is easy since I just do "if "Acct-Input-Octets" in structure.
The problem comes in the total will change IF Session-Time goes to 0 (i.e. they re-connect). Therefore the running total needs to reset, unless it doesn't in which case it is an error (in RADIUS Input-Octets must reset with new Session-ID).

Something like this?
totals = 0
for line in fileObj:
name, value = line.split('=')
if name.strip() == 'Acct-Session-Time' and value.strip() == '0':
totals = 0
elif name.strip() == 'Acct-Input-Octets':
totals += int(value.strip())

Here is a regex approach:
Steps:
1- read times and octets into two lists
2- go over times and store the last index of a '0' element, in the mean time check the same index for octets and make sure if it is also '0', if not throw exception
3- add up the values in octets from the end to the last index of '0'
import re
log = open('log.txt').read()
times = re.findall('Acct-Session-Time\s*=\s*(\d+)\s*', log)
octets = re.findall('Acct-Input-Octets\s*=\s*(\d+)\s*', log)
last_zero_index = 0
for i in range(0, len(times)):
if times[i] == '0':
last_zero_index = i
if octets[i] != '0':
raise Exception('Session time is reset but the usage is not')
totals = 0
for value in octets[-last_zero_index:]:
totals += int(value)
print(totals)

Python Min-Max Function - List as argument to return min and max element

Question: write a program which first defines functions minFromList(list) and maxFromList(list). Program should initialize an empty list and then prompt user for an integer and keep prompting for integers, adding each integer to the list, until the user enters a single period character. Program should than call minFromList and maxFromList with the list of integers as an argument and print the results returned by the function calls.
I can't figure out how to get the min and max returned from each function separately. And now I've added extra code so I'm totally lost. Anything helps! Thanks!
What I have so far:
def minFromList(list)
texts = []
while (text != -1):
texts.append(text)
high = max(texts)
return texts
def maxFromList(list)
texts []
while (text != -1):
texts.append(text)
low = min(texts)
return texts
text = raw_input("Enter an integer (period to end): ")
list = []
while text != '.':
textInt = int(text)
list.append(textInt)
text = raw_input("Enter an integer (period to end): ")
print "The lowest number entered was: " , minFromList(list)
print "The highest number entered was: " , maxFromList(list)

I think the part of the assignment that might have confused you was about initializing an empty list and where to do it. Your main body that collects data is good and does what it should. But you ended up doing too much with your max and min functions. Again a misleading part was that assignment is that it suggested you write a custom routine for these functions even though max() and min() exist in python and return exactly what you need.
Its another story if you are required to write your own max and min, and are not permitted to use the built in functions. At that point you would need to loop over each value in the list and track the biggest or smallest. Then return the final value.
Without directly giving you too much of the specific answer, here are some individual examples of the parts you may need...
# looping over the items in a list
value = 1
for item in aList:
if item == value:
print "value is 1!"
# basic function with arguments and a return value
def aFunc(start):
end = start + 1
return end
print aFunc(1)
# result: 2
# some useful comparison operators
print 1 > 2 # False
print 2 > 1 # True
That should hopefully be enough general information for you to piece together your custom min and max functions. While there are some more advanced and efficient ways to do min and max, I think to start out, a simple for loop over the list would be easiest.

Binary search of unaccesible data field in ldap from python

I'm interested in reproducing a particular python script.
I have a friend who was accessing an ldap database, without authentication. There was a particular field of interest, we'll call it nin (an integer) for reference, and this field wasn't accessible without proper authentication. However, my friend managed to access this field through some sort of binary search (rather than just looping through integers) on the data; he would check the first digit, check if it was greater or less than the starting value, he would augment that until it returned a true value indicating existence, adding digits and continuing checking until he found the exact value of the integer nin.
Any ideas on how he went about this? I've access to a similarly set up database.

Your best bet would be to get authorization to access that field. You are circumventing the security of the database otherwise.

Figured it out. I just needed to filter on (&(cn=My name)(nin=guess*) and I managed to filter until it returns the correct result.
Code follows in case anyone else needs to find a field they aren't supposed to access, but can check results for and know the name of.
def lookup(self, username="", guess=0,verbose=0):
guin = guess
result_set = []
varsearch = "(&(name=" + str(username) + ")(" + "nin" + "=" + str(guin) + "*))"
result_id = self.l.search("", ldap.SCOPE_SUBTREE, varsearch, ["nin"])
while True:
try:
result_type, result_data = self.l.result(result_id, 0, 5.0)
if (result_data == []):
break
else:
if result_type == ldap.RES_SEARCH_ENTRY:
result_set.append(result_data)
except ldap.TIMEOUT:
return {"name": username}
if len(result_set) == 0:
return self.lookup(username, guin + 1,verbose)
else:
if guess < 1000000:
return self.lookup(username, guess * 10,verbose)
else:
if verbose==1:
print "Bingo!",
return str(guess)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Automatically generate incremental ID based on pattern - python

Related

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

Unable to change value of dataframe at specific location

Python capture file parsing

Python Min-Max Function - List as argument to return min and max element

Binary search of unaccesible data field in ldap from python

Categories

Resources