I've written something to calculate reserves where already reserved quantity needs to be taken into account in the next iteration. Only thing is the parameter isn;t used in the next iteration. The out put does contain the calculation, but the parameter isn't carried over. Any suggestions on how to solve this? Feel free to point out the muppetness of anything in the code, still learning
import pandas as pd
df=pd.read_csv('test_file.csv')
final_reserve = []
for i in range(len(df)):
#If it's the first row or a new sku set the reserved to 0
if i == 0 or df.loc[i]['SKU'] == df.loc[i-1]['SKU']:
reserved = 0
#calculate reserve
to_reserve = df.loc[i]['ON_HAND'] - (df.loc[i]['SALES']*2) - df.loc[i]['ALREADY_RESERVED']
final_reserve.append(to_reserve)
#add to the reserved parameter
reserved += to_reserve
else:
#if it's not the first row or a new sku take the already reserved units into account in the calculation
to_reserve = df.loc[i]['ON_HAND'] - (df.loc[i]['SALES']*2) - df.loc[i]['ALREADY_RESERVED']-reserved
final_reserve.append(to_reserve)
reserved += to_reserve
df['to_reserve'] = final_reserve
df.head()
As the output shown below shows, on 2nd row it's getting to 500, where it should deduct the 300 already reserved on 1st row
output reserve
I just had a quick look at your question, based on the first comment in your for loop:
#If it's the first row or a new sku set the reserved to 0
you want to change the next line to:
if i == 0 or df.loc[i]['SKU'] <> df.loc[i-1]['SKU']:
That is, change the second == operator to an <> operator.
As it stands now, your code is reseting the reserved to 0 if the SKU evaluated matches the previous one, and not if it is a new SKU.
Related
I have a file here that asks the user for a city/county, reads the file to find any lines with the city or county they specified, and in the end the program should print the date that the number of increase in cases was highest.
def main():
#open the file
myFile = open("Covid Data.txt")
#read the first line
firstLine = myFile.readline()
#set current, previous, and greatest to 0
current = 0
previous = 0
greatest = 0
#ask user for a city/county name
userLocation = input("Please enter a location ").title
#for each line in the file
for dataLine in myFile:
#strip the end of the line
dataLine = dataLine.rstrip("\n")
#split the data line by the commas and place the parts into a list
dataList = dataLine.split(",")
#if dataList[2] is equal to location
if dataList[2] == userLocation:
#subtract previous from current to find the number of cases that the total increased by
cases = current - previous
#if cases is higher than what is currently set as the greatest
if cases > greatest:
#set the new greatest to amount of cases
greatest = cases
#save the date of the current line
date = str(dataList[0])
#At the end print the data for the highest number of cases
print("On",date," ",location," had the highest increase of cases with ",cases," cases.")
#close file
For some reason, every time I run the code, after I type in what city/county I want to view information for, I keep getting an UnboundLocalError for the variable "date". It tells me that it was referenced before assignment, even though I clearly define it. Why am I getting this error?
You will need to initialize a value for the date variable before entering the loop. For example date = None. Same with cases. The problem is that if there is no valid data available, the date in the loop never gets set and thus doesn't exist.
You also are not altering the values of current or previous, which might be the cause for the bug you're seeing where the date variable never gets set (cases will always get value 0 in the loop).
Also there is a typo in the print, where you try to use location instead of the actual variable called userLocation.
My friend, you are having the problem of locals() and globals() attribute.
I am quite sure, If you put:
globals()[date]= str(dataList[0])
you won't have this problem anymore. Check this page, in 5 minutes you will understand:
https://www.geeksforgeeks.org/global-local-variables-python/
Your code has more defects.
The title is a method so you have to use as .title().
You have to define your variables outside of conditions.
The location variable is undefined in your print function.
I have written a working version from your code.
Code:
def main():
# open the file
myFile = open("Covid Data.txt")
# read the first line
firstLine = myFile.readline()
# set current, previous, and greatest to 0
current = 0
previous = 0
greatest = 0
cases = 0
date = None
# ask user for a city/county name
userLocation = input("Please enter a location ").title()
# for each line in the file
for dataLine in myFile:
# strip the end of the line
dataLine = dataLine.rstrip("\n")
# split the data line by the commas and place the parts into a list
dataList = dataLine.split(",")
print(dataList)
# if dataList[2] is equal to location
if dataList[2] == userLocation:
# subtract previous from current to find the number of cases that the total increased by
cases = current - previous
# if cases is higher than what is currently set as the greatest
if cases > greatest:
# set the new greatest to amount of cases
greatest = cases
# save the date of the current line
date = str(dataList[0])
# At the end print the data for the highest number of cases
print("On", date, " ", userLocation, " had the highest increase of cases with ", cases, " cases.")
myFile.close()
main()
Covid Data.txt:
First line
2020.12.04,placeholder,Miami
Test:
>>> python3 test.py
Please enter a location Texas
On None Texas had the highest increase of cases with 0 cases.
>>> python3 test.py
Please enter a location Miami
On None Miami had the highest increase of cases with 0 cases.
NOTE:
As you can see above, your logic doesn't work but the script can run. Some of conditions will be always False. For example because of this the date variable won't get value so it will be always None.
So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
So I'm trying to go through my dataframe in pandas and if the value of two columns is equal to something, then I change a value in that location, here is a simplified version of the loop I've been using (I changed the values of the if/else function because the original used regex and stuff and was quite complicated):
pro_cr = ["IgA", "IgG", "IgE"] # CR's considered productive
rows_changed = 0
prod_to_unk = 0
unk_to_prod = 0
changed_ids = []
for index in df_sample.index:
if num=1 and color="red":
pass
elif num=2 and color="blue":
prod_to_unk += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "unknown"
rows_changed += 1
elif num=3 and color="green":
unk_to_prod += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "productive"
rows_changed += 1
else:
pass
print("Number of productive columns changed to unknown: {}".format(prod_to_unk))
print("Number of unknown columns changed to productive: {}".format(unk_to_prod))
print("Total number of rows changed: {}".format(rows_changed))
So the main problem is the changing code:
df_sample.at[index, "Functionality"] = "unknown" # or productive
If I run this code without these lines of code, it works properly, it finds all the correct locations, tells me how many were changed and what their ID's are, which I can use to validate with the CSV file.
If I use df_sample["Functionality"][index] = "unknown" # or productive the code runs, but checking the rows that have been changed shows that they were not changed at all.
When I use df.at[row, column] = value I get "AttributeError: 'BlockManager' object has no attribute 'T'"
I have no idea why this is showing up. There are no duplicate columns. Hope this was clear (if not let me know and I'll try to clarify it). Thanks!
To be honest, I've never used df.at - but try using df.loc instead:
df_sample.loc[index, "Functionality"] = "unknown"
You can also iat.
Example: df.iat[iTH row, jTH column]
I have written a python script in ArcGIS that selects features that intersect. It needs to keep repeating until all relevant features are selected. At this point the selection will stop changing. Is it possible to set a loop to keep repeating until the number of selected features is the same as last time it looped? I can get the selected features using the arcpy.GetCount_management() method.
I've set the number of selected features to be a variable:
selectCount = arcpy.GetCount_management("StreamT_StreamO1")
Then this is the
mylist = []
with arcpy.da.SearchCursor("antiRivStart","ORIG_FID") as mycursor:
for feat in mycursor:
mylist.append(feat[0])
liststring = str(mylist)
queryIn1 = liststring.replace('[','(')
queryIn2 = queryIn1.replace(']',')')
arcpy.SelectLayerByAttribute_management('StreamT_StreamO1',"ADD_TO_SELECTION",'OBJECTID IN '+ queryIn2 )
arcpy.SelectLayerByLocation_management("antiRivStart","INTERSECT","StreamT_StreamO1","","ADD_TO_SELECTION")
So what I want to do would effectively be:
while selectcount == previousselectcount:
do stuff
but I don't know how the while loop is supposed to be constructed
You are pretty close to how you would monitor the change in the number of features. Consider the following.
previousselectcount = -1
selectcount = arcpy.GetCount_management("StreamT_StreamO1")
while selectcount != previousselectcount:
do stuff
# update both counts at the end of what you want to do in the while loop
previousselectcount = selectcount
selectcount = arcpy.GetCount_management("StreamT_StreamO1")
Note the not equals operator (!=) in the while loop condition.
python wiki
If selectcount or previousselectcount are of type float, you probably wants to do a range
aka
while selectcount >= previousselectcount+c:
....
with c a positive constant very close to zero.
I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.
Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):
results = {}
start = 0
while True:
# Fetch blocks of size 500
buffer = column_family.get(key, column_offset=start, column_count=500)
if len(buffer) == 0:
break
# Merge these results into the main one
results.update(buffer)
# Update the offset
start += len(buffer)
Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.
So how can I accomplish a buffered read of all the columns in a single row? Thanks.
From the pycassa 1.0.8 documentation
it would appear that you could use something like the following [pseudocode]:
results = {}
start = 0
startColumn = ""
while True:
# Fetch blocks of size 500
buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
# iterate returned values.
# set startColumn == previous column_finish.
Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.
In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.
for col in cf.xget(key, column_count=2**63-1):
# do something with the column.