Skip operations on row if it is non numeric in pandas dataframe

Skip operations on row if it is non numeric in pandas dataframe - python

I have a dataframe:
import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])
I want to calculate length of region only for non-zero numeric row values and skip function for the row with an error note if the value is not right. Here is what I have so far:
df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
#print (df['start'][i]).isnumeric()
start = int(df['start'][i])
#print start
#print df['start'][i]
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
However, pandas converts df['start'] into a str variable and even if I use int to convert it, I get the following error:
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'
What am I missing here? Thanks for your time!

You can define a custom function to do the calculation then apply that function to each row.
def calculate_region_length(x):
start_val = x[0]
stop_val = x[1]
try:
start_val = float(start_val)
return (stop_val - start_val) + 1.0
except ValueError:
return None
The custom function accepts a list as input. The function will test the start value to see if it can be converted into a float. If it cannot then None will be returned. This way if '1' is stored as a string the value can still be converted to float and won't be skipped whereas '$%%' in your example cannot and will return None.
Next you call the custom function for each row:
df['length_of_region'] = df[['start', 'stop']].apply(lambda x: calculate_region_legnth(x), axis=1)
This will create your new column with (stop - start) + 1.0 for rows where start is not a non-convertible string and None where start is a string that cannot be converted to a number.
You can then update the Notes field based on rows where None is returned to identify the regions where a start value is missing:
df.loc[df['length_of_region'].isnull(), 'Notes'] = df['region_name']

After staring at the code for quite some time, found a simple and elegant fix to reassign df['start'][i] to start that I use in try-except as follows:
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
start = int(df['start'][i])
df['start'][i] = start
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
Re-assigning the start variable, converts it into int format and helps to calculate length_of_region only for numeric columns

Related

Converting 'time values' stored in a tuple into 24 hours format

Suppose you have a tuple (Hours, minutes) and you want to convert it into 24 hours format hh:mm:ss. Assuming the seconds will always be 0. Eg. (14,0) will be 14:00:00 and (15,0) will be 15:00:00.
So far this is my sketchy way of coming close to the answer:
start_time = (14, 0)
st = ''
for num in start_time:
num = str(num)
if len(num) == 2:
st += num
else:
st += str(num) + '00'
print(st)

The problems with your current approach:
you're not using any : character. After each iteration, check if it's the last by considering the index in your for-loop. If it's not, append a : character.
if len(num) == 1, you're inserting two trailing zeros, when it should be one leading zero.
Refactored code:
start_time = (14, 0)
st = ''
# `i` will store the index for each iteration
for i, num in enumerate(start_time):
num = str(num)
if len(num) == 2:
st += num
else:
# If the number of digits is not 2, append a leading '0'
st += '0' + num
# If it's not the last number in the tuple
if i < len(start_time) - 1:
st += ':'
else:
st += ':00'
print(st)
This will output the expected value.
An alternative and compact way of doing this:
def time(t):
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Join the mapped values to a single string using `:`
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Map this function to each value in this new tuple
# vvvvvvvvvvvvvvv Create a function that adds two trailing zero to a value
# vvvvvvvv Create a new tuple with a trailing zero
return ':'.join(map('{:02d}'.format, t + (0,)))
print(time((14, 0)))
print(time((15, 0)))

I know you're probably just training on strings but if you want to create proper datetime objects use the datetime module:
from datetime import datetime
def create_time(x: tuple):
return datetime.strptime(f"2021-11-03 {x[0]}:{x[1]}", "%Y-%m-%d %H:%M")
# ^^^^^^^^ you can set the date if needed
print(create_time((14, 15)))
Output:
2021-11-03 14:15:00
If you want only time just use the .time() method:
from datetime import datetime
def create_time(x: tuple):
return datetime.strptime(f"{x[0]}:{x[1]}", "%H:%M").time()
print(create_time((14, 15)))
Output:
14:15:00

Accessing pyomo variables with two indices

I have started using pyomo to solve optimization problems. I have a bit of an issue regarding accessing the variables, which use two indices. I can easily print the solution, but I want to store the index depending variable values within a pd.DataFrame to further analyze the result. I have written following code, but it needs forever to store the variables. Is there a faster way?
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([len(price_dict)])
for index in varobject:
exist = False
two = False
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exists = True #does a index exist
frequency[index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
two = True #is index of two indices
if index[1] in df_variables.columns:
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
df_variables[index[1]] = np.nan
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
# If no index exist, simple print the variable value
print(varobject.value)
if not(exists):
if not(two):
df_variable = pd.Series(frequency, name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)

with some more work and less DataFrame, I have solved the issue with following code. Thanks to BlackBear for the comment
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([20,len(price_dict)])
exist = False
two = False
list_index = []
dict_position = {}
count = 0
for index in varobject:
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exist = True #does a index exist
frequency[0,index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
exist = True
two = True #is index of two indices
if index[1] in list_index:
position = dict_position[index[1]]
frequency[position,index[0]] = varobject[index].value
else:
dict_position[index[1]] = count
list_index.append(index[1])
print(list_index)
frequency[count,index[0]] = varobject[index].value
count += 1
else:
# If no index exist, simple print the variable value
print(varobject.value)
if exist:
if not(two):
frequency = np.transpose(frequency)
df_variable = pd.Series(frequency[:,0], name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
for i in range(count):
df_variable = pd.Series(frequency[i,:], name=str(v)+ '_' + list_index[i])
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)

I can not seem to figure out why I can't convert to an int

I am trying to convert an string into an int but for some reason it does seem to want to work. I am very interested in learning where there is a disconnect in my understanding of the code written. Thank you!
def fourdigitnum(num):
string = str(num)
while (len(string) < 4):
string = "0" + string
return string
def ascend_desend(string,order):
if(order == "ascend"):
string = sorted(string,reverse = True)
elif(order == "descend"):
string = sorted(string,reverse = False)
return string
def list_to_num(lst):
string = ""
for ele in lst:
string = string + ele
return int(string)
def KaprekarsConstant(num):
num = int(num)
count = 0
while num != 6174:
num = fourdigitnum(num)
#ascend and descend are lists
ascend = ascend_desend(num,"ascend")
descend = ascend_desend(num,"descend")
#they must be turned into nums
ascend = list_to_num(ascend)
descend = list_to_num(descend)
num = descend - ascend
count += 1
return count
print KaprekarsConstant(raw_input())
Error:
Traceback (most recent call last):
File "/tmp/576159006/main.py", line 41, in <module>
print KaprekarsConstant(2111)
File "/tmp/576159006/main.py", line 33, in KaprekarsConstant
ascend = list_to_num(ascend)
File "/tmp/576159006/main.py", line 16, in list_to_num
return int(string)
ValueError: invalid literal for int() with base 10: '999-'

In your code the calculation in line num = descend - ascend returns the negative(-) value, because generated ascend values is greater than descend. You can take absolute value and continue from here see below lines:
num = descend - ascend
num = abs(num)
count += 1

You need to check the calculation at line 15.
Seems like in your code, the value of 'string' being calculated at this line is getting the value '999-' which cannot be converted to integer type since it has a '-' in it.

Getting an Trueerror message

I am getting the error "TypeError: range() integer end argument expected, got list." Not sure what to do about it. Thanks for the help!
if iput == 1:
numresistors = [input("Number of resistors?")]
if numresistors == [2]:
r1 = raw_input("Enter first resistor:")
r2 = raw_input("Enter second resistor:")
R1 = Parsing(r1)
R2 = Parsing(r2)
req = R1.valueParsing() + R2.valueParsing()
req2 = fmtnum(req)
print "The value of the series resistors is %s." % req2
else:
sumr = 0
for x in range (numresistors):
sumr = sumr + x
print "The value of the series resistors is %s." % sumr

numresistors is being stored as a list containing a single value
numresistors = [input("Number of resistors?")]
The error you are getting is saying that the range function doesn't know what to do with a list. You could either call range with the only item in the list (range(numresistors[0])) or not store it as a list in the first place.
numresistors = input("Number of resistors?")

Python Dynamic Knapsack

Right now I am attempting to code the knapsack problem in Python 3.2. I am trying to do this dynamically with a matrix. The algorithm that I am trying to use is as follows
Implements the memoryfunction method for the knapsack problem
Input: A nonnegative integer i indicating the number of the first
items being considered and a nonnegative integer j indicating the knapsack's capacity
Output: The value of an optimal feasible subset of the first i items
Note: Uses as global variables input arrays Weights[1..n], Values[1...n]
and table V[0...n, 0...W] whose entries are initialized with -1's except for
row 0 and column 0 initialized with 0's
if V[i, j] < 0
if j < Weights[i]
value <-- MFKnapsack(i - 1, j)
else
value <-- max(MFKnapsack(i -1, j),
Values[i] + MFKnapsack(i -1, j - Weights[i]))
V[i, j} <-- value
return V[i, j]
If you run the code below that I have you can see that it tries to insert the weight into the the list. Since this is using the recursion I am having a hard time spotting the problem. Also I get the error: can not add an integer with a list using the '+'. I have the matrix initialized to start with all 0's for the first row and first column everything else is initialized to -1. Any help will be much appreciated.
#Knapsack Problem
def knapsack(weight,value,capacity):
weight.insert(0,0)
value.insert(0,0)
print("Weights: ",weight)
print("Values: ",value)
capacityJ = capacity+1
## ------ initialize matrix F ---- ##
dimension = len(weight)+1
F = [[-1]*capacityJ]*dimension
#first column zeroed
for i in range(dimension):
F[i][0] = 0
#first row zeroed
F[0] = [0]*capacityJ
#-------------------------------- ##
d_index = dimension-2
print(matrixFormat(F))
return recKnap(F,weight,value,d_index,capacity)
def recKnap(matrix, weight,value,index, capacity):
print("index:",index,"capacity:",capacity)
if matrix[index][capacity] < 0:
if capacity < weight[index]:
value = recKnap(matrix,weight,value,index-1,capacity)
else:
value = max(recKnap(matrix,weight,value,index-1,capacity),
value[index] +
recKnap(matrix,weight,value,index-1,capacity-(weight[index]))
matrix[index][capacity] = value
print("matrix:",matrix)
return matrix[index][capacity]
def matrixFormat(*doubleLst):
matrix = str(list(doubleLst)[0])
length = len(matrix)-1
temp = '|'
currChar = ''
nextChar = ''
i = 0
while i < length:
if matrix[i] == ']':
temp = temp + '|\n|'
#double digit
elif matrix[i].isdigit() and matrix[i+1].isdigit():
temp = temp + (matrix[i]+matrix[i+1]).center(4)
i = i+2
continue
#negative double digit
elif matrix[i] == '-' and matrix[i+1].isdigit() and matrix[i+2].isdigit():
temp = temp + (matrix[i]+matrix[i+1]+matrix[i+2]).center(4)
i = i + 2
continue
#negative single digit
elif matrix[i] == '-' and matrix[i+1].isdigit():
temp = temp + (matrix[i]+matrix[i+1]).center(4)
i = i + 2
continue
elif matrix[i].isdigit():
temp = temp + matrix[i].center(4)
#updates next round
currChar = matrix[i]
nextChar = matrix[i+1]
i = i + 1
return temp[:-1]
def main():
print("Knapsack Program")
#num = input("Enter the weights you have for objects you would like to have:")
#weightlst = []
#valuelst = []
## for i in range(int(num)):
## value , weight = eval(input("What is the " + str(i) + " object value, weight you wish to put in the knapsack? ex. 2,3: "))
## weightlst.append(weight)
## valuelst.append(value)
weightLst = [2,1,3,2]
valueLst = [12,10,20,15]
capacity = 5
value = knapsack(weightLst,valueLst,5)
print("\n Max Matrix")
print(matrixFormat(value))
main()

F = [[-1]*capacityJ]*dimension
does not properly initialize the matrix. [-1]*capacityJ is fine, but [...]*dimension creates dimension references to the exact same list. So modifying one list modifies them all.
Try instead
F = [[-1]*capacityJ for _ in range(dimension)]
This is a common Python pitfall. See this post for more explanation.

for the purpose of cache illustration, I generally use a default dict as follows:
from collections import defaultdict
CS = defaultdict(lambda: defaultdict(int)) #if i want to make default vals as 0
###or
CACHE_1 = defaultdict(lambda: defaultdict(lambda: int(-1))) #if i want to make default vals as -1 (or something else)
This keeps me from making the 2d arrays in python on the fly...
To see an answer to z1knapsack using this approach:
http://ideone.com/fUKZmq

def zeroes(n,m):
v=[['-' for i in range(0,n)]for j in range(0,m)]
return v
value=[0,12,10,20,15]
w=[0,2,1,3,2]
v=zeroes(6,5)
def knap(i,j):
global v
if i==0 or j==0:
v[i][j]= 0
elif j<w[i] :
v[i][j]=knap(i-1,j)
else:
v[i][j]=max(knap(i-1,j),value[i]+knap(i-1,j-w[i]))
return v[i][j]
x=knap(4,5)
print (x)
for i in range (0,len(v)):
for j in range(0,len(v[0])):
print(v[i][j],end="\t\t")
print()
print()
#now these calls are for filling all the boxes in the matrix as in the above call only few v[i][j]were called and returned
knap(4,1)
knap(4,2)
knap(4,3)
knap(4,4)
for i in range (0,len(v)):
for j in range(0,len(v[0])):
print(v[i][j],end="\t\t")
print()
print()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Skip operations on row if it is non numeric in pandas dataframe - python

Related

Converting 'time values' stored in a tuple into 24 hours format

Accessing pyomo variables with two indices

I can not seem to figure out why I can't convert to an int

Getting an Trueerror message

Python Dynamic Knapsack

Categories

Resources