Looping over dataset of strings - python

I'm trying to pick out specific occurrences of a value in a dataset of mine, but keep running into a problem dealing with turning the values into strings and looping over them. My code is below:
data = np.genfromtxt('DurhamAirMass.txt')
spot = data[:,1]
mass = str(data[:,2])
DP = np.array([])
DT = np.array([])
MP = np.array([])
MT = np.array([])
TR = np.array([])
for i in range(1461):
if mass[i] == '2':
DP = np.append(DP, str(spot[i]))
if mass[i] == '3':
DT = np.append(DT, str(spot[i]))
if mass[i] == '5':
MP = np.append(MP, str(spot[i]))
if mass[i] == '6' or '66' or '67':
MT = np.append(MT, str(spot[i]))
if mass[i] == '7':
TR = np.append(TR, str(spot[i]))
print DP
When I attempt to print out the DP array, I get an error pointing at the first if statement and saying "IndexError: string index out of range". Any ideas?

What is the purpose of converting data[:,2] into a string?
Btw. or does not work as you think, you have to repeat `mass[i]==``
Why not:
data = np.genfromtxt('DurhamAirMass.txt')
mass = data[:, 1]
spot = data[:, 2]
DP = mass[spot == 2]
DT = mass[spot == 3]
MP = mass[spot == 5]
MT = mass[(spot == 6)||(spot == 66)||(spot == 67)]
TR = mass[spot == 7]

As a general rule, you should never hard-code for-loop iterations unless you want it to be considered an error for the input iterator to ever be greater/smaller than your hard-coded value (and even then, there are better ways to accomplish that).
Your code should probably look like this:
for i in range(len(data)):
...
This will ensure you always loop over only the data you actually have.

You are indeed causing an IndexError
try checking spot to see how large it is, my guess is that 1461 is larger then its bounds perhaps you could try setting your for loop as:
for i in range(len(spot)):
...
instead, this will garauntee that you will only access valid Indexes for spot, if this still causes a problem try the same for mass
for i in range(len(mass)):
...
You could also add a check to make sure the data is the length you think it is.
print len(mass), len(spot), len(spot) == len(mass)
It's always good practice to double check your assumptions in the case of an error. In this case you are clearly being told there is an IndexError so the next step is to find out what index is causing it.
Maybe more information would help you?
try:
for i in range(len(spot)):
# code as usual
except:
print i
raise e
This will tell you what index is causing the error.

I just changed all the strings to ints and that solved things. I didn't think that would work at first. Thanks for all of your answers, everyone!

Related

How can I turn the results of a for loop into a list?

How can I alter this for loop to turn it into a list? Thanks in advance.
for x in column_list:
des, res=rp.ttest(group1= df[x][df['patients'] == 1], group1_name= "Patients",
group2= df[x][df['patients'] == 0], group2_name= "Controls")
res1=res.set_index('Independent t-test').T
y=res1['Two side test p value = '].values
if y < 0.005:
print (x,y)
I recall your previous question. I think something like the following may help:
outputList = []
for x in column_list:
des, res=rp.ttest(group1= df[x][df['patients'] == 1], group1_name= "Patients",
group2= df[x][df['patients'] == 0], group2_name= "Controls")
res1=res.set_index('Independent t-test').T
y=res1['Two side test p value = '].values
#collect the values into a dictionary and append that dictionary to the outputList
outputList.append({"x":x, "y":y})
#Now loop the list and only print the ones with appropriate p-level
for pair in outputList:
if pair["y"] < 0.005:
print (pair["x"],pair["y"])
This is moving the print() down into a new section of code after all of the x,y pairs have been collected into a list. Using a dictionary makes this all a bit cleaner.
If you want to see whats going on, you can toss print(outputList) before that second for loop and see what that output looks like. It may help connect the dots.

Python ( iteration problem ) with an exercice

The code :
import pandas as pd
import numpy as np
import csv
data = pd.read_csv("/content/NYC_temperature.csv", header=None,names = ['temperatures'])
np.cumsum(data['temperatures'])
printcounter = 0
list_30 = [15.22]#first temperature , i could have also added it by doing : list_30.append(i)[0] since it's every 30 values but doesn't append the first one :)
list_2 = [] #this is for the values of the subtraction (for the second iteration)
for i in data['temperatures']:
if (printcounter == 30):
list_30.append(i)
printcounter = 0
printcounter += 1
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
list_2.append(substraction)
print(max(list_2))
Hey guys ! i'm really having trouble with the black part.
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
I'm trying to iterate over the elements and sub stracting element x with the next element (x+1) but the following error pops out TypeError: 'float' object is not iterable. I have also tried to iterate using x instead of list_30[x] but then when I use next(x) I have another error.
for x in list_30: will iterate on list_30, and affect to x, the value of the item in the list, not the index in the list.
for your case you would prefer to loop on your list with indexes:
index = 0
while index < len(list_30):
substract = list_30[index] - list_30[index + 1]
edit: you will still have a problem when you will reach the last element of list_30 as there will be no element of list_30[laste_index + 1],
so you should probably stop before the end with while index < len(list_30) -1:
in case you want the index and the value, you can do:
for i, v in enumerate(list_30):
substract = v - list_30[i + 1]
but the first one look cleaner i my opinion
if you`re trying to find ifference btw two adjacent elements of an array (like differentiate it), you shoul probably use zip function
inp = [1, 2, 3, 4, 5]
delta = []
for x0,x1 in zip(inp, inp[1:]):
delta.append(x1-x0)
print(delta)
note that list of deltas will be one shorter than the input

How do I write this Python code to use 2+ fewer nested if statements?

I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file. As far as I know I have to find the max value in each row group.
I am looking for:
how to write it with at least two fewer levels of nesting
in fewer lines in general
I tried to use a dictionary lambda combo as a switch statement in place of some of the if statements, and eliminate at least two levels of nesting, but I couldn't figure out how to do the greater than evaluation without nesting further.
import pyarrow.parquet as pq
def main():
metafile = r'D:\my_parquet_meta_file.metadata'
meta = pq.read_metadata(metafile)
max_i = 0
max_j = 0
max_k = 0
for grp in range(0, meta.num_row_groups):
for col in range(0, meta.num_columns):
# locate columns i,j,k
if meta.row_group(grp).column(col).path_in_schema in ['i', 'j', 'k']:
if meta.row_group(grp).column(col).path_in_schema == 'i':
if meta.row_group(grp).column(col).statistics.max > max_i:
max_i = meta.row_group(grp).column(col).statistics.max
if meta.row_group(grp).column(col).path_in_schema == 'j':
if meta.row_group(grp).column(col).statistics.max > max_j:
max_j = meta.row_group(grp).column(col).statistics.max
if meta.row_group(grp).column(col).path_in_schema == 'k':
if meta.row_group(grp).column(col).statistics.max > max_k:
max_k = meta.row_group(grp).column(col).statistics.max
print('max i: ' + str(max_i), 'max j: ' + str(max_j), 'max k: ' + str(max_k))
if __name__ == '__main__':
main()
I've had someone give me 2 solutions:
The first involves using a list to hold the max values for each of my nominated columns, and then uses the python max function to evaluate the higher value before assigning it back. I must say I'm not a huge fan of using an unnamed positional max value variable, but it does the job in this instance and I can't fault it.
Solution 1:
import pyarrow.parquet as pq
def main():
metafile = r'D:\my_parquet_meta_file.metadata'
meta = pq.read_metadata(metafile)
max_value = [0, 0, 0]
for grp in range(0, meta.num_row_groups):
for col in range(0, meta.num_columns):
column = meta.row_group(grp).column(col)
for i, name in enumerate(['i', 'j', 'k']):
if column.path_in_schema == name:
max_value[i] = max(max_value[i], column.statistics.max)
print(dict(zip(['max i', 'max j', 'max k'], max_value)))
if __name__ == '__main__':
main()
The second uses similar methods, but additionally uses list comprehension to get all of of the column objects before iterating through each column object to find the column's max values. This removes one additional level of nesting but more importantly separates the gathering of columns objects into a separate collection before interrogating them, making the process a little clearer. I think on the downside is may require higher memory usage due to everything in the column object being retained rather than just the reported max value.
:
Solution 2:
import pyarrow.parquet as pq
def main():
metafile = r'D:\my_parquet_meta_file.metadata'
meta = pq.read_metadata(metafile)
max_value = [0, 0, 0]
columns = [meta.row_group(grp).column(col)
for col in range(0, meta.num_columns)
for grp in range(0, meta.num_row_groups)] # Apparently list generators are read right to left
for column in columns:
for i, name in enumerate(['i', 'j', 'k']):
if column.path_in_schema == name:
max_value[i] = max(max_value[i], column.statistics.max)
print(dict(zip(['max i', 'max j', 'max k'], max_value)))
if __name__ == '__main__':
main()
*Update I've found out it actually uses less memory - the column object I mentioned, is actually a list generator not a list. It won't retrieve each column until it's called in the second loop where I enumerate through the "columns" list generator. The downside of using a list generator is you can only iterate through it once (it's not reusable) unless you redefine it.
The upside is if I happen to want to "break" from the loop once I've found a desired value, I could and there would be no remaining list taking up memory and it would not need to have called every column object making it faster. In my case it doesn't really matter cause I do go through the whole list anyway, but with a lower memory foot print.
*Note the list generator here is a Python 3 feature, Python 2 would have returned the complete list for the same syntax
# In Python 3 this returns a list generator, in Python 2 it returns a populated lsit
columns = [meta.row_group(grp).column(col)
for col in range(0, meta.num_columns)
for grp in range(0, meta.num_row_groups)]
To get a populated list as you would in Python 2, requires the list() function
e.g.
columns = list([<generator expression ... >])
You can simulate a switch statement with the following function:
def switch(v):yield lambda *c:v in c
It simulates a switch statement using a single pass for loop with if/elif/else conditions that don't repeat the switching value:
for example:
for case in switch(x):
if case(3):
# ... do something
elif case(4,5,6):
# ... do something else
else:
# ... do some other thing
It can also be used in a more C style:
for case in switch(x):
if case(3):
# ... do something
break
if case(4,5,6):
# ... do something else
break
else:
# ... do some other thing
Here's how to use it with your code:
...
for case in switch(meta.row_group(grp).column(col).path_in_schema):
if not case('i', 'j', 'k'): break
statMax = meta.row_group(grp).column(col).statistics.max
if case('i') and statMax > max_i: max_i = statMax
elif case('j') and statMax > max_j: max_j = statMax
elif case('k') and statMax > max_k: max_k = statMax
...

Extracting Gurobi Solution Index

I have a bunch of gurobi variables
y[0],y[1],...,y[n]
x[0],x[1],...,x[m].
I would like to be able to figure out the indices of the optimal y's that are not zero. In other words, if the optimal solution is y[0]=0, y[1]=5, y[2]=0, y[3]=1, I would like to return [1,3]. So far, I have
F = []
for v in model.getVars():
if v.varName[0]=='y' and v.x>0:
F.append[v.varName]
This, in the above example, would give me ['y[1]', 'y[3]']. Since this output is a string, I'm not sure how I can get the 1 and 3 out of it. Please help.
Thanks in advance!
I am using the following which works:
Index_List = []
for v in m.getVars():
if v.varName[0] == 'y' and v.x>0:
Index = int(v.varName[2])
for j in range(3,3+100)):
BOOL = False
try:
IndexNEW =int(v.varName[j])
Index = 10*Index+IndexNEW
BOOL = True
except ValueError:
()
if not BOOL:
break
Index_List.append(Index)
The resulting Index_List is as desired. There must be a better way though.
Assuming
from gurobipy import *
m = Model()
If you create a gurobi tupledict for your variables with
x = m.addVars(nx, vtype=GRB.INTEGER, name="x")
y = m.addVars(ny, vtype=GRB.INTEGER, name="y")
# ...your constraints and objective here..
then you can directly call the attributes for your variables (in your case the .X attribute for the variable value in the current solution). Using a list comprehension it could be done with:
m.optimize()
if m.status == GRB.OPTIMAL:
indices = [i for i in range(ny) if y[i].X > 0]
where nx and ny are the number of your variables.

Python Creating a dictionary of dictionaries structure, nested values are the same

I'm attempting to build a data structure that can change in size and be posted to Firebase. The issue I am seeing is during the construction of the data structure. I have the following code written:
for i in range(len(results)):
designData = {"Design Flag" : results[i][5],
"performance" : results[i][6]}
for j in range(len(objectiveNameArray)):
objectives[objectiveNameArray[j]] = results[i][columnHeaders.index(objectiveNameArray[j])]
designData["objectives"] = copy.copy(objectives)
for k in range(len(variableNameArray)):
variables[variableNameArray[k]] = results[i][columnHeaders.index(variableNameArray[k])]
designData["variables"] = copy.copy(variables)
for l in range(len(responseNameArray)):
responses[responseNameArray[l]] = results[i][columnHeaders.index(responseNameArray[l])]
designData["responses"] = copy.copy(responses)
for m in range(len(constraintNameArray)):
constraintViolated = False
if constraintNameArray[m][1] == "More than":
if results[i][columnHeaders.index(constraintNameArray[m][0])] > constraintNameArray[m][2]:
constraintViolated = True
else:
constraintViolated = False
elif constraintNameArray[m][1] == "Less than":
if results[i][columnHeaders.index(constraintNameArray[m][0])] < constraintNameArray[m][2]:
constraintViolated = True
else:
constraintViolated = False
if constraintNameArray[m][0] in constraints:
if constraints[constraintNameArray[m][0]]["violated"] == True:
constraintViolated = True
constraints[constraintNameArray[m][0]] = {"value" : results[i][columnHeaders.index(constraintNameArray[m][0])], "violated" : constraintViolated}
designData["constraints"] = copy.copy(constraints)
data[studyName][results[i][4]] = designData
When I include print(designData) inside of the for loop, I see that my results are changing as expected for each loop iteration.
However, if I include print(data) outside of the for loop, I get a data structure where the values added by the results array are all the same values for each iteration of the loop even though the key is different.
Comparing print(data) and print(designData)
I apologize in advance if this isn't enough information this is my first post on Stack so please be patient with me.
It is probably because you put the variables like objectives, variables, responses directly to the designData. Try the following:
import copy
....
designData['objectives'] = copy.copy(objectives)
....
designData['variables'] = copy.copy(variables)
....
designData['responses'] = copy.copy(responses)
For similar questions, see copy a list.

Categories

Resources