Using loops to compare two lists to find matching values - python

I have two lists, pPop and sPop. sPop is pPop after being sorted in ascending numerical order (they're populations of towns/cities).
I also have four other lists, pName, pType, pLat, and pLong, but I'm not really doing anything with them at this point in time.
I need to sort this list of cities by ascending population size, and I basically have been told to do it using what I know currently - which isn't that much. I've tried this using tuples and other stuff, but those fall under things I haven't been taught.
I have to compare sPop to pPop and use the information I get from that to reorder the other four lists, so I can spit them out in a .csv file.
I get the idea, I'm just not sure of the execution. I think I need to run a loop over all of sPop, with a loop inside that running for all pPop, which checks if sPop[x] = pPop[y], (x from 0 to len(sPop)) giving some kind of affirmative response when it's true. If it's true, then set pVar[y] equal to sVar[x].
After writing this out it seems fine, I'm just not entirely sure how to loop for every index in python. When I do, say,
for x in sPop
it's
x = sPop[i] i=range(0:len(sPop))
when I'd prefer x to refer to the index itself, not the value of the array/list at that index.
Short version: loop over indices in an array with
for x in array
where x refers to the index. How do? If you can't, best way to refer to the index within a loop? It's late and my brain is fried on this problem after spending about six hours trying to work this out a few days ago using different methods.
EDIT:
Alright, got it. For anyone who is somehow curious (maybe someone'll stumble across this one in five years), you loop over sPop, then over pPop, (use
for indexX, varX in enumerate(list) twice) then use
if varX ==varY
sortedList.append(initialList[varY])
Can't put down the actual code or I'd probably get smacked with a plagiarism checker.

To get the index:
for index, x in enumerate(sPop):
print index, x

for x in range(len(sPop)):
item = sPop[x]

Related

Iterating over array and slicing or making changes in Python

I'm about to pull my hair out on this. I'm not sure why the index in my array is not being implemented in the second column.
I created this array - project_information :
project_information.append([proj_id,project_text])
When I print this out, I get the rows and columns. It contains about 40 rows.
When I iterate through it to print out the contents, everything comes out fine. I am using this:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
The project_text column contains text, while the project_id contains integers. It prints out perfectly, and the index, changes for both project_id and project_text.
However, I need to use the project_text in a different way, and I am really struggling with this. I need to slice the text to a shorter text for reuse. To do this, I tried:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
if len(project_text) > 5000:
trunc_proj_text = project_text[:1000]
else:
trunc_proj_text = project_text
print (project_id)
print(trunc_proj_text)
The problem I'm having here is that though the project_id column is being iterated through properly, the project_text is not. What I am getting is just the text in the first row for the project_text, sliced, and repeated for as many times as the length of the array.
I have tried different ways, and also a while loop, but it is still not working.
I've also looked at these answers for reference - Slicing,indexing and iterating over 2D Numpy arrays,Efficient iteration over slice in Python, iteration over list slices, and I can't seem to see how they can be applied to my problem.
I'm not well-versed in using Numpy, so is this something that it could help with? I'm well aware this might be simple and I'm missing it because I've been working on various aspects of this project for the past weeks, so I would appreciate a bit of consideration in this.
Thanks in advance.
The problem was with the input list here, so the slicing with this code does in fact work. The code to create the input array has now been fixed. The original code to create the input list was concatenating the strings for each entry, so the project_texts for each appeared different from the end, but all had the same beginning. But viewing this on a console, it was hard to see.

How to transfer data, and trim data from one multi-dimensional array to another with a condition

I have built a python program processing the probability of various datasets. I input 'manually' various mean values and standard deviations, and that works, however I need to automate it so that I can upload all my data through a text or csv file. I've got so far but now have a nested for loop query I think with indices problems, but some background follows...
My code works for a small dataset where I can manually key in 6-8 parameters working but now I need to automate it and upload various inputs of unknown sizes by csv / text file. I am copying my existing code and amending it where appropriate but I have run into a problem.
I have a 2_D numpy-array where some probabilities have been reverse sorted. I have a second array which gives me the value of 68.3% of each row, and I want to trim the low value 31.7% data.
I need a solution which can handle an unspecified number of rows.
My pre-existing code worked for a single one-dimensional array was
prob_combine_sum= np.sum(prob_combine)
#Reverse sort the probabilities
prob_combine_sorted=sorted(prob_combine, reverse=True)
#Calculate 1 SD from peak Prob by multiplying Total Prob by 68.3%
sixty_eight_percent=prob_combine_sum*0.68269
#Loop over the sorted list and append the 1SD data into a list
#onesd_prob_combine
onesd_prob_combine=[]
for i in prob_combine_sorted:
onesd_prob_combine.append(i)
if sum(onesd_prob_combine) > sixty_eight_percent:
break
That worked. However, now I have a multi-dimensional array, and I want to take the 1 standard deviation data from that multi-dimensional array and stick it in another.
There's probably more than one way of doing this but I thought I would stick to the for loop, but now it's more complicated by the indices. I need to preserve the data structure, and I need to be able to handle unlimited numbers of rows in the future.
I simulated some data and if I can get this to work with this, I should be able to put it in my program.
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]])
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
#Task transfer data from sorted_probabilities to target array on
condition that value in each target row is less than the value in the
sd_test array.
#Ignore the problem that data transferred won't add up to 68.3%.
My real data-sample is very big. I just need a way of trimmining
and transferring.
for row in sorted_probabilities:
for element in row:
target_array[row].append[i]
if sum(target[row]) > sd_test[row]:
break
Error: IndexError: index 9 is out of bounds for axis 0 with size 4
I know it's not a very good attempt. My problem is that I need a solution which will work for any 2D array, not just one with 4 rows.
I'd be really grateful for any help.
Thank you
Edit:
Can someone help me out with this? I am struggling.
I think the reason my loop will not work is that the 'index' row I am using is not a number, but in this case a row. I will have a think about this. In meantime has anyone a solution?
Thanks
I tried the following code after reading the comments:
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter]=sorted_probabilities[counter][element]
if target_array[counter] > sd_test[counter]:
break
I get an error: IndexError: index 9 is out of bounds for axis 0 with size 9
I think it's because I am trying to add to numpy array of pre-determined dimensions? I am not sure. I am going to try another tack now as I can not do this with this approach. It's having to maintain the rows in the target array that is making it difficult. Each row relates to an object, and if I lose the structure it will be pointless.
I recommend you use pandas. You can read directly the csv in a dataframe and do multiple operations on columns and such, clean and neat.
You are mixing numpy arrays with python lists. Better use only one of these (numpy is preferred). Also try to debug your code, because it has either syntax and logical errors. You don't have variable i, though you're using it as an index; also you are using row as index while it is a numpy array, but not an integer.
I strongly recommend you to
0) debug your code (at least with prints)
1) use enumerate to create both of your for loops;
2) replace append with plain assigning, because you've already created an empty vector (target_array). Or initialize your target_array as empty list and append into it.
3) if you want to use your solution for any 2d array, wrap your code into a function
Try this:
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],
[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]]
)
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter] = element # Here I removed the code that produced error
if target_array[counter] > sd_test[counter]:
break

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Difficulty with writing a merge function with conditions

Im trying to write a function but simply cant get it right. This is supposed to be a merge function the merges as follows: the function recieves as an input a list of lists(m lists, all ints). The function creates a list that contains the indexes of the minimun values in each list of the input(each list of the list of lists, overall m indexes). example:
lst_of_lsts= [[3,4,5],[2,0,7]]
min_lst= [0,1]
At each stage, the function chooses the minimum value from that list and adds it to a new list called merged. Then, it erases it from the list of indexes(min_lst) and adds the next index which is now the new minimum.
At the end it returns merged which is an organized list from small ints to big ints. example:
merged= [0,2,3,4,5,7]
Another thing is that Im not allowed to change the original input.
def min_index(lst):
return min(range(len(lst)), key=lambda n: lst[n])
def min_lists(lstlst):
return [min_index(lst) for lst in lstlst]
then
min_lists([[3,4,5],[2,0,7]]) # => [0, 1]
Edit:
This site doesn't exist to solve your homework for you. If you work at it and your solution doesn't do what you expect, show us what you've done and we'll try to point out your mistake.
I figure my solution is OK because it's correct, but in such a way that your teacher will never believe you wrote it; however if you can understand this solution it should help you solve it yourself, and along the way teach you some Python-fu.

Reordering pairs of values within an array so they are in sequence

I'm trying to build a solution to properly order an array of value pairs so that they end up in the correct sequence. Consider this example in Python:
theArray = [['Dempster St','Main St'],['Dempster St','Church St'],['Emerson St','Church St']]
I need to order the array so that in the end it looks like this:
theArray = [['Emerson St','Church St'],['Church St','Dempster St'],['Dempster St','Main St']]
Some considerations:
There is no guarantee that the order within each pair point in the same direction. Ex: in the example above, the second array element has the order of their pairs pointing in the opposite direction of the rest (Dempster to Church instead of Church to Dempster)
The code should be built so that it could be used in both Python and C, so ideally it should be done without any language-specific tricks
At the end, it doesn't matter in which order the final array will be built, as long as the elements follow the correct order. For example, the solution below would also work:
theArray = [['Main St','Dempster St'],['Dempster St','Church St'],['Church St','Emerson St']]
Ideas?
I managed to make it work. I iterated each element of every pair with each other by using multiple nested loops - so that I could check for their uniqueness (and in order to do that, I increment an associated variable whenever an item was found more than once, like a refcount); at the end, the two elements with the lowest count are beginning and end of the route. From there it was quite easy to find the remaining connections.

Categories

Resources