Get indexes for Subsample of list of lists

Get indexes for Subsample of list of lists - python

I have several lists of data in python:
a = [2,45,1,3]
b = [4,6,3,6,7,1,37,48,19]
c = [45,122]
total = [a,b,c]
I want to get n random indexes from them:
n = 7
# some code
result = [[1,3], [2,6,8], [0,1]] # or
result = [[0], [0,2,6,8], [0,1]] # or
result = [[0,1], [0,2,3,6,8], []] # or any other
The idea - it takes randomly any elements (indexes of that elements) from any arrays, but total count of them must be n.
So my idea - generate random indexes:
n = 7
total_len = sum([len(el) for el in total])
inds = random.sample(range(total_length), n))
But how then get such indexes?
I think about np.cumsum() and shift indixes after that but can't find elegant solution...
P.S.
Actually, I need to use it for loading data from a several csv files using skiprow option. So my idea - get indexes for every file, and this let me load only necessary rows from every file.
So my real task:
i have several csv files of different length and need to get n random rows from them.
My idea:
lengths = my_func_to_get_lengths_for_every_csv(paths) # list of lengths
# generate random subsamle of indexes
skip = ...
for ind, fil in enumerate(files):
pd.read_csv(fil, skiprows=skip[ind])

You could flatten the list first and then take your samples:
total_flat = [item for sublist in total for item in sublist]
inds = random.sample(total_flat , k=n)

Is this what you mean?
relative_inds = []
min_bound = 0
for lst in total:
relative_inds.append([i - min_bound for i in inds if min_bound <= i < min_bound + len(lst)])
min_bound += len(lst)

Related

Operation on each possible pair of items in the list

I want to collect the values of the product of each pair in a list of numbers.
This works fine with smaller numbers, but not with larger ones. How can I optimize my solution?
#will work fine with min = 10 & max = 99
#but not with these values under
min = 1000
max = 9999
seq = range(min, max + 1)
products = set()
for i in seq:
for j in seq:
p = i * j
products.add(p)

You can use numpy to take the outer product and then take the unique values.
min_num = 1000
max_num = 9999
numbers = np.arange(min_num, max_num+1)
products = np.unique(np.outer(numbers, numbers))

You can build a set directly with a comprehension. To optimize, only compute each product once by multiplying numbers with subsequent ones and themselves rather than every inverted pairs (which only wastes time producing duplicate values):
lo = 1000
hi = 9999
prods = {i*j for i in range(lo,hi+1) for j in range(i,hi+1)}
print(len(prods)) # 20789643

Nested list comprehension should be faster:
products = [i*j for i in seq for j in seq]

How do I end a while loop with a for loop in it?

Im trying to create a sequence of jobs, and put them in an array.
the coding works if I run the lines separately. The one problem is that it does not stop the while loop when count equals amountofmachines
it gives the error:
IndexError: list assignment index out of range
Im a bit new to python and used to Matlab. How can I end this while loop and make the code resume at the line a.sort()?
import random
import numpy as np
from random import randint
MachineNumber = 6 #amount of machines imported from Anylogic
JobNumber = 4 #amount of job sequences
JobSeqList = np.zeros((JobNumber,MachineNumber), dtype=np.int64)
amountofmachines = randint(1, MachineNumber) #dictated how much machines the order goes through
a = [0]*amountofmachines #initialize array of machines sequence
count = 0 #initialize array list of machines
element = [n for n in range(1, MachineNumber+1)]
while count <= amountofmachines:
a[count] = random.choice(element)
element.remove(a[count])
count = count + 1
a.sort() #sorts the randomized sequence
A = np.asarray(a) #make an array of the list
A = np.pad(A, (0,MachineNumber-len(a)), 'constant') #adds zeros to the end of sequence
#add the sequence to the array of all sequences
JobSeqList[0,:] = A[:]

I have tested your code, and found the answer!
Matlab indexes start at 1, so the first item in a list would be at 1..
However, python indexes start at 0, so the first item in a list would be at 0…
Change this line:
while count <= amountofmachines
To be:
while count < amountofmachines
Updated Code:
import random
import numpy as np
from random import randint
MachineNumber = 6 #amount of machines imported from Anylogic
JobNumber = 4 #amount of job sequences
JobSeqList = np.zeros((JobNumber,MachineNumber), dtype=np.int64)
amountofmachines = randint(1, MachineNumber) #dictated how much machines the order goes through
a = [0]*amountofmachines #initialize array of machines sequence
count = 0 #initialize array list of machines
element = [n for n in range(1, MachineNumber+1)]
while count < amountofmachines:
a[count] = random.choice(element)
element.remove(a[count])
count = count + 1
a.sort() #sorts the randomized sequence
A = np.asarray(a) #make an array of the list
A = np.pad(A, (0,MachineNumber-len(a)), 'constant') #adds zeros to the end of sequence
#add the sequence to the array of all sequences
JobSeqList[0,:] = A[:]

The problem with your while loop using < vs <= has already been answered, but I'd like to go a bit further and suggest that building a list in this way (by having a counter you increment or decrement manually) is something that's almost never done in Python in the first place, in the hope that giving you some more "pythonic" tools will help you avoid similar stumbling blocks in the future as you're getting used to Python. Python has really great tools for iterating over and building data structures that eliminate a lot of opportunities for minor errors like this, by taking all the "busy work" off of your shoulders.
All of this code:
a = [0]*amountofmachines #initialize array of machines sequence
count = 0 #initialize array list of machines
element = [n for n in range(1, MachineNumber+1)]
while count < amountofmachines:
a[count] = random.choice(element)
element.remove(a[count])
count = count + 1
a.sort() #sorts the randomized sequence
amounts to "build a sorted array of amountofmachines unique numbers taken from range(1, MachineNumber+1)", which can be more simply expressed using random.sample and sorted:
a = sorted(random.sample(range(1, MachineNumber + 1), amountofmachines))
Note that a = sorted(a) is the same as a.sort() -- sorted does a sort and returns the result as a list, whereas sort does an in-place sort on an existing list. In the line of code above, random.sample returns a list of random elements taken from the range, and sorted returns a sorted version of that list, which is then assigned to a.
If random.sample didn't exist, you could use random.shuffle and a list slice. This of this as shuffling a deck of cards (element) and then taking amountofmachines cards off the top before re-sorting them:
element = [n for n in range(1, MachineNumber+1)]
random.shuffle(element)
a = sorted(element[:amountofmachines])
If neither of those existed and you had to use random.choice to pick elements one by one, there are still easier ways to built a list through iteration; there's no need to statically pre-allocate the list, and there's no need to track your iteration with a counter you manage yourself, because for does that for you:
a = []
element = [n for n in range(1, MachineNumber+1)]
for i in range(amountofmachines):
a.append(random.choice(element))
element.remove(a[i])
a.sort()
To make it simpler yet, it's not necessary to even have the for loop keep track of i for you, because you can access the last item in a list with [-1]:
a = []
element = [n for n in range(1, MachineNumber+1)]
for _ in range(amountofmachines):
a.append(random.choice(element))
element.remove(a[-1])
a.sort()
and to make it simpler yet, you can use pop() instead of remove():
a = []
element = [n for n in range(1, MachineNumber+1)]
for _ in range(amountofmachines):
a.append(element.pop(random.choice(range(len(element)))))
a.sort()
which could also be expressed as a list comprehension:
element = [n for n in range(1, MachineNumber+1)]
a = [
element.pop(random.choice(range(len(element))))
for _ in range(amountofmachines)
]
a.sort()
or as a generator expression passed as an argument to sorted:
element = [n for n in range(1, MachineNumber+1)]
a = sorted(
element.pop(random.choice(range(len(element))))
for _ in range(amountofmachines)
)

Find the maximum element by summing overlapping intervals

Say we are given the total size of the interval space. Say we are also given an array of tuples giving us the start and end indices of the interval to sum over along with a value. After completing all the sums, we would like to return the maximum element. How would I go about solving this efficiently?
Input format: n = interval space, intervals = array of tuples that contain start index, end index, and value to add to each element
Eg:
Input: n = 5, intervals = [(1,2,100),(2,5,100),(3,4,100)]
Output: 200
so array is initially [0,0,0,0,0]
At each iteration the following modifications will be made:
1) [100,100,0,0,0]
2) [100,200,100,100,100]
3) [100,200,200,200,100]
Thus the answer is 200.
All I've figured out so far is the brute force solution of splicing the array and adding a value to the spliced portion. How can I do better? Any help is appreciated!

One way is to separate your intervals into a beginning and an end, and specify how much is added or subtracted to the total based on whether you are in that interval or not. Once you sort the intervals based on their location on the number line, you traverse it, adding or subtracting the values based on whether you enter or leave the interval. Here is some code to do so:
def find_max_val(intervals):
operations = []
for i in intervals:
operations.append([i[0],i[2]])
operations.append([i[1]+1,-i[2]])
unique_ops = defaultdict(int)
for operation in operations:
unique_ops[operation[0]] += operation[1]
sorted_keys = sorted(unique_ops.keys())
print(unique_ops)
curr_val = unique_ops[sorted_keys[0]]
max_val = curr_val
for key in sorted_keys[1:]:
curr_val += unique_ops[key]
max_val = max(max_val, curr_val)
return max_val
intervals = [(1,2,100),(2,5,100),(3,4,100)]
print(find_max_val(intervals))
# Output: 200

Here is the code for 3 intervals.
n = int(input())
x = [0]*n
interval = []
for i in range(3):
s = int(input()) #start
e = int(input()) #end
v = int(input()) #value
#add value
for i in range (s-1, e):
x[i] += v
print(max(x))

You can use list comprehension to do a lot of the work.
n=5
intervals = [(1,2,100),(2,5,100),(3,4,100)]
intlst = [[r[2] if i>=r[0]-1 and i<=r[1]-1 else 0 for i in range(n)] for r in intervals]
lst = [0]*n #[0,0,0,0,0]
for ls in intlst:
lst = [lst[i]+ls[i] for i in range(n)]
print(lst)
print(max(lst))
Output
[100, 200, 200, 200, 100]
200

list index out of range, python

I get an Error list index out of range when I'm trying to split a big list in to an array with arrays in it.
I have no idea why this is happening.
the end result of this code should be an array with arrays in it. So that i later can call for example val[5] and get 10 values.
I'm able to print val if the print statement is inside the for loop. And the code is working as it should. But if I move the print statement outside the for loop I get the Index out of range error.
import sys
import numpy as np
from numpy import array, random, dot
def vectorizeImages(filename):
file_object = open(filename)
lines = file_object.read().split()
#array to save all values that is digets.
strings = []
values = []
images = []
val = []
test=[]
#loop that checks if the position in the list contains a digit.
#If it does it will save the digit to the value-array.
for i in lines:
if i.isdigit():
strings.append(i)
#Converting all the strings to ints.
strings = list(map(int, strings))
#deviding every value in the array by 32
for i in strings:
a = i
values.append(a)
#splits large list into smaller lists and ads a 1
for i in range(len(values)):
a = []
for j in range(400):
a.append(values[i*400+j]) #ERROR:list index out of range
a.append(1)
val.append(a)

Your error is here: a.append(values[i*400+j]).
When i = 0, then you populate a and you end up with 400 elements. You do the full loop if your have at least 400 elements.
But at some point, you will ask for more elements than what you have in values, and then it fails because you loop and the biggest element you ask is len(values) * 400 - 1 which is obviously greater than your list size.

This is what I did to solve my problem (also improved my python skills)
def readImages(images):
lines = open(images).read().split()
images = []
values = [float(i)/32.0 for i in lines if i.isdigit()]
for i in range(len(values) / 400):
a = [values[i * 400 + j] for j in range(400)]
images.append(a)
return np.array(images)

Selecting rows from array under many conditions

I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!

Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:

It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get indexes for Subsample of list of lists - python

You could flatten the list first and then take your samples: total_flat = [item for sublist in total for item in sublist] inds = random.sample(total_flat , k=n)

Is this what you mean? relative_inds = [] min_bound = 0 for lst in total: relative_inds.append([i - min_bound for i in inds if min_bound <= i < min_bound + len(lst)]) min_bound += len(lst)

Related

Operation on each possible pair of items in the list

How do I end a while loop with a for loop in it?

Find the maximum element by summing overlapping intervals

list index out of range, python

Selecting rows from array under many conditions

Categories

Resources