How to evaluate columns that contain lists in pandas?

How to evaluate columns that contain lists in pandas? - python

Say I have a dataframe which describes the dimensions of hundreds of cardboard boxes:
df = [[17829292, (13, 14, 20)], [17739292, (20, 10, 15)], [17827792, (10, 10, 12)]]
df = pd.DataFrame(df, columns = ['Serial Number', 'Box Dimensions'])
Given that the 'Box Dimensions' column contains a list of numbers, is there a simple way to search through the dataframe for all boxes with dimensions less than or equal to certain values, <= (15, 15, 15)? Or do I need to separate out each box's (x, y, z) values into three separate columns to evaluate them this way?
I've tried df.loc[df['Box Dimensions'] <= (15, 15, 15)], but for some reason this only evaluates the first number in the list. The function will return boxes whose x values are <=15, but whose y and z dimensions exceed these parameters.

You can try:
m=pd.DataFrame(df['Box Dimensions'].tolist()).le(15).all(1)
#OR(If needed to check dimensions seperately above method can be modified as well)
m=df['Box Dimensions'].map(lambda x:x[0]<=15 and x[1]<=15 and x[2]<=15)
#Finally:
df[m]
#OR
df.loc[m]
output of above code:
Serial Number Box Dimensions
2 17827792 (10, 10, 12)

Alternatively if you need to test box dimensions independently
df.loc[(df['Box Dimensions'].str[0] <= 15) & (df['Box Dimensions'].str[1] <= 15) & (df['Box Dimensions'].str[2] <= 15)]
Serial Number Box Dimensions
2 17827792 (10, 10, 12)

Related

Why is my array coming out as shape: (6, 1, 2) when it is made of two (6, ) arrays?

I'm trying to import data from an excel and create an array pos with 6 rows and two columns. Later, when I go to index the array pos[0][1], I get an error: IndexError: index 1 is out of bounds for axis 0 with size 1.
I looked at the shape of my array and it returns (6, 1, 2). I was expecting to get (6, 2). The individual shapes of the arrays which make up pos are (6, ) and (6, ) which I don't really understand, why not (6, 1)? Don't quite understand the difference between the two.
irmadata = pd.read_excel("DangerZone.xlsx")
irma_lats = irmadata["Average Latitude"].tolist()
irma_longs = irmadata["Average Longitude"].tolist()
shipdata = pd.read_excel("ShipPositions.xlsx")
ship_lats = shipdata["Latitude"].to_numpy() ## these are the (6, ) arrays
ship_longs = shipdata["Longitude"].to_numpy()
pos = np.array([[ship_lats], [ship_longs]], dtype = "d").T
extent = [-10, -90, 0, 50]
ax = plot.axes(projection = crs.PlateCarree())
ax.stock_img()
ax.add_feature(cf.COASTLINE)
ax.coastlines(resolution = "50m")
ax.set_title("Base Map")
ax.set_extent(extent)
ax.plot(irma_longs, irma_lats)
for i in range(len(ship_lats)):
lat = pos[i][0]
lon = pos[i][1] ## This is where my error occurs
ax.plot(lon, lat, 'o', label = "Ship " + str(i+1))
plot.show()
Obviously, I could just index pos[0][0][1] however, I'd like to know why I'm getting this issue. I'm coming from MATLAB so I suppose a lot of my issues will stem from differences in how numpy and MATLAB work, and hence any tips would also be appreciated!

I solved it, I didn't realise I could just use single square brackets for combining my two column arrays. So, changing pos = np.array([ship_lats], [ship_longs]], dtype = "d").T to pos = np.array([ship_lats, ship_longs], dtype = "d").T worked.

Recursively trying to find the maximum w/o loops

So I'm given a tuple of ordered pairs in this format:
(x,y) where
x represents the physical weight of the objects, y represents the cost/value of the object.
((5, 20), (10, 70), (40, 200), (20, 80), (10, 100))
Objects may only be used once, but there may be multiples of those objects in the original tuple of ordered pairs.
z is the max weight that can be shipped. It's an integer. z could be 50 or something like that.
Goal: Find the maximum value possible that you can send given the limit Z.
The difficulty is that we can ONLY use recursion and we cannot use loops nor can we use python built-in functions.
I've tried to work out the max value in a list of integers, which I did separately to try to get some sort of idea. I have also tried giving the objects a 'mass' and doing value/weight, but that didn't work very well either.
def maximum_val(objects: ((int,int),) , max_weight : int) -> int:
if max_weight == 0:
return 0
else:
return objects[0][1] + maximum_val(objects[1:], max_weight - gifts[0][0])
((5, 20), (10, 70), (40, 200), (20, 80), (10, 100))
Example: Given the tuple above and the limit Z=40, the best possible value that could be obtained is 250 -> (10, 70), (10, 100), (20, 80)

This is known as knapsack and you are looking for a recursive variant.
At every step, check what is best. Include the first object or skip the first object:
objects = ((5, 20), (10, 70), (40, 200), (20, 80), (10, 100))
def recursive_knapsack(objects, limit ):
if not objects:
return 0
if objects[0][0] > limit:
#first object cant fit
return recursive_knapsack(objects[1:],limit)
include = objects[0][1] + recursive_knapsack(objects[1:], limit-objects[0][0])
exclude = recursive_knapsack(objects[1:],limit)
if include < exclude:
return exclude
else:
return include

How to generate random numbers to satisfy a specific mean and median in python?

I would like to generate n random numbers e.g., n=200, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.
I searched everywhere and i could not find a solution for this. I tried the following script by it works for small numbers such as 20, for big numbers it takes ages and result is returned.
n=200
x = np.random.randint(0,1,size=n) # initalisation only
while True:
if x.mean() == 12 and np.median(x) == 6.5:
break
else:
x=np.random.randint(2,40,size=n)
Could anyone help me by improving this to get a quick result even when n=5000 or so?

One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.
In [162]: arr1 = np.random.randint(2, 7, 100)
In [163]: arr2 = np.random.randint(7, 40, 100)
In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22
In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5
Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99)
arr2 = np.random.randint(7, 40, 99)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
decm, intg = math.modf(i)
args = np.argsort(arr2)
arr2[args[-41:-1]] -= int(intg)
arr2[args[-1]] -= int(np.round(decm * 40))
return np.concatenate((arr1, mid, arr2))
Demo:
arr = gen_random()
print(np.median(arr))
print(arr.mean())
6.5
12.0
The logic behind the function:
In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1, mid and arr2. arr1 and arr2 each hold 99 items and the mid holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200 and subtract the result from our N largest numbers which in this case we can choose them from arr2 and use N=50.
Edit:
If it's not a problem to have float numbers in your result you can actually shorten the function as following:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99).astype(np.float)
arr2 = np.random.randint(7, 40, 99).astype(np.float)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
args = np.argsort(arr2)
arr2[args[-40:]] -= i
return np.concatenate((arr1, mid, arr2))

Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.
Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.
A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.
A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.
Example code:
>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>
So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r) is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).
In the end, a possible generator function could be:
def gendistrib(n):
if n % 2 != 0 :
raise ValueError("gendistrib needs an even parameter")
n2 = n//2 # n / 2 in Python 2
pop1 = list(range(2, 7)) # lower range
pop2 = list(range(7, 41)) # upper range
w2 = [ 5, 4, 3, 2 ] + ( [1] * 30) # weights for upper range
r1 = random.choices(pop1, k=n2) # lower part of the distrib.
r2 = random.choices(pop2, w2, k=n2) # upper part
r = r1 + r2
random.shuffle(r) # randomize order
# time to force an exact mean
tot = sum(r)
expected = 12 * n
if tot > expected: # too high: decrease some values
for i, val in enumerate(r):
if val != 2 and val != 7:
r[i] = val - 1
tot -= 1
if tot == expected:
random.shuffle(r) # shuffle again the decreased values
break
elif tot < expected: # too low: increase some values
for i, val in enumerate(r):
if val != 6 and val != 40:
r[i] = val + 1
tot += 1
if tot == expected:
random.shuffle(r) # shuffle again the increased values
break
return r
It is really fast: I could timeit gendistrib(10000) at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)

Ok, you're looking at the distribution which has no less than 4 parameters - two of those defining range and two responsible for required mean and median.
I could think about two possibilities from the top of my head:
Truncated normal distribution, look here for details. You have already range defined, and have to recover μ and σ from mean and median. It will require solving couple of nonlinear equation, but quite doable in python. Sampling could be done using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html
4-parameters Beta distribution, see here for details. Again, recovering α and β in Beta distribution from mean and median will require solving couple of non-linear equations. Knowing them sampling would be easy via https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html
UPDATE
Here how you could do it for truncated normal going from mean to mu: Truncated normal with a given mean

If you have a bunch of smaller arrays with the right median and mean, you can combine them to produce a larger array.
So... you can pre-generate smaller arrays as you are currently doing, and then combine them randomly for larger n. Of course, this will result in a biased random sample, but it sounds like you just want something that's approximately random.
Here's working (py3) code that generates a sample of size 5000 with your desired properties, which it build from smaller samples of size 4, 6, 8, 10, ..., 18.
Note, that I changed how the smaller random samples are built: half of the numbers must be <= 6 and half >= 7 if the median is to be 6.5, so we generate those halves independently. This speeds things up massively.
import collections
import numpy as np
import random
rs = collections.defaultdict(list)
for i in range(50):
n = random.randrange(4, 20, 2)
while True:
x=np.append(np.random.randint(2, 7, size=n//2), np.random.randint(7, 41, size=n//2))
if x.mean() == 12 and np.median(x) == 6.5:
break
rs[len(x)].append(x)
def random_range(n):
if n % 2:
raise AssertionError("%d must be even" % n)
r = []
while n:
i = random.randrange(4, min(20, n+1), 2)
# Don't be left with only 2 slots left.
if n - i == 2: continue
xs = random.choice(rs[i])
r.extend(xs)
n -= i
random.shuffle(r)
return r
xs = np.array(random_range(5000))
print([(i, list(xs).count(i)) for i in range(2, 41)])
print(len(xs))
print(xs.mean())
print(np.median(xs))
Output:
[(2, 620), (3, 525), (4, 440), (5, 512), (6, 403), (7, 345), (8, 126), (9, 111), (10, 78), (11, 25), (12, 48), (13, 61), (14, 117), (15, 61), (16, 62), (17, 116), (18, 49), (19, 73), (20, 88), (21, 48), (22, 68), (23, 46), (24, 75), (25, 77), (26, 49), (27, 83), (28, 61), (29, 28), (30, 59), (31, 73), (32, 51), (33, 113), (34, 72), (35, 33), (36, 51), (37, 44), (38, 25), (39, 38), (40, 46)]
5000
12.0
6.5
The first line of the output shows that there's 620 2's, 52 3's, 440 4's etc. in the final array.

While this post already has an accepted answer, I'd like to contribute a general non integer approach. It does not need loops or testing. The idea is to take a PDF with compact support. Taking the idea of the accepted answer of Kasrâmvd, make two distributions in the left and right interval. Chose shape parameters such that the mean falls to the given value. The interesting opportunity here is that one can create a continuous PDF, i.e. without jumps where the intervals join.
As an example I have chosen the beta distribution. To have finite non-zero values at the border I've chosen beta =1 for the left and alpha = 1 for the right.
Looking at the definition of the PDF and the requirement of the mean the continuity gives two equations:
4.5 / alpha = 33.5 / beta
2 + 6.5 * alpha / ( alpha + 1 ) + 6.5 + 33.5 * 1 / ( 1 + beta ) = 24
This is a quadratic equation rather easy to solve. The just using scipy.stat.beta like
from scipy.stats import beta
import matplotlib.pyplot as plt
import numpy as np
x1 = np.linspace(2, 6.5, 200 )
x2 = np.linspace(6.5, 40, 200 )
# i use s and t not alpha and beta
s = 1./737 *(np.sqrt(294118) - 418 )
t = 1./99 *(np.sqrt(294118) - 418 )
data1 = beta.rvs(s, 1, loc=2, scale=4.5, size=20000)
data2 = beta.rvs(1, t, loc=6.5, scale=33.5, size=20000)
data = np.concatenate( ( data1, data2 ) )
print np.mean( data1 ), 2 + 4.5 * s/(1.+s)
print np.mean( data2 ), 6.5 + 33.5/(1.+t)
print np.mean( data )
print np.median( data )
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.hist(data1, bins=13, density=True )
ax.hist(data2, bins=67, density=True )
ax.plot( x1, beta.pdf( x1, s, 1, loc=2, scale=4.5 ) )
ax.plot( x2, beta.pdf( x2, 1, t, loc=6.5, scale=33.5 ) )
ax.set_yscale( 'log' )
plt.show()
provides
>> 2.661366939244768 2.6495436216856976
>> 21.297348804473618 21.3504563783143
>> 11.979357871859191
>> 6.5006779033245135
so results are as required and it looks like:

"Sensibly" remove points in a Python list

Suppose I have two arrays indicating the x and y coordinates of a calibration curve.
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
Y = [2,4,6,8,10,12,14,16,18,20,24,28,32,36,40,60,80,100]
My example arrays above contain 18 points. You'll notice that the x values are not linearly spaced; there are more points at lower values of x.
Let's suppose I need to reduce the number of points in my calibration curve to 13 points. Obviously, I could just remove the first five or the last five points, but that would shorten my overall range of x values. To maintain range and minimise the space between x values I would preferentially remove values x= 2,4,6,8,10. Removing these x points and their respective y values would leave 13 points in the curve as required.
How could I do this point selection and removal automatically in Python? I.e. Is there an algorithm to pick the best x points from a list, where "best" is defined as keeping the points as close as possible while keeping the overall range and adhering to the new number of points.
Please note that the points remaining must be in the original lists, so I can't interpolate the 18 points on to a 13 point grid.

This would maximize the square root distances between the chosen points. It in some sense spreads the points as far as possible.
import itertools
list(max(itertools.combinations(sorted(X), 13), i
key=lambda l: sum((a - b) ** 2 for a, b in zip(l, l[1:]))))
Note that this is only feasible for small problems. The time complexity for selecting k points is O(k * (len(X) choose k)), so basically O(exp(len(X)). So don't even think about using this for, e.g., len(X) == 100 and k == 10.

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 30, 40, 50]
Y = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 28, 32, 36, 40, 60, 80, 100]
assert len(X) == len(set(X)), "Duplicate X values found"
points = list(zip(X, Y))
points.sort() # sorts by X
while len(points) > 13:
# Find index whose neighbouring X values are closest together
i = min(range(1, len(points) - 1), key=lambda p: points[p + 1][0] - points[p - 1][0])
points.pop(i)
print(points)
Output:
[(1, 2), (3, 6), (5, 10), (7, 14), (10, 20), (12, 24), (14, 28), (16, 32), (18, 36), (20, 40), (30, 60), (40, 80), (50, 100)]
If you want the original series again:
X, Y = zip(*points)

An algorithm that would achieve that:
Convert each number into the sum of the absolute difference to the number to the left and to the right. If a number is missing, first or last cases, then use MAX_INT. For example, 1 would become MAX_INT; 2 would become 2, 10 would become 3.
Remove the first case with the lowest sum.
If you need to remove more numbers, go to 1.
This would remove 2,4,6,8,10,3,...

Here is a recursive approach that repeatedly removes the point which will be the least missed:
def mostRedundantPoint(x):
#returns the index, i, in the range 0 < i < len(x) - 1
#that minimizes x[i+1] - x[i-1]
#assumes len(x) > 2 and that x
#is sorted in ascending order
gaps = [x[i+1] - x[i-1] for i in range(1,len(x)-1)]
i = gaps.index(min(gaps))
return i+1
def reduceList(x,k):
if len(x) <= k:
return x
else:
i = mostRedundantPoint(x)
return reduceList(x[:i]+x[i+1:],k)
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
print(reduceList(X,13))
#prints [1, 3, 5, 7, 10, 12, 14, 16, 18, 20, 30, 40, 50]
This list essentially agrees with your intended output since 7 vs. 8 have the same net effect. It is reasonably quick in the sense that it is almost instantaneous in reducing sorted([random.randint(1,10**6) for i in range(1000)]) from 1000 elements to 100 elements. The fact that it is recursive implies that it will blow the stack if you try to remove many more points than that, but with what seems to be your intended problem size that shouldn't be an issue. If need be, you could of course replace the recursion by a loop.

Pairing images as np arrays into a specific format

So I have 2 images, X and Y, as numpy arrays, each of shape (3, 30, 30): that is, 3 channels (RGB), each of height and width 30 pixels. I'd like to pair them up into a numpy array to get a specific output shape:
my_pair = pair_up_images(X, Y)
my_pair.shape = (2, 3, 30, 30)
Such that I can get the original images by slicing:
my_pair[0] == X
my_pair[1] == Y
After a few attempts, I keep getting either:
my_pair.shape = (2,) #By converting the images into lists and adding them.
This works as well, but the next step in the pipeline just requires a shape (2, 3, 30, 30)
my_pair.shape = (6, 30, 30) # using np.vstack
my_pair.shape = (3, 60, 30) # using np.hstack
Thanks!

Simply:
Z = np.array([X, Y])
Z.shape
Out[62]: (2, 3, 30, 30)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to evaluate columns that contain lists in pandas? - python

Alternatively if you need to test box dimensions independently df.loc[(df['Box Dimensions'].str[0] <= 15) & (df['Box Dimensions'].str[1] <= 15) & (df['Box Dimensions'].str[2] <= 15)] Serial Number Box Dimensions 2 17827792 (10, 10, 12)

Related

Why is my array coming out as shape: (6, 1, 2) when it is made of two (6, ) arrays?

Recursively trying to find the maximum w/o loops

How to generate random numbers to satisfy a specific mean and median in python?

"Sensibly" remove points in a Python list

Pairing images as np arrays into a specific format

Categories

Resources