I have a vector of values vals, a same-dimension vector of frequencies freqs, and a set of frequency values pins.
I need to find the max values of vals within the corresponding interval around each pin (from pin-1 to pin+1). However, the intervals merge if they overlap (e.g., [1,2] and [0.5,1.5] become [0.5,2]).
I have a code that (I think) works, but I feel is not optimal at all:
import numpy as np
np.random.seed(666)
freqs = np.linspace(0, 20, 50)
vals = np.random.randint(100, size=(len(freqs), 1)).flatten()
print(freqs)
print(vals)
pins = [2, 6, 10, 11, 15, 15.2]
# find one interval for every pin and then sum to find final ones
islands = np.zeros((len(freqs), 1)).flatten()
for pin in pins:
island = np.zeros((len(freqs), 1)).flatten()
island[(freqs >= pin-1) * (freqs <= pin+1)] = 1
islands += island
islands = np.array([1 if x>0 else 0 for x in islands])
print(islands)
maxs = []
k = 0
idxs = []
for i,x in enumerate(islands):
if (x > 0) and (k == 0): # island begins
k += 1
idxs.append(i)
elif (x > 0) and (k > 0): # island continues
pass
elif (x == 0) and (k > 0): # island finishes
idxs.append(i)
maxs.append(np.max(vals[idxs[0]:idxs[1]]))
k = 0
idxs = []
continue
print(maxs)
Which gives maxs=[73, 97, 79, 77].
Here's some optimizations for your code. There are many numpy functions that make your life easier, get to know them and use them ;). I tried commenting my code to make it as understandable as possible, but let me know if anything is unclear!
import numpy as np
np.random.seed(666)
freqs = np.linspace(0, 20, 50)
vals = np.random.randint(100, size=(len(freqs), 1)).flatten()
print(freqs)
print(vals)
pins = [2, 6, 10, 11, 15, 15.2]
# find one interval for every pin and then sum to find final ones
islands = np.zeros_like(freqs) # in stead of: np.zeros((len(freqs), 1)).flatten()
for pin in pins:
island = np.zeros_like(freqs) # see above comment
island[(freqs >= pin-1) & (freqs <= pin+1)] = 1 # "&" makes it more readable
islands += island
# in stead of np.array([1 if x>0 else 0 for x in islands])
islands = np.where(islands > 0, 1, 0) # read as: where "islands > 0" put a '1', else put a '0'
# compare each value with the next to get island/sea transistions (islands are 1's seas are 0's)
island_edges = islands[:-1] != islands[1:]
# split at the edges (+1 to account for starting at the 1 index with comparisons
# islands_and_seas is a list of 'seas' and 'islands'
islands_and_seas = np.split(islands, np.where(island_edges)[0]+1)
# do the same as above but on the 'vals' array
islands_and_seas_vals = np.split(vals, np.where(island_edges)[0]+1)
# get the max values for the seas and islands
max_vals = np.array([np.max(arr) for arr in islands_and_seas_vals])
# create an array where the islands -> True, and seas -> False
islands_and_seas_bool = [np.all(arr) for arr in islands_and_seas]
# select only the max values of islands with
maxs = max_vals[islands_and_seas_bool]
print(maxs)
I have a Pandas DataFrame containing 3 categorical grouping variables and 1 numerical outcome variable. Within each group, there is an n = 6, where one of these values may be an outlier (as defined by the distribution within each group: an outlier can either exceed quartile 3 by 1.5 times the inter-quartile range, or be less than quartile 1 by 1.5 times the inter-quartile range).
An example of the DataFrame is shown below:
# Making the df without our outcome variable
import numpy as np
import pandas as pd
G1 = np.repeat(['E', 'F'], 24)
G2 = np.tile(np.repeat(['C', 'D'], 6), 4)
G3 = np.tile(np.repeat(['A', 'B'], 12), 2)
dummy_data = pd.DataFrame({'G1' : G1, 'G2' : G2, 'G3': G3})
# Defining a function to generate a numpy array with n = 6, where one of these values is an outlier # by our previous definition
np.random.seed(0)
def outlier_arr(low, high):
norm_arr = np.random.randint(low, high, 5)
IQR = np.percentile(norm_arr, 75) - np.percentile(norm_arr, 25)
upper_fence = np.percentile(norm_arr, 75) + (IQR * 1.5)
lower_fence = np.percentile(norm_arr, 25) - (IQR * 1.5)
rand_decision = np.random.randint(0, 2, 1)[0]
if rand_decision == 1:
high_outlier = np.round(upper_fence * 3, decimals = 0)
final_arr = np.hstack([norm_arr, high_outlier])
else:
low_outlier = np.round(lower_fence * (1/3), decimals = 0)
final_arr = np.hstack([norm_arr, low_outlier])
return final_arr.astype(int)
# Making a list to add into the dataframe to represent our values
abund_arr = []
for i in range(0, 8):
abund_arr = abund_arr + outlier_arr(700, 800).tolist()
abund_arr = np.array(abund_arr)
# Appending this list as a new row
dummy_data['V1'] = abund_arr
This should generate a DataFrame with 3 grouping variables G1, G2, and G3, and a single outcome variable V1 where each group should have one outlier that needs to be removed. We can look at the first 6 rows (a single group) with dummy_data.head(6) below to see that one of these values (the last row) is an outlier that we would like to filter out.
G1 G2 G3 V1
0 E C A 744
1 E C A 747
2 E C A 764
3 E C A 767
4 E C A 767
5 E C A 2391 <--- outlier
From what I understand, a good approach may be to use df.groupby().filter(), and to group by variables G1, G2, and G3 and implement a user-defined function to filter() that returns T/F based on the outlier criteria discusses above.
I have tried this, where the function for detecting outliers (returns array of True or False) within an array is found below:
def is_outlier(x):
IQR = np.percentile(x, 75) - np.percentile(x, 25)
upper_fence = np.percentile(x, 75) + (IQR * 1.5)
lower_fence = np.percentile(x, 25) - (IQR * 1.5)
return (x > upper_fence) | (x < lower_fence)
Which correctly detects an outlier as shown below:
test_arr = outlier_arr(300, 500)
is_outlier(test_arr)
# returns an array of [False, False, False, False, False, True]
However, when using the method described above on a pandas object, the following code throws no errors, but also does not filter any of the outliers:
dummy_data.groupby(['G1', 'G2', 'G3']).filter(lambda x: (is_outlier(x['V1'])).any())
NOTE: I actually found a way to do this here, where you use apply() instead of filter().
Running dummy_data[~dummy_data.groupby(['G1', 'G2', 'G3'])['V1'].apply(is_outlier)] produced the desired result.
However, just for the sake of doing it with this method, what needs to be tweaked to get this to work using filter()? If it's possible, which of the two ways is correct/preferred?
Thanks in advance.
Trying to backtest trading logic for fun but I can seem to comprehend how to utilize numpy to make decisions. For example, I want to set df['position'] = 1 or -1 based on whether the data is below or above the upper and lower lines. If Data <= the lower line I want to set position = 1 and keep it at 1 until Data it is >= the upper line. Once data is >= the upper line I want to set position = -1 and keep at -1 then repeat.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.random.standard_normal((5, 100)).flatten()
data = data.cumsum()
df = pd.DataFrame({'Data': data})
df['std'] = df['Data'].rolling(50).std()
df['SMA'] = df['Data'].rolling(50).mean()
df['upper'] = df['SMA'] + (2 * df['std'])
df['lower'] = df['SMA'] - (2 * df['std'])
df[['Data', 'SMA', 'upper', 'lower']].plot(figsize=(10, 6))
df['position'] = 0
plt.show()
Here I try to do just that but fail because I don't know how to do this properly.
df['islower'] = np.where(df['Data'] < df['lower'], 1, 0)
df['isupper'] = np.where(df['Data'] > df['upper'], 1, 0)
df['position'] = np.where(df['isupper']==1, -1, 0) | np.where(df['islower']==1, 1, 0)
I think what you want to do is:
df['islower'] = df['islower'].where(df['Data'] < df['lower'], 1, 0)
df['isupper'] = df['isupper'].where(df['Data'] < df['upper'], 1, 0)
I have a code for sequentially whether every pair of cartesian coordinates found in my DataFrame fall into certain geometric enclosed areas. But it is rather slow, I suspect because it is not vectorized. Here is an example:
from matplotlib.patches import Rectangle
r1 = Rectangle((0,0), 10, 10)
r2 = Rectangle((50,50), 10, 10)
df = pd.DataFrame([[1,2],[-1,5], [51,52]], columns=['x', 'y'])
for j in range(df.shape[0]):
coordinates = df.x.iloc[j], df.y.iloc[j]
if r1.contains_point(coordinates):
df['location'].iloc[j] = 0
else r2.contains_point(coordinates):
df['location'].iloc[j] = 1
Can someone propose an approach for speed-up?
It's better to convert the rectangular patches into an array and work on it after deducing the extent to which they are spread out.
def seqcheck_vect(df):
xy = df[["x", "y"]].values
e1 = np.asarray(rec1.get_extents())
e2 = np.asarray(rec2.get_extents())
r1m1, r1m2 = np.min(e1), np.max(e1)
r2m1, r2m2 = np.min(e2), np.max(e2)
out = np.where(((xy >= r1m1) & (xy <= r1m2)).all(axis=1), 0,
np.where(((xy >= r2m1) & (xy <= r2m2)).all(axis=1), 1, np.nan))
return df.assign(location=out)
For the given sample the function outputs:
benchmarks:
def loopy_version(df):
for j in range(df.shape[0]):
coordinates = df.x.iloc[j], df.y.iloc[j]
if rec1.contains_point(coordinates):
df.loc[j, "location"] = 0
elif rec2.contains_point(coordinates):
df.loc[j, "location"] = 1
else:
pass
return df
testing on a DF of 10K rows:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, (10000,2)), columns=list("xy"))
# check if both give same outcome
loopy_version(df).equals(seqcheck_vect(df))
True
%timeit loopy_version(df)
1 loop, best of 3: 3.8 s per loop
%timeit seqcheck_vect(df)
1000 loops, best of 3: 1.73 ms per loop
So, the vectorized approach is approximately 2200 times faster compared to the loopy one.
I. Am. Stuck.
I have been working on this for over a week now, and I cannot seem to get my code to run correctly. I am fairly new to PIL and Python as a whole. I am trying to make a 2x3 collage of some pictures. I have my code listed below. I am trying to get my photos to fit without any access black space in the newly created collage, however when I run my code I can only get 2 pictures to be placed into the collage, instead of the 6 I want. Any suggestions would be helpful.
*CODE EDITED
from PIL import Image
im= Image.open('Tulips.jpg')
out=im.convert("RGB", (
0.412453, 0.357580, 0.180423, 0,
0.212671, 0.715160, 0.072169, 0,
0.019334, 0.119193, 0.950227, 0 ))
out.save("Image2" + ".jpg")
out2=im.convert("RGB", (
0.9756324, 0.154789, 0.180423, 0,
0.212671, 0.715160, 0.254783, 0,
0.123456, 0.119193, 0.950227, 0 ))
out2.save("Image3" + ".jpg")
out3= im.convert("1")
out3.save("Image4"+".jpg")
out4=im.convert("RGB", (
0.986542, 0.154789, 0.756231, 0,
0.212671, 0.715160, 0.254783, 0,
0.123456, 0.119193, 0.112348, 0 ))
out4.save("Image5" + ".jpg")
out5=Image.blend(im, out4, 0.5)
out5.save("Image6" + ".jpg")
listofimages=['Tulips.jpg', 'Image2.jpg', 'Image3.jpg', 'Image4.jpg', 'Image5.jpg', 'Image6.jpg']
def create_collage(width, height, listofimages):
Picturewidth=width//3
Pictureheight=height//2
size=Picturewidth, Pictureheight
new_im=Image.new('RGB', (450, 300))
for p in listofimages:
Image.open(p)
for col in range(0,width):
for row in range(0, height):
image=Image.eval(p, lambda x: x+(col+row)/30)
new_im.paste(p, (col,row))
new_im.save("Collage"+".jpg")
create_collage(450,300,listofimages)
Here's some working code.
When you call Image.open(p), that returns an Image object, so you need to store than in a variable: im = Image.open(p).
I'm not sure what image=Image.eval(p, lambda x: x+(col+row)/30) is meant to do so I removed it.
size is the size of the thumbnails, but you're not using that variable. After opening the image, it should be resized to size.
I renamed Picturewidth and Pictureheight to thumbnail_width and thumbnail_height to make it clear what they are and follow Python naming conventions.
I also moved the number of cols and rows to variables so they can be reused without magic numbers.
The first loop opens each image into an im, thumbnails it and puts it in a list of ims.
Before the next loops we initialise i,x, andy` variables to keep track of which image we're looking at, and the x and y coordinates to paste the thumbnails into the larger canvas. They'll be updated in the next loops.
The first loop is for columns (cols), not pixels (width). (Also range(0, thing) does the same as range(thing).)
Similarly the second loop is for rows instead of pixels. Inside this loop we paste the current image at ims[i] into the big new_im at x, y. These are pixel positions, not row/cols positions.
At the end of the inner loop, increment the i counter, and add thumbnail_height to y.
Similarly, at the end of the outer loop, and add thumnnail_width to x and reset y to zero.
You only need to save new_im once, after these loops have finished.
There's no need for concatenating "Image2" + ".jpg" etc., just do "Image2.jpg".
This results in something like this:
This code could be improved. For example, if you don't need them for anything else, there's no need to save the intermediate ImageX.jpg files, and rather than putting those filenames in listofimages, put the images directly there: listofimages = [im, out1, out2, etc...], and then replace for p in listofimages: with for im in listofimages: and remove im = Image.open(p).
You could also calculate some padding for the images so the blackspace is even.
from PIL import Image
im= Image.open('Tulips.jpg')
out=im.convert("RGB", (
0.412453, 0.357580, 0.180423, 0,
0.212671, 0.715160, 0.072169, 0,
0.019334, 0.119193, 0.950227, 0 ))
out.save("Image2.jpg")
out2=im.convert("RGB", (
0.9756324, 0.154789, 0.180423, 0,
0.212671, 0.715160, 0.254783, 0,
0.123456, 0.119193, 0.950227, 0 ))
out2.save("Image3.jpg")
out3= im.convert("1")
out3.save("Image4.jpg")
out4=im.convert("RGB", (
0.986542, 0.154789, 0.756231, 0,
0.212671, 0.715160, 0.254783, 0,
0.123456, 0.119193, 0.112348, 0 ))
out4.save("Image5.jpg")
out5=Image.blend(im, out4, 0.5)
out5.save("Image6.jpg")
listofimages=['Tulips.jpg', 'Image2.jpg', 'Image3.jpg', 'Image4.jpg', 'Image5.jpg', 'Image6.jpg']
def create_collage(width, height, listofimages):
cols = 3
rows = 2
thumbnail_width = width//cols
thumbnail_height = height//rows
size = thumbnail_width, thumbnail_height
new_im = Image.new('RGB', (width, height))
ims = []
for p in listofimages:
im = Image.open(p)
im.thumbnail(size)
ims.append(im)
i = 0
x = 0
y = 0
for col in range(cols):
for row in range(rows):
print(i, x, y)
new_im.paste(ims[i], (x, y))
i += 1
y += thumbnail_height
x += thumbnail_width
y = 0
new_im.save("Collage.jpg")
create_collage(450, 300, listofimages)
I made a solution inspired by #Hugo's answer which only requires the input list of images. The function automatically creates a grid based on the number of images input.
def find_multiples(number : int):
multiples = set()
for i in range(number - 1, 1, -1):
mod = number % i
if mod == 0:
tup = (i, int(number / i))
if tup not in multiples and (tup[1], tup[0]) not in multiples:
multiples.add(tup)
if len(multiples) == 0:
mod == number % 2
div = number // 2
multiples.add((2, div + mod))
return list(multiples)
def get_smallest_multiples(number : int, smallest_first = True) -> Tuple[int, int]:
multiples = find_multiples(number)
smallest_sum = number
index = 0
for i, m in enumerate(multiples):
sum = m[0] + m[1]
if sum < smallest_sum:
smallest_sum = sum
index = i
result = list(multiples[i])
if smallest_first:
result.sort()
return result[0], result[1]
def create_collage(listofimages : List[str], n_cols : int = 0, n_rows: int = 0,
thumbnail_scale : float = 1.0, thumbnail_width : int = 0, thumbnail_height : int = 0):
n_cols = n_cols if n_cols >= 0 else abs(n_cols)
n_rows = n_rows if n_rows >= 0 else abs(n_rows)
if n_cols == 0 and n_rows != 0:
n_cols = len(listofimages) // n_rows
if n_rows == 0 and n_cols != 0:
n_rows = len(listofimages) // n_cols
if n_rows == 0 and n_cols == 0:
n_cols, n_rows = get_smallest_multiples(len(listofimages))
thumbnail_width = 0 if thumbnail_width == 0 or n_cols == 0 else round(thumbnail_width / n_cols)
thumbnail_height = 0 if thumbnail_height == 0 or n_rows == 0 else round(thumbnail_height/n_rows)
all_thumbnails : List[Image.Image] = []
for p in listofimages:
thumbnail = Image.open(p)
if thumbnail_width * thumbnail_scale < thumbnail.width:
thumbnail_width = round(thumbnail.width * thumbnail_scale)
if thumbnail_height * thumbnail_scale < thumbnail.height:
thumbnail_height = round(thumbnail.height * thumbnail_scale)
thumbnail.thumbnail((thumbnail_width, thumbnail_height))
all_thumbnails.append(thumbnail)
new_im = Image.new('RGB', (thumbnail_width * n_cols, thumbnail_height * n_rows), 'white')
i, x, y = 0, 0, 0
for col in range(n_cols):
for row in range(n_rows):
if i > len(all_thumbnails) - 1:
continue
print(i, x, y)
new_im.paste(all_thumbnails[i], (x, y))
i += 1
y += thumbnail_height
x += thumbnail_width
y = 0
extension = os.path.splitext(listofimages[0])[1]
if extension == "":
extension = ".jpg"
destination_file = os.path.join(os.path.dirname(listofimages[0]), f"Collage{extension}")
new_im.save(destination_file)
Example usage:
listofimages=['Tulips.jpg', 'Image2.jpg', 'Image3.jpg', 'Image4.jpg', 'Image5.jpg', 'Image6.jpg']
create_collage(listofimages)
In this case, because the input images are 6, the function returns a 3x2 (3 rows, 2 columns) collage of the images.
To do so, the function finds the two smallest integer multiples of the length of the input list of graphs (e.g. for 12, it returns 3 and 4 rather than 2 and 6) and creates a grid, where the first number is always the smallest of the multiples and it is taken to be the number of columns (i.e. by default the grid gets fewer columns than rows; for 12 images, you get a 4x3 matrix: 4 rows, 3 columns). This it can be customized via the smallest_first argument (only exposed in get_smallest_multiples()).
Optional arguments also allow to force a number of rows/columns.
The final image size is the sum of the sizes of the single images, but an optional thumbnail_scale argument allows to specify a percentage of scaling for all the thumbnails (defaults to 1.0, i.e. 100%, no scaling).
This function works well when the size of the images are all roughly the same. I have not covered more complex scenarios.