Break DataFrame into multiple DataFrames depending on monotonically decreasing nature - python

Assume, I have a data frame series containing increasing set of values and decreasing set of values. (Like a sawtooth pattern).
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
y = list(np.linspace(0, 1, 3)) + list(np.linspace(1,0, 11)) * 5
x = list(range(len(y)))
df = pd.DataFrame({"x": x, "y":y})
df.plot("x", "y")
Now I would like to extract these down-sliding sections in to separate dfs. What would be the best way to do this ?
What I am expecting to see a list of dfs as below (image shows the data of the first df)
pd.DataFrame({"x": range(11), "y":list(np.linspace(1,0, 11))}).plot("x", "y")

Use:
s = (df['y']-df['y'].shift(-1)>0)
t = s-s.shift(1)
u = t[t>=0].astype(int).cumsum()
u = u[u>0]
df.loc[u.index].groupby(u)['y'].apply(list)
Output"
y
1 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
2 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
3 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
4 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
5 [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999...
Name: y, dtype: object

Related

NumPy slicing squares in 2D array

I want to create a heightfield map that consists of squares of random height. Given an array of NxN, I want that every square of size MxM, where M<N, will be at the same random height, with the height sampled from a uniform distribution. For example, if we have N = 6 and M = 2, we would have:
0.2, 0.2, 0.6, 0.6, 0.1, 0.1,
0.2, 0.2, 0.6, 0.6, 0.1, 0.1,
0.5, 0.5, 0.3, 0.3, 0.8, 0.8,
0.5, 0.5, 0.3, 0.3, 0.8, 0.8,
0.6, 0.6, 0.4, 0.4, 0.9, 0.9,
0.6, 0.6, 0.4, 0.4, 0.9, 0.9
For now, I've come up with an inefficient way of doing it with 2 nested for loops. I'm sure there must be an efficient and elegant way to do that with NumPy slicing.
This solution using the repeat() method should work for N/M integer.
import numpy as np
N = 6
M = 2
values = np.random.random( [N//M, N//M] )
y = values.repeat( M, axis=0 ).repeat( M, axis=1 )
print(y)

Open interval (a,b) and half-open interval (a,b] using Python's linspace

A half-open interval of the form [0,0.5) can be created using the following code:
rv = np.linspace(0., 0.5, nr, endpoint=False)
where nr is the number of points in the interval.
Question: How do I use linspace to create an open interval of the form (a,b) or a half-open interval of the form (a,b]?
Probably the simplest way (since this functionality isn't built in to np.linspace()) is to just slice what you want.
Let's say you're interested in the interval [0,1] with a spacing of 0.1.
>>> import numpy as np
>>> np.linspace(0, 1, 11) # [0,1]
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
>>> np.linspace(0, 1, 11-1, endpoint=False) # [0,1)
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> np.linspace(0, 1, 11)[:-1] # [0,1) again
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> np.linspace(0, 1, 11)[1:] # (0,1]
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
>>> np.linspace(0, 1, 11)[1:-1] # (0,1)
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

How do I grab random elements on python from paired lists?

I tried to compare drop height versus rebound height and have some data here:
drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
I want to select 5 random data points off of these variables, so I tried
smol_drop_heights = []
smol_rebound_heights = []
for each in range(0,5):
smol_drop_heights.append(drop_heights[randint(0, 9)])
smol_rebound_heights.append(rebound_heights[randint(0, 9)])
print(smol_drop_heights)
print(smol_rebound_heights)
When they print, they print different sets of data, and sometimes even repeat data, how do I fix this?
[0.8, 1.6, 0.6, 0.2, 0.12]
[1.02, 1.15, 0.88, 0.88, 0.6]
Here is a sample output, where you can see .88 is repeated.
A simple way to avoid repetitions and keep the data points paired and randomly sort the pairs:
from random import random
drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
pairs = list(sorted(zip(drop_heights, rebound_heights), key=lambda _: random()))[:5]
smol_drop_heights = [d for d, _ in pairs]
smol_rebound_heights = [r for _, r in pairs]
One way to do it would be:
drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
indices = [*range(len(drop_heights))]
from random import shuffle
shuffle(indices)
smol_drop_heights = []
smol_rebound_heights = []
for each in indices:
smol_drop_heights.append(drop_heights[each])
smol_rebound_heights.append(rebound_heights[each])
print(smol_drop_heights)
print(smol_rebound_heights)
Output:
[1.7, 0.8, 1.6, 1.2, 0.2, 0.4, 1.4, 2.0, 1.0, 0.6]
[1.34, 0.6, 1.15, 0.88, 0.16, 0.3, 1.02, 1.51, 0.74, 0.46]
Or, much shorter:
from random import sample
drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
paired = [*zip(drop_heights, rebound_heights)]
smol_drop_heights, smol_rebound_heights = zip(*sample(paired,5))
print(smol_drop_heights[:5])
print(smol_rebound_heights[:5])
Here"s what I would do.
import random
import numpy as np
k=5
drop_heights = np.array([0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0])
rebound_heights = np.array([0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51])
idx = random.sample(range(len(drop_heights )), k)
print(drop_heights[idx])
print(rebound_heights [idx])
You could try shuffling and then use the index of the original items like,
>>> drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
>>> rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
>>>
>>> import random
>>> d = drop_heights[:] # keep a copy to get index for making pairs later
>>> random.shuffle(drop_heights)
>>> # iterate through the new list and get the index of the item
>>> # from the original lists
>>> nd, nr = zip(*[(x,rebound_heights[d.index(x)]) for x in drop_heights])
>>> nd[:5]
(1.4, 0.6, 1.7, 0.2, 1.0)
>>> nr[:5]
(1.02, 0.46, 1.34, 0.16, 0.74)
or just use operator.itemgetter and random.sample like,
>>> drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
>>> rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
>>>
>>> import random, operator
>>> indexes = random.sample(range(len(drop_heights)), 5)
>>> indexes
[5, 0, 4, 7, 3]
>>> f = operator.itemgetter(*indexes)
>>> f(drop_heights)
(1.2, 0.2, 1.0, 1.6, 0.8)
>>> f(rebound_heights)
(0.88, 0.16, 0.74, 1.15, 0.6)
Your problem is that when you call randint, it gives a different random number each time. To solve this you would need to save an index variable, to a random number, each time the code loops, so that you add the same random variable each time.
for each in range(0, 4):
index = randint(0, 9)
smol_drop_heights.append(drop_heights[index])
smol_rebound_heights.append(rebound_heights[index])
print(smol_drop_heights)
print(smol_rebound_heights)
To solve the problem about repeats, just check if the lists already have the variable you want to add, you could do it with either variable, as neither of them have repeats in them, and since there may be repeats, a for loop will not be sufficient, so you will have to repeat until the lists are full.
So my final solution is:
while True:
index = randint(0, 9)
if drop_heights[index] not in smol_drop_heights:
smol_drop_heights.append(drop_heights[index])
smol_rebound_heights.append(rebound_heights[index])
if len(smol_drop_heights) == 4:
break
print(smol_drop_heights)
print(smol_rebound_heights)
And since you may want to arrange those value in order, you may do this:
smol_drop_heights = []
smol_rebound_heights = []
while True:
index = randint(0, 9)
if drop_heights[index] not in smol_drop_heights:
smol_drop_heights.append(drop_heights[index])
smol_rebound_heights.append(rebound_heights[index])
if len(smol_drop_heights) == 4:
smol_drop_heights.sort()
smol_rebound_heights.sort()
break
print(smol_drop_heights)
print(smol_rebound_heights)
Ok, you want to do two things, pair your lists. The idiomatic way to do this is to use zip:
drop_heights = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.7, 2.0]
rebound_heights = [0.16, 0.30, 0.46, 0.6, 0.74, 0.88, 1.02, 1.15, 1.34, 1.51]
paired = list(zip(drop_heights, rebound_heights))
Then, you want to sample five pairs from this. So use random.sample:
sampled = random.sample(paired, 5)
Finally, if you need them to be in seperate lists (you probably don't, but if you must), you can unpack it like this:
smol_drop_heights, smol_rebound_heights = zip(*sampled)
You can actually just do this in all at once, although it might become a bit unreadable:
smol_drop_heights, smol_rebound_heights = zip(*random.sample(list(zip(drop_heights, rebound_heights)), 5))

Numpy Arange generating values of inconsistent decimal points

Can someone explain me what's happening here?
Why is there more decimal points for 0.3 and 0.7 values.
I just want 1 decimal point values.
threshold_range = np.arange(0.1,1,0.1)
threshold_range.tolist()
[Output]: [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]
Use np.round
Ex.
import numpy as np
threshold_range = np.arange(0.1,1,0.1)
print(threshold_range.tolist())
print(np.round(threshold_range, 2).tolist())
O/P:
[0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Solution: You can simply use round function:
threshold_range = np.arange(0.1,1,0.1).round(1)
threshold_range.tolist() # [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Reason of error: I think it has to do with floating point precision ;)

Python, pandas: how to extract values from a symmetric, multi-index dataframe

I have a symmetric, multi-index dataframe from which I want to systematically extract data:
import pandas as pd
df_index = pd.MultiIndex.from_arrays(
[["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 0.3, -0.4],
[0.5, 1.0, 0.9, -0.8],
[0.3, 0.9, 1.0, 0.1],
[-0.4, -0.8, 0.1, 1.0]],
index=df_index, columns=df_index)
I want a function extract_vals that can return all values related to elements in the same group, EXCEPT for the diagonal AND elements must not be double-counted. Here are two examples of the desired behavior (order does not matter):
A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]
My question is similar to this question on SO, but my situation is different because I am using a multi-index dataframe.
Finally, to make things more fun, please consider efficiency because I'll be running this many times on much bigger dataframes. Thanks very much!
EDIT:
Happy001's solution is awesome. I came up with a method myself based on the logic of extracting the elements where target is NOT in BOTH the rows and columns, and then extracting the lower triangle of those elements where target IS in BOTH the rows and columns. However, Happy001's solution is much faster.
First, I created a more complex dataframe to make sure both methods are generalizable:
import pandas as pd
import numpy as np
df_index = pd.MultiIndex.from_arrays(
[["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
[0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
[1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
[-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
[1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
[-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
index=df_index, columns=df_index)
Next, I defined both versions of extract_vals (the first is my own):
def extract_vals(target, multi_index_level_name, df):
# Extract entries where target is in the rows but NOT also in the columns
target_in_rows_but_not_in_cols_vals = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) != target]
# Extract entries where target is in the rows AND in the columns
target_in_rows_and_cols_df = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) == target]
mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]
# Append both arrays of extracted values
vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)
return vals
def extract_vals2(target, multi_index_level_name, df):
# Get indices for what you want to extract and then extract all at once
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
df.index.get_level_values(multi_index_level_name)[i] == target or (
df.columns.get_level_values(multi_index_level_name)[j] == target))]
return df.values[tuple(np.transpose(coord))]
I checked that both functions returned output as desired:
# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])
# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)
assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)
And finally, I checked speed:
## Test speed
import time
# Method 1
start1 = time.time()
for ii in range(10000):
out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec
# Method 2
start2 = time.time()
for ii in range(10000):
out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec
I don't assume df has the same columns and index. (Of course they can be the same).
def extract_vals(group_label, df):
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i<j and (df.index.get_level_values('group')[i] == group_label or df.columns.get_level_values('group')[j] == group_label) ]
return df.values[tuple(np.transpose(coord))]
print extract_vals('A', df)
print extract_vals('B', df)
result:
[ 0.5 0.3 -0.4 0.9 -0.8]
[ 0.3 -0.4 0.9 -0.8 0.1]
is that what you want?
all elements above the diagonal:
In [139]: df.values[np.triu_indices(len(df), 1)]
Out[139]: array([ 0.5, 0.3, -0.4, 0.9, -0.8, 0.1])
A_vals:
In [140]: df.values[np.triu_indices(len(df), 1)][:-1]
Out[140]: array([ 0.5, 0.3, -0.4, 0.9, -0.8])
B_vals:
In [141]: df.values[np.triu_indices(len(df), 1)][1:]
Out[141]: array([ 0.3, -0.4, 0.9, -0.8, 0.1])
Source matrix:
In [142]: df.values
Out[142]:
array([[ 1. , 0.5, 0.3, -0.4],
[ 0.5, 1. , 0.9, -0.8],
[ 0.3, 0.9, 1. , 0.1],
[-0.4, -0.8, 0.1, 1. ]])

Categories

Resources