Python sample without replacement and change population

Python sample without replacement and change population - python

If you have a list of 100 values, which you want to subset into 3 in the ratio 2:1:1, what's the easiest way to do this in Python?
My current solution is to take a sample of the indices for each subset then remove these values from the original list, i.e.
my_list = [....]
num_A = 50
subset_A = []
num_B = 25
subset_B = []
num_C = 25
subset_C = []
a_indices = random.sample(xrange(len(my_list)), num_A)
for i in sorted(a_indices, reverse=True): # Otherwise can get index out of range
subset_A.append(my_list.pop(i))
b_indices = random.sample(xrange(len(my_list)), num_B)
for i in sorted(b_indices, reverse=True): # Otherwise can get index out of range
subset_B.append(my_list.pop(i))
subset_C = my_list[:]
assert len(subset_C) == num_C
However I'm sure there's a much more elegant solution than this.

There's a much easier way. You can just shuffle the array and take parts.
xs = [...]
random.shuffle(xs)
print(xs[:50], xs[50:75], xs[75:])

Related

Finding similar numbers in a list and getting the average

I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found

It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)

You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]

Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])

You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.

python interpolation of some datapoints in dataset / merging lists

In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.

I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want

How do I use a while loop to access all the 2nd elements of lists which are the values stored in a dictionary?

If I have a dictionary like this, filled with similar lists, how can I apply a while loo tp extract a list that prints that second element:
racoona_valence={}
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"],\}
I need to print the part that says "BOBB7" for 2nd element of the lists in a larger dictionary. There are ten key-value pairs in it, so I am starting it like so, but unsure what to do because all the examples I can find don't relate to my problem:
n=10
gene_list = []
while n>0:
Any help greatly appreciated.

Well, there's a bunch of ways to do it depending on how well-structured your data is.
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"], "rs13283414": ["7:87345874365-839479328749+","BOBB4"]}
output = []
for key in racoona_valence.keys():
output.append(racoona_valence[key][1])
print(output)
other_output = []
for key, value in racoona_valence.items():
other_output.append(value[1])
print(other_output)
list_comprehension = [value[1] for value in racoona_valence.values()]
print(list_comprehension)
n = len(racoona_valence.values())-1
counter = 0
gene_list = []
while counter<=n:
gene_list.append(list(racoona_valence.values())[n][1])
counter += 1
print(gene_list)

Here is a list comprehension that does what you want:
second_element = [x[1] for x in racoona_valence.values()]
Here is a for loop that does what you want:
second_element = []
for value in racoona_valence.values():
second_element.append(value[1])
Here is a while loop that does what you want:
# don't use a while loop to loop over iterables, it's a bad idea
i = 0
second_element = []
dict_values = list(racoona_valence.values())
while i < len(dict_values):
second_element.append(dict_values[i][1])
i += 1
Regardless of which approach you use, you can see the results by doing the following:
for item in second_element:
print(item)
For the example that you gave, this is the output:
BOBB7

efficient way to split temporal Numpy vector automatically

I have a temporal vector as in the following image:
Numpy vector:
https://drive.google.com/file/d/0B4Jac-wNMDxHS3BnUzBoUkdmOGs/view?usp=sharing
I would like to know an efficient way to split the vector in numpy, and extract the 5 chunks of the signals that drop in amplitude significantly.
I could separate them by considering the amplitude 2.302 as the cut off amplitude and separate them by the initial index when the signal drops bellow this value and the final index when the signal goes above this value.
Any efficient way to do this in numpy?

So I've programmed the solution in pure python and lists:
vec = np.load('vector_numpy.npy')
# plt.plot(vec)
# plt.show()
print vec.shape
temporal_vec = []
flag = 0
flag_start = 0
flag_end = 0
all_vectors = []
all_index = []
count = -1
for element in vec:
count = count+1
#print element
if element < 2.302:
if flag_start ==0:
all_index.append(count)
flag_start=1
temporal_vec.append(element)
flag = 1
if flag == 1:
if element >= 2.302:
if flag_start==1:
all_index.append(count)
flag_start=0
all_vectors.append(temporal_vec)
temporal_vec = []
flag = 0
print(all_vectors)
for element in all_vectors:
print(len(element))
plt.plot(element)
plt.show()
print(all_index)
Any fancier way in Numpy or better/shorter python code?

Problems with the zip function: lists that seem not iterable

I'm having some troubles trying to use four lists with the zip function.
In particular, I'm getting the following error at line 36:
TypeError: zip argument #3 must support iteration
I've already read that it happens with not iterable objects, but I'm using it on two lists! And if I try use the zip only on the first 2 lists it works perfectly: I have problems only with the last two.
Someone has ideas on how to solve that? Many thanks!
import numpy
#setting initial values
R = 330
C = 0.1
f_T = 1/(2*numpy.pi*R*C)
w_T = 2*numpy.pi*f_T
n = 10
T = 1
w = (2*numpy.pi)/T
t = numpy.linspace(-2, 2, 100)
#making the lists c_k, w_k, a_k, phi_k
c_karray = []
w_karray = []
A_karray = []
phi_karray = []
#populating the lists
for k in range(1, n, 2):
c_k = 2/(k*numpy.pi)
w_k = k*w
A_k = 1/(numpy.sqrt(1+(w_k)**2))
phi_k = numpy.arctan(-w_k)
c_karray.append(c_k)
w_karray.append(w_k)
A_karray.append(A_k)
phi_karray.append(phi_k)
#making the function w(t)
w = []
#doing the sum for each t and populate w(t)
for i in t:
w_i = ([(A_k*c_k*numpy.sin(w_k*i+phi_k)) for c_k, w_k, A_k, phi_k in zip(c_karray, w_karray, A_k, phi_k)])
w.append(sum(w_i)

Probably you mistyped the last 2 elements in zip. They should be A_karray and phi_karray, because phi_k and A_k are single values.
My result for w is:
[-0.11741034896740517,
-0.099189027720991918,
-0.073206290274556718,
...
-0.089754003567358978,
-0.10828235682188027,
-0.1174103489674052]
HTH,
Germán.

I believe you want zip(c_karray, w_karray, A_karray, phi_karray). Additionally, you should produce this once, not each iteration of the for the loop.
Furthermore, you are not really making use of numpy. Try this instead of your loops.
d = numpy.arange(1, n, 2)
c_karray = 2/(d*numpy.pi)
w_karray = d*w
A_karray = 1/(numpy.sqrt(1+(w_karray)**2))
phi_karray = numpy.arctan(-w_karray)
w = (A_karray*c_karray*numpy.sin(w_karray*t[:,None]+phi_karray)).sum(axis=-1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python sample without replacement and change population - python

There's a much easier way. You can just shuffle the array and take parts. xs = [...] random.shuffle(xs) print(xs[:50], xs[50:75], xs[75:])

Related

Finding similar numbers in a list and getting the average

python interpolation of some datapoints in dataset / merging lists

How do I use a while loop to access all the 2nd elements of lists which are the values stored in a dictionary?

efficient way to split temporal Numpy vector automatically

Problems with the zip function: lists that seem not iterable

Categories

Resources