I have an amplitude curve from x = 2000 to 5000 in 3000 steps and a data curve from x = 0 to 10000 in 50000 steps. Now I want to normalize the data (multiply with the amplitude curve), but as you can see the two arrays are of unequal length and have different start points.
Is there any way of doing this without resizing one of the two? (all values outside the amplitude range can be zero)
You can normalize two arrays of unequal size, but you have to make a decision or two about what makes sense for your application.
Example code:
a1 = [1,2,3,4]
a2 = [20,30]
If I want to scale the values in a1 by a2, how should I do it?
pairwise by indices, discarding extra length
make copies of indices in a2 to pad its length
pad values in a2 with fixed values
interpolate values in a2 to create new data points, while adding to its length
Do what makes sense for your data.
You said you don't want to resize the lists so you'll probably just have to iterate both lists using a while loop and keeping track of indices for each array. Stop looping when you reach the end of one the ranges.
You could also use the zip and map functions to do something like
>>> b = [2, 4, 6, 8]
>>> c = [1, 3, 5, 7, 9]
>>> map( lambda x : x[0]*x[1], zip(b, c[1:]))
>>> [6, 20, 42, 72]
but I am not sure if thats something you "can" do.
You can kind of do this with pandas if you're smart about how you define your row and column labels. When you multiply the dataframes, pandas will align the data where the column and row labels match. Values where the labels do not match will be set to NaN. Consider the following example:
# every other step
df1 = pandas.DataFrame(
data=np.arange(1, 10).reshape(3, 3),
columns=[1, 3, 5],
index=[0, 2, 4]
)
print(df1)
1 3 5
0 1 2 3
2 4 5 6
4 7 8 9
# every step
df2 = pandas.DataFrame(
data=np.arange(0, 25).reshape(5, 5),
columns=[1, 2, 3, 4, 5],
index=[0, 1, 2, 3, 4]
)
1 2 3 4 5
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
print(df1 * df2)
1 2 3 4 5
0 0 -- 4 -- 12 # <-- labels match
1 -- -- -- -- --
2 40 -- 60 -- 84 # <-- labels match
3 -- -- -- -- --
4 140 -- 176 -- 216 # <-- labels match
# ^ ^ ^
# | | |
Related
I've the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
Which gives:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 1
7 1
8 2
9 2
10 5
11 9
12 1
13 2
14 3
15 3
16 3
17 5
I want to replace duplicate values in the 'ID' column with the lowest, not yet used, value. However, consequtive identical values should be seen as a group and their values should be changed in the same way. For example: the first two values are both 1. These are consequtive so they are a group and the second '1' should therefore not be replaced with a '2'. Row 14-16 are three consequtive threes. The value 3 has already been used to replace above values, so these threes need to be replaced. But they're consequtive, thus a group, and should get the same replacemnt value. The expected outcome is as follows and will make it more clear:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
def fun():
v, dub = 1, set()
d = yield
while True:
num = d.iloc[0]['ID']
if num in dub:
while v in dub:
v += 1
d.ID = num = v
dub.add(num)
d = yield d
f = fun()
next(f)
df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)
Output:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
I made up a way to get your outcome using for loops and dictionaries. More difficult I expected to be fair, the code can seem a bit complex at first but it isnt. Probably, there's a way to do it with multiple logical vectors, but I don't know.
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
for m in range(1,df.shape[0]):
# Find a value different from previous one
# therefore, beginning of new subgroup
if diffs[m-1] != 0:
# Check if value was before in the ID column
if d[df.ID[m]] == True:
# Get the lowest value which wasn't used
lowest = [k for k, v in d.items() if v == False][0]
# loop over the subgroup (which differences are 0)
for n in range(m+1,df.shape[0]):
if diffs[n-1] > 0: # If find a new subgroup
break # then stop looping
# Replace the subgroup with the lowest value
df.ID[m:n] = lowest # n is the final index of the subgroup
# *Exception in case last number is a subgroup itself
# then previous for loop doesnt work
if m == df.shape[0]-1:
df.ID[m] = lowest
# Update dictionaries for values retrieved from ID column
d[df.ID[m]] = True
print(df)
Therefore, what you want is to think about your column ID as subgroups or different arrays, checking different conditions and making different operations then. You can think about your column as a set of multiple arrays:
[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]
What you need to do is find the limits of that subgroups and check whether they meet certain conditions (1. not a previous number, 2. the lowest number which we didn't use). We can know the subgroups if we calculate the differences between a value and the previous one
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
We can know the conditions using a dictionary which keys are the integers in the array or longer values we could need and values are whether we have used them or not (True or False).
To do so, we need to do the max value of the ID column. However, we need to build the dictionary with more numbers as there are in the column (in your example the max(input) = 9 and max(output) = 12). You could do randomly, I chose to calculate the possible proportion we could need attending to the number of rows and number of unique values in the column (the last input in a = range... ).
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True
Last part of the code is a main for loop with some If and another for inside, it works as:
# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
# in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so
# there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.
I am sure there are many ways to shorten this code. But this work and should work with larger dataframes and the same conditions.
Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)
Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.
The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8
This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1
Let's say I have this array:
np.arange(9)
[0 1 2 3 4 5 6 7 8]
I would like to shuffle the elements with np.random.shuffle but certain numbers have to be in the original order.
I want that 0, 1, 2 have the original order.
I want that 3, 4, 5 have the original order.
And I want that 6, 7, 8 have the original order.
The number of elements in the array would be multiple of 3.
For example, some possible outputs would be:
[ 3 4 5 0 1 2 6 7 8]
[ 0 1 2 6 7 8 3 4 5]
But this one:
[2 1 0 3 4 5 6 7 8]
Would not be valid because 0, 1, 2 are not in the original order
I think that maybe zip() could be useful here, but I'm not sure.
Short solution using numpy.random.shuffle and numpy.ndarray.flatten functions:
arr = np.arange(9)
arr_reshaped = arr.reshape((3,3)) # reshaping the input array to size 3x3
np.random.shuffle(arr_reshaped)
result = arr_reshaped.flatten()
print(result)
One of possible random results:
[3 4 5 0 1 2 6 7 8]
Naive approach:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
shuffled_array = np.empty_like(array_to_shuffle)
cur_idx = 0
for idx in indices:
shuffled_array[cur_idx:cur_idx+3] = array_to_shuffle[idx*3:(idx+1)*3]
cur_idx += 3
Faster (and cleaner) option:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
tmp = array_to_shuffle.reshape([-1,3])
tmp = tmp[indices,:]
tmp.reshape([-1])
I have a very large dataframe
in>> all_data.shape
out>> (228714, 436)
What I would like to do effciently is multiply many of the columns together. I started with a for loop and list of columns--the most effcient way I have found is
from itertools import combinations
newcolnames=list(all_data.columns.values)
newcolnames=newcolnames[0:87]
#make cross products (the columns I want to operate on are the first 87)
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data[c1] * all_data[c2]
The problem as one may guess is I have 87 columns which would give on the order of 3800 new columns (yes this is what I intended). Both my jupyter notebook and ipython shell choke on this calculation. I need to figure a better way to undertake this multiplication.
Is there a more efficient way to vectorize and/or process? Perhaps using a numpy array (my dataframe has been processed and now contains only numbers and NANs, it started with categorical variables).
As you have mentioned NumPy in the question, that might be a viable option here, specially because you might want to work in 2D space of NumPy instead of 1D columnar processing with pandas. To start off, you can convert the dataframe to a NumPy array with a call to np.array, like so -
arr = np.array(df) # df is the input dataframe
Now, you can get the pairwise combinations of the column IDs and then index into the columns and perform column-wise multiplications and all of this would be done in a vectorized manner, like so -
idx = np.array(list(combinations(newcolnames, 2)))
out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
Sample run -
In [117]: arr = np.random.randint(0,9,(4,8))
...: newcolnames = [1,4,5,7]
...: for c1, c2 in combinations(newcolnames, 2):
...: print arr[:,c1] * arr[:,c2]
...:
[16 2 4 56]
[64 2 6 16]
[56 3 0 24]
[16 4 24 14]
[14 6 0 21]
[56 6 0 6]
In [118]: idx = np.array(list(combinations(newcolnames, 2)))
...: out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
...:
In [119]: out.T
Out[119]:
array([[16, 2, 4, 56],
[64, 2, 6, 16],
[56, 3, 0, 24],
[16, 4, 24, 14],
[14, 6, 0, 21],
[56, 6, 0, 6]])
Finally, you can create the output dataframe with propers column headers (if needed), like so -
>>> headers = ['{0}*{1}'.format(idx[i,0],idx[i,1]) for i in range(len(idx))]
>>> out_df = pd.DataFrame(out,columns = headers)
>>> df
0 1 2 3 4 5 6 7
0 6 1 1 6 1 5 6 3
1 6 1 2 6 4 3 8 8
2 5 1 4 1 0 6 5 3
3 7 2 0 3 7 0 5 7
>>> out_df
1*4 1*5 1*7 4*5 4*7 5*7
0 1 5 3 5 3 15
1 4 3 8 12 32 24
2 0 6 3 0 0 18
3 14 0 14 0 49 0
you can try the df.eval() method:
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data.eval('{} * {}'.format(c1, c2))