curve fitting by a sum of gaussian with scipy - python

I'm doing bioinformatics and we map small RNA on mRNA. We have the mapping coordinate of a protein on each mRNA and we calculate the relative distance between the place where the protein bound the mRNA and the site that is bound by a small RNA.
I obtain the following dataset :
dist eff
-69 3
-68 2
-67 1
-66 1
-60 1
-59 1
-58 1
-57 2
-56 1
-55 1
-54 1
-52 1
-50 2
-48 3
-47 1
-46 3
-45 1
-43 1
0 1
1 2
2 12
3 18
4 18
5 13
6 9
7 7
8 5
9 3
10 1
13 2
14 3
15 2
16 2
17 2
18 2
19 2
20 2
21 3
22 1
24 1
25 1
26 1
28 2
31 1
38 1
40 2
When i plot the data, i have 3 pics : 1 at around 3 -4
another one around 20 and a last one around -50.
I try cubic spline interpolation, but it does'nt work very well for my data.
My idea was to do curve fitting with a sum of gaussians.
For example in my case, estimate 3 gaussian curve at point 5,20 and -50.
How can i do so ?
I looked at scipy.optimize.curve_fit(), but how can i fit the curve at precise intervalle ?
How can i add the curve to have one single curve ?

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
import scipy.optimize
data = np.array([-69,3, -68, 2, -67, 1, -66, 1, -60, 1, -59, 1,
-58, 1, -57, 2, -56, 1, -55, 1, -54, 1, -52, 1,
-50, 2, -48, 3, -47, 1, -46, 3, -45, 1, -43, 1,
0, 1, 1, 2, 2, 12, 3, 18, 4, 18, 5, 13, 6, 9,
7, 7, 8, 5, 9, 3, 10, 1, 13, 2, 14, 3, 15, 2,
16, 2, 17, 2, 18, 2, 19, 2, 20, 2, 21, 3, 22, 1,
24, 1, 25, 1, 26, 1, 28, 2, 31, 1, 38, 1, 40, 2])
x, y = data.reshape(-1, 2).T
def tri_norm(x, *args):
m1, m2, m3, s1, s2, s3, k1, k2, k3 = args
ret = k1*scipy.stats.norm.pdf(x, loc=m1 ,scale=s1)
ret += k2*scipy.stats.norm.pdf(x, loc=m2 ,scale=s2)
ret += k3*scipy.stats.norm.pdf(x, loc=m3 ,scale=s3)
return ret
params = [-50, 3, 20, 1, 1, 1, 1, 1, 1]
fitted_params,_ = scipy.optimize.curve_fit(tri_norm,x, y, p0=params)
plt.plot(x, y, 'o')
xx = np.linspace(np.min(x), np.max(x), 1000)
plt.plot(xx, tri_norm(xx, *fitted_params))
plt.show()
>>> fitted_params
array([ -60.46845528, 3.801281 , 13.66342073, 28.26485602,
1.63256981, 10.31905367, 110.51392765, 69.11867159,
63.2545624 ])
So you can see your idea of the three peaked function doesn't agree too much with your real data.

Related

Populating a 2D array in python

I need to populate a 2D array whose shape is 3xN, where N is initially unknown. The code looks as follows:
import numpy as np
import random
nruns = 5
all_data = [[]]
for run in range(nruns):
n = random.randint(1,10)
d1 = random.sample(range(0, 30), n)
d2 = random.sample(range(0, 30), n)
d3 = random.sample(range(0, 30), n)
data_tmp = [d1, d2, d3]
all_data = np.concatenate((all_data,data_tmp),axis=0)
This gives the following error:
ValueError Traceback (most recent call last)
<ipython-input-103-22af8f04e7c0> in <module>
10 d3 = random.sample(range(0, 30), n)
11 data_tmp = [d1, d2, d3]
---> 12 all_data = np.concatenate((all_data,data_tmp),axis=0)
13 print(np.shape(data_tmp))
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 4
Is there a way to do this without pre-allocating all_data? Note that in my application, the data will not be random, but generated inside the loop.
Many thanks!
You could store the data generated in each step of the for loop into a list and create the array when you are done.
In [298]: import numpy as np
...: import random
In [299]: nruns = 5
...: all_data = []
In [300]: for run in range(nruns):
...: n = random.randint(1,10)
...: d1 = random.sample(range(0, 30), n)
...: d2 = random.sample(range(0, 30), n)
...: d3 = random.sample(range(0, 30), n)
...: all_data.append([d1, d2, d3])
In [301]: all_data = np.hstack(all_data)
In [302]: all_data
Out[302]:
array([[13, 28, 14, 15, 11, 0, 0, 19, 6, 28, 14, 18, 1, 15, 4, 20,
9, 14, 15, 13, 27, 28, 25, 5, 7, 4, 10, 22, 12, 6, 23, 15,
0, 20, 14, 5, 13],
[10, 9, 23, 4, 25, 28, 17, 14, 3, 4, 5, 9, 7, 18, 23, 9,
14, 15, 25, 26, 29, 12, 21, 0, 5, 6, 11, 27, 13, 26, 22, 14,
6, 5, 7, 23, 0],
[13, 0, 7, 14, 29, 26, 12, 16, 13, 3, 9, 6, 11, 2, 19, 17,
28, 14, 25, 24, 3, 12, 22, 7, 23, 18, 5, 14, 0, 14, 15, 8,
3, 2, 26, 21, 16]])
See if this is what you need, i.e. populate along axis 1 instead of 0.
import numpy as np
import random
nruns = 5
all_data = [[], [], []]
for run in range(nruns):
n = random.randint(1,10)
d1 = random.sample(range(0, 30), n)
d2 = random.sample(range(0, 30), n)
d3 = random.sample(range(0, 30), n)
data_tmp = [d1, d2, d3]
all_data = np.concatenate((all_data, data_tmp), axis=1)
How about using np.random only:
nruns = 5
# set seed for repeatability, remove for randomness
np.random.seed(42)
# randomize the lengths for the runs
num_samples = np.random.randint(1,10, nruns)
# sampling with the total length
all_data = np.random.randint(0,30, (3, num_samples.sum()))
# or, if `range(0,30)` represents some population
# all_data = np.random.choice(range(0,30), (3,num_samples.sum()) )
print(all_data)
Output:
[[25 18 22 10 10 23 20 3 7 23 2 21 20 1 23 11 29 5 1 27 20 0 11 25
21 28 11 24 16 26 26]
[ 9 27 27 15 14 29 29 14 29 18 11 22 19 24 2 4 18 6 20 8 6 17 3 24
27 13 17 25 8 25 20]
[ 1 19 27 14 27 6 11 28 7 14 2 13 16 3 17 7 3 1 29 5 21 9 3 21
28 17 25 11 1 9 29]]

NumPy - fast stable arg-sort of large array by frequency

I have large 1D NumPy array a of any comparable dtype, some of its elements may be repeated.
How do I find sorting indexes ix that will stable-sort (stability in a sense described here) a by frequencies of values in descending/ascending orders?
I want to find fastest and simplest way to do this. Maybe there is existing standard numpy function to do that.
There is another related question here but it was asking specifically to remove arrays duplicates, i.e. output only unique sorted values, I need all values of original array including duplicates.
I've coded my first trial to do the task, but it is not the fastest (uses Python's loop) and probably not shortest/simplest possible form. This python loop can be very expensive if repeating of equal elements is not high and array is huge. Also would be nice to have short function for doing this all if available in NumPy (e.g. imaginary np.argsort_by_freq()).
Try it online!
import numpy as np
np.random.seed(1)
hi, n, desc = 7, 24, True
a = np.random.choice(np.arange(hi), (n,), p = (
lambda p = np.random.random((hi,)): p / p.sum()
)())
us, cs = np.unique(a, return_counts = True)
af = np.zeros(n, dtype = np.int64)
for u, c in zip(us, cs):
af[a == u] = c
if desc:
ix = np.argsort(-af, kind = 'stable') # Descending sort
else:
ix = np.argsort(af, kind = 'stable') # Ascending sort
print('rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)')
print(' / sorted_freqs(4) / sorting_ix(5)')
print(np.stack((
np.arange(n), a, af, a[ix], af[ix], ix,
), 0))
outputs:
rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)
/ sorted_freqs(4) / sorting_ix(5)
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[ 1 1 1 1 3 0 5 0 3 1 1 0 0 4 6 1 3 5 5 0 0 0 5 0]
[ 7 7 7 7 3 8 4 8 3 7 7 8 8 1 1 7 3 4 4 8 8 8 4 8]
[ 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 5 5 5 5 3 3 3 4 6]
[ 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 4 4 4 4 3 3 3 1 1]
[ 5 7 11 12 19 20 21 23 0 1 2 3 9 10 15 6 17 18 22 4 8 16 13 14]]
I might be missing something, but it seems that with a Counter you can then sort the indexes of each element according to the count of that element's value, using the element value and then the index to break ties. For example:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], v, i) for i, v in enumerate(a)]
t.sort()
print([v[2] for v in t])
t.sort(reverse=True)
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[23, 21, 20, 19, 12, 11, 7, 5, 15, 10, 9, 3, 2, 1, 0, 22, 18, 17, 6, 16, 8, 4, 14, 13]
If you want to maintain ascending order of indexes with groups with equal counts, you can just use a lambda function for the descending sort:
t.sort(key = lambda x:(-x[0],-x[1],x[2]))
print([v[2] for v in t])
Output:
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 14, 13]
If you want to maintain the ordering of elements in the order that they originally appeared in the array if their counts are the same, then rather than sort on the values, sort on the index of their first occurrence in the array:
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
idxs = {}
t = []
for i, v in enumerate(a):
if not v in idxs:
idxs[v] = i
t.append((counts[v], idxs[v], i))
t.sort()
print([v[2] for v in t])
t.sort(key = lambda x:(-x[0],x[1],x[2]))
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 13, 14]
To sort according to count, and then position in the array, you don't need the value or the first index at all:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], i) for i, v in enumerate(a)]
t.sort()
print([v[1] for v in t])
t.sort(key = lambda x:(-x[0],x[1]))
print([v[1] for v in t])
This produces the same output as the prior code for the sample data, for your string array:
a = ['g', 'g', 'c', 'f', 'd', 'd', 'g', 'a', 'a', 'a', 'f', 'f', 'f',
'g', 'f', 'c', 'f', 'a', 'e', 'b', 'g', 'd', 'c', 'b', 'f' ]
This produces the output:
[18, 19, 23, 2, 4, 5, 15, 21, 22, 7, 8, 9, 17, 0, 1, 6, 13, 20, 3, 10, 11, 12, 14, 16, 24]
[3, 10, 11, 12, 14, 16, 24, 0, 1, 6, 13, 20, 7, 8, 9, 17, 2, 4, 5, 15, 21, 22, 19, 23, 18]
I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N) time. Used numpy functions: np.unique, np.argsort and array indexing.
Although wasn't asked in original question, I implemented extra flag equal_order_by_val if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d. In other words in case of False we sort stably just by key freq, and when it is True we sort by (freq, value) for ascending order and by (-freq, value) for descending order.
Try it online!
import string, math
import numpy as np
np.random.seed(0)
# Generating input data
hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
lambda p = np.random.random((letters.size,)): p / p.sum()
)())
for equal_order_by_val in [False, True]:
# Solving task
us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
af = cs[ui]
sort_key = -af if desc else af
if equal_order_by_val:
shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
sort_key = ((sort_key.astype(np.int64) << shift_bits) +
np.arange(us.size, dtype = np.int64)[ui])
ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself
# Printing results
print('\nequal_order_by_val:', equal_order_by_val)
for name, val in [
('i_col', np.arange(n)), ('original_a', a),
('freqs', af), ('sorted_a', a[ix]),
('sorted_freqs', af[ix]), ('sorting_ix', ix),
]:
print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))
outputs:
equal_order_by_val: False
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c d d c d c b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 4 5 15 21 22 19 23 18
equal_order_by_val: True
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c c c d d d b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 15 22 4 5 21 19 23 18

Find nearest index in one dataframe to another

I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)

ValueError: could not broadcast input array from shape (5) into shape (7)

In this code:
import pandas as pd
myj='{"columns":["tablename","alias_tablename","real_tablename","dbname","finalcost","columns","pri_col"],"index":[0,1],"data":[["b","b","vip_banners","openx","",["id","name","adlink","wap_link","ipad_link","iphone_link","android_link","pictitle","target","starttime","endtime","weight_limit","weight","introduct","isbutton","sex","tag","gomethod","showtype","version","warehouse","areaid","textpic","smallpicture","group","service_provider","channels","chstarttime","chendtime","tzstarttime","tzendtime","status","editetime","shownum","wap_version","ipad_version","iphone_version","android_version","showtime","template_id","app_name","acid","ab_test","ratio","ab_tset_type","acid_type","key_name","phone_models","androidpad_version","is_delete","ugep_group","author","content","rule_id","application_id","is_default","district","racing_id","public_field","editor","usp_expression","usp_group","usp_php_expression","is_pic_category","is_custom_finance","midwhitelist","is_freeshipping","resource_id","usp_property","always_display","pushtime","is_pmc","version_type","is_plan","loop_pic_frame_id","plan_personal_id","personal_id","is_img_auto","banner_type","ext_content"],"id"],["a","a","vip_adzoneassoc","openx","",["id","zone_id","ad_id","weight"],"id"]]}'
df=pd.read_json(myj, orient='split')
bl=['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime','weight']
al= ['zone_id,ad_id', 'zone_id,ad_id,id', 'ad_id', 'id', 'zone_id']
#
#bl=['add_time', 'allot_time', 'create_time', 'end_pay_time', 'start_pay_time', 'order_status,update_time', 'order_type,order_status,add_time', 'order_type,order_status,end_pay_time', 'wms_flag,order_status,is_first,order_date', 'last_update_time', 'order_code', 'order_date', 'order_sn', 'parent_sn', 'id', 'user_id', 'wms_flag,order_date']
#al=['area_id', 'last_update_time', 'mobile', 'parent_sn', 'id', 'transport_number', 'parent_sn']
def get_index(row):
print(row)
if row.tablename=='b':
return bl
else:
return al
# return ['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime', 'weight']
df['index_cols']=df.apply(get_index,axis=1)
I run into error:
ValueError: could not broadcast input array from shape (5) into shape
(7)
Instead if I use the commented out bl and al evevything run fine.
Also if I use
bl=['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime']
it runs fine too, what 's the problem?
In pandas-0.22.0 a list coming from apply method could be used to construct a new dataframe, when its length equals the number of columns in the initial dataframe.
For example:
>>> df = pd.DataFrame([range(100),range(100)])
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
1 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
You can return a list in apply and get a dataframe:
>>> df.apply(lambda x:(x+1).values.tolist(), axis=1)
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
1 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
but if length does not match dimension:
>>> df.apply(lambda x:(x+1).values.tolist()[:99], axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
we get a Series.
And if you return lists of different dimension and the first one matches the dimension while the next one does not (like in your case) you get an error:
>>> df.apply(lambda x:[1] * 99 if x.name==0 else [0] * 100 , axis=1)
0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Works ok.
And this one
>>> df.apply(lambda x:[1] * 100 if x.name==0 else [0] * 99 , axis=1)
raises an error.
In pandas-0.23 you get a Series either way:
>>> df.apply(lambda x:(x+1).values.tolist(), axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
>>> df.apply(lambda x:(x+1).values.tolist()[:9], axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [1, 2, 3, 4, 5, 6, 7, 8, 9]
This problem does not apply to tuples in pandas-0.22.0:
>>> df.apply(lambda x:(1,) * 9 if x.name==0 else (0,) * 10 , axis=1)
0 (1, 1, 1, 1, 1, 1, 1, 1, 1)
1 (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
You can use this fact in your case:
bl = ('is_delete,status,author', 'endtime', 'banner_type',
'id', 'starttime', 'status,endtime', 'weight')
al = ('zone_id,ad_id', 'zone_id,ad_id,id', 'ad_id', 'id', 'zone_id')
>>> df.apply(get_index, axis=1)
0 (is_delete,status,author, endtime, banner_type...
1 (zone_id,ad_id, zone_id,ad_id,id, ad_id, id, z...
dtype: object

Iterate over a matrix, sum over some rows and add the result to another array

Hi there I have the following matrix
[[ 47 43 51 81 54 81 52 54 31 46]
[ 35 21 30 16 37 11 35 30 39 37]
[ 8 17 11 2 5 4 11 9 17 10]
[ 5 9 4 0 1 1 0 3 9 3]
[ 2 7 2 0 0 0 0 1 2 1]
[215 149 299 199 159 325 179 249 249 199]
[ 27 49 24 4 21 8 35 15 45 25]
[100 100 100 100 100 100 100 100 100 100]]
I need to iterate over the matrix summing all elements in rows 0,1,2,3,4 only
example: I need
row_0_sum = 47+43+51+81....46
Furthermore I need to store each rows sum in an array like this
[row0_sum, row1_sum, row2_sum, row3_sum, row4_sum]
So far I have tried this code but its not doing the job:
mu = np.zeros(shape=(1,6))
#get an average
def standardize_ratings(matrix):
sum = 0
for i, eli in enumerate(matrix):
for j, elj in enumerate(eli):
if(i<5):
sum = sum + matrix[i][j]
if(j==elj.len -1):
mu[i] = sum
sum = 0
print "mu[i]="
print mu[i]
This just gives me an Error: numpy.int32 object has no attribute 'len'
So can someone help me. What's the best way to do this and which type of array in Python should I use to store this. Im new to Python but have done programming....
Thannks
Make your data, matrix, a numpy.ndarray object, instead of a list of lists, and then just do matrix.sum(axis=1).
>>> matrix = np.asarray([[ 47, 43, 51, 81, 54, 81, 52, 54, 31, 46],
[ 35, 21, 30, 16, 37, 11, 35, 30, 39, 37],
[ 8, 17, 11, 2, 5, 4, 11, 9, 17, 10],
[ 5, 9, 4, 0, 1, 1, 0, 3, 9, 3],
[ 2, 7, 2, 0, 0, 0, 0, 1, 2, 1],
[215, 149, 299, 199, 159, 325, 179, 249, 249, 199],
[ 27, 49, 24, 4, 21, 8, 35, 15, 45, 25],
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]])
>>> print matrix.sum(axis=1)
[ 540 291 94 35 15 2222 253 1000]
To get the first five rows from the result, you can just do:
>>> row_sums = matrix.sum(axis=1)
>>> rows_0_through_4_sums = row_sums[:5]
>>> print rows_0_through_4_sums
[540 291 94 35 15]
Or, you can alternatively sub-select only those rows to begin with and only apply the summation to them:
>>> rows_0_through_4 = matrix[:5,:]
>>> print rows_0_through_4.sum(axis=1)
[540 291 94 35 15]
Some helpful links will be:
NumPy for Matlab Users, if you are familiar with these things in Matlab/Octave
Slicing/Indexing in NumPy

Categories

Resources