The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
Related
I have a simple question in pandas.
Lets say I have following data:
d = {'a': [1, 3, 5, 2, 10, 3, 5, 4, 2]}
df = pd.DataFrame(data=d)
df
How do I count the number of rows which are between minimum and maximum value in column a? So number of rows (it is 3 in this case) which are between 1 and 10 in this particular case?
Thanks
You can use this:
diff = np.abs(df['a'].idxmin() - df['a'].idxmax()) - 1
IIUC, you could get the index of the min and max, and subtract 2:
out = len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2
output: 3
If the is a chance that the max is before the min:
out = max(len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2, 0)
Alternative it the order of the min/max does not matter:
import numpy as np
out = np.ptp(df.reset_index()['a'].agg(['idxmin', 'idxmax']))-1
update: index of the 2nd largest/smallest:
# second largest
df['a'].nlargest(2).idxmin()
# 2
# second smallest
df['a'].nsmallest(2).idxmax()
# 3
Since they are numbers, we can dump down into numpy and compute the result:
arr = df.a.to_numpy()
np.abs(np.argmax(arr) - np.argmin(arr)) - 1
3
np.abs(df.loc[df.a==df.a.min()].index-df.loc[df.a==df.a.max()].index)-1
So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])
So I'm slicing my timeseries data, but for some of the columns, I need to be able to have the sum of the elements the were sliced. For example if you had
s = pd.Series([10, 30, 21, 18])
s = s[::-2]
I need to get the sum of a range of elements in this situation so I would need
3 39
1 40
as the output. I've see things like .cumsum() but I can't find anything to sum a range of elements
I'm not quite understand what's the first column represent. But the second column seems to be the sum result.
If you have got the correct slice, it's easy to get sum with sum(), like this:
import numpy as np
import pandas as pd
data = np.arange(0, 10).reshape(-1, 2)
pd.DataFrame(data).iloc[2:].sum(axis=1)
Output is :
2 9
3 13
4 17
dtype: int64
The answer based only in your title would be df[-15:].sum(), but it seems you're looking for performing a calculation per group of slicing.
To address this problem, pandas provides the window utilities. So, you can simply do:
s = pd.Series([10, 30, 21, 18])
s.rolling(2).sum()[::-2].astype(int)
which returns:
3 39
1 40
dtype: int64
Also, it's escalable, once you can replace 2 by any other value, and the .rolling method also works in dataframe objects.
I have a 2D array which looks like this:
array = [[23 ,89, 4, 3, 0],[12, 73 ,3, 5,1],[7, 9 ,12, 11 ,0]]
Where the last column is always 0 or 1 for all the rows. My aim is to calculate two means for column 0, where one mean will be when the last column's value is 0 and one of the mean will be when last column's value will be 1.
e.g. for given sample array above:
mean 1: 15 (mean for 0 column for all the rows where last column is 0)
mean 2: 12 (mean for 0 column for all the rows where last column is 1)
I have tried this (where train is my input array's name):
mean_c1_0=np.mean(train[:: , 0])
variance_c1_0=np.var(train[:: , 0])
This gets me mean and variance for column 0's ll the values.
I can always introduce one more for loop and couple of if conditions to keep checking last column and only then add corresponding values in column 0 but I am looking for an efficient approach. Since I am new to Python I was hoping if there is a numpy function that can get this done.
Can you point me to any such documentation ?
You can use numpy's array filtering. (see How can I slice a numpy array by the value of the ith field?), and just get the mean that way. No loops needed.
import numpy
x = numpy.array([[23, 89, 4, 3, 0],[12, 73, 3, 5, 1],[7, 9, 12, 11, 0]])
numpy.mean(x[x[:,-1]==1][::,0])
numpy.mean(x[x[:,-1]==0][::,0])
You can try this.
mean_of_zeros = np.mean(numpy_array[np.where(numpy_array[:,-1] == 0)])
mean_of_ones = np.mean(numpy_array[np.where(numpy_array[:,-1] == 1)])
I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)