how to design agg funtion for pandas groupby - python

my dataFrame is like this:
user,rating, f1,f2,f3,f4
20, 3, 0.1, 0, 3, 5
20, 4, 0.2, 3, 5, 2
18, 4, 0.6, 8, 7, 2
18, 1, 0.7, 9, 2, 7
I want to compute a profile for a user, for instance
for user 20, it should be 3*[0.1,0,3,5]+4*[0.2,3,5,2]
which is a weighted sum of f1 to f4
How should I write a agg function to complete this task?
df.groupby('user').agg(....)

you can try this :
df.groupby('user').apply(lambda x : sum(x['rating'] * (x['f1']+x['f2']+x['f3']+x['f4'])))

Related

Getting a list out of nested list in python

I am getting list out of a nested list.
list_of_data = [{'id':99,
'rocketship':{'price':[10, 10, 10, 10, 10],
'ytd':[1, 1, 1.05, 1.1, 1.18]}},
{'id':898,
'rocketship':{'price':[10, 10, 10, 10, 10],
'ytd':[1, 1, 1.05, 1.1, 1.18]}},
{'id':903,
'rocketship':{'price':[20, 20, 20, 10, 10],
'ytd':[1, 1, 1.05, 1.1, 1.18]}},
{'id':999,
'rocketship':{'price':[20, 20, 20, 10, 10],
'ytd':[1, 3, 4.05, 1.1, 1.18]}},
]
price, ytd = map(list, zip(*((list_of_data[i]['rocketship']['price'], list_of_data[i]['rocketship']['ytd']) for i in range(0, len(list_of_data)))))
My expected output is below (But, I am getting something different):
price = [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 10, 10, 20, 20, 20, 10, 10]
ytd = [1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 3, 4.05, 1.1, 1.18]
But, I am getting this:
price
Out[19]:
[[10, 10, 10, 10, 10],
[10, 10, 10, 10, 10],
[20, 20, 20, 10, 10],
[20, 20, 20, 10, 10]]
Expected output:
price = [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 10, 10, 20, 20, 20, 10, 10]
ytd = [1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 3, 4.05, 1.1, 1.18]
try this:
update
Thanks #shawn caza
performance test for 100000 loops:
shawncaza answer: 0.10945558547973633 seconds
my answer with get method : 0.1443953514099121 seconds
my answer with square bracket method : 0.10936307907104492 seconds
list_of_data = [{'id': 99,
'rocketship': {'price': [10, 10, 10, 10, 10],
'ytd': [1, 1, 1.05, 1.1, 1.18]}},
{'id': 898,
'rocketship': {'price': [10, 10, 10, 10, 10],
'ytd': [1, 1, 1.05, 1.1, 1.18]}},
{'id': 903,
'rocketship': {'price': [20, 20, 20, 10, 10],
'ytd': [1, 1, 1.05, 1.1, 1.18]}},
{'id': 999,
'rocketship': {'price': [20, 20, 20, 10, 10],
'ytd': [1, 3, 4.05, 1.1, 1.18]}},
]
price = []
ytd = []
for i in list_of_data:
price.extend(i['rocketship']['price'])
ytd.extend(i['rocketship']['ytd'])
print(price)
print(ytd)
>>> [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 10, 10, 20, 20, 20, 10, 10]
>>> [1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 3, 4.05, 1.1, 1.18]
Using list comprehension:
price, ytd = [i for item in list_of_data for i in item["rocketship"]["price"]],
[i for item in list_of_data for i in item["rocketship"]["ytd"]]
Output
price: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 10, 10, 20, 20, 20, 10, 10]
ytd: [1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 1, 1.05, 1.1, 1.18, 1, 3, 4.05, 1.1, 1.18]
I traded a bit of readability for performance here
import itertools
tuples = ((item['rocketship']['price'], item['rocketship']['ytd']) for item in list_of_data)
price, ytd = functools.reduce(lambda a, b: (a[0] + b[0], a[1] + b[1]), tuples, ([], []))
I tried to keep things in a single loop and use generator to optimize memory use. But if the data is big, the resulting price and ytd are also big too, hopefully you thought about that already.
Update:
Thanks to #j1-lee's performance test, I redo the code again as follow:
import functools
def extend_list(a, b):
a.extend(b)
return a
tuples = ((item['rocketship']['price'], item['rocketship']['ytd'])
for item in list_of_data)
price, ytd = map(
list,
functools.reduce(
lambda a, b: (extend_list(a[0], b[0]), extend_list(a[1], b[1])),
tuples,
([], [])
)
)
This reduce the execution time from 45.556s to 0.096s. My best guess would be when you use + operator, it would create a new list from 2 old list, which requires copying them over a new one, so it will go as:
list(4) + list(4) = list(8) # 8 copies
list(8) + list(4) = list(12) # 12 copies
list(12) + list(4) = list(16) # 16 copies
...
Using .extend() would only need to copy the new additional list into the old one, so it should be faster
list(4).extend(list(4)) = list(8) # 4 copies
list(8).extend(list(4)) = list(12) # 4 copies
list(12).extend(list(4)) = list(16) # 4 copies
...
It would be better if someone can point to the specific documentation or information though.
Perform a list comprehension and flatten your result.
ytd = sum([d['rocketship']['ytd'] for d in list_of_data], [])
price = sum([d['rocketship']['price'] for d in list_of_data], [])
Instead of passing the list function in your map, you could pass itertools.chain.from_iterable to merge all the individual lists. Then you can run the list() after to transform the generator into a list
import itertools
price_gen, ytd_gen = map(itertools.chain.from_iterable ,zip(*((i['rocketship']['price'], i['rocketship']['ytd']) for i in list_of_data)))
price = list(price_gen)
ytd = list(ytd_gen)
However, creating seperate generators for each dataset actually seems to be much faster. ~7x faster in my test.
import itertools
price_gen = itertools.chain.from_iterable(d['rocketship']['price'] for d in list_of_data)
ytd_gen = itertools.chain.from_iterable(d['rocketship']['ytd'] for d in list_of_data)
price = list(price_gen)
ytd = list(ytd_gen)
Maybe it's the zip that slows things down?
cProfile comparison using the small original dataset looping the task 99,999 times using different solutions presented in this post:
ncalls tottime percall cumtime percall filename:lineno(function)
99999 0.132 0.000 1.344 0.000 (opt_khanh)
99999 0.469 0.000 0.714 0.000 (opt_shawn)
99999 0.142 0.000 0.535 0.000 (opt_Jaeyoon)
99999 0.267 0.000 0.413 0.000 (opt_ramesh)
99999 0.076 0.000 0.399 0.000 (opt_abdo)
I try to use a double comprehension. I don't know it's a good idea as it could hurt code readibility, maybe.
price = [
item
for sublist in [rocket["rocketship"]["price"] for rocket in list_of_data]
for item in sublist
]
ytd = [
item
for sublist in [rocket["rocketship"]["ytd"] for rocket in list_of_data]
for item in sublist
]
print(price)
print(ytd)

Why do these two numpy.divide operations give such different results?

I would like to correct the values in hyperspectral readings from a cameara using the formula described over here;
the captured data is subtracted by dark reference and divided with
white reference subtracted dark reference.
In the original example, the task is rather simple, white and dark reference has the same shape as the main data so the formula is executed as:
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
However the main data is much larger in my experience. Shapes in my case are as following;
$ white_nparr.shape, dark_nparr.shape, data_nparr.shape
((100, 640, 224), (100, 640, 224), (4300, 640, 224))
that's why I repeat the reference arrays.
white_nparr_rep = white_nparr.repeat(43, axis=0)
dark_nparr_rep = dark_nparr.repeat(43, axis=0)
return np.divide(np.subtract(data_nparr, dark_nparr_rep), np.subtract(white_nparr_rep, dark_nparr_rep))
And it works almost perfectly, as can be seen in the image at the left. But this approach requires enormous amount of memory, so I decided to traverse the large array and replace the original values with corrected ones on-the-go instead:
ref_scale = dark_nparr.shape[0]
data_scale = data_nparr.shape[0]
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr),
np.subtract(white_nparr, dark_nparr)
)
But that traversal approach gives me the ugliest of results, as can be seen in the right. I'd appreciate any idea that would help me fix this.
Note: I apply 20-times co-adding (mean of 20 readings) to obtain the images below.
EDIT: dtype of each array is as following:
$ white_nparr.dtype, dark_nparr.dtype, data_nparr.dtype
(dtype('float32'), dtype('float32'), dtype('float32'))
Your two methods don't agree because in the first method you used
white_nparr_rep = white_nparr.repeat(43, axis=0)
but the second method corresponds to using
white_nparr_rep = np.tile(white_nparr, (43, 1, 1))
If the first method is correct, you'll have to adjust the second method to act accordingly. Perhaps
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr[i]),
np.subtract(white_nparr[i], dark_nparr[i])
)
A simple example with 2-d arrays that shows the difference between repeat and tile:
In [146]: z
Out[146]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
In [147]: np.repeat(z, 3, axis=0)
Out[147]:
array([[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15]])
In [148]: np.tile(z, (3, 1))
Out[148]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
Off topic postscript: I don't know why the author of the page that you linked to writes NumPy expressions as (for example):
corrected_nparr = np.divide(
np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
NumPy allows you to write that as
corrected_nparr = (data_nparr - dark_nparr) / (white_nparr - dark_nparr)
whick looks much nicer to me.

Average a python dataframe column based on another column

I would like to take the average of column b when the corresponding value in column a is > 5
I get the error message:
TypeError: '>' not supported between instances of 'str' and 'int'
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b = [0.05, 0.05, 0.05, 0.04, 0.03, 0, 0, 0, 0, 0.03]
d = {'col_a': a, 'col_b': b}
df = pd.DataFrame(d)
x = df['col_a' > 5]['col_b'].mean()
print(x)
df['col_a' > 5]
This tries to check if the string 'col_a' is > 5, which can't be done.
You meant df[df['col_a'] > 5]['col_b'].mean()

How to Reccurently Transpose A Series/List/Array

I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.
If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])

Attempting to make a multi-column graph

I am trying to make a column graph where the y-axis is the mean grain size, the x-axis is the distance along the transect, and each series is a date and/or number value (it doesn't really matter).
I have been trying a few different methods in Excel 2010 but I cannot figure it out. My hope is that, lets say at the first location, 9, there will be three columns and then at 12 there will be two columns. If it matter at all, lets say the total distance is 50. The result of this data should have 7 sets of columns along the transect/x-axis.
I have tried to do this using python but my coding knowledge is close to nil. Here is my code so far:
import numpy as np
import matplotlib.pyplot as plt
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371, 1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
If someone happen to know of a code to use, it would be very helpful. A recommendation for how to do this in Excel would be awesome too.
There's a plotting library called seaborn, built on top of matplotlib, that does this in one line. Your example:
import numpy as np
import seaborn as sns
from matplotlib.pyplot import show
grainsize = [0.7912, 0.513, 0.4644, 1.0852, 1.8515, 1.812, 6.371,
1.602, 1.0251, 5.6884, 0.4166, 24.8669, 0.5223, 37.387, 0.5159, 0.6727]
series = [2, 3, 4, 1, 4, 2, 3, 4, 1, 4, 1, 4, 1, 4, 1, 4]
distance = [9, 9, 9, 12, 12, 15, 15, 15, 17, 17, 25, 25, 32.5, 32.5, 39.5, 39.5]
ax = sns.barplot(x=distance, y=grainsize, hue=series, palette='muted')
ax.set_xlabel('distance')
ax.set_ylabel('grainsize')
show()
You will be able to do a lot even as a total newbie by editing the many examples in the seaborn gallery. Use them as training wheels: edit only one thing at a time and think about what changes.

Categories

Resources