Pandas groupby error: groupby() takes at least 3 arguments (2 given) - python

I have the dataframe as following:
(cusid means the customer id; product means product id bought by the customer; count means the purchased count of this product.)
cusid product count
1521 30 2
18984 99 1
25094 1 1
2363 36 1
3316 21 1
19249 228 1
13220 78 1
1226 79 4
1117 112 2
I want to calculate the average number of every product that every customer would buy.
Seeming need to get groupby product in cusid, then groupby product in count, then get the mean.
my expect ouput:
product mean(count)
30
99
1
36
Here is my code:
(df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
got the error:
TypeError Traceback (most recent call last)
<ipython-input-43-0fac990bbd61> in <module>()
----> 1 (df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
TypeError: groupby() takes at least 3 arguments (2 given
have no idea how to fix it

df.groupby(['cusid', 'product']).mean().reset_index().groupby('product')['count'].mean()
OUTPUT:
product
1 1
21 1
30 2
36 1
78 1
79 4
99 1
112 2
228 1
python version: 3.7.4
pandas version: 0.25.0

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How to apply a function to some values in a dataframe column

I have this single line of code that checks if a dataframe column is between the range of a value.
data.loc[data.day<6, 'month'] -= 1
The above code works fine for the entire dataframe, but I only want to apply it to the key column with value equal to salary
data
amount day month key
0 111627.94 1 6 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
Expected output:
amount day month key
0 111627.94 1 5 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
I have tried using this filter query
data[[data.key == 'salary'].day<13, 'month'] -= 1 which resulted to the below error
AttributeError Traceback (most recent call last)
<ipython-input-773-81b5a31a7b9f> in <module>
----> 1 test_df[[test_df.key == 'salary'].day<13, 'month'] -= 1
AttributeError: 'list' object has no attribute 'day'
tried this as well
new = data.loc[data.key == 'salary'], new.loc[new.day<6, 'month'] -=1 This worked but I want to do it in a single line rather than assigning a variable new to it.
You can combine multiple conditions into one Boolean index by using logical operators and surrounding each condition with parentheses:
data.loc[(data.day < 6) & (data.key == "salary"), "month"] -= 1

Pandas groupby throws: TypeError: unhashable type: 'numpy.ndarray'

I have a dataframe as shown in the picture:
problem dataframe: attdf
I would like to group the data by Source class and Destination class, count the number of rows in each group and sum up Attention values.
While trying to achieve that, I am unable to get past this type error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-100-6f2c8b3de8f2> in <module>()
----> 1 attdf.groupby(['Source Class', 'Destination Class']).count()
8 frames
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
458 table = hash_klass(size_hint or len(values))
459 uniques, labels = table.factorize(values, na_sentinel=na_sentinel,
--> 460 na_value=na_value)
461
462 labels = ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
attdf.groupby(['Source Class', 'Destination Class'])
gives me a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1e720f2080> which I'm not sure how to use to get what I want.
Dataframe attdf can be imported from : https://drive.google.com/open?id=1t_h4b8FQd9soVgYeiXQasY-EbnhfOEYi
Please advise.
#Adam.Er8 and #jezarael helped me with their inputs. The unhashable type error in my case was because of the datatypes of the columns in my dataframe.
Original df and df imported from csv
It turned out that the original dataframe had two object columns which i was trying to use up in the groupby. Hence the unhashable type error. But on importing the data into a new dataframe right out of a csv fixed the datatypes. Consequently, no type errors faced anymore.
try using .agg as follows:
import pandas as pd
attdf = pd.read_csv("attdf.csv")
print(attdf.groupby(['Source Class', 'Destination Class']).agg({"Attention": ['sum', 'count']}))
Output:
Attention
sum count
Source Class Destination Class
0 0 282.368908 1419
1 7.251101 32
2 3.361009 23
3 22.482438 161
4 14.020189 88
5 10.138409 75
6 11.377947 80
1 0 6.172269 32
1 181.582437 1035
2 9.440956 62
3 12.007303 67
4 3.025752 20
5 4.491725 28
6 0.279559 2
2 0 3.349921 23
1 8.521828 62
2 391.116034 2072
3 9.937170 53
4 0.412747 2
5 4.441985 30
6 0.220316 2
3 0 33.156251 161
1 11.944373 67
2 9.176584 53
3 722.685180 3168
4 29.776050 137
5 8.827215 54
6 2.434347 16
4 0 17.431855 88
1 4.195519 20
2 0.457089 2
3 20.401789 137
4 378.802604 1746
5 3.616083 19
6 1.095061 6
5 0 13.525333 75
1 4.289306 28
2 6.424412 30
3 10.911705 54
4 3.896328 19
5 250.309764 1132
6 8.643153 46
6 0 15.249959 80
1 0.150240 2
2 0.413639 2
3 3.108417 16
4 0.850280 6
5 8.655959 46
6 151.571505 686

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Categories

Resources