How to apply a function to some values in a dataframe column - python

I have this single line of code that checks if a dataframe column is between the range of a value.
data.loc[data.day<6, 'month'] -= 1
The above code works fine for the entire dataframe, but I only want to apply it to the key column with value equal to salary
data
amount day month key
0 111627.94 1 6 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
Expected output:
amount day month key
0 111627.94 1 5 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
I have tried using this filter query
data[[data.key == 'salary'].day<13, 'month'] -= 1 which resulted to the below error
AttributeError Traceback (most recent call last)
<ipython-input-773-81b5a31a7b9f> in <module>
----> 1 test_df[[test_df.key == 'salary'].day<13, 'month'] -= 1
AttributeError: 'list' object has no attribute 'day'
tried this as well
new = data.loc[data.key == 'salary'], new.loc[new.day<6, 'month'] -=1 This worked but I want to do it in a single line rather than assigning a variable new to it.

You can combine multiple conditions into one Boolean index by using logical operators and surrounding each condition with parentheses:
data.loc[(data.day < 6) & (data.key == "salary"), "month"] -= 1

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

function calculation based on condition [duplicate]

This question already has an answer here:
Pandas Rolling Python to create new Columns
(1 answer)
Closed 2 years ago.
I am practising and new to create a function in Python with conditions:
create a function that takes an input of an integer number (for example m, where m is between 2 to n, and n is the maximum number of rows). This function calculates the ‘Sum A’ and ‘Sum B’ from the last m-days. There will be no value for the first m-days
The original data:
V TP A B Sum A Sum B
3509 47.81
4862 48.406667 235353.2133
1810 49.26 89160.6
3824 49.263333 188382.9867
2209 47.386667 104677.1467
4558 45.573333 207723.2533
3832 44.396667 170128.0267
3778 43.75 165287.5
1005 44.64 44863.2
4047 43.76 177096.72
2201 44.383333 97687.71667 655447.7167 824912.6467
2507 45.156667 113207.7633 533302.2667 824912.6467
4392 44.4333 195151.2 444141.6667 1020063.847
3497 43.296667 151408.4433 255758.68 1171472.29
1181 43.07 50865.67 255758.68 1117660.813
1971 42.89 84536.19 255758.68 994473.75
4994 43.563333 217555.2867 473313.9667 824345.7233
2017 44.816667 90395.21667 563709.1833 659058.2233
2823 44.936667 126856.21 645702.1933 659058.2233
2774 45.13 125190.62 770892.8133 481961.5033
Continue original data
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The attempt that I have done so far is and it shows error KeyError 'A':
curret_period = int(input("enter days: "))
sumA = curret_period * ((df["A"] < df["A"]),'')
sumB = curret_period * ((df["B"] >= df["B"]),'')
print(sumA)
print(sumB)
I am wondering is there a better way to create the function? I also wonder if below is the one that I need?
def function_name()
print()
Expected result when m= 10:
A B Sum A Sum B
0
1 235353.21333333332
2 89160.59999999999
3 188382.98666666663
4 104677.1466666667
5 207723.25333333333
6 170128.02666666667
7 165287.5
8 44863.200000000004
9 177096.72
10 97687.71666666666 655447.7167 824912.6467
11 113207.76333333334 533302.2667 824912.6467
12 195151.2 444141.6667 1020063.847
13 151408.4433333333 255758.68 1171472.29
14 50865.66999999999 255758.68 1117660.813
15 84536.19000000002 255758.68 994473.75
16 217555.28666666665 473313.9667 824345.7233
17 90395.21666666666 563709.1833 659058.2233
18 126856.21 645702.1933 659058.2233
19 125190.61999999998 770892.8133 481961.5033
Any suggestion? Thank you in advance.
You can utilize df.tail() to get the last m rows of the dataframe and then simply sum() each column.
We can also check if m is not greater than the length of the dataframe, however even if you did not have this it would just sum the entire dataframe.
def sumof(df, m):
if m <= len(df.index):
rows = df.tail(m)
print(rows['A'].sum())
print(rows['B'].sum())
else:
print("'m' can not be greater than length of dataframe")

Pandas groupby error: groupby() takes at least 3 arguments (2 given)

I have the dataframe as following:
(cusid means the customer id; product means product id bought by the customer; count means the purchased count of this product.)
cusid product count
1521 30 2
18984 99 1
25094 1 1
2363 36 1
3316 21 1
19249 228 1
13220 78 1
1226 79 4
1117 112 2
I want to calculate the average number of every product that every customer would buy.
Seeming need to get groupby product in cusid, then groupby product in count, then get the mean.
my expect ouput:
product mean(count)
30
99
1
36
Here is my code:
(df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
got the error:
TypeError Traceback (most recent call last)
<ipython-input-43-0fac990bbd61> in <module>()
----> 1 (df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
TypeError: groupby() takes at least 3 arguments (2 given
have no idea how to fix it
df.groupby(['cusid', 'product']).mean().reset_index().groupby('product')['count'].mean()
OUTPUT:
product
1 1
21 1
30 2
36 1
78 1
79 4
99 1
112 2
228 1
python version: 3.7.4
pandas version: 0.25.0

Pandas groupby throws: TypeError: unhashable type: 'numpy.ndarray'

I have a dataframe as shown in the picture:
problem dataframe: attdf
I would like to group the data by Source class and Destination class, count the number of rows in each group and sum up Attention values.
While trying to achieve that, I am unable to get past this type error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-100-6f2c8b3de8f2> in <module>()
----> 1 attdf.groupby(['Source Class', 'Destination Class']).count()
8 frames
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
458 table = hash_klass(size_hint or len(values))
459 uniques, labels = table.factorize(values, na_sentinel=na_sentinel,
--> 460 na_value=na_value)
461
462 labels = ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
attdf.groupby(['Source Class', 'Destination Class'])
gives me a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1e720f2080> which I'm not sure how to use to get what I want.
Dataframe attdf can be imported from : https://drive.google.com/open?id=1t_h4b8FQd9soVgYeiXQasY-EbnhfOEYi
Please advise.
#Adam.Er8 and #jezarael helped me with their inputs. The unhashable type error in my case was because of the datatypes of the columns in my dataframe.
Original df and df imported from csv
It turned out that the original dataframe had two object columns which i was trying to use up in the groupby. Hence the unhashable type error. But on importing the data into a new dataframe right out of a csv fixed the datatypes. Consequently, no type errors faced anymore.
try using .agg as follows:
import pandas as pd
attdf = pd.read_csv("attdf.csv")
print(attdf.groupby(['Source Class', 'Destination Class']).agg({"Attention": ['sum', 'count']}))
Output:
Attention
sum count
Source Class Destination Class
0 0 282.368908 1419
1 7.251101 32
2 3.361009 23
3 22.482438 161
4 14.020189 88
5 10.138409 75
6 11.377947 80
1 0 6.172269 32
1 181.582437 1035
2 9.440956 62
3 12.007303 67
4 3.025752 20
5 4.491725 28
6 0.279559 2
2 0 3.349921 23
1 8.521828 62
2 391.116034 2072
3 9.937170 53
4 0.412747 2
5 4.441985 30
6 0.220316 2
3 0 33.156251 161
1 11.944373 67
2 9.176584 53
3 722.685180 3168
4 29.776050 137
5 8.827215 54
6 2.434347 16
4 0 17.431855 88
1 4.195519 20
2 0.457089 2
3 20.401789 137
4 378.802604 1746
5 3.616083 19
6 1.095061 6
5 0 13.525333 75
1 4.289306 28
2 6.424412 30
3 10.911705 54
4 3.896328 19
5 250.309764 1132
6 8.643153 46
6 0 15.249959 80
1 0.150240 2
2 0.413639 2
3 3.108417 16
4 0.850280 6
5 8.655959 46
6 151.571505 686

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

Categories

Resources