function calculation based on condition [duplicate] - python

This question already has an answer here:
Pandas Rolling Python to create new Columns
(1 answer)
Closed 2 years ago.
I am practising and new to create a function in Python with conditions:
create a function that takes an input of an integer number (for example m, where m is between 2 to n, and n is the maximum number of rows). This function calculates the ‘Sum A’ and ‘Sum B’ from the last m-days. There will be no value for the first m-days
The original data:
V TP A B Sum A Sum B
3509 47.81
4862 48.406667 235353.2133
1810 49.26 89160.6
3824 49.263333 188382.9867
2209 47.386667 104677.1467
4558 45.573333 207723.2533
3832 44.396667 170128.0267
3778 43.75 165287.5
1005 44.64 44863.2
4047 43.76 177096.72
2201 44.383333 97687.71667 655447.7167 824912.6467
2507 45.156667 113207.7633 533302.2667 824912.6467
4392 44.4333 195151.2 444141.6667 1020063.847
3497 43.296667 151408.4433 255758.68 1171472.29
1181 43.07 50865.67 255758.68 1117660.813
1971 42.89 84536.19 255758.68 994473.75
4994 43.563333 217555.2867 473313.9667 824345.7233
2017 44.816667 90395.21667 563709.1833 659058.2233
2823 44.936667 126856.21 645702.1933 659058.2233
2774 45.13 125190.62 770892.8133 481961.5033
Continue original data
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The attempt that I have done so far is and it shows error KeyError 'A':
curret_period = int(input("enter days: "))
sumA = curret_period * ((df["A"] < df["A"]),'')
sumB = curret_period * ((df["B"] >= df["B"]),'')
print(sumA)
print(sumB)
I am wondering is there a better way to create the function? I also wonder if below is the one that I need?
def function_name()
print()
Expected result when m= 10:
A B Sum A Sum B
0
1 235353.21333333332
2 89160.59999999999
3 188382.98666666663
4 104677.1466666667
5 207723.25333333333
6 170128.02666666667
7 165287.5
8 44863.200000000004
9 177096.72
10 97687.71666666666 655447.7167 824912.6467
11 113207.76333333334 533302.2667 824912.6467
12 195151.2 444141.6667 1020063.847
13 151408.4433333333 255758.68 1171472.29
14 50865.66999999999 255758.68 1117660.813
15 84536.19000000002 255758.68 994473.75
16 217555.28666666665 473313.9667 824345.7233
17 90395.21666666666 563709.1833 659058.2233
18 126856.21 645702.1933 659058.2233
19 125190.61999999998 770892.8133 481961.5033
Any suggestion? Thank you in advance.

You can utilize df.tail() to get the last m rows of the dataframe and then simply sum() each column.
We can also check if m is not greater than the length of the dataframe, however even if you did not have this it would just sum the entire dataframe.
def sumof(df, m):
if m <= len(df.index):
rows = df.tail(m)
print(rows['A'].sum())
print(rows['B'].sum())
else:
print("'m' can not be greater than length of dataframe")

Related

Finding overlap in range based on multiple dataframe column values

I have a TSV that looks as follows:
chr_1 start_1 chr_2 start_2
11 69633786 14 105884873
12 81940993 X 137690551
13 29782093 12 97838049
14 105864244 11 69633799
17 33207000 20 9992701
17 38446991 20 2102271
17 38447482 17 29623333
20 9992701 17 33207000
20 10426599 17 33094167
20 13765533 17 29469669
22 27415959 8 36197094
22 37191634 8 38983042
22 44464751 18 74004141
8 36197054 22 23130534
8 36197054 22 23131537
8 36197054 8 23130539
This will be referred to as transDiffStartEndChr, which is a Dataframe.
I am working on a program that takes this TSV as input, and outputs rows that have the same chr_1 and chr_2, and a start_1 and start_2 that are +/- 1000.
Ideal output would look like:
chr_1 start_1 chr_2 start_2
8 36197054 8 23130539
8 36197054 22 23131537
Potentially creating groups for every hit based on chr_1 and chr_2.
My current script/thoughts:
transDiffStartEndChr = pd.read_csv('test-input.tsv', sep='\t')
#I will extract rows first by chr_1, in this case I'm doing a test case for 17.
rowsStartChr17 = transDiffStartEndChr[transDiffStartEndChr.apply(extractChr, chr='17', axis=1)]
#I figure I can do something stupid and using brute force, but I feel like I'm not tackling this problem correctly
for index, row in rowsStartChr17.iterrows():
for index2, row2 in rowsStartChr17.iterrows():
if index == index2:
continue
elif row['chr_1'] == row2['chr_1'] and row['chr_2'] == row2['chr_2']:
if proximityCheck(row['start_1'], row2['start_1']) and proximityCheck(row['start_2'], row2['start_2']):
print(f'Row: {index} Match: {index2}')
Any thoughts are appreciated.
Can play with numpy and pandas to filter out the groups that don't match your requirements.
>>> df.groupby(['chr_1', 'chr_2'])\
.filter(lambda s: len(np.array(np.where(
np.tril(
np.abs(
np.subtract.outer(s['start_2'].values,
s['start_2'].values)) < 1500 , -1)))\
.flatten()) > 0)
The logic is to groupby chr_1 and chr_2 and perform an outer subtraction between start_2 values to check whether we can values below 1500 (the threshold I used).

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

How to apply a function to some values in a dataframe column

I have this single line of code that checks if a dataframe column is between the range of a value.
data.loc[data.day<6, 'month'] -= 1
The above code works fine for the entire dataframe, but I only want to apply it to the key column with value equal to salary
data
amount day month key
0 111627.94 1 6 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
Expected output:
amount day month key
0 111627.94 1 5 salary
474 131794.61 31 10 salary
590 131794.61 29 11 salary
1003 102497.94 11 7 other_income
1245 98597.94 1 8 other_income
2446 5000.00 2 7 other_income
2447 10000.00 2 7 other_income
I have tried using this filter query
data[[data.key == 'salary'].day<13, 'month'] -= 1 which resulted to the below error
AttributeError Traceback (most recent call last)
<ipython-input-773-81b5a31a7b9f> in <module>
----> 1 test_df[[test_df.key == 'salary'].day<13, 'month'] -= 1
AttributeError: 'list' object has no attribute 'day'
tried this as well
new = data.loc[data.key == 'salary'], new.loc[new.day<6, 'month'] -=1 This worked but I want to do it in a single line rather than assigning a variable new to it.
You can combine multiple conditions into one Boolean index by using logical operators and surrounding each condition with parentheses:
data.loc[(data.day < 6) & (data.key == "salary"), "month"] -= 1

How to compare value in Pandas DataFrame against a value in the previous row AND the previous column?

I have a dataframe consisting of two columns filled with float values. I need to calculate all the values of 'h' minus all the values of 'c', at the index previous to the current 'h' value.
So for instance, for 'h' in row 1, I need to calculate 1.17322 - 1.17285 (the value of 'c' in the previous row)
I have tried several different methods to accomplish this, including the use of: .iloc, .shift(), .groupby(), and .diff(), but I cannot get exactly what I'm looking for.
If anybody could help, it would be greatly appreciated
c h
0 1.17285 1.17310
1 1.17287 1.17322
2 1.17298 1.17340
3 1.17346 1.17348
4 1.17478 1.17511
5 1.17595 1.17700
6 1.17508 1.17633
7 1.17474 1.17545
8 1.17463 1.17546
9 1.17224 1.17468
10 1.17437 1.17456
11 1.17552 1.17641
12 1.17750 1.17784
13 1.17694 1.17770
Try this using shift, for as an example:
df['c_shift'] = df['c'].shift()
df['diff'] = df['h'] - df['c_shift']
print(df)
Output:
c h c_shift diff
0 1.17285 1.17310 NaN NaN
1 1.17287 1.17322 1.17285 0.00037
2 1.17298 1.17340 1.17287 0.00053
3 1.17346 1.17348 1.17298 0.00050
4 1.17478 1.17511 1.17346 0.00165
5 1.17595 1.17700 1.17478 0.00222
6 1.17508 1.17633 1.17595 0.00038
7 1.17474 1.17545 1.17508 0.00037
8 1.17463 1.17546 1.17474 0.00072
9 1.17224 1.17468 1.17463 0.00005
10 1.17437 1.17456 1.17224 0.00232
11 1.17552 1.17641 1.17437 0.00204
12 1.17750 1.17784 1.17552 0.00232
13 1.17694 1.17770 1.17750 0.00020
Of course, you can do this in one step:
df['diff'] = df['h'] - df['c'].shift()

Not getting 0 index from pandas value_counts()

total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1

Categories

Resources