Python Pandas - Get the First Value that Meets Criteria - python

We have this function:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
df["getfirst"] = np.where(df["USDamt"] > CustomAmt, 1, 0)
wantedprice = "??"
print(df)
print()
print("Wanted Price:",wantedprice)
return wantedprice
Calling it using a custom USDamt like this:
GetPricePerCustomAmt(500)
gets this result:
Price USDamt getfirst
0 281.48 104.84 0
1 281.44 5140.77 1
2 281.42 10072.24 1
3 281.39 15773.83 1
4 281.33 19314.54 1
5 281.27 22255.55 1
6 281.20 23427.64 1
7 281.13 23708.77 1
8 281.10 23738.77 1
9 281.08 24019.88 1
10 281.01 25986.95 1
11 281.00 26127.45 1
Wanted Price: ??
We want to return the Price row of the first 1 appearing in the "getfirst" column.
Examples:
GetPricePerCustomAmt(500)
Wanted Price: 281.44
GetPricePerCustomAmt(15000)
Wanted Price: 281.39
GetPricePerCustomAmt(24000)
Wanted Price: 281.08
How do we do it?
(If you know a more efficient way to get the wanted price please do tell too)

Use next with iter for return default value if no value matched and is returneded empty Series, for filtering use boolean indexing:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
return next(iter(df.loc[df["USDamt"] > CustomAmt, 'Price']), 'no matched')
print(GetPricePerCustomAmt(500))
281.44
print(GetPricePerCustomAmt(15000))
281.39
print(GetPricePerCustomAmt(24000))
281.08
print(GetPricePerCustomAmt(100000))
no matched

Related

Pandas mutliIndex sort by group

I would like to keep values the same order (descending), but I am unable to group by index level 0 the following frame. The block with code 0512 should come together keeping descending order by code.
code product count
0510 あたたか新潟こしひかり 5kg           1
0511 キッコ−マン 味わいリッチ減塩しょうゆ 450ml 1
7プレミアム 国産果汁使用ゆずぽん酢 200ml  1
0512 キリン 生茶 525ml              1
キリンレモン 450ml              1
コカ・コーラ い・ろ・は・す もも 555ML   1
サントリー なっちゃん オレンジ 425ml    1
サントリー プレミアムボス ブラック 490ml  2
サントリー 天然水南アルプス 2L ケース     1
サントリー 天然水南アルプス 2L ペット     1
サントリー 朝摘みオレンジ&天然水 540ml   1
大塚 ポカリスエット 900ML ペット      1
森永 inゼリー エネルギーレモン 180g    1
綾鷹 525MLペット               2
7プレミアム パイナップルサイダー 500ml   1
7プレミアム フルーツオ・レ 500ml      1
GAクラフトマン ダークモカ 440ml      1
UCC 職人の珈琲 無糖 930ML ペット    1
0513 アサヒ オフ 500ml×6            1
キリン 本麒麟 500ml             1
万上 濃厚熟成本みりん 1L            1
東村山純米酒 720ml              1
0514 ブルボン プチポテトコンソメ味 45g       1
ロッテ ガーナローストミルク 50g        1
ロッテ グリーンガム 9枚             1
My code
data = df.groupby(['code','product']).size().reset_index(name='counts').set_index(['code','product'])
data1 = data.sort_values(by=['counts','code'], ascending=False).groupby(['product','code']).sum()
EDIT:
I could see that the second groupby put the code together but mess up the descending order of count per code as we can see for 0512.
You should pass a list to the ascending argument in the second line, like this:
data1 = data.sort_values(by=['counts','code'], ascending=[False,False]).groupby(['product','code']).sum()
Otherwise, it would take the default value which is True for "code" column.

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

Better Way to do this in Pandas?

I'm just seeking some guidance on how to do this better. I was just doing some basic research to compare Monday's opening and low. The code code returns two lists, one with the returns (Monday's close - open/Monday's open) and a list that's just 1's and 0's to reflect if the return was positive or negate.
Please take a look as I'm sure there's a better way to do it in pandas but I just don't know how.
#Monday only
m_list = [] #results list
h_list = [] #hit list (close-low > 0)
n=0 #counter variable
for t in history.index:
if datetime.datetime.weekday(t[1]) == 1: #t[1] is the timestamp in multi index (if timestemp is a Monday)
x = history.ix[n]['open']-history.ix[n]['low']
m_list.append((history.ix[n]['open']-history.ix[n]['low'])/history.ix[n]['open'])
if x > 0:
h_list.append(1)
else:
h_list.append(0)
n += 1 #add to index counter
else:
n += 1 #add to index counter
print("Mean: ", mean(m_list), "Max: ", max(m_list),"Min: ",
min(m_list), "Hit Rate: ", sum(h_list)/len(h_list))
You can do that by straight forward :
(history['open']-history['low'])>0
This will give you true for rows where open is greater and flase where low is greater.
And if you want 1,0, you can multiply the above statement with 1.
((history['open']-history['low'])>0)*1
Example
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.random(10),
'b':np.random.random(10)})
Printing the data frame:
print(df)
a b
0 0.675916 0.796333
1 0.044582 0.352145
2 0.053654 0.784185
3 0.189674 0.036730
4 0.329166 0.021920
5 0.163660 0.331089
6 0.042633 0.517015
7 0.544534 0.770192
8 0.542793 0.379054
9 0.712132 0.712552
To make a new column compare where it is 1 if a is greater and 9 if b is greater :
df['compare'] = (df['a']-df['b']>0)*1
this will add new column compare:
a b compare
0 0.675916 0.796333 0
1 0.044582 0.352145 0
2 0.053654 0.784185 0
3 0.189674 0.036730 1
4 0.329166 0.021920 1
5 0.163660 0.331089 0
6 0.042633 0.517015 0
7 0.544534 0.770192 0
8 0.542793 0.379054 1
9 0.712132 0.712552 0

How do you set a specific column with a specific value to a new value in a Pandas DF?

I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder

Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Categories

Resources