Pandas; Python - VLookup

Pandas; Python - VLookup - python

This is my first time posting a question, so take it easy on me if I don't know stack overflow norm of asking questions.
Attached is a snippet of what I am trying to accomplish on my side-project. I want to be able to compare a user input with a database .xlsx file that was imported by pandas.
I want to compare the user input with the database column ['Component'], if that component is there, it will grab its properties associated with that component.
comp_loc = r'C:\Users\ayubi\Documents\Python Files\Chemical_Database.xlsx'
data = pd.read_excel(comp_loc)
print(data)
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
if LK == data['Component'].any():
Tcrit = data['TC, (K)']
Pcrit = data['PC, (bar)']
A = data['A']
B = data['B']
C = data['C']
D = data['D']
else:
print('False')
Results
Component TC, (K) PC, (bar) A B C D
0 Benzene 562.2 48.9 -6.983 1.332 -2.629 -3.333
1 Toluene 591.8 41.0 -7.286 1.381 -2.834 -2.792
What is the Light Key?: Benzene
False
Please let me know if you have any questions.
I do appreciate your help!

You can do this by taking advantage of indices and using the df.loc accessor in pandas:
# set index to Component column for convenience
data = data.set_index('Component')
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
# check if your lookup is in the index
if LK in data.index:
# grab the row by the index using .loc
row = data.loc[LK]
# if the column name has spaces, you need to access as key
Tcrit = row['TC, (K)']
Pcrit = row['PC, (bar)']
# if the column name doesn't have a space, you can access as attribute
A = row.A
B = row.B
C = row.C
D = row.D
else:
print('False')

This is a great case for an Index. Set 'Component' to the Index, and then you can use a very fast loc call to look up the data. Instead of the if-else use a try-except as a KeyError is going to tell you that the LK doesn't exist, without requiring the slower check of first checking whether it's in the index.
I also highly suggest you keep the values as a single Series, instead of floating around as 6 different varibales. It's simple to access each item by the Series index, i.e. Series['A'].
df = df.set_index('Component')
def grab_data(df, LK):
try:
return df.loc[LK]
except KeyError:
return False
grab_data(df, 'Benzene')
#TC, (K) 562.200
#PC, (bar) 48.900
#A -6.983
#B 1.332
#C -2.629
#D -3.333
#Name: Benzene, dtype: float64
grab_data(df, 'foo')
#False

Related

How to access Pandas series value in a custom function

I'm working on a project to monitor my 5k time for my running/jogging activities based on their GPS data. I'm currently exploring my data in a Jupyter notebook & now realize that I will need to exclude some activities.
Each activity is a row in a dataframe. While I do want to exclude some rows, I don't want to drop them from my dataframe as I will also use the df for other calculations.
I've added a column to the df along with a custom function for checking the invalidity reasons of a row. It's possible that a run could be excluded for multiple reasons.
In []:
# add invalidity reasons column & update logic
df['invalidity_reasons'] = ''
def maintain_invalidity_reasons(reason):
"""logic for maintaining ['invalidity reasons']"""
reasons = []
if invalidity_reasons == '':
return list(reason)
else:
reasons = invalidity_reasons
reasons.append(reason)
return reasons
I filter down to specific rows in my df and pass them to my function. The below example returns a set of five rows from the df. Below is an example of using the function in my Jupyter notebook.
In []:
columns = ['distance','duration','notes']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns].apply(maintain_invalidity_reasons('short_run'),axis=1)
Out []:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-107-0bd06407ef08> in <module>
2
3 filt = (df['duration'] < pd.Timedelta('5 minutes'))
----> 4 df.loc[filt,columns].apply(maintain_invalidity_reasons(reason='short_run'),axis=1)
<ipython-input-106-60264b9c7b13> in maintain_invalidity_reasons(reason)
5 """logic for maintaining ['invalidity reasons']"""
6 reasons = []
----> 7 if invalidity_reasons == '':
8 return list(reason)
9 else:
NameError: name 'invalidity_reasons' is not defined
Here is an example of the output of my filter if I remove the .apply() call to my function
In []:
columns = ['distance','duration', 'notes','invalidity_reasons']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns]
Out []:
It seems that my issue lies in not knowing how to specify that I want to reference the scalar value in the 'invalidity_reasons' index/column (not sure of the proper term) of the specific row.
I've tried adjusting the IF statement with the below variants. I've also tried to apply the function with/out the axis argument. I'm stuck, please help!
if 'invalidity_reasons' == '':
if s['invalidity_reasons'] == '':

This is pretty much a stab in the dark, but I hope it helps. In the following I'm using this simple frame as an example (to have something to work with):
df = pd.DataFrame({'Col': range(5)})
Now if you define
def maintain_invalidity_reasons(current_reasons, new_reason):
if current_reasons == '':
return [new_reason]
if type(current_reasons) == list:
return current_reasons + [new_reason]
return [current_reasons] + [new_reason]
add another column invalidity_reasons to df
df['invalidity_reasons'] = ''
populate one cell (for the sake of exemplifying)
df.loc[0, 'invalidity_reasons'] = 'a reason'
Col invalidity_reasons
0 0 a reason
1 1
2 2
3 3
4 4
build a filter
filt = (df.Col < 3)
and then do
df.loc[filt, 'invalidity_reasons'] = (df.loc[filt, 'invalidity_reasons']
.apply(maintain_invalidity_reasons,
args=('another reason',)))
you will get
Col invalidity_reasons
0 0 [a reason, another reason]
1 1 [another reason]
2 2 [another reason]
3 3
4 4
Does that somehow resemble what you are looking for?

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Search for column in pandas

How do you search if a value exist in a specific row?
Example I have this file which contains the following:
ID Name
1 Mark
2 John
3 Mary
The user will input 1 and it will
print("the value already exist.")
But if the user input 4 it will add a new row containing 4 and
name = input('Name')
and update the file like this
ID Name
1 Mark
2 John
3 Mary
4 (userinput)

An easy approach will be:
import pandas as pd
bool_val = False
for i in range(0, df.shape[0]):
if str(df.iloc[i]['ID']) == str(input_str):
bool_val = False
break
else:
print("there")
bool_val = True
if bool_val == True:
df = df.append(pd.Series([input_str, name], index = ['ID', 'Name']), ignore_index=True)
Remember to add the parameter ignore_index to avoid TypeError. I added a bool value to avoid appending a row multiple times.

searchid=20 #use sys.argv[1] if needed to be passed as argument to the program. Or read it as raw_input
if str(searchid) in df.index.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(name)]
If ID is not index:
if str(searchid) in df.ID.values.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(searchid),str(name)]
specifying column headers to update during df update might avoid errors of mismatch:
df.loc[searchid]={'ID': str(searchid), 'Name': str(name)}
This should help
Also read at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, that mentions the inherent nature of append and concat to copy the full dataframe.

df.loc['ID'] will return the row containing the ID in the index of the dataframe. Assuming IDs are the index values of the df you are referring to.
If you have a list of IDs and wish to search for them all together then:
assuming:
listofids=['ID1','ID2','ID3']
df.loc[listofids]
will yield the rows containing the above IDs
If IDs are not in index then:
Assuming df['ids'] contain the given ID list:
'searchedID' in df.ids.values
will return True or False based on presence or absence

convert for loop to apply function in python to reduce run-time

I have a for-loop looks like this, but it takes a long time to run once a large dataset being passed into it.
for i in range(0,len(data_sim.index)):
for j in range(1,len(data_sim.columns)):
user = data_sim.index[i]
activity = data_sim.columns[j]
if dt_full.loc[i][j] != 0:
data_sim.loc[i][j] = 0
else:
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchases = data_activity.loc[user,activity_top_names]
data_sim.loc[i][j] = getScore(user_purchases,activity_top_sims)
In for-loop, data_sim looks like this:
CustomerId A B C D E
1 NAs NAs NAs NAs NAs
2 ..
I tried to reproduce the same process in apply function, which looks like this:
def test(cell):
user = cell.index
activity = cell
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchase = data_activity_index.loc[user, activity_top_names]
if dt_full.loc[user][activity] != 0:
return cell.replace(cell, 0)
else:
re = getScore(user_purchase, activity_top_sims)
return cell.replace(cell, re)
In function, data_sim2 looks like this, I set the 'CustomerId' column to index column and duplicated the activity name to each activity column.
CustomerId(Index) A B C D E
1 A B C D E
2 A B C D E
Inside of the function 'def test(cell)', if the cell is in data_sim2[1][0],
cell.index = 1 # userId
cell # activity name
The whole idea of this for-loop is to fit the scoring data into 'data_sim' table based on position of each cell. And I used the same idea in creating function, used the same calculation in each cell, then apply this to the data table 'data_sim',
data_test = data_sim2.apply(lambda x: test(x))
it gave me a error said
"sort_values() missing 1 required positional argument: 'by'"
which is odd, because this issue was not happening inside of for loop. It sounds like the 'data_corr.loc[activity]' is still a Dataframe istead of a Series.

lookup first match in Pandas dataframe

I need to use a DataFrame as a lookup table on columns that are not part of the index. For example (this is a simple one just to illustrate):
import pandas as pd
westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
['California','Sacramento']],
columns=['state','capital'])
print westcoast
state capital
0 Washington Olympia
1 Oregon Salem
2 California Sacramento
It's easy to lookup and get a Series as an output:
westcoast[westcoast.state=='Oregon'].capital
1 Salem
Name: capital, dtype: object
but I want to obtain the string 'Salem':
westcoast[westcoast.state=='Oregon'].capital.values[0]
'Salem'
and the .values[0] seems somewhat clunky... is there a better way?
(FWIW: my real data has maybe 50 rows at most, but lots of columns, so if I do set an index column, no matter what column I choose, there will be a lookup operation like this that is not based on an index, and the relatively small number of rows means that I don't care if it's O(n) lookup.)

Yes, you can use Series.item if the lookup will always returns one element from the Series:
westcoast.loc[westcoast.state=='Oregon', 'capital'].item()
Exceptions can be handled if the lookup returns nothing, or one or more values and you need only the first item:
s = westcoast.loc[westcoast.state=='Oregon', 'capital']
s = np.nan if s.empty else s.iat[0]
print (s) #Salem
s = westcoast.loc[westcoast.state=='New York', 'capital']
s = np.nan if s.empty else s.iat[0]
print (s)
nan
A more general solution to handle the exceptions because there are 3 possible output scenarios:
westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
['California','Sacramento'],['Oregon','Portland']],
columns=['state','capital'])
print (westcoast)
state capital
0 Washington Olympia
1 Oregon Salem
2 California Sacramento
3 Oregon Portland
s = westcoast.loc[westcoast.state=='Oregon', 'capital']
#if not value returned
if s.empty:
s = 'no match'
#if only one value returned
elif len(s) == 1:
s = s.item()
else:
# if multiple values returned, return a list of values
s = s.tolist()
print (s)
['Salem', 'Portland']
It is possible to create a lookup function:
def look_up(a):
s = westcoast.loc[westcoast.state==a, 'capital']
#for no match
if s.empty:
return np.nan
#for match only one value
elif len(s) == 1:
return s.item()
else:
#for return multiple values
return s.tolist()
print (look_up('Oregon'))
['Salem', 'Portland']
print (look_up('California'))
Sacramento
print (look_up('New Yourk'))
nan

If you are going to do frequent lookups of this sort, then it pays to make state the index:
state_capitals = westcoast.set_index('state')['capital']
print(state_capitals['Oregon'])
# Salem
With an index, each lookup is O(1) on average, whereas westcoast['state']=='Oregon' requires O(n) comparisons. Of course, building the index is also O(n), so you would need to do many lookups for this to pay off.
At the same time, once you have state_capitals the syntax is simple and dict-like. That might be reason enough to build state_capitals.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas; Python - VLookup - python

Related

How to access Pandas series value in a custom function

Efficient way to loop through GroupBy DataFrame

Search for column in pandas

convert for loop to apply function in python to reduce run-time

lookup first match in Pandas dataframe

Categories

Resources