Best way to check every data point inside dataframe pandas - python

I have 2 DataFrames contain 110.000 rows and 10 columns and another with 47.000 datapoints with 8 columns. I use the 2 dataframe to check the validate of first DataFrame. If this match, i will take this row of first DataFrame and second DataFrame to new DataFrame.
The way i check is inside the 2nd DataFrame i got a keyword in column keyword. Then i check this with column string of 1st DataFrame if it include keyword or not.
Now i'm using 2 loops iterrows() to check it. But i will take a lot of day to do. I wonder is there any more efficient way to do this.
My code is like this :
for index, ebayrow in ebaydata.iterrows():
make_match = [e_scrubrow for idx,e_scrubrow in etail_scrub_data.iterrows() if e_scrubrow['keyword'] in ebayrow['title']]
nummatch = len(make_match)
if nummatch == 0:
continue
else:
model_match = [e_scrubrow for e_scrubrow in make_match if e_scrubrow['keyword2'] in ebayrow['title']]
nummatch = len(model_match)
if nummatch == 0:
continue
else:
if nummatch == 1:
scrubrow = model_match[0]
ebaychecked.append(scrubrow['keyword'])
ebaychecked1.append(scrubrow['keyword2'])
ebaychecked2.append(scrubrow['keyword3'])
ebaychecked7.append(ebayrow['info'])
print(len(ebaychecked))
else:
year_match = [e_scrubrow for e_scrubrow in model_match if e_scrubrow['keyword3'] in ebayrow['title']]
nummatch = len(year_match)
if nummatch == 0:
scrubrow = model_match[0]
ebaychecked.append(scrubrow['keyword'])
ebaychecked1.append(scrubrow['keyword2'])
ebaychecked2.append(scrubrow['keyword3'])
ebaychecked7.append(ebayrow['info'])
print(len(ebaychecked))

Related

Matched 3 different column element of 2 different dataframe

I am trying to solve a problem where I have two dataframe which are df1 and df2. Both dataframe has the same column. I wanted to check if df1['column1'] == df2["column1"] and df1['column2'] == df2['column2] and df1['column3'] == df2['column3'] if this true wanted to get index of the both dataframe where condition is matched. I tried this but it takes a long time because I have around 250 000 row dataframe. Does anyone suggest some efficient way to find out this?
Tried solution :
from datetime import datetime
MS_counter = 0
matched_ws_index = []
start = datetime.now()
for MS_id in Mastersheet_df["Index"]:
WS_counter = 0
for WS_id in Weekly_sheet_df["Index"]:
if (Weekly_sheet_df.loc[WS_counter,"Trial ID"] == Mastersheet_df.loc[MS_counter,"Trial ID"]) and (Mastersheet_df.loc[MS_counter,"Biomarker Type"] == Weekly_sheet_df.loc[WS_counter,"Biomarker Type"]) and (WS_id == MS_id): # match trial id
print("Trial id, index and biomarker type are matched")
print(WS_counter)
print(MS_counter)
matched_ws_index.append(WS_counter)
WS_counter +=1
MS_counter +=1
end = datetime.now()
print("The time of execution of above program is :",
str(end-start)[5:])
Expected output is :
If above three condition is true it should gives the dataframe index postion like this
Matched
df1 index is = 170
Matched df2 index is = 658

Pandas for Loop Optimization(Vectorization) when looking at previous row value

I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize

How to access Pandas series value in a custom function

I'm working on a project to monitor my 5k time for my running/jogging activities based on their GPS data. I'm currently exploring my data in a Jupyter notebook & now realize that I will need to exclude some activities.
Each activity is a row in a dataframe. While I do want to exclude some rows, I don't want to drop them from my dataframe as I will also use the df for other calculations.
I've added a column to the df along with a custom function for checking the invalidity reasons of a row. It's possible that a run could be excluded for multiple reasons.
In []:
# add invalidity reasons column & update logic
df['invalidity_reasons'] = ''
def maintain_invalidity_reasons(reason):
"""logic for maintaining ['invalidity reasons']"""
reasons = []
if invalidity_reasons == '':
return list(reason)
else:
reasons = invalidity_reasons
reasons.append(reason)
return reasons
I filter down to specific rows in my df and pass them to my function. The below example returns a set of five rows from the df. Below is an example of using the function in my Jupyter notebook.
In []:
columns = ['distance','duration','notes']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns].apply(maintain_invalidity_reasons('short_run'),axis=1)
Out []:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-107-0bd06407ef08> in <module>
2
3 filt = (df['duration'] < pd.Timedelta('5 minutes'))
----> 4 df.loc[filt,columns].apply(maintain_invalidity_reasons(reason='short_run'),axis=1)
<ipython-input-106-60264b9c7b13> in maintain_invalidity_reasons(reason)
5 """logic for maintaining ['invalidity reasons']"""
6 reasons = []
----> 7 if invalidity_reasons == '':
8 return list(reason)
9 else:
NameError: name 'invalidity_reasons' is not defined
Here is an example of the output of my filter if I remove the .apply() call to my function
In []:
columns = ['distance','duration', 'notes','invalidity_reasons']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns]
Out []:
It seems that my issue lies in not knowing how to specify that I want to reference the scalar value in the 'invalidity_reasons' index/column (not sure of the proper term) of the specific row.
I've tried adjusting the IF statement with the below variants. I've also tried to apply the function with/out the axis argument. I'm stuck, please help!
if 'invalidity_reasons' == '':
if s['invalidity_reasons'] == '':
This is pretty much a stab in the dark, but I hope it helps. In the following I'm using this simple frame as an example (to have something to work with):
df = pd.DataFrame({'Col': range(5)})
Now if you define
def maintain_invalidity_reasons(current_reasons, new_reason):
if current_reasons == '':
return [new_reason]
if type(current_reasons) == list:
return current_reasons + [new_reason]
return [current_reasons] + [new_reason]
add another column invalidity_reasons to df
df['invalidity_reasons'] = ''
populate one cell (for the sake of exemplifying)
df.loc[0, 'invalidity_reasons'] = 'a reason'
Col invalidity_reasons
0 0 a reason
1 1
2 2
3 3
4 4
build a filter
filt = (df.Col < 3)
and then do
df.loc[filt, 'invalidity_reasons'] = (df.loc[filt, 'invalidity_reasons']
.apply(maintain_invalidity_reasons,
args=('another reason',)))
you will get
Col invalidity_reasons
0 0 [a reason, another reason]
1 1 [another reason]
2 2 [another reason]
3 3
4 4
Does that somehow resemble what you are looking for?

Search for column in pandas

How do you search if a value exist in a specific row?
Example I have this file which contains the following:
ID Name
1 Mark
2 John
3 Mary
The user will input 1 and it will
print("the value already exist.")
But if the user input 4 it will add a new row containing 4 and
name = input('Name')
and update the file like this
ID Name
1 Mark
2 John
3 Mary
4 (userinput)
An easy approach will be:
import pandas as pd
bool_val = False
for i in range(0, df.shape[0]):
if str(df.iloc[i]['ID']) == str(input_str):
bool_val = False
break
else:
print("there")
bool_val = True
if bool_val == True:
df = df.append(pd.Series([input_str, name], index = ['ID', 'Name']), ignore_index=True)
Remember to add the parameter ignore_index to avoid TypeError. I added a bool value to avoid appending a row multiple times.
searchid=20 #use sys.argv[1] if needed to be passed as argument to the program. Or read it as raw_input
if str(searchid) in df.index.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(name)]
If ID is not index:
if str(searchid) in df.ID.values.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(searchid),str(name)]
specifying column headers to update during df update might avoid errors of mismatch:
df.loc[searchid]={'ID': str(searchid), 'Name': str(name)}
This should help
Also read at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, that mentions the inherent nature of append and concat to copy the full dataframe.
df.loc['ID'] will return the row containing the ID in the index of the dataframe. Assuming IDs are the index values of the df you are referring to.
If you have a list of IDs and wish to search for them all together then:
assuming:
listofids=['ID1','ID2','ID3']
df.loc[listofids]
will yield the rows containing the above IDs
If IDs are not in index then:
Assuming df['ids'] contain the given ID list:
'searchedID' in df.ids.values
will return True or False based on presence or absence

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

Categories

Resources