pandas - find first occurrence - python

Suppose I have a structured dataframe as follows:
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.
Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.
use idxmax on df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpy equivalent
(df.A.values != 'a').argmax()
3
However, if A has already been sorted, then we can use searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpy equivalent
df.A.values.searchsorted('a', side='right')
3

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:
df[df.A!='a'].first_valid_index()
3
However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:
df.loc[df.A!='a','A'].index[0]
Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
argmax numpy: 0.0009 1.29
for loop: 0.0045 6.43
searchsorted pandas: 0.0075 10.71
idxmax pandas: 0.0267 38.14
index[0]: 0.0295 42.14
first_valid_index pandas: 0.1181 168.71
Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.
For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
searchsorted pandas: 0.0076 10.86
argmax numpy: 0.0117 16.71
index[0]: 0.0815 116.43
idxmax pandas: 0.0904 129.14
first_valid_index pandas: 0.1691 241.57
for loop: 9.6504 13786.29
The code to produce these results is below:
import timeit
# code snippet to be executed only once
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''
# code snippets whose execution time is to be measured
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]
mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")
mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")
mycode_set.append( '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")
mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
mycode_set.append( '''for index in range(len(df['A'])):
if df['A'][index] != 'a':
ans = index
break
''')
message.append("for loop: ")
total_time_in_sec = []
for i in range(len(mycode_set)):
mycode = mycode_set[i]
total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
stmt = mycode, number = 100),4))
output = pd.DataFrame(total_time_in_sec, index = message, \
columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)
output = output.sort_values(by = "total_time_sec")
display(output)
For the larger dataframe:
mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

Using pandas groupby() to group by column or list of columns. Then first() to get the first value in each group.
import pandas as pd
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
#Group df by column and get the first value in each group
grouped_df = df.groupby("A").first()
#Reset indices to match format
first_values = grouped_df.reset_index()
print(first_values)
>>> A B
0 a 1
1 b 1

For multiple conditions:
Let's say we have:
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
And we want to find the first item different than a and c, we do:
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
Times:
import numpy as np
import pandas as pd
from datetime import datetime
ITERS = 1000
def pandas_multi_condition(s):
ts = datetime.now()
for i in range(ITERS):
n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)
def numpy_bitwise_and(s):
ts = datetime.now()
for i in range(ITERS):
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)
Output:
pandas_multi_condition():
4
0:00:01.144767
numpy_bitwise_and():
4
0:00:00.019013

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
if df['A'][index] != 'a':
print(index)
break
The index is the row number of the 1st index of where df.A!='a'

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:
def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
if row[col] > max:
max = row[col]
rtn_index = index
return rtn_index

Generalized Form:
index = df.loc[df.column_name == 'value_you_looking_for'].index[0]
Example:
index_of_interest = df.loc[df.A == 'a'].index[0]

Related

Python loop multiple positional arguments to function from dataframe or array

I have .csv file with colums A and B, and rows with values.
A,
B
232.65,
-57.48
22.69,
-5.46
23.67,
-7.71
I want to loop these values with function.
So lets create simple function with 2 positional parameters:
def add_numbers(n1, n2):
answer = n1+n2
print(answer)
#lets read the file:
import pandas as pd
df= pd.read_csv(r'myfile.csv')
print(df[A]) #just checking the column A for this example, but column B is also required for n2.
0 232.65
1 22.69
3 23.67
I could also transfer those colums to array but I still could not loop it. How would I loop these with the function that requires two arguments?
arr = np.loadtxt(r'myfile.csv', delimiter=',')
arr
array([[232.65, -57.48],
[22.69, -5.46],
[23.67, -7.71],
I have been trying to loop with various ways and iter and enumerate and apply, but I keep doing something slightly wrong.
Cheer, Joonatan
You can loop and pass the values in each row to add_numbers. Then use iterrows to get the index and row values.
def add_numbers(n1, n2):
answer = n1 + n2
print(answer)
import pandas as pd
df = pd.read_csv(r'myfile.csv')
for index, row in df.iterrows():
add_numbers(row['A'], row['B'])
I hope, it works for your solution. Whenever you have to apply a custom function and want to loop through values to perform an operation use apply function.
Code:
import pandas as pd
df = pd.read_csv('./loop_through_2_columns.csv')
def add_numbers(n1, n2):
answer = n1+n2
return answer
df['result'] = df.apply(lambda x: add_numbers(x.A, x.B), axis=1)
df.head()
Output:
A B result
0 232.65 -57.48 175.17
1 22.69 -5.46 17.23
2 23.67 -7.71 15.96
try:
import pandas as pd
df = pd.DataFrame({
'A': [232.65,22.69,23.67],
'B':[-57.48,-5.46,-7.71]
})
def add_numbers(n1, n2):
answer = n1+n2
print(answer)
for i in range(len(df)):
n1 = df.loc[i,'A']
n2 = df.loc[i,'B']
add_numbers(n1,n2)
output:
175.17000000000002
17.23
15.96

Pandas how to turn column of lists into multiple columns?

I have a very large DataFrame where one column (COL) includes a range (i.e. list) of values. I want to turn this COL into individual columns labeled with the specific number and containing a 1 if the specific number is in COL else 0.
Below is my current approach. However, this is slow with high number of OBSERVATIONS and MAX_VALUE.
import pandas as pd
import numpy as np
OBSERVATIONS = 100000 # number of values 600000
MAX_VALUE = 400 # 400
_ = pd.DataFrame({
'a':np.random.randint(2,20,OBSERVATIONS),
'b':np.random.randint(30,MAX_VALUE,OBSERVATIONS)
})
_['res'] = _.apply(lambda x: range(x['a'],x['b']),axis=1)
for i in range(MAX_VALUE):
_[f'{i}'] = _['res'].apply(lambda x: 1 if i in x else 0)
You can try and do the calculations in numpy and then insert the numpy array to the dataframe. This is about 5 times faster:
import pandas as pd
import numpy as np
import time
OBSERVATIONS = 100_000 # number of values 600000
MAX_VALUE = 400 # 400
_ = pd.DataFrame({
'a':np.random.randint(2,20,OBSERVATIONS),
'b':np.random.randint(30,MAX_VALUE,OBSERVATIONS)
})
_['res'] = _.apply(lambda x: range(x['a'],x['b']),axis=1)
res1 = _.copy()
start = time.time()
for i in range(MAX_VALUE):
res1[f'{i}'] = res1['res'].apply(lambda x: 1 if i in x else 0)
print(f'original: {time.time() - start}')
start = time.time()
z = np.zeros((len(_), MAX_VALUE), dtype=np.int64)
for i,r in enumerate(_.res):
z[i,range(r.start,r.stop)]=1
res2 = pd.concat([_, pd.DataFrame(z)], axis=1)
res2.columns = list(map(str, res2.columns))
print(f'new : {time.time() - start}')
assert res1.equals(res2)
Output:
original: 23.649751663208008
new : 4.586429595947266

Count matching combinations in a pandas dataframe

I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!
Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.

Pandas: Check if row exists with certain values

I have a two dimensional (or more) pandas DataFrame like this:
>>> import pandas as pd
>>> df = pd.DataFrame([[0,1],[2,3],[4,5]], columns=['A', 'B'])
>>> df
A B
0 0 1
1 2 3
2 4 5
Now suppose I have a numpy array like np.array([2,3]) and want to check if there is any row in df that matches with the contents of my array. Here the answer should obviously true but eg. np.array([1,2]) should return false as there is no row with both 1 in column A and 2 in column B.
Sure this is easy but don't see it right now.
Turns out it is really easy, the following does the job here:
>>> ((df['A'] == 2) & (df['B'] == 3)).any()
True
>>> ((df['A'] == 1) & (df['B'] == 2)).any()
False
Maybe somebody comes up with a better solution which allows directly passing in the array and the list of columns to match.
Note that the parenthesis around df['A'] == 2 are not optional since the & operator binds just as strong as the == operator.
an easier way is:
a = np.array([2,3])
(df == a).all(1).any()
If you also want to return the index where the matches occurred:
index_list = df[(df['A'] == 2)&(df['B'] == 3)].index.tolist()
To find rows where a single column equals a certain value:
df[df['column name'] == value]
To find rows where multiple columns equal different values, Note the inner ():
df[(df["Col1"] == Value1 & df["Col2"] == Value2 & ....)]
a simple solution with dictionary
def check_existance(dict_of_values, df):
v = df.iloc[:, 0] == df.iloc[:, 0]
for key, value in dict_of_values.items():
v &= (df[key] == value)
return v.any()
import pandas as pd
df = pd.DataFrame([[0,1],[2,3],[4,5]], columns=['A', 'B'])
this_row_exists = {'A':2, 'B':3}
check_existance(this_row_exists, df)
# True
this_row_does_not_exist = {'A':2, 'B':5}
check_existance(this_row_does_not_exist, df)
# False
An answer that works with larger dataframes so you don't need to manually check for each columns:
import pandas as pd
import numpy as np
#define variables
df = pd.DataFrame([[0,1],[2,3],[4,5]], columns=['A', 'B'])
a = np.array([2,3])
def check_if_np_array_is_in_df(df, a):
# transform a into a dataframe
da = pd.DataFrame(np.expand_dims(a,axis=0), columns=['A','B'])
# drop duplicates from df
ddf=df.drop_duplicates()
result = pd.concat([ddf,da]).shape[0] - pd.concat([ddf,da]).drop_duplicates().shape[0]
return result
print(check_if_np_array_is_in_df(df, a))
print(check_if_np_array_is_in_df(df, [1,3]))
If you want to return the row where the matches occurred:
resulting_row = df[(df['A'] == 2)&(df['B'] == 3)].values

Finding overlapping segments in Pandas

I have two pandas DataFrames A and B, with columns ['start', 'end', 'value'] but not the same number of rows. I'd like to set the values for each row in A as follows:
A.iloc(i) = B['value'][B['start'] < A[i,'start'] & B['end'] > A[i,'end']]
There is a possibility of multiple rows of B satisfy this condition for each i, in that case max or sum of corresponding rows would be the result. In case if none satisfies the value of A.iloc[i] should not be updated or set to a default value of 0 (either way would be fine)
I'm interested to find the most efficient way of doing this.
import numpy as np
np.random.seed(1)
lenB = 10
lenA = 20
B_start = np.random.rand(lenB)
B_end = B_start + np.random.rand(lenB)
B_value = np.random.randint(100, 200, lenB)
A_start = np.random.rand(lenA)
A_end = A_start + np.random.rand(lenA)
#if you use dataframe
#B_start = B["start"].values
#B_end = ...
mask = (A_start[:, None ] > B_start) & (A_end[:, None] < B_end)
r, c = np.where(mask)
result = pd.Series(B_value[c]).groupby(r).max()
print result

Categories

Resources