Check pandas column for successive row values - python

I have:
hi
0 1
1 2
2 4
3 8
4 3
5 3
6 2
7 8
8 3
9 5
10 4
I have a list of lists and single integers like this:
[[2,8,3], 2, [2,8]]
For each item in the main list, I want to find out the index of when it appears in the column for the first time.
So for the single integers (i.e 2) I want to know the first time this appears in the hi column (index 1, but I am not interested when it appears again i.e index 6)
For the lists within the list, I want to know the last index of when the list appears in order in that column.
So for [2,8,3] that appears in order at indexes 6, 7 and 8, so I want 8 to be returned. Note that it appears before this too, but is interjected by a 4, so I am not interested in it.
I have so far used:
for c in chunks:
# different method if single note chunk vs. multi
if type(c) is int:
# give first occurence of correct single notes
single_notes = df1[df1['user_entry_note'] == c]
single_notes_list.append(single_notes)
# for multi chunks
else:
multi_chunk = df1['user_entry_note'].isin(c)
multi_chunk_list.append(multi_chunk)

You can do it with np.logical_and.reduce + shift. But there are a lot of edge cases to deal with:
import numpy as np
def find_idx(seq, df, col):
if type(seq) != list: # if not list
s = df[col].eq(seq)
if s.sum() >= 1: # if something matched
idx = s.idxmax().item()
else:
idx = np.NaN
elif seq: # if a list that isn't empty
seq = seq[::-1] # to get last index
m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])
s = df.loc[m]
if not s.empty: # if something matched
idx = s.index[0]
else:
idx = np.NaN
else: # empty list
idx = np.NaN
return idx
l = [[2,8,3], 2, [2,8]]
[find_idx(seq, df, col='hi') for seq in l]
#[8, 1, 7]
l = [[2,8,3], 2, [2,8], [], ['foo'], 'foo', [1,2,4,8,3,3]]
[find_idx(seq, df, col='hi') for seq in l]
#[8, 1, 7, nan, nan, nan, 5]

Related

Change some values in column if condition is true in Pandas dataframe without loop

I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).
Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6

How to compare every value in a Pandas dataframe to all the next values?

I am learning Pandas and I am moving my python code to Pandas. I want to compare every value with the next values using a sub. So the first with the second etc.. The second with the third but not with the first because I already did that. In python I use two nested loops over a list:
sub match_values (a, b):
#do some stuff...
l = ['a', 'b', 'c']
length = len(l)
for i in range (1, length):
for j in range (i, length): # starts from i, not from the start!
if match_values(l[i], l[j]):
#do some stuff...
How do I do a similar technique in Pandas when my list is a column in a dataframe? Do I simply reference every value like before or is there a clever "vector-style" way to do this fast and efficient?
Thanks in advance,
Jo
Can you please check this ? It provides an output in the form of a list for each row after comparing the values.
>>> import pandas as pd
>>> import numpy as np
>>> val = [16,19,15,19,15]
>>> df = pd.DataFrame({'val': val})
>>> df
val
0 16
1 19
2 15
3 19
4 15
>>>
>>>
>>> df['match'] = df.apply(lambda x: [ (1 if (x['val'] == df.loc[idx, 'val']) else 0) for idx in range(x.name+1, len(df)) ], axis=1)
>>> df
val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []
Yes, vector comparison as pandas is built on Numpy:
df['columnname'] > 5
This will result in a Boolean array. If you also want to return the actually part of the dataframe:
df[df['columnname'] > 5]

Converting a list with no tuples into a data frame

Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9

How to code for longest progressive sequence

A sequence is said to be progressive if it doesn't decrease in time
E.g. 1 1 2 2 is a progressive sequence but 1 2 1 is not
Let S be the sequence and represented by L spaced integer Ki(where i =1,2,3.. L)now the task is to find the longest progressive sequence in S
a= int(input()) #length of the sequence
b=input() # sequence S
if(int(b[-1])>=int(b[-3])):
print(b)
else:
for i in range(a+2):
print(b[i],end='')
Output 1:
4
1 1 2 1
1 1 2
Output 2:
4
1 3 2 1
1 3 2(But correct answer is 1 3)
I think your code is too short to check for progressive sequences and works only for the one example you provided.
I'll give it a try:
# get some sequence here
seq = [1, 2, 4, 3, 5, 6, 7]
# store the first value
_v = seq[0]
# construct a list of lists
cnts = list()
# and store the first value into this list
cnt = [_v]
# iterate over the values starting from 2nd value
for v in seq[1:]:
if v < _v:
# if the new value is smaller, we have to append our current list and restart
cnts.append(cnt)
cnt = [v]
else:
# else we append to the current list
cnt.append(v)
# store the current value as last value
_v = v
else:
# append the last list to the results
cnts.append(cnt)
# get the longest subsequence
print(max(cnts, key=lambda x: len(x)))
Output:
[3, 5, 6, 7]

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Categories

Resources