Iterate over two arrays looking for coincidences

Iterate over two arrays looking for coincidences - python

So I got this two CSV in Python using Pandas, they both have two columns where column A is an ID an column B is their value, I need to figure out how to explore CSV 1 to look for one of the values of CSV 2. So let's say that the value of ID #5 in CSV 2 is "10", but on CSV 1 "10" is in the ID #15, whereas ID #5 has a value of 20
This is my code as of right now
def searchValue(array_A, array_B):
for z, b in array_b:
for y, a in array_a:
if(b['value'] in a['value']):
print('true')
return True
else:
print('fake')
return False
I'll appreciate any tip on this, I've been trying with enclosing array_a and array_b in length, range and others with no luck, I really don't know what I'm doing wrong
My algorithm needs to simply return a True or False depending if the value of B is present anywhere on the values of A

Related

remove duplicate values in the next n rows, but keeping first

would really appreciate on the below question, i don't really know where to start trying,
I have a dataframe
pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
I want to write a function that iterates through the values, and remove any duplicates in the next n rows.
For example, if n=5, starting from the first number "1", if there is any "1" in the next 5 rows, it is deleted (marked by "x"). In the next iteration, the second "1" wouldn't be used given it is deleted from the first iteration.
The resulting dataframe would be
pd.DataFrame({'value':[1,'x',2,'x','x','x',1,'x','x',2,'x','x']})
I would want to eventually drop the "x" rows but for the purpose of illustration I've marked it out.

Do you want to actually see the 'x' are they just to demonstrate to us they're to be deleted?
If the latter you could do something like this:
x1 = pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
x1['t'] = x1.index //5
x1.drop_duplicates(subset = ['value', 't']).drop(columns = 't')
value
0 1
2 2
5 1
9 2
10 1

Performing the equivalent of a vlookup within a merged df in Pandas

I had no pandas/python experience this time last week so I have had a steep learning curve in trying to transfer a complex, multi-step process that was being done in excel, into pandas. Sorry if the following is unclear.
I merged 2 dataframes. I have a column, let's call it 'new_ID', with new ID names from originaldf1, some of which say 'no match was found'. For the 'no match was found' entries I would like to get the old ID number from originaldf2, which is another column in currentdf, let's call this col 'old_ID'. So, I would like to do something like an excel vlookup where I say: "if there is 'no match was found' in col 'new_ID', give me the ID that is in col 'old_ID', in that same row". The output I would like is just a list of all the old IDs where no match was found.
I've tried a few solutions that I found on here but all just give me blank outputs. I'm assuming this is because they aren't searching each individual instance of "no match found". For example I tried:
deletes = mydf.loc[mydf['new_ID'] == "no match was found", ['old_ID']
this turns out with just the col header, then all blank.
is what i'm trying to do possible in pandas? or maybe i'm stuck in excel ways of thinking, and there is a better/different way!?...
enter image description here

Welcome to Python. What you are trying to do is a straightforward task in pandas. Each column of a pandas Dataframe is a Series object; basically a list of values. You are trying to find which row numbers (aka indeces) satisfy this criteria: new_id == "no match was found". This can be done by pulling the column out of the dataframe and applying a lambda function. I would recommend pasting this code in a new file and playing around to see how it works.
import pandas as pd
# Create test data frame
df = pd.DataFrame(columns=('new_id','old_id'))
df.loc[0] = (1, None)
df.loc[1] = ("no match", 4)
df.loc[2] = (3, None)
df.loc[3] = ("no match", 4)
print("\nHere is our test dataframe:")
print(df)
print("\nShow the values of the 'new_id' that meet our criteria:")
print(df['new_id'][lambda x: x == "no match"])
# Pull the index from these rows
indeces = df['new_id'][lambda x: x == "no match"].index.tolist()
print("\nIndeces:\n", indeces)
print("\nShow only the rows of the data frame that match 'indeces':")
print(df.loc[indeces]['old_id'])
A couple of notes about this code:
df.loc[] refers to a specific row of a data frame. df.loc[2] refers to the 3rd row (since pandas data frames are generally zero-indexed)
A lambda function here takes each value of a list (or Series object) individually and plugs these values one-by-one into a function. In this case we are referring to each value of 'new_id' as 'x', and then checking if x == "no match". Placing brackets [] around it converts the output to a list. So in this case the output of [lambda x: x == "no_match"] will be a list of True or False values. The list is then applied to our Series object, so that only the rows with True are returned.
After the lambda function .index.tolist() is applied to convert the Series object to a list of its indeces.

Working off your example im going to assume all new_ID entries are numbers only unless there is no match.
so if your dataframe looks like this (assuming this 2nd column has any values, i didnt know so i put 0's)
new_ID
originaldf2
1
0
2
0
3
0
no match
4
Next we can check to see if your new_id column has an id or not by seeing if it contains a number using str.isnumeric()
has_id =df1.new_ID.str.isnumeric()
has_id>>>
0 True
1 True
2 True
3 False
4 True
Name: new_ID, dtype: bool
Then finally we'll use where()
what this does it takes the first argument cond that we've passed the has_id bollean filter through and checks whether its True or False. If true, it keeps original value, if false, goes to the argument found in other which in this case we assigned to the second column of our dataframe.
df1.where(has_id, df.iloc[:,1], axis=0)>>>
new_ID old_df_2
0 1 0
1 2 0
2 3 0
3 4 4

Replace values for each group

I want to replace values in ['animal'] for each subid/group, based on a condition.
The values in animal column are numbers (0-3) and vary for each subid, so a where cond == 1 might look like [0,3] for one subid or [2,1] or [0,3] and the same goes for b.
for s in sids:
a = df[(df['subid']==s)&(df['cond'] == 1)]['animal'].unique()
b = df[(df['subid']==s)&(df['cond'] == 0)]['animal'].unique()
df["animal"].replace({a[0]: 0,a[1]:1,b[0]:2,b[1]:3})
The thing is I think the dataframe overwrites entirely each time and uses only the last iteration of the for loop instead of saving the appropriate values for each group.
I tried specifying the subid at the beginning like so, df[df['subid']==s["animal"].replace({a[0]:0,a[1]:1,b[0]:2,b[1]:3}) but it didn't work.
Any pointers are appreciated, thanks!

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

I am using the numpy and pandas modules to work with data from an excel sheet. I want to iterate through a column and make sure each rows' value is higher than the previous ones' by 1.
For example, cell A1 of excel sheet has a value of 1, I would like to make sure cell A2 has a value of 2. And I would like to do this for the entire column of my excel sheet.
The problem is I'm not sure if this is a good way to go about doing this.
This is the code I've come up with so far:
import numpy as np
import pandas as pd
i = 1
df = pd.read_excel("HR-Employee-Attrition(1).xlsx")
out = df['EmployeeNumber'].to_numpy().tolist()
print(out)
for i in out:
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
It gives me the error:
IndexError: list index out of range.
Could someone advise me on how to check every row in my excel column?

If I understood the problem correctly, you may need to iterate over the length of the list -1 to avoid the out of range:
for i in range(len(out)-1):
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
but there is an easier way to achieve this though, which is:
df['EmployeeNumber'].diff()

I don't understand why you are using a for-loop for such a thing:
I've created an Excel-sheet, with two columns, like this:
Index Name
1 A
2 B
C
D
E
I selected the two numbers (1 and 2) and double-clicked on the right-bottom corner of the selection rectangle, while recording what I was doing, and this macro got recorded:
Selection.AutoFill Destination:=Range("A2:A6")
As you see, Excel does not write a for-loop for this (the for-loop might prove being a performance whole in case of large Excel sheets).
The result on my Excel sheet was:
Index Name
1 A
2 B
3 C
4 D
5 E

Python Pandas df.isin shows inaccurate results

I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?

isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate over two arrays looking for coincidences - python

Related

remove duplicate values in the next n rows, but keeping first

Performing the equivalent of a vlookup within a merged df in Pandas

Replace values for each group

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

Python Pandas df.isin shows inaccurate results

Categories

Resources