Python Pandas df.isin shows inaccurate results - python

I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?

isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]

Related

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

I am using the numpy and pandas modules to work with data from an excel sheet. I want to iterate through a column and make sure each rows' value is higher than the previous ones' by 1.
For example, cell A1 of excel sheet has a value of 1, I would like to make sure cell A2 has a value of 2. And I would like to do this for the entire column of my excel sheet.
The problem is I'm not sure if this is a good way to go about doing this.
This is the code I've come up with so far:
import numpy as np
import pandas as pd
i = 1
df = pd.read_excel("HR-Employee-Attrition(1).xlsx")
out = df['EmployeeNumber'].to_numpy().tolist()
print(out)
for i in out:
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
It gives me the error:
IndexError: list index out of range.
Could someone advise me on how to check every row in my excel column?
If I understood the problem correctly, you may need to iterate over the length of the list -1 to avoid the out of range:
for i in range(len(out)-1):
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
but there is an easier way to achieve this though, which is:
df['EmployeeNumber'].diff()
I don't understand why you are using a for-loop for such a thing:
I've created an Excel-sheet, with two columns, like this:
Index Name
1 A
2 B
C
D
E
I selected the two numbers (1 and 2) and double-clicked on the right-bottom corner of the selection rectangle, while recording what I was doing, and this macro got recorded:
Selection.AutoFill Destination:=Range("A2:A6")
As you see, Excel does not write a for-loop for this (the for-loop might prove being a performance whole in case of large Excel sheets).
The result on my Excel sheet was:
Index Name
1 A
2 B
3 C
4 D
5 E

Pandas query is inconsistent

I need to calculate a sum in python based on a huge json file. It is of importance that the calculation is correct. First I add it to pandas and it gets a structure like the following, but bigger.
A B
a 1
a 2
b 2
Then I want the sum of column B where A is a. To do this I use the pandas query method. The problem is that sometimes it gives the correct answer, 3, and sometimes just 0. I have tried both of the code syntaxes below, which I think is equivalent to each other.
my_sum = df[df["A"] == "a"].sum()["B"]
my_sum = df.query("A == 'a'")['B'].sum()
Could it be the query that is run asynchronously?
How can this sum be calculated without getting any inconsistencies?
Clearifications:
my_sum sometimes equals 3
but most often
my_sum equals 0
It is more often 3 when running in the pycharm debugger.
Column B consists of floats.

Using Boolean Statements and manipulating original dataframe

So, I've got a dataframe that looks like:
with 308 different ORIGIN_CITY_NAME and 12 different UNIQUE_CARRIER.
I am trying to remove the cities where the number of unique carrier airline is < 5 As such, I performed this function:
Now, I want i'd like to take this result and manipulate my original data, df in such a way that I can remove the rows where the ORIGIN_CITY_NAME corresponds to being TRUE.
I had an idea in mind which is to use the isin() function or the apply(lambda) function in Python but I'm not familiar how to go about it. Is there a more elegant way to go about this? Thank you!
filter was made for this
df.groubpy('ORIGIN_CITY_NAME').filter(
lambda d: d.UNIQUE_CARRIER.nunique() >= 5
)
However, to continue along the vein you were attempting to get results from...
I'd use map
mask = df.groubpy('ORIGIN_CITY_NAME').UNIQUE_CARRIER.nunique() >= 5
df[df.ORIGIN_CITY_NAME.map(mask)]
Or transform
mask = df.groupby('ORIGIN_CITY_NAME').UNIQUE_CARRIER.transform(
lambda x: x.nunique() >= 5
)
df[mask]

Fastest way to apply function involving multiple dataframe

I'm searching to improve my code, and I don't find any clue for my problem.
I have 2 dataframes (let's say A and B) with same number of row & columns.
I want to create a third dataframe C, which will transformed each A[x,y] element based on B[x,y] element.
Actually I perform the operation with 2 loop, one for x and one for y dimension :
import pandas
A=pandas.DataFrame([["dataA1","dataA2","dataA3"],["dataA4","dataA5","dataA6"]])
B=pandas.DataFrame([["dataB1","dataB2","dataB3"],["dataB4","dataB5","dataB6"]])
Result=pandas.DataFrame([["","",""],["","",""]])
def mycomplexfunction(x,y):
return str.upper(x)+str.lower(y)
for indexLine in range(0,2):
for indexColumn in range(0,3):
Result.loc[indexLine,indexColumn]=mycomplexfunction(A.loc[indexLine,indexColumn],B.loc[indexLine,indexColumn])
print(Result)
0 1 2
0 DATAA1datab1 DATAA2datab2 DATAA3datab3
1 DATAA4datab4 DATAA5datab5 DATAA6datab6
but I'm searching for a more elegant and fastway to do it directly by using dataframe functions.
Any idea ?
Thanks,

Categories

Resources