Python Pandas df.isin shows inaccurate results

Python Pandas df.isin shows inaccurate results - python

I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?

isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]

Related

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!

You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

I am using the numpy and pandas modules to work with data from an excel sheet. I want to iterate through a column and make sure each rows' value is higher than the previous ones' by 1.
For example, cell A1 of excel sheet has a value of 1, I would like to make sure cell A2 has a value of 2. And I would like to do this for the entire column of my excel sheet.
The problem is I'm not sure if this is a good way to go about doing this.
This is the code I've come up with so far:
import numpy as np
import pandas as pd
i = 1
df = pd.read_excel("HR-Employee-Attrition(1).xlsx")
out = df['EmployeeNumber'].to_numpy().tolist()
print(out)
for i in out:
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
It gives me the error:
IndexError: list index out of range.
Could someone advise me on how to check every row in my excel column?

If I understood the problem correctly, you may need to iterate over the length of the list -1 to avoid the out of range:
for i in range(len(out)-1):
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
but there is an easier way to achieve this though, which is:
df['EmployeeNumber'].diff()

I don't understand why you are using a for-loop for such a thing:
I've created an Excel-sheet, with two columns, like this:
Index Name
1 A
2 B
C
D
E
I selected the two numbers (1 and 2) and double-clicked on the right-bottom corner of the selection rectangle, while recording what I was doing, and this macro got recorded:
Selection.AutoFill Destination:=Range("A2:A6")
As you see, Excel does not write a for-loop for this (the for-loop might prove being a performance whole in case of large Excel sheets).
The result on my Excel sheet was:
Index Name
1 A
2 B
3 C
4 D
5 E

Pandas query is inconsistent

I need to calculate a sum in python based on a huge json file. It is of importance that the calculation is correct. First I add it to pandas and it gets a structure like the following, but bigger.
A B
a 1
a 2
b 2
Then I want the sum of column B where A is a. To do this I use the pandas query method. The problem is that sometimes it gives the correct answer, 3, and sometimes just 0. I have tried both of the code syntaxes below, which I think is equivalent to each other.
my_sum = df[df["A"] == "a"].sum()["B"]
my_sum = df.query("A == 'a'")['B'].sum()
Could it be the query that is run asynchronously?
How can this sum be calculated without getting any inconsistencies?
Clearifications:
my_sum sometimes equals 3
but most often
my_sum equals 0
It is more often 3 when running in the pycharm debugger.
Column B consists of floats.

Using Boolean Statements and manipulating original dataframe

So, I've got a dataframe that looks like:
with 308 different ORIGIN_CITY_NAME and 12 different UNIQUE_CARRIER.
I am trying to remove the cities where the number of unique carrier airline is < 5 As such, I performed this function:
Now, I want i'd like to take this result and manipulate my original data, df in such a way that I can remove the rows where the ORIGIN_CITY_NAME corresponds to being TRUE.
I had an idea in mind which is to use the isin() function or the apply(lambda) function in Python but I'm not familiar how to go about it. Is there a more elegant way to go about this? Thank you!

filter was made for this
df.groubpy('ORIGIN_CITY_NAME').filter(
lambda d: d.UNIQUE_CARRIER.nunique() >= 5
)
However, to continue along the vein you were attempting to get results from...
I'd use map
mask = df.groubpy('ORIGIN_CITY_NAME').UNIQUE_CARRIER.nunique() >= 5
df[df.ORIGIN_CITY_NAME.map(mask)]
Or transform
mask = df.groupby('ORIGIN_CITY_NAME').UNIQUE_CARRIER.transform(
lambda x: x.nunique() >= 5
)
df[mask]

Fastest way to apply function involving multiple dataframe

I'm searching to improve my code, and I don't find any clue for my problem.
I have 2 dataframes (let's say A and B) with same number of row & columns.
I want to create a third dataframe C, which will transformed each A[x,y] element based on B[x,y] element.
Actually I perform the operation with 2 loop, one for x and one for y dimension :
import pandas
A=pandas.DataFrame([["dataA1","dataA2","dataA3"],["dataA4","dataA5","dataA6"]])
B=pandas.DataFrame([["dataB1","dataB2","dataB3"],["dataB4","dataB5","dataB6"]])
Result=pandas.DataFrame([["","",""],["","",""]])
def mycomplexfunction(x,y):
return str.upper(x)+str.lower(y)
for indexLine in range(0,2):
for indexColumn in range(0,3):
Result.loc[indexLine,indexColumn]=mycomplexfunction(A.loc[indexLine,indexColumn],B.loc[indexLine,indexColumn])
print(Result)
0 1 2
0 DATAA1datab1 DATAA2datab2 DATAA3datab3
1 DATAA4datab4 DATAA5datab5 DATAA6datab6
but I'm searching for a more elegant and fastway to do it directly by using dataframe functions.
Any idea ?
Thanks,

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas df.isin shows inaccurate results - python

Related

How to replace two entire columns in a df by adding 5 to the previous value?

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

Pandas query is inconsistent

Using Boolean Statements and manipulating original dataframe

Fastest way to apply function involving multiple dataframe

Categories

Resources