Pandas query is inconsistent - python

I need to calculate a sum in python based on a huge json file. It is of importance that the calculation is correct. First I add it to pandas and it gets a structure like the following, but bigger.
A B
a 1
a 2
b 2
Then I want the sum of column B where A is a. To do this I use the pandas query method. The problem is that sometimes it gives the correct answer, 3, and sometimes just 0. I have tried both of the code syntaxes below, which I think is equivalent to each other.
my_sum = df[df["A"] == "a"].sum()["B"]
my_sum = df.query("A == 'a'")['B'].sum()
Could it be the query that is run asynchronously?
How can this sum be calculated without getting any inconsistencies?
Clearifications:
my_sum sometimes equals 3
but most often
my_sum equals 0
It is more often 3 when running in the pycharm debugger.
Column B consists of floats.

Related

Find Sign when Sign Changes in Pandas Column while Ignoring Zeros using Vectorization

I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output

How Do I Create a Dataframe from 1 to 100,000?

I am sure this is not hard, but I can't figure it out!
I want to create a dataframe that starts at 1 for the first row and ends at 100,000 in increments of 1, 2, 4, 5, or whatever. I could do this in my sleep in Excel, but is there a slick way to do this without importing a .csv or .txt file?
I have needed to do this in variations many times and just settled on importing a .csv, but I am tired of that.
Example in Excel
Generating numbers
Generating numbers is not something special to pandas, rather numpy module or range function (as mentioned by #Grismer) can do the trick. Let's say you want to generate a series of numbers and assign these numbers to a dataframe. As I said before, there are multiple approaches two of which I personally prefer.
range function
Take range(1,1000,1) as an Example. This function gets three arguments two of which are not mandatory. The first argument defines the start number, the second one defines the end number, and the last one points to the steps of this range. So the abovementioned example will result in the numbers 1 to 9999 (Note that this range is a half-open interval which is closed at the start and open at the end).
numpy.arange function
To have the same results as the previous example, take numpy.arange(1,1000,1) as an example. The arguments are completely the same as the range's arguments.
Assigning to dataframe
Now, if you want to assign these numbers to a dataframe, you can easily do this by using the pandas module. Code below is an example of how to generate a dataframe:
import numpy as np
import pandas as pd
myRange = np.arange(1,1001,1) # Could be something like myRange = range(1,1000,1)
df = pd.DataFrame({"numbers": myRange})
df.head(5)
which results in a dataframe like(Note that just the first five rows have been shown):
numbers
0
1
1
2
2
3
3
4
4
5
Difference of numpy.arange and range
To keep this answer short, I'd rather to refer to this answer by #hpaulj

I am not creating another variable as I expect to

I am trying to create a running (moving) total of a value called var1.
Thus, I would want it to look like this:
Thus, if var1 = 5, 4, 3, 12 for the first four values respectively, I want
9 (5+4), 7 (4+3), 15 (3+12) for the TOTAL values etc.
Instead, it is just taking 2 TIMES var1, so that the first four values of total are:
10, 8, 6, 24 etc.
This is the code I am trying. It seems to work (no errors)
import datetime
import pandas as pd
data=pd.read_csv("C:/Users/ameri/tempjohn.csv")
data.total=0
i=1
while i < 3:
data.total+=data.var1
i+=1
print(data.total)
can anybody help?
thanks
John
A Pandas dataframe is not a simple Python variable even if you do computations with it: it behaves more or less as a vectorized 2D array.
What happens in your code:
you set the column total of the dataframe to 0: data.total becomes a Series of same lenght as the dataframe containing only 0 values
you execute (for i == 1) data.total += data.var1: as it previously contained 0 values, data.total is now a copy of (the Series) data.var1
you execute (for i == 2) data.total += data.var1: ok, data.total now contains twice the values of data.var1
end of loop because 3 < 3 is false...
What do to next:
read a Pandas tutorial if you want to go that way, but please remember that Pandas is not Python and some Pandas objects have different semantics than standard Python ones... of forget about Pandas if you only want to learn Python
If you really want to do it the Pandas way, the magic word is shift: data.total = data.var1 + data.var1.shift()

Converting for-loop to windowed function

We have a large (spark) dataframe, and we need to compute a new column. Each row is calculated from the value in the previous row in the same column (the first row in the new column is simply 1). This is trivial in a for-loop, but due to the extremely large number of rows we wish to do this in a window function. Because the input to the current row is the previous row, it is not obvious to us how we can achieve this, if it's possible.
We have a large dataframe with a column containing one of three values: A, B and C. Each of these 3 options represents a formula to compute a new value in a new column in the same row.
If it is A, then the new value is 1.
If it is B, then the new value is the same as the previous row.
if it is C, then the new value is the same as the previous row + 1.
For example:
A
B
B
C
B
A
C
B
C
A
Should become:
A 1
B 1
B 1
C 2
B 2
A 1
C 2
B 2
C 3
A 1
We can achieve this behavior as follows using a for loop (pseudocode):
for index in range(my_df):
if index == 0:
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'A':
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'B':
my_df[new_column][index] = my_df[new_column][index-1]
elseif my_df[letter_column][index] == 'C':
my_df[new_column][index] = my_df[new_column][index-1] + 1
We wish to replace the for loop with a window function. We tried using the 'lag' keyword, but the previous row's value depends on previous calculations. Is there a way to do this or is it fundamentally impossible to do this with a window (or map) function? And if it's impossible, is there an alternative that would be faster than the for loop? (A reduce function would have similar performance?)
Again, due to the extremely large number of rows, this is about performance. We should have enough RAM to hold everything in memory, but we wish the processing to be as quick as possible (and to learn how to solve analogues of this problem more generally: applying window functions that require data calculated in previous rows of that window function). Any help would be much appreciated!!
Kind regards,
Mick

Python Pandas df.isin shows inaccurate results

I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?
isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]

Categories

Resources