Looping through a Pandas dataframe as a two dimensional array - python

At some point in a running process i have two dataframes containing standard deviation values from various sampling distributions.
dfbest keeps the smallest deviations as the best. dftmp records the current values..
so, the first time dftmp looks like this
0 1 2 3 4
0 22.552408 7.299163 15.114379 5.214829 9.124144
with dftmp.shape (1, 5)
Ignoring for a moment the pythonic constructs and treating the dataframes as 2d arrays like a spreadsheet in VBA excel i write...
A:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[R][C] > dftmp[R][C]:
dfbest[R][C] = dftmp[R][C]
B:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[C][R] > dftmp[C][R]:
dfbest[C][R] = dftmp[C][R]
Code A fails while B works. I 'd expect the opposite but i m new to python so who knows what i m not seeing here.. Any suggestions? I suspect there is a more appropriate .iloc solution to this.

When you access a dataframe like that (dfbest[x][y]), x means a column, then y a row. That's why code B works.
Here is more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics

Related

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Adding a smaller array to a larger array at a specified location (using variables)

Suppose I have a 6x6 matrix I want to add into a 9v9 matrix, but I also want to add it at a specified location and not necessarily in a 6x6 block.
The below code summarizes what I want to accomplish, the only difference is that I want to use variables instead of the rows 0:6 and 3:9.
import numpy as np
a = np.zeros((9,9))
b = np.ones((6,6))
a[0:6,3:9] += b #Inserts the 6x6 ones matrix into the top right corner of the 9x9 zeros
Now using variables:
rows = np.array([0,1,2,3,4,5])
cols = np.array([3,4,5,6,7,8])
a[rows,3:9] += b #This works fine
a[0:6,cols] += b #This also works fine
a[rows,cols += b #But this gives me the following error: ValueError: shape mismatch: value array of shape (6,6) could not be broadcast to indexing result of shape (6,)
I have spent hours reading through forums and trying different solutions but nothing has ever worked. The reason I need to use variables is because these are input by the user and could be any combination of rows and columns. This notation worked perfectly in MatLab, where I could add b into a with any combination of rows and columns.
Explanation:
rows = np.array([0,1,2,3,4,5])
cols = np.array([3,4,5,6,7,8])
a[rows,cols] += b
You could translate the last line to the following code:
for x, y, z in zip(rows, cols, b):
a[x, y] = z
That means: rows contains the x-coordinate, cols the y-coordinate of the field you want to manipulate. Both arrays contain 6 values, so you effectively manipulate 6 values, and b must thus also contain exactly 6 values. But your b contains 6x6 values. Therefore this is a "shape mismatch". This site should contain all you need about indexing of np.arrays.

Pandas Apply/Lambda returning dataframe and not single row

New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))

Converting for-loop to windowed function

We have a large (spark) dataframe, and we need to compute a new column. Each row is calculated from the value in the previous row in the same column (the first row in the new column is simply 1). This is trivial in a for-loop, but due to the extremely large number of rows we wish to do this in a window function. Because the input to the current row is the previous row, it is not obvious to us how we can achieve this, if it's possible.
We have a large dataframe with a column containing one of three values: A, B and C. Each of these 3 options represents a formula to compute a new value in a new column in the same row.
If it is A, then the new value is 1.
If it is B, then the new value is the same as the previous row.
if it is C, then the new value is the same as the previous row + 1.
For example:
A
B
B
C
B
A
C
B
C
A
Should become:
A 1
B 1
B 1
C 2
B 2
A 1
C 2
B 2
C 3
A 1
We can achieve this behavior as follows using a for loop (pseudocode):
for index in range(my_df):
if index == 0:
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'A':
my_df[new_column][index] = 1
elseif my_df[letter_column][index] == 'B':
my_df[new_column][index] = my_df[new_column][index-1]
elseif my_df[letter_column][index] == 'C':
my_df[new_column][index] = my_df[new_column][index-1] + 1
We wish to replace the for loop with a window function. We tried using the 'lag' keyword, but the previous row's value depends on previous calculations. Is there a way to do this or is it fundamentally impossible to do this with a window (or map) function? And if it's impossible, is there an alternative that would be faster than the for loop? (A reduce function would have similar performance?)
Again, due to the extremely large number of rows, this is about performance. We should have enough RAM to hold everything in memory, but we wish the processing to be as quick as possible (and to learn how to solve analogues of this problem more generally: applying window functions that require data calculated in previous rows of that window function). Any help would be much appreciated!!
Kind regards,
Mick

Fastest way to apply function involving multiple dataframe

I'm searching to improve my code, and I don't find any clue for my problem.
I have 2 dataframes (let's say A and B) with same number of row & columns.
I want to create a third dataframe C, which will transformed each A[x,y] element based on B[x,y] element.
Actually I perform the operation with 2 loop, one for x and one for y dimension :
import pandas
A=pandas.DataFrame([["dataA1","dataA2","dataA3"],["dataA4","dataA5","dataA6"]])
B=pandas.DataFrame([["dataB1","dataB2","dataB3"],["dataB4","dataB5","dataB6"]])
Result=pandas.DataFrame([["","",""],["","",""]])
def mycomplexfunction(x,y):
return str.upper(x)+str.lower(y)
for indexLine in range(0,2):
for indexColumn in range(0,3):
Result.loc[indexLine,indexColumn]=mycomplexfunction(A.loc[indexLine,indexColumn],B.loc[indexLine,indexColumn])
print(Result)
0 1 2
0 DATAA1datab1 DATAA2datab2 DATAA3datab3
1 DATAA4datab4 DATAA5datab5 DATAA6datab6
but I'm searching for a more elegant and fastway to do it directly by using dataframe functions.
Any idea ?
Thanks,

Categories

Resources