Python numpy vectorization for heat dispersion - python

I'm supposed to write a code to represent heat dispersion using the finite difference formula given below.
๐‘ข(๐‘ก)๐‘–๐‘—=(๐‘ข(๐‘กโˆ’1)[๐‘–+1,๐‘—] + ๐‘ข(๐‘กโˆ’1) [๐‘–โˆ’1,๐‘—] +๐‘ข(๐‘กโˆ’1)[๐‘–,๐‘—+1] + ๐‘ข(๐‘กโˆ’1)[๐‘–,๐‘—โˆ’1])/4
The formula is supposed to produce the result only for a time step of 1. So, if an array like this was given:
100 100 100 100 100
100 0 0 0 100
100 0 0 0 100
100 0 0 0 100
100 100 100 100 100
The resulting array at time step 1 would be:
100 100 100 100 100
100 50 25 50 100
100 25 0 25 100
100 50 25 50 100
100 100 100 100 100
I know the representation using for loops would be as follows, where the array would have a minimum of 2 rows and 2 columns as a precondition:
h = np.copy(u)
for i in range(1,h.shape[0]-1):
for j in range (1, h.shape[1]-1):
num = u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1]
h[i][j] = num/4
But I cannot figure out how to vectorize the code to represent heat dispersion. I am supposed to use numpy arrays and vectorization and am not allowed to use for loops of any kind, and I think I am supposed to rely on slicing, but I cannot figure out how to write it out and have started out with.
r, c = h.shape
if(c==2 or r==2):
return h
I'm sure that if the rows=2 or columns =2 then the array is returned as is, but correct me if Im wrong. Any help would be greatly appreciated. Thank you!

Try:
h[1:-1,1:-1] = (h[2:,1:-1] + h[:-2,1:-1] + h[1:-1,2:] + h[1:-1,:-2]) / 4
This solution uses slicing where:
1:-1 stays for indices 1,2, ..., LAST - 1
2: stays for 2, 3, ..., LAST
:-2 stays for 0, 1, ..., LAST - 2
During each iteration only the inner elements (indices 1..LAST-1) are updated

Related

Optimizing Memory Allocations of Pandas Code to Process Rows Using Explicit Loops with Numba Optimization

Assume I have data in the form (As a Pandas' Data Frame):
Index
ID
Value
Div Factor
Weighted Sum
1
1
2
1
2
1
3
2
3
2
6
1
4
1
1
3
5
2
3
2
6
2
9
3
7
2
8
4
8
3
5
1
9
3
6
2
10
1
8
4
11
3
2
3
12
3
7
4
I want to calculate the column Weighted Sum as following (For the $i$ -th row):
Look at all values from row 1 to i.
Sum values by groups according to the ID value of each row. So we have k sum values where k is the number of unique ID value from the row 1 to i.
Divide each sum (There are k sum values) by the number of elements in the group.
Sum those k values and divide by k (The average of the averages).
For example, let's do rows 1, 7 and 12:
Row 1
For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.
Row 7
For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 + 3 + 2) / 3 = 2.
For the group of ID = 2 we have: (8 + 9 + 3 + 6) / 4 = 6.5.
Then the average of averages is (2 + 6.5) / 2 = 4.25.
Row 12
For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 + 1 + 3 + 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 + 9 + 3 + 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 + 2 + 6 + 5) / 4 = 5.
Then the average of averages is (3.5 + 6.5 + 5) / 3 = 5.
Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.
As a follow up to Apply a Function per Row of Sub Groups of the Data **Above** the Current Row, I got a good answer yet it allocates a lot of memory in case the number of rows and the number of unique ID are high.
I was wondering if there is a way to create a much smaller auxiliary data using explicit loops while accelerating them using Numba.
The idea is to have similar or better performance while reducing the memory footprint considerably.
Here's a benchmark showing the performance of pandas, numpy and numba/numpy solutions at various row counts (12 to 36,000) and unique ID counts (3 to 9,000):
rows 12, unique ID values: 3
Timeit results:
foo_1 (pandas) ran in 0.003095399937592447 seconds using 1 iterations
foo_2 (numpy) ran in 0.0003358999965712428 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00018770003225654364 seconds using 1 iterations
rows 120, unique ID values: 30
Timeit results:
foo_1 (pandas) ran in 0.0024368000449612737 seconds using 1 iterations
foo_2 (numpy) ran in 0.001127400086261332 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00029390002600848675 seconds using 1 iterations
rows 1200, unique ID values: 300
Timeit results:
foo_1 (pandas) ran in 0.01624089991673827 seconds using 1 iterations
foo_2 (numpy) ran in 0.009926999919116497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.002144100028090179 seconds using 1 iterations
rows 12000, unique ID values: 3000
Timeit results:
foo_1 (pandas) ran in 2.391147599904798 seconds using 1 iterations
foo_2 (numpy) ran in 0.2884287000633776 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.1226186000276357 seconds using 1 iterations
rows 36000, unique ID values: 9000
Timeit results:
foo_1 (pandas) ran in 44.33448620000854 seconds using 1 iterations
foo_2 (numpy) ran in 3.0259654000401497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 1.6273660999722779 seconds using 1 iterations
The pandas solution creates an intermediate dataframe that is num IDs x num rows in size. The numpy and numpy/numba solutions calculate results column by column, so they create a handful of intermediate 1D arrays of length num rows. The numpy/numba solution is consistently 2-5 times faster than numpy, and pandas is 2-10 times slower than numpy.
Upping the size a bit more gives the following result (where the pandas solution is commented out):
rows 120000, unique ID values: 30000
Timeit results:
foo_1 (pandas) ran in 6.00004568696022e-06 seconds using 1 iterations
foo_2 (numpy) ran in 28.882483799941838 seconds using 1 iterations
foo_3 (numpy_numba) ran in 38.77682559995446 seconds using 1 iterations
So it appears that there is a threshold above which the numpy/numba solution loses ground to regular numpy.
Full test code:
import pandas as pd
# insert code to initialize dfInit here
print(dfInit)
'''
Index ID Value Div Factor
1 1 2 1
2 1 3 2
3 2 6 1
4 1 1 3
5 2 3 2
6 2 9 3
7 2 8 4
8 3 5 1
9 3 6 2
10 1 8 4
11 3 2 3
12 3 7 4
'''
def initDf(colMult=1):
df = dfInit.copy()
dfMult = pd.concat([df.assign(ID=dfInit.ID + 3*i, Index=df.Index + len(df)*i) for i in range(colMult)], axis=0).reset_index(drop=True)
print(f'\nrows {len(dfMult)}, unique ID values: {len(dfMult.ID.unique())}')
return dfMult
df = initDf()
def pd_foo_1(df):
df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)
return df
def foo_1(df):
#return None
try:
return pd_foo_1(df)
except (ValueError):
print('overflow encountered')
return None
import numpy as np
def foo_2(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
uniqIds = df.ID.unique()
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds += cumValues / (cumCounts + (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds += curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
df['Weighted Sum'] = weightedSum
return df
from numba import njit
#njit
def np_foo_3(values, ids):
uniqIds = np.unique(ids)
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds += cumValues / (cumCounts + (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds += curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
return weightedSum
def foo_3(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
weightedSum = np_foo_3(values, ids)
df['Weighted Sum'] = weightedSum
return df
foo_count = 3
foo_names=['foo_' + str(i + 1) for i in range(foo_count)]
foo_labels=['pandas', 'numpy', 'numpy_numba']
exec("foo_funcs=[" + ','.join(f"foo_{str(i + 1)}" for i in range(foo_count)) + "]")
for foo in foo_names:
print(f'{foo} output:')
#print(eval(f"{foo}(df)"))
eval(f"{foo}(df)"); print("... output suppressed.")
# ===================== BENCHMARK with timeit:
from timeit import timeit
n = 1
for colMult in [1,10,100,1000,3000,10000]:
df = initDf(colMult)
print(f'Timeit results:')
for i, foo in enumerate(foo_names):
t = timeit(f"{foo}(df)", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ({foo_labels[i]}) ran in {t} seconds using {n} iterations')
# ===================== ... END BENCHMARK with timeit.
Space used:
The memory of a pandas solution which pivots IDs is proportional to num rows x num unique IDs. By comparison, a solution that loops over IDs processing one copy of the Value column at a time uses memory proportional to num rows.
This means that 10^7 rows x 10^6 unique IDs or about 10^13 4-8 byte values (call it 10^14 bytes, or 100,000 GB) is not feasible storing the pivot table in program memory with pandas.
However, 10^7 rows of at most 10 1-D arrays of 8-byte values uses on the order of 10^9 bytes or 1 GB of program memory in the looping solutions (numpy or numpy/numba above).
Note that adding a nested loop over chunks of a fixed number of rows will allow us to cap the memory usage of most of the roughly 10 1-D arrays mentioned above, but we will still need to calculate at least 1 array (the result), so this will never give us more than a factor of 10 reduction in memory footprint.

How to get multiples of 100 rounded to the nearest thousand

I'm learning Python and I'm trying to come up with a for loop (or any other method) that can return multiples of 100 but rounded to the nearest thousand, here's what I have right now:
huneds = [h * 100 for h in range(1,50)]
for r in huneds:
if r % 3 == float:
print(r)
else:
break
The built-in round() function will accept a negative number that you can use to round to thousands:
for r in huneds:
print(round(r, -3))
Which prints:
0
0
0
0
0
1000
1000
1000
1000
1000
1000
1000
1000
1000
2000
...
4000
4000
5000
5000
5000
5000
You can see use
for n in range(0,2500,100):
print(n, ' -> ',1000 * round(n / 1000))
For any number m, m is a multiple of n if the remainder of n / m is 0. I.e. n % m == 0 or in your case; r % 100 == 0, as the modulus operator (%) returns the remainder of a division. Use:
for r in huneds:
if r % 100 == 0:
print(r)
But every number is already a multiple of 100, as you multiplied all of them by 100.
You may be after something like:
# Range uses (start, stop, step) params
# print(list(range(0, 200, 10))) --> [0, 10, 20, ... 170, 180, 190]
for r in range(0, 200, 10):
if r % 100 == 0 and r != 0:
print(r)
Outputs
100
But you would like to round to the nearest 1000. The round() function can do that.
for r in range(0, 2000, 10):
if r % 100 == 0 and r != 0:
print(f"{r} | {round(r, -3)}")
100 | 0
200 | 0
300 | 0
400 | 0
500 | 0
600 | 1000
700 | 1000
800 | 1000
...
The f string does the same as r + ' | ' + round(r, -3)
This shows the number that is a multiple of 100 which is r, and then it rounded to the nearest 1000
round()'s second argument is the amount of digits to round too, as we are going to the nearest 1000, we use -3 as you are going on the left side of the decimal
I suggest having a read of:
https://www.w3schools.com/python/ref_func_range.asp
And I highly reccomend: this site (python principles) for learning python. Pro membership is currently free
Simply do this,
list(map(lambda x: round(x/1000) * 1000, huneds))
It'll return you a list of rounded values for all the items of the list huneds.

Calculating DataFrame columns based on other columns

Having DataFrame like so:
# comments are the equations that have to be done to calculate the given column
df = pd.DataFrame({
'item_tolerance': [230, 115, 155],
'item_intake': [250,100,100],
'open_items_previous_day': 0, # df.item_intake.shift() + df.open_items_previous_day.shift() - df.items_shipped.shift() + df.items_over_under_sla.shift()
'total_items_to_process': 0, # df.item_intake + df.open_items_previous_day
'sla_relevant': 0, # df.item_tolerance if df.open_items_previous_day + df.item_intake > df.item_tolerance else df.open_items_previous_day + df.item_intake
'items_shipped': [230, 115, 50],
'items_over_under_sla': 0 # df.items_shipped - df.sla_relevant
})
item_tolerance
item_intake
open_items_previous_day
total_items_to_process
sla_relevant
items_shipped
items_over_under_sla
0
230
250
0
0
0
230
0
1
115
100
0
0
0
115
0
2
155
100
0
0
0
50
0
I'd like to calculate all the columns that have comments in them. I've tried using df.apply(some_method, axis=1) to perform row wise calculations but the problem is that I don't have the access to the previous row inside some_method(row).
To give a little more explanation, what I'm trying to achieve is for example: df.items_over_under_sla = df.items_shipped - df.sla_relevant but df.sla_relevant is based on equation which needs df.open_items_previous_day which needs df.open_items_previous_day which needs the previous row to be calculated. This is the problem, I need to calculate rows based on the values from this row and the previous one.
What is the correct approach to such problem?
If you are calculating each column with a different operation I suggest obtaining them individually:
df['open_items_previous_day'] = df['item_intake'].shift(fill_value=0) + df['open_items_previous_day'].shift(fill_value=0) - df['items_shipped'].shift(fill_value=0) + df['items_over_under_sla'].shift(fill_value=0)
df['total_items_to_process'] = df['item_intake'] + df['open_items_previous_day']
df = df.assign(sla_relevant=np.where(df['open_items_previous_day'] + df['item_intake'] > df['item_tolerance'], df['item_tolerance'], df['open_items_previous_day'] + df['item_intake']))
df['items_over_under_sla'] = df['items_shipped'] - df['sla_relevant']
df
Out[1]:
item_tolerance item_intake open_items_previous_day total_items_to_process sla_relevant items_shipped items_over_under_sla
0 230 250 0 250 230 230 0
1 115 100 20 120 115 115 0
2 155 100 -15 85 85 50 -35
The problem that you are facing is not about having to use the previous row (you are working around that just fine using the shift function). The real problem here is that all columns that you are trying to get (except for total_items_to_process) depend on each other, therefore you can't get the rest of the columns without having one of them first (or assuming it is zero initially).
That's why you are going to get different results depending on which column you've calculated first.

How to give certain rows 'points' depending on how much larger that row's column 1 is compared to that row's column 2

I'm looking at creating an algorithm where if the views_per_hour is 2x larger than the average_views_per_hour, I give the channel 5 points; if it is 3x larger I give the row 10 points and if it is 4x larger, I give the row 20 points. I'm not really sure how to go about this and would really appreciate some help.
df = pd.DataFrame({'channel':['channel1','channel2','channel3','channel4'], 'views_per_hour_today':[300,500,2000,100], 'average_views_per_hour':[100,200,200,50],'points': [0,0,0,0] })
df.loc[:, 'average_views_per_hour'] *= 2
df['n=2'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 3
df['n=3'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 4
df['n=4'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 10, 0)
I expected to be able to add up the results from columns n=2, n=3, n=4 for each row in the 'Points' column but the columns are always showing either 5 or 10 and never 0 (the code thinks that the views_per_hour is always greater than the average_views_per_hour, even when the average_views_per_hour is multiplied by a large integer.)
There are multiple ways of solving this kind of problem. You can use numpy select which has more concise syntax, you can also define a function and apply on the data frame.
div = df['views_per_hour_today']/df['average_views_per_hour']
cond = [(div >= 2) & (div < 3), (div >= 3) & (div < 4), (div >= 4) ]
choice = [5, 10, 20]
df['points'] = np.select(cond, choice)
channel views_per_hour_today average_views_per_hour points
0 channel1 300 100 10
1 channel2 500 200 5
2 channel3 2000 200 20
3 channel4 100 50 5

Classifying Data in a New Column

I have following df:
Column 1
1
2435
3345
104
505
6005
10000
80000
100000
4000000
4440
520
...
This structure is not the best to plot a histogram, which is the main purpose. Bins don't really solve the problem either, at least from what I've tested so far. That's why I like to create my own bins in a new column:
I basically want to assign every value within a certain range in column 1 a bucket in column2, so that it look like this:
Column 1 Column2
1 < 10000
2435 < 10000
3345 < 10000
104 < 10000
505 < 10000
6005 < 10000
10000 < 50000
80000 < 150000
100000 < 150000
4000000 < 250000
4440 < 10000
520 < 10000
...
Once I get there, creating a plot will be much easier.
Thanks!
There is a pandas equivalent to this cut there is a section describing this here. cut returns the open closed intervals for each value:
In [29]:
df['bin'] = pd.cut(df['Column 1'], bins = [0,10000, 50000, 150000, 25000000])
df
Out[29]:
Column 1 bin
0 1 (0, 10000]
1 2435 (0, 10000]
2 3345 (0, 10000]
3 104 (0, 10000]
4 505 (0, 10000]
5 6005 (0, 10000]
6 10000 (0, 10000]
7 80000 (50000, 150000]
8 100000 (50000, 150000]
9 4000000 (150000, 25000000]
10 4440 (0, 10000]
11 520 (0, 10000]
The dtype of the column is a Category and can be used for filtering, counting, plotting etc.
numpy.histogram takes a bins parameter which can be an integer array, and returns an array of the counts within those bins. So, if you run
import numpy as np
counts, _ = np.histogram(df[`Column 1`].values, [10000, 50000, 150000, 250000])
You will have the bins you want. From here, you can do whatever you want, including plotting the number of counts within each bin:
plot(counts)

Categories

Resources