Can the performance of this SQL inner join be improved? - python

I am using python and sqllite3 and I was wondering if the performance of this query can be improved?
main table with ~100,000 rows
0 1 2 3 4 Amount
0 0 9 12 6 60 40800.0
1 0 9 12 6 61 40100.0
2 0 9 12 6 65 39900.0
3 0 9 12 6 74 40300.0
4 0 9 12 7 60 40600.0
util table ~75,000 rows
0 1 2 Amount
0 78 75 65 9900.0
1 80 75 65 9900.0
2 80 72 65 10000.0
3 78 72 65 10000.0
4 79 75 65 10000.0
The query currently gets the Cartesian product of the two tables where the sum of the amount is between 49,700 and 50,000 and gets the first 200,000 matches if my understanding is correct.
con = sqlite3.connect(':memory:')
df.to_sql(name='main', con=con)
df1.to_sql(name='util', con=con)
query = '''
SELECT *
FROM main AS m
INNER JOIN
util AS u
ON
50000 >= m.Amount + u.Amount
AND
49700 <= m.Amount + u.Amount
LIMIT
200000;
'''
final_df = pd.read_sql_query(query, con)

Since you're not matching on a column value, but on the expression m.Amount + u.Amount, it has to be computed for every possible combination of rows between the two tables (100k * 75k = 7500mil or 7.5 billion combos). What you've effectively got is a CROSS JOIN since you're not matching on any column between the two tables.
1. You can make sure the expression is evaluated only once rather than for each part of the AND clause 50000 >= m.Amount + u.Amount & 49700 <= m.Amount + u.Amount by using the BETWEEN operator. I would just the standard 'from table1, table2' with WHERE for clarity:
SELECT * FROM main AS m
INNER JOIN
util AS u
ON
m.Amount + u.Amount BETWEEN 49700 AND 50000
;
2. You'll have to use other methods to reduce the number of rows that are checked. For example, when Amount for either tables is more than 50,000 it can't be a match, so it gets exclude earlier in the verification and saves time by not computing m.Amount + u.Amount even once:
SELECT * FROM main AS m, util AS u
WHERE
m.Amount <= 50000
AND
u.Amount <= 50000
AND
m.Amount + u.Amount BETWEEN 49700 AND 50000
;
If the amounts cannot be 0, then change the <= 50000 to < 50000.
3. You can do other things like find the minimum Amount in each table and then make sure that the other table's Amount is less than 50000 - that first min amt.
4. Using the "sum of 2 numbers" problem, you can do a one-time calculation of a minimum match Amt and max match Amt (add two new columns) for one of the tables and then use the BETWEEN check using the Amt from the other table. It still needs to do a cross join but the cpu-time to evaluate each match is reduced.
ALTER TABLE main ADD COLUMN min_match INT default 0;
ALTER TABLE main ADD COLUMN max_match INT default 0;
UPDATE main SET min_match = 49700 - Amount,
max_match = 50000 - Amount;
SELECT * FROM main AS m, util AS u
WHERE
u.Amount BETWEEN m.min_match AND m.max_match
;

Related

Pronic Integers in String/Number

The program must accept an integer N as the input. The program must print all the pronic integers formed by series of continuously occurring digits (in the same order) in N as the output.
The pronic integers can be represented as n*(n+1).
Note: The pronic integers must be printed in the order of their occurrence.
Boundary Condition(s):
1 <= N <= 10^20
Max execution time: 4000 milliseconds
Input Format:
The first line contains N.
Output Format:
The first line contains the pronic integers separated by a space.
Example Input/Output 1:
Input:
93042861
Output:
930 30 0 42 2 6
Explanation:
30 * 31 = 930
5 * 6 = 30
0 * 1 = 0
6 * 7 = 42
1 * 2 = 2
2 * 3 = 6
Example Input/Output 2:
Input:
247025123524
Output:
2 702 0 2 12 2 2352 2
Explanation:
1 * 2 = 2
26 * 27 = 702
0 * 1 = 0
1 * 2 = 2
3 * 4 = 12
1 * 2 = 2
48 * 49 = 2352
1 * 2 = 2
def ispro(n):
for i in range(n+1):
if i*(i+1)==n: return 1
return 0
def pro(a):
n=len(a)
for i in range(n):
for j in range(n):
if(a[i:j]!=""):
if(a[i:j]==str(int(a[i:j]))):
if(ispro(int(a[i:j]))):
print(a[i:j],end=" ")
a=input().strip()
pro(a)
In this code, time limit exceeds for string length greater than 10.
You may edit this code or create your own code to solve this problem.
N = 10^20 meaning there are 20 digits possible. This means we have to check 20 * (19) / 2 --> 190 possible numbers to check if they are pronic. Let us denote one of these 20 possible numbers as PRN.
The maximum number of digits PRN can be is 20 - we obviously cannot brute force each possible number (that would take 10^20 iterations). Instead, we can binary search over N, where N * (N+1) = PRN.
You can search up binary search to learn more about it, if you don't know already about it. Essentially, if our guess (N) is too big, we make N smaller, otherwise we make N bigger. We do this until N * (N+1) = PRN (meaning PRN is pronic), or there is no possible solution - so we move on.
This binary search would take log(n) time. So for a max of 20 digits, 67 iterations. And for over 190 possible numbers, this would be 12730 checks - which would easily fit in the time constraints. There are probably more mathematically beautiful solutions, but this will do.

Create a master data set comprised of multiple data frames

I have been stuck on this problem for a while now! Included below is a very simplified version of my program, along with some context. Essentially I want to view is one large dataframe which has all of my desired permutations based on my input variables. This is in the context of scenario analysis and it will help me avoid doing on-demand calculations through my BI tool when the user wants to change variables to visualise the output.
I have tried:
Creating a function out of my code and trying to apply the function with each of the step size changes of my input variables ( no idea what I am doing there).
Literally manually changing the input variables myself (as a noob I realise this is not the way to go but had to first see my code was working to append df's).
Essentially what I want to achieve is as follows:
use the variables "date_offset" and "cost" and vary each of them by the required number of defined steps sizes and number of steps
As an example, if there are 2 values for date_offset (step size 1) and two values for cost (step size one) there are a possible 4 combinations, therefore the data set will be 4 times the size of the df in my code below.
Now I have all of the permutations of the input variable and the corresponding data frame to go with each of those permutations, I would like to append each one of the data frames together.
I should be left with one data frame for all of the possible scenarios which I can then visualise with a BI tool.
I hope you guys can help :)
Here is my code.....
import pandas as pd
import numpy as np
#want to iterate through starting at a date_offset of 0 with a total of 5 steps and a step size of 1
date_offset = 0
steps_1 = 5
stepsize_1 = 1
#want to iterate though starting at a cost of 5 with a total number of steps of 5 and a step size of 1
cost = 5
steps_2 = 4
step_size = 1
df = {'id':['1a', '2a', '3a', '4a'],'run_life':[10,20,30,40]}
df = pd.DataFrame(df)
df['date_offset'] = date_offset
df['cost'] = cost
df['calc_col1'] = df['run_life']*cost
Are you trying to do something like this:
from itertools import product
data = {'id': ['1a', '2a', '3a', '4a'], 'run_life': [10, 20, 30, 40]}
df = pd.DataFrame(data)
date_offset = 0
steps_1 = 5
stepsize_1 = 1
cost = 5
steps_2 = 4
stepsize_2 = 1
df2 = pd.DataFrame(
product(
range(date_offset, date_offset + steps_1 * stepsize_1 + 1, stepsize_1),
range(cost, cost + steps_2 * stepsize_2 + 1, stepsize_2)
),
columns=['offset', 'cost']
)
result = df.merge(df2, how='cross')
result['calc_col1'] = result['run_life'] * result['cost']
Output:
id run_life offset cost calc_col1
0 1a 10 0 5 50
1 1a 10 0 6 60
2 1a 10 0 7 70
3 1a 10 0 8 80
4 1a 10 0 9 90
.. .. ... ... ... ...
115 4a 40 5 5 200
116 4a 40 5 6 240
117 4a 40 5 7 280
118 4a 40 5 8 320
119 4a 40 5 9 360
[120 rows x 5 columns]

Cumulative difference of numbers starting from an initial value

I have a Pandas dataframe containing a series of numbers:
df = pd.DataFrame({'deduction':[10,60,70,50,60,10,10,60,60,20,50,20,10,90,60,70,30,50,40,60]})
deduction
0 10
1 60
2 70
3 50
4 60
5 10
6 10
7 60
8 60
9 20
10 50
11 20
12 10
13 90
14 60
15 70
16 30
17 50
18 40
19 60
I would like to compute the cumulative difference of these numbers, starting from a larger number (i.e. <base_number> - 10 - 60 - 70 - 50 - ...).
My current solution is to negate all the numbers, prepend the (positive) larger number to the dataframe, and then call cumsum():
# Compact:
(-df['deduction'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
# Expanded:
total_series = (
# Negate
(-df['deduction']
# Reverse
[::-1])
# Add the base value to the end
.append(pd.Series([start_value]))
# Reverse again (to put the base value at the beginning)
[::-1]
# Calculate cumulative sum (all the values except the first are negative, so this will work)
.cumsum()
# Clean up
.reset_index(drop=True)
)
But I was wondering if there were possible a shorter solution, that didn't append to the series (I hear that that's bad practice).
(It doesn't need to be put in a dataframe; a series, like I've done above, will be alright.)
df['total'] = start_value - df["deduction"].cumsum()
If you need the start value at the beginning of the series then shift and insert (there's a few ways to do it, and this is one of them):
df['total'] = -df["deduction"].shift(1, fill_value=-start_value).cumsum()

Using df.apply on a function with multiple inputs to generate multiple outputs

I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)

I want to multiply two columns in a pandas DataFrame and add the result into a new column

I'm trying to multiply two existing columns in a pandas Dataframe (orders_df): Prices (stock close price) and Amount (stock quantities) and add the calculation to a new column called Value. For some reason when I run this code, all the rows under the Value column are positive numbers, while some of the rows should be negative. Under the Action column in the DataFrame there are seven rows with the 'Sell' string and seven with the 'Buy' string.
for i in orders_df.Action:
if i == 'Sell':
orders_df['Value'] = orders_df.Prices*orders_df.Amount
elif i == 'Buy':
orders_df['Value'] = -orders_df.Prices*orders_df.Amount)
Please let me know what i'm doing wrong !
I think an elegant solution is to use the where method (also see the API docs):
In [37]: values = df.Prices * df.Amount
In [38]: df['Values'] = values.where(df.Action == 'Sell', other=-values)
In [39]: df
Out[39]:
Prices Amount Action Values
0 3 57 Sell 171
1 89 42 Sell 3738
2 45 70 Buy -3150
3 6 43 Sell 258
4 60 47 Sell 2820
5 19 16 Buy -304
6 56 89 Sell 4984
7 3 28 Buy -84
8 56 69 Sell 3864
9 90 49 Buy -4410
Further more this should be the fastest solution.
You can use the DataFrame apply method:
order_df['Value'] = order_df.apply(lambda row: (row['Prices']*row['Amount']
if row['Action']=='Sell'
else -row['Prices']*row['Amount']),
axis=1)
It is usually faster to use these methods rather than over for loops.
If we're willing to sacrifice the succinctness of Hayden's solution, one could also do something like this:
In [22]: orders_df['C'] = orders_df.Action.apply(
lambda x: (1 if x == 'Sell' else -1))
In [23]: orders_df # New column C represents the sign of the transaction
Out[23]:
Prices Amount Action C
0 3 57 Sell 1
1 89 42 Sell 1
2 45 70 Buy -1
3 6 43 Sell 1
4 60 47 Sell 1
5 19 16 Buy -1
6 56 89 Sell 1
7 3 28 Buy -1
8 56 69 Sell 1
9 90 49 Buy -1
Now we have eliminated the need for the if statement. Using DataFrame.apply(), we also do away with the for loop. As Hayden noted, vectorized operations are always faster.
In [24]: orders_df['Value'] = orders_df.Prices * orders_df.Amount * orders_df.C
In [25]: orders_df # The resulting dataframe
Out[25]:
Prices Amount Action C Value
0 3 57 Sell 1 171
1 89 42 Sell 1 3738
2 45 70 Buy -1 -3150
3 6 43 Sell 1 258
4 60 47 Sell 1 2820
5 19 16 Buy -1 -304
6 56 89 Sell 1 4984
7 3 28 Buy -1 -84
8 56 69 Sell 1 3864
9 90 49 Buy -1 -4410
This solution takes two lines of code instead of one, but is a bit easier to read. I suspect that the computational costs are similar as well.
Since this question came up again, I think a good clean approach is using assign.
The code is quite expressive and self-describing:
df = df.assign(Value = lambda x: x.Prices * x.Amount * x.Action.replace({'Buy' : 1, 'Sell' : -1}))
To make things neat, I take Hayden's solution but make a small function out of it.
def create_value(row):
if row['Action'] == 'Sell':
return row['Prices'] * row['Amount']
else:
return -row['Prices']*row['Amount']
so that when we want to apply the function to our dataframe, we can do..
df['Value'] = df.apply(lambda row: create_value(row), axis=1)
...and any modifications only need to occur in the small function itself.
Concise, Readable, and Neat!
Good solution from bmu. I think it's more readable to put the values inside the parentheses vs outside.
df['Values'] = np.where(df.Action == 'Sell',
df.Prices*df.Amount,
-df.Prices*df.Amount)
Using some pandas built in functions.
df['Values'] = np.where(df.Action.eq('Sell'),
df.Prices.mul(df.Amount),
-df.Prices.mul(df.Amount))
For me, this is the clearest and most intuitive:
values = []
for action in ['Sell','Buy']:
amounts = orders_df['Amounts'][orders_df['Action'==action]].values
if action == 'Sell':
prices = orders_df['Prices'][orders_df['Action'==action]].values
else:
prices = -1*orders_df['Prices'][orders_df['Action'==action]].values
values += list(amounts*prices)
orders_df['Values'] = values
The .values method returns a numpy array allowing you to easily multiply element-wise and then you can cumulatively generate a list by 'adding' to it.
First, multiply the columns Prices and Amount. Afterwards use mask to negate the values if the condition is True:
df.assign(
Values=(df["Prices"] * df["Amount"]).mask(df["Action"] == "Buy", lambda x: -x)
)

Categories

Resources