I have a Pandas dataframe containing a series of numbers:
df = pd.DataFrame({'deduction':[10,60,70,50,60,10,10,60,60,20,50,20,10,90,60,70,30,50,40,60]})
deduction
0 10
1 60
2 70
3 50
4 60
5 10
6 10
7 60
8 60
9 20
10 50
11 20
12 10
13 90
14 60
15 70
16 30
17 50
18 40
19 60
I would like to compute the cumulative difference of these numbers, starting from a larger number (i.e. <base_number> - 10 - 60 - 70 - 50 - ...).
My current solution is to negate all the numbers, prepend the (positive) larger number to the dataframe, and then call cumsum():
# Compact:
(-df['deduction'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
# Expanded:
total_series = (
# Negate
(-df['deduction']
# Reverse
[::-1])
# Add the base value to the end
.append(pd.Series([start_value]))
# Reverse again (to put the base value at the beginning)
[::-1]
# Calculate cumulative sum (all the values except the first are negative, so this will work)
.cumsum()
# Clean up
.reset_index(drop=True)
)
But I was wondering if there were possible a shorter solution, that didn't append to the series (I hear that that's bad practice).
(It doesn't need to be put in a dataframe; a series, like I've done above, will be alright.)
df['total'] = start_value - df["deduction"].cumsum()
If you need the start value at the beginning of the series then shift and insert (there's a few ways to do it, and this is one of them):
df['total'] = -df["deduction"].shift(1, fill_value=-start_value).cumsum()
Related
I am using python and sqllite3 and I was wondering if the performance of this query can be improved?
main table with ~100,000 rows
0 1 2 3 4 Amount
0 0 9 12 6 60 40800.0
1 0 9 12 6 61 40100.0
2 0 9 12 6 65 39900.0
3 0 9 12 6 74 40300.0
4 0 9 12 7 60 40600.0
util table ~75,000 rows
0 1 2 Amount
0 78 75 65 9900.0
1 80 75 65 9900.0
2 80 72 65 10000.0
3 78 72 65 10000.0
4 79 75 65 10000.0
The query currently gets the Cartesian product of the two tables where the sum of the amount is between 49,700 and 50,000 and gets the first 200,000 matches if my understanding is correct.
con = sqlite3.connect(':memory:')
df.to_sql(name='main', con=con)
df1.to_sql(name='util', con=con)
query = '''
SELECT *
FROM main AS m
INNER JOIN
util AS u
ON
50000 >= m.Amount + u.Amount
AND
49700 <= m.Amount + u.Amount
LIMIT
200000;
'''
final_df = pd.read_sql_query(query, con)
Since you're not matching on a column value, but on the expression m.Amount + u.Amount, it has to be computed for every possible combination of rows between the two tables (100k * 75k = 7500mil or 7.5 billion combos). What you've effectively got is a CROSS JOIN since you're not matching on any column between the two tables.
1. You can make sure the expression is evaluated only once rather than for each part of the AND clause 50000 >= m.Amount + u.Amount & 49700 <= m.Amount + u.Amount by using the BETWEEN operator. I would just the standard 'from table1, table2' with WHERE for clarity:
SELECT * FROM main AS m
INNER JOIN
util AS u
ON
m.Amount + u.Amount BETWEEN 49700 AND 50000
;
2. You'll have to use other methods to reduce the number of rows that are checked. For example, when Amount for either tables is more than 50,000 it can't be a match, so it gets exclude earlier in the verification and saves time by not computing m.Amount + u.Amount even once:
SELECT * FROM main AS m, util AS u
WHERE
m.Amount <= 50000
AND
u.Amount <= 50000
AND
m.Amount + u.Amount BETWEEN 49700 AND 50000
;
If the amounts cannot be 0, then change the <= 50000 to < 50000.
3. You can do other things like find the minimum Amount in each table and then make sure that the other table's Amount is less than 50000 - that first min amt.
4. Using the "sum of 2 numbers" problem, you can do a one-time calculation of a minimum match Amt and max match Amt (add two new columns) for one of the tables and then use the BETWEEN check using the Amt from the other table. It still needs to do a cross join but the cpu-time to evaluate each match is reduced.
ALTER TABLE main ADD COLUMN min_match INT default 0;
ALTER TABLE main ADD COLUMN max_match INT default 0;
UPDATE main SET min_match = 49700 - Amount,
max_match = 50000 - Amount;
SELECT * FROM main AS m, util AS u
WHERE
u.Amount BETWEEN m.min_match AND m.max_match
;
I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.
I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)
I am trying to fill missing fields in a 3x3 matrix (square) in order to form a magic square (row, columns both diagonals sum are the same, filled with any none repeating positive integers ).
an example of such square will be like :
[_ _ _]
[_ _ 18]
[_ 28 _]
Since it doesn't follow the basic rules of the normal magic square where its integers are limited to 1-9(from 1 to n^2). , the magic constant (sum) is not equal to 15
(n(n^2+1)/2) rather it's unknown and has many possible values.
I tried a very naïve way where I generated random numbers in the empty fields with an arbitrary max of 99 then I took the whole square passed it into a function that checks if it's a valid magic square.
It basically keeps going forever till it finds the combination of numbers in the right places.
Needless to say, this solution was dumb, it keeps going for hours before it finds the answer if ever.
I also thought about doing an exhaustive number generation (basically trying every combination of numbers) till I find the right one but this faces the same issue.
So I need help figuring out an algorithm or some way to limit the range of random number generated
3 by 3 magic squares are a vector space with these three basis elements:
1 1 1 0 1 -1 -1 1 0
1 1 1 -1 0 1 1 0 -1
1 1 1 1 -1 0 0 -1 1
You can introduce 3 variables a, b, c that represent the contribution from each of the 3 basis elements, and write equations for them given your partial solution.
For example, given your example grid you'll have:
a + b - c = 18
a - b - c = 28
Which immediately gives 2b = 10 or b=-5. And a-c = 23, or c=a-23.
The space of solutions looks like this:
23 2a-28 a+5
2a-18 a 18
a-5 28 2a-23
You can see each row/column/diagonal adds up to 3a.
Now you just need to find integer solutions for a and c that satisfy your positive and non-repeating constraints.
For example, a=100, b=-5, c=77 gives:
23 172 105
182 100 18
95 28 177
The minimal sum magic square with positive integer elements occurs for a=15, and the sum is 3a=45.
23 2 20
12 15 18
10 28 7
It happens that there are no repeats here. If there were, we'd simply try the next larger value of a and so on.
A possible approach is translating the given numbers to other values. A simple division is not possible, but you can translate with (N-13)/5. Then you have a partial filled in square:
- - - 2 7 6
- - 1 for which there is a solution 9 5 1
- 3 - 4 3 8
When you translate these numbers back with (N*5)+13, you obtain:
23 48 43
58 38 18 which sums up to 114 in all directions (5 * 15) + (3 * 13)
33 28 53
I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")