Find the maximum output over a variable-length period

Find the maximum output over a variable-length period - python

I have a hypothetical set of data with 3 columns that has monthly profit data for a set of widget machines. I am trying to figure out the maximum profit period within a 2-year span.
The 3 columns are as follows:
name: identifier of widget machine (there are maybe 100 of these)
date: month/year over a 2 year period
profit: dollars made from widgets that month (can be negative if costs exceed revenue)
The maximum profit period is a concurrent set of months at least 3 months long (and could encompass all of the data).
Obviously I could brute force this and simply test every combination: Jan-Mar, Jan-Apr, Jan-May, Feb-Apr, etc. but I am looking for a better solution than creating all of these by hand. It seems like the data is a bit too big to want to transpose across and turn months into columns so I would like to be able to operate on a stacked dataset as described.
I'd prefer a sas data step but an sql query that works in proc SQL would be fine as well (but the sets of subqueries that might be required are beyond my ability).
Example Data:
data max(drop=dt);
length name dt $50;
infile datalines delimiter=',';
input name $ dt profit;
date=input(dt,mmddyy10.);
format date mmddyy10.;
datalines;
Widget1,01/01/2011,1000
Widget1,02/01/2011,2000
Widget1,03/01/2011,500
Widget2,01/01/2011,100
Widget2,02/01/2011,200
Widget2,03/01/2011,-50
Widget2,04/01/2011,250
Widget2,05/01/2011,-150
Widget2,06/01/2011,-250
Widget2,07/01/2011,400
Widget2,08/01/2011,0
Widget2,03/01/2011,-200
;
Maybe a better phrasing of the question would be "How do I come up with all possible consecutive combinations of values?" From a query like that, I could then take the max of combinations where # of values >= 3.
The query would build up every combination of sequential rows in the table, drop those where there are less than 3 rows, and then return the max value (grouped by Widget# of course). I suppose it would also helpful to know the starting and ending row for each combination. I'm trying to work out how this would be done in an SQL query (doesn't sound like a sas datastep to my mind)
Python Sample:
Here is a sample with some made up data that I wrote in Python. It is not the most efficient thing but it gets the sort of result I am looking for--I just can't figure out how to replicate it in SQL or SAS:
from itertools import groupby
data = []
data.append(['Widget1','Jan',5])
data.append(['Widget1','Feb',1])
data.append(['Widget1','Mar',-2])
data.append(['Widget1','Apr',0])
data.append(['Widget1','May',-3])
data.append(['Widget1','Jun',8])
data.append(['Widget1','Jul',-2])
data.append(['Widget1','Aug',1])
data.append(['Widget2','Jan',-1])
data.append(['Widget2','Feb',1])
data.append(['Widget2','Mar',-3])
data.append(['Widget2','Apr',1])
data.append(['Widget2','May',-60])
data.append(['Widget2','Jun',9])
data.append(['Widget2','Jul',-2])
data.append(['Widget2','Aug',20])
results = []
for key, group in groupby(data, lambda g: g[0]):
max = -999999
for i,v in enumerate(data):
if key <> v[0]:
continue
runningtotal = 0
for j,w in enumerate(data):
if key <> w[0]:
continue
if i <= j:
runningtotal = runningtotal + w[2]
if i+2 <= j and runningtotal > max:
max = runningtotal
maxstart = v[1]
maxend = w[1]
results.append([key, maxstart, maxend, max])
print results
This gives me the result of
[['Widget1', 'Jan', 'Jun', 9],
['Widget2', 'Jun', 'Aug', 27]]
for the fake python data I made.

Your core problem appears to be that you see combinatorially many periods, but you want a solution that doesn't require a combinatorial amount of work.
Luckily for you, if you have N months, you can solve this problem in O(N^2) time with O(N) space.
The trick is that you don't actually need to save all the period's values; you only need the maximal one. So let's break the big problem down into smaller chunks.
First, create two arrays of length N and fill them with zeroes. Now, you read in the first month and (if it earned a profit) put it in the first cell of each - it's the "best run of length 1" and also the "current run of length 1". If it's negative, leave the "best" at zero, but fill the "current" cell anyhow.
Next you read in the second month. If the second month earned more profit than the first, you replace the first cell of each array with the value of the second month (otherwise just replace the "current" one but leave the "best" alone). Then, if the first month plus the second month nets positive, put that value in the second cell of each - that's the "longest run of length 2" and the "current run of length 2" - the current-run-of-two plus the most recent cell is the current-run-of-three.
When you read in the third month, things start to get interesting. First you check the first cell - if the third month is greater than the value currently there, replace it. Next you check the second cell. If adding the third month and subtracting the first would make that value greater, do it. Otherwise just put it in the "current" array, but not the "best" array. Finally, populate the third cells with the "current run of length 2" value plus the third cell.
Continue on in this fashion. When you reach row i, you have the current runs having length 1..i stored, along with the best of each length so far.
When you reach the end of the array, you can discard the "current" values and just take the max of the "best" array!
Because this requires 1+2+3+...+N operations, it's O(N^2). Only one pass through the input data is necessary, and the storage is 2N, which is O(N). If you wish to know which period was most profitable, just store the cell that begins the run as well as the run's sum.

I think I have a working method here. A combination of a proc SQL cross join and a quick data step seems to give everything I want (though it could probably be done in a single big SQL query). Here it is on the sample data from my python example.
data one;
length name $50;
infile datalines delimiter=',';
input name $ dt profit;
datalines;
Widget1,1,5
Widget1,2,1
Widget1,3,-2
Widget1,4,0
Widget1,5,-3
Widget1,6,8
Widget1,7,-2
Widget1,8,1
Widget2,1,-1
Widget2,2,1
Widget2,3,-3
Widget2,4,1
Widget2,5,-60
Widget2,6,9
Widget2,7,-2
Widget2,8,20
;
proc sql;
create table two as
select a.name, a.dt as start, b.dt as end, b.profit
from one as a cross join one as b
where start <= end and a.name = b.name
order by name, start, end;
quit;
run;
data two; set two;
by name start;
if first.start then sum=0;
sum+profit;
months = (end-start)+1;
run;
proc means noprint data=two(where=(months>=3));
by name;
output out=three(drop=_:) maxid(sum(start) sum(end))=start end max(sum)=;
run;
Making it operate on date constructs instead of numbered months would be trivial (just change the 'months' variable to be based on actual dates).

The data step (and Proc SQL) will read the data sequentially, however to mimic some of the functionality of the array solution from other programming languages, you can use the LAG function to look at previous observations.
If you sort your data by NAME and DATE, then you can use the BY statement in your data step and have access to FIRST. and LAST. to know when the NAME has changed.
Once you work out the algorithm, which obviously is the hardest part and which I don't have yet, you can OUTPUT each profit total per date sequence, then sort the new data set by name and profit total which should put the highest total at the beginning of the BY group (First.Name).
Maybe this will spur some additional ideas from you or other SAS programmers.

Related

DAX formula that adds a decimal number to the next row till the end

please I desperately need help.
Below is my table
The first table without expected result
The trips column was derived by dividing the Load column by 11,000.
What I am looking to achieve with DAX is, return the whole number, then get the decimal multiply it by 11,000
then add it to the next row of the Load column,
again divide that row in Load column by 11,000
return only the whole number and repeat the process above till end of the row.
This calculations needs to be group by the Village column and the Date column provides Ordering
The last row can be returned in a new row with next date (which could be in day, month or quarter)
At the end, below is the expected result. with focus on Expected result
The second image with expected result
Please notice the additional rows below that retained their decimal without whole number, those were the remainder from the last row in that particular village
If this can also be solved with python, I am also happy to use it since python scripts can be used in Power BI.
You can access the dataset sample here
The link to access the sample dataset used

Need to do calculation in dataframe with previous row value

I have this data frame with two column. The condition I need to form is when 'Balance Created column is empty, I need to take last filled value of Balance Created and add it with the next row of Amount value.
Original Data frame:
After Calculation, my desired result should be:

you can try using cummulative sum of pandas to achieve this,
df['Amount'].cumsum()
# Edit-1
condition = df['Balance Created'].isnull()
df.loc[condition, 'Balance Created'] = df['Amount'].loc[condition]
you can also apply based on groups like deposit and withdraw
df.groupby('transaction')['Amount'].cumsum()

I assume your question is mostly "How do I solve this using pandas", which is a good question that others have given you pandas-specific answers for.
But in case this question is more in the lines of "How do I solve this using an algorithm", which is a common problem to solve for people just starting of writing code, then this little paragraph might push you in the right direction.
for index in frame do
if frame.balance[i] is empty do
if i equals 0 do // Edge-case where first balance is missing
frame.balance[i] = frame.amount[i]
else do
frame.balance[i] = frame.amount[i] + frame.balance[i-1]
end
end
end

Divide a number to date in Excel or Python

Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks

My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84

You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2

Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here:

Merging large data in Python in local machine

I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.

Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.

Python/Pandas: sort by date and compute two week (rolling?) average

So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()

Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.