Create Table with pay grades

Create Table with pay grades - python

Having trouble coding for the following question:
"Create a table that has 15 pay grades (rows) and within each pay grade are 10 steps (columns). Grade 1 step 1 starts at $21,885. Each step in a pay grade increases by 1.4 percent from the previous step. Each pay grade increases by 4.3 percent from step 1 in the previous grade. Label each row and column appropriately. Print the table and write to a file. Use integer values only."
Any help is greatly appreciated!

I'm not going to do your homework for you, but I'll give you some ideas to point you in the right direction. I assume you can use numpy so you can create and use arrays (perfect for this application).
Create a numpy ndarray, dimension: 15 rows (pay grades) by 10 columns
(steps)
Assign the starting pay for Grade 1, Step 1 to cell [0,0]
Step/Column values increase by 1.4% so the next column value iscol_i+1 = 1.014*col_i
Grade/Rows values increase by 4.3% so the next row value is row_i+1 = 1.043*row_i
These can be calculated with 2 loops over the row/column indices.
If you're clever, you can create values for one row (or column) then calculate each row/comun in one shot.
ndarray won't handle mixed data types for titles, but printing should be simple enough with formatted strings.
"Use integer values only" leads to an interesting question:
Do you use integer math, or retain accuracy with floats, then print integer values?
Also, you need to decide if you want to truncate or round.

Related

Allocate total amount to a column using cumulative sum, up to another column's limit

Background: I am having a list of several hundred departments that I would like to allocate budget as follow:
Each DEPT has an AMT_TOTAL budget within given number of months. They also have a monthly limit LIMIT_MONTH that they cannot exceed.
As each DEPT plans to spend their budget as fast as possible, we assume they will spend up to their monthly limit until AMT_TOTAL runs out. The amount be forecast they will spend, given this assumption, is in AMT_ALLOC_MONTH
My objective is to calculate the AMT_ALLOC_MONTH column, given the LIMIT_MONTH and AMT_TOTAL column. Based on what I've read and searched, I believe a combination of fillna and cumsum() can do the job. So far, the Python dataframe I've managed to generate is as followed:
I planned to fill the NaN using the following line:
table['AMT_ALLOC_MONTH'] = min((table['AMT_TOTAL'] - table.groupby('DEPT')['AMT_ALLOC_MONTH'].cumsum()).ffill, table['LIMIT_MONTH'])
My objective is to have the AMT_TOTAL minus the cumulative sum of AMT_ALLOC_MONTH (excluding the NaN values), grouped by DEPT; the result is then compared with value in column LIMIT_MONTH, and the smaller value is filled in the NaN cells. The process is repeated till all NaN cells of each DEPT is filled.
Needless to say, the result did not come up as I expected; the code line only works with the 1st NaN after the cell with value; subsequent NaN cells just copy the value above it. If there is a way to fix the issue, or a new & more intuitive way to do this, please help. Truly appreciated!

Try this:
for department in table['DEPT'].unique():
subset = table[table['DEPT'] == department]
for index, row in subset.iterrows():
subset = table[table['DEPT'] == department]
cumsum = subset.loc[:index-1, 'AMT_ALLOC_MONTH'].sum()
limit = row['LIMIT_MONTH']
remaining = row['AMT_TOTAL'] - cumsum
table.at[index, 'AMT_ALLOC_MONTH'] = min(remaining, limit)
It't not very elegant I guess, but it seems to work..

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing

My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

Need to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut

I'm using the diamonds dataset, below are the columns
Question: to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut. Represent the number under each cell as a percentage of total
I have the above query. Although being a beginner, I created the Volume column and tried to create bins with equal population using qcut, but I'm not able to proceed further. Could someone help me out with the approach to solve the question?
pd.qcut(diamond['Volume'], q=4)

You are on the right path: pd.qcut() attempts to break the data you provide into q equal-sized bins (though it may have to adjust a little, depending on the shape of your data).
pd.qcut() also lets you specify labels=False as an argument, which will give you back the number of the bin into which the observation falls. This is a little confusing, so here's a quick exaplanation: you could pass labels=['A','B','C','D'] (given your request for 4 bins), which would return the labels of the bin into which each row falls. By telling pd.qcut that you don't have labels to give the bins, the function returns a bin number, just without a specific label. Otherwise, what the function gives back is a tuple with the range into which the observation (row) fell, and the bin number.
The reason you want the bin number is because of your next request: a cross-tab for the bin-indicator column and cut. First, create a column with the bin numbering:
diamond['binned_volume] = pd.qcut(diamond['Volume'], q=4, labels=False)`
Next, use the pd.crosstab() method to get your table:
pd.crosstab(diamond['binned_volume'], diamond['cut'], normalize=True)
The normalize=True argument will have the table calculate the entries as the entry divided by their sum, which is the last part of your question, I believe.

Divide a number to date in Excel or Python

Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks

My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84

You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2

Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here:

Find the maximum output over a variable-length period

I have a hypothetical set of data with 3 columns that has monthly profit data for a set of widget machines. I am trying to figure out the maximum profit period within a 2-year span.
The 3 columns are as follows:
name: identifier of widget machine (there are maybe 100 of these)
date: month/year over a 2 year period
profit: dollars made from widgets that month (can be negative if costs exceed revenue)
The maximum profit period is a concurrent set of months at least 3 months long (and could encompass all of the data).
Obviously I could brute force this and simply test every combination: Jan-Mar, Jan-Apr, Jan-May, Feb-Apr, etc. but I am looking for a better solution than creating all of these by hand. It seems like the data is a bit too big to want to transpose across and turn months into columns so I would like to be able to operate on a stacked dataset as described.
I'd prefer a sas data step but an sql query that works in proc SQL would be fine as well (but the sets of subqueries that might be required are beyond my ability).
Example Data:
data max(drop=dt);
length name dt $50;
infile datalines delimiter=',';
input name $ dt profit;
date=input(dt,mmddyy10.);
format date mmddyy10.;
datalines;
Widget1,01/01/2011,1000
Widget1,02/01/2011,2000
Widget1,03/01/2011,500
Widget2,01/01/2011,100
Widget2,02/01/2011,200
Widget2,03/01/2011,-50
Widget2,04/01/2011,250
Widget2,05/01/2011,-150
Widget2,06/01/2011,-250
Widget2,07/01/2011,400
Widget2,08/01/2011,0
Widget2,03/01/2011,-200
;
Maybe a better phrasing of the question would be "How do I come up with all possible consecutive combinations of values?" From a query like that, I could then take the max of combinations where # of values >= 3.
The query would build up every combination of sequential rows in the table, drop those where there are less than 3 rows, and then return the max value (grouped by Widget# of course). I suppose it would also helpful to know the starting and ending row for each combination. I'm trying to work out how this would be done in an SQL query (doesn't sound like a sas datastep to my mind)
Python Sample:
Here is a sample with some made up data that I wrote in Python. It is not the most efficient thing but it gets the sort of result I am looking for--I just can't figure out how to replicate it in SQL or SAS:
from itertools import groupby
data = []
data.append(['Widget1','Jan',5])
data.append(['Widget1','Feb',1])
data.append(['Widget1','Mar',-2])
data.append(['Widget1','Apr',0])
data.append(['Widget1','May',-3])
data.append(['Widget1','Jun',8])
data.append(['Widget1','Jul',-2])
data.append(['Widget1','Aug',1])
data.append(['Widget2','Jan',-1])
data.append(['Widget2','Feb',1])
data.append(['Widget2','Mar',-3])
data.append(['Widget2','Apr',1])
data.append(['Widget2','May',-60])
data.append(['Widget2','Jun',9])
data.append(['Widget2','Jul',-2])
data.append(['Widget2','Aug',20])
results = []
for key, group in groupby(data, lambda g: g[0]):
max = -999999
for i,v in enumerate(data):
if key <> v[0]:
continue
runningtotal = 0
for j,w in enumerate(data):
if key <> w[0]:
continue
if i <= j:
runningtotal = runningtotal + w[2]
if i+2 <= j and runningtotal > max:
max = runningtotal
maxstart = v[1]
maxend = w[1]
results.append([key, maxstart, maxend, max])
print results
This gives me the result of
[['Widget1', 'Jan', 'Jun', 9],
['Widget2', 'Jun', 'Aug', 27]]
for the fake python data I made.

Your core problem appears to be that you see combinatorially many periods, but you want a solution that doesn't require a combinatorial amount of work.
Luckily for you, if you have N months, you can solve this problem in O(N^2) time with O(N) space.
The trick is that you don't actually need to save all the period's values; you only need the maximal one. So let's break the big problem down into smaller chunks.
First, create two arrays of length N and fill them with zeroes. Now, you read in the first month and (if it earned a profit) put it in the first cell of each - it's the "best run of length 1" and also the "current run of length 1". If it's negative, leave the "best" at zero, but fill the "current" cell anyhow.
Next you read in the second month. If the second month earned more profit than the first, you replace the first cell of each array with the value of the second month (otherwise just replace the "current" one but leave the "best" alone). Then, if the first month plus the second month nets positive, put that value in the second cell of each - that's the "longest run of length 2" and the "current run of length 2" - the current-run-of-two plus the most recent cell is the current-run-of-three.
When you read in the third month, things start to get interesting. First you check the first cell - if the third month is greater than the value currently there, replace it. Next you check the second cell. If adding the third month and subtracting the first would make that value greater, do it. Otherwise just put it in the "current" array, but not the "best" array. Finally, populate the third cells with the "current run of length 2" value plus the third cell.
Continue on in this fashion. When you reach row i, you have the current runs having length 1..i stored, along with the best of each length so far.
When you reach the end of the array, you can discard the "current" values and just take the max of the "best" array!
Because this requires 1+2+3+...+N operations, it's O(N^2). Only one pass through the input data is necessary, and the storage is 2N, which is O(N). If you wish to know which period was most profitable, just store the cell that begins the run as well as the run's sum.

I think I have a working method here. A combination of a proc SQL cross join and a quick data step seems to give everything I want (though it could probably be done in a single big SQL query). Here it is on the sample data from my python example.
data one;
length name $50;
infile datalines delimiter=',';
input name $ dt profit;
datalines;
Widget1,1,5
Widget1,2,1
Widget1,3,-2
Widget1,4,0
Widget1,5,-3
Widget1,6,8
Widget1,7,-2
Widget1,8,1
Widget2,1,-1
Widget2,2,1
Widget2,3,-3
Widget2,4,1
Widget2,5,-60
Widget2,6,9
Widget2,7,-2
Widget2,8,20
;
proc sql;
create table two as
select a.name, a.dt as start, b.dt as end, b.profit
from one as a cross join one as b
where start <= end and a.name = b.name
order by name, start, end;
quit;
run;
data two; set two;
by name start;
if first.start then sum=0;
sum+profit;
months = (end-start)+1;
run;
proc means noprint data=two(where=(months>=3));
by name;
output out=three(drop=_:) maxid(sum(start) sum(end))=start end max(sum)=;
run;
Making it operate on date constructs instead of numbered months would be trivial (just change the 'months' variable to be based on actual dates).

The data step (and Proc SQL) will read the data sequentially, however to mimic some of the functionality of the array solution from other programming languages, you can use the LAG function to look at previous observations.
If you sort your data by NAME and DATE, then you can use the BY statement in your data step and have access to FIRST. and LAST. to know when the NAME has changed.
Once you work out the algorithm, which obviously is the hardest part and which I don't have yet, you can OUTPUT each profit total per date sequence, then sort the new data set by name and profit total which should put the highest total at the beginning of the BY group (First.Name).
Maybe this will spur some additional ideas from you or other SAS programmers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.