Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks
My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84
You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2
Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here:
Related
I want to check how many values are lower than 2500
1)Using .count(
df[df.price<2500]["price"].count()
Using .values_counts()
df[df.price<2500]["price"].value_counts()
this ise code view
First one results 27540 and second 2050. Which one is correct count?
Definitely not 2050, analyze your histogram.
The method value_counts will assign only one row for a number that has duplicates but it will associate the number of duplicates. So it seems to be 2050 differents prices, but if you count duplicates there are much more.
I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)
Here's my data:
https://docs.google.com/spreadsheets/d/1Nyvx2GXUFLxrJdRTIKNAqIVvGP7FyiQ9NrjKiHoX3kE/edit?usp=sharing
Dataset
It's a small part of dataset with 100s of order_id.
I want to find duration in #timestamp column with respect to order_id. Example. for order_id 3300400, duration will be from index 6 to index 0. Similarly for all other order ids.
I want to have the sum of items.quantity and items.price with respect to order ids. Ex. for order_id 3300400, sum of items.quantity = 2 and sum of items.price = 499+549 = 1048. Similarly for other order_ids.
I am new to python but I think it will need the use of loops. Any help will be highly appreciated.
Thanks and Regards,
Shantanu Jain
you have figured out how to use the groupby() method which is good. In order to work out the diff in timestamps its a little more work.
# Function to get first and last stamps within group
def get_index(df):
return df.iloc[[0, -1]]
# apply function and then use diff method on ['#timestamp']
df['time_diff'] = df.groupby('order_id').apply(get_index)['#timestamp'].diff()
I haven't tested any of this code, and it will only work if your time stamps are pd.timestamps. It should at least give you an idea on where to start
The database I'm using shows the total number of debtors for every town for every quarter.
Since there's 43 towns listed, there's 43 'total debtors' per quarter (30-Sep-17, etc).
My goal is to find the total number of debtors for every quarter (so theoretically, finding the sum of every 43 'total debtors' listed) but I'm not quite sure how.
I've tried using the sum() function, but I'm sure how to make it so it only adds the total quarter by quarter.
Here's what the database looks like and my attempt (I printed the first 50 rows just to provide an idea of what it looks like)
https://i.imgur.com/h1y43j8.png
Sorry in advance if the explanation was a bit unclear.
You should use groupby. It's a nice pandas function to do exactly what you are trying to do. It groups the df according to whatever column you pick.
total_debtors_pq = df.groupby('Quarter end date')['Total number of debtors'].sum()
You can then extract the total for each quarter from total_debtors_pq.
Having trouble coding for the following question:
"Create a table that has 15 pay grades (rows) and within each pay grade are 10 steps (columns). Grade 1 step 1 starts at $21,885. Each step in a pay grade increases by 1.4 percent from the previous step. Each pay grade increases by 4.3 percent from step 1 in the previous grade. Label each row and column appropriately. Print the table and write to a file. Use integer values only."
Any help is greatly appreciated!
I'm not going to do your homework for you, but I'll give you some ideas to point you in the right direction. I assume you can use numpy so you can create and use arrays (perfect for this application).
Create a numpy ndarray, dimension: 15 rows (pay grades) by 10 columns
(steps)
Assign the starting pay for Grade 1, Step 1 to cell [0,0]
Step/Column values increase by 1.4% so the next column value iscol_i+1 = 1.014*col_i
Grade/Rows values increase by 4.3% so the next row value is row_i+1 = 1.043*row_i
These can be calculated with 2 loops over the row/column indices.
If you're clever, you can create values for one row (or column) then calculate each row/comun in one shot.
ndarray won't handle mixed data types for titles, but printing should be simple enough with formatted strings.
"Use integer values only" leads to an interesting question:
Do you use integer math, or retain accuracy with floats, then print integer values?
Also, you need to decide if you want to truncate or round.