creating groups based on multiple criteria in pandas - python

I have 5 columns based on state-level demographics (gpd, per capita income, dti, unemployment rate and hpi), and a column with states. I would like to create a variable that contains 4 groups based on these demographic variables, so for example:
Group 1 would have values in the first quantile (worse demographics),
group 2 would be in the second quantile,
group 3 would be in the third quantile,
group 4 would be in the fourth quantile.
Here I just put a random number from 1 to 4 to indicate what the outcome should be. In the dataset I of course deal with more states, but this is roughly what it should look like.
So, in the end, every state would belong to a certain group, based on its demographics.
For all the variables it holds that the lower the value, the worse it is, except for unemployment rate of course.
Any help would be greatly appreciated.

Related

Split a pandasframe into X groups each with an average value of Y

Really wish I would've paid more attention to statistics in college, maybe I'd remember something for this.
Regardless, I have a dataset with approx ~4800 rows, and I want to group this dataset into 8 groups, however I'd like this final dataset to have only half the rows that the others do (hence 7.5 groups). I would like each of these rows to have approx the same number of rows (except for group 8 which will have 1/2), and the sum of their values equal to approx the same amount.
Total rows: 4792
Sum of values: 33367
Average value per row: 33367/4792=6.963
Median: 2
Std Dev: 13.644
Total rows per intended group: 638.93
Total rows for 8th group: 319.46
Total value per intended group: 4448.933
Total value for 8th group: 2224.4965
I've calculated the following out for the specific dataset and have the formulas ready to calculate them on the fly, but I'm not quite sure how to use Python to split the dataset up. Cursory research into the issue suggests the knapsack algorithm? Other thoughts, set the values of each row to their value - average value, and then take a value from the top and as many values from the bottom until it equals zero, then carry on to next group, repeating through each group, and then starting over, until I've processed through all rows.

How to create stock index that alternates between two groups of stocks in Python?

So I have two groups of stocks, group A and group B. I have 2 data frames for each group's tickers and 2 for the daily stock performance over the last 10 years. For a given year, I want to be able to change which group of stocks becomes my index (based on some metric, ie. interest rate).
I know how to analyze and to visualize an index of equal-weighted stocks, however, how would I go about constructing my data frame to alternate which group of stocks are in the index on a yearly basis?

randomisation in python within a group for A/B test

I have a python list of size 67 with three unique values with following distribution
A - 20 occurrences randomly spread in the list
B - 36 occurrences randomly spread in the list
C - 11 occurrences randomly spread in the list
I want to perform random selection at 10% within each group to perform a special treatment on the values selected from randomisation.
Based on the occurrences in the list shown above, 2 treatments for group A, 3 treatments for B and 1 treatment for C should have been done.
Selection for treatment need not be done exactly on the 10th occurrence of a value but the ratio of treatment to values should be maintained at approximately 10%.
Right now, I have this code
import random
if random.random() <= 0.1
do something
Using this code, doesn't get me above requirement treatments at a group level. Instead it randomly picks treatments across all groups. I want to trim the selection of random samples at a group level. How do I do that?
Also, if this list were dynamic and keeps getting bigger and bigger and populated with more values of A,B,C at run time albeit with different distributions, how can I still maintain the randomisation at a group (unique value in the list) level.

Cumulated total highest bill of a dataset

I have a huge dataset with a lot of different client names, bills etc.
Now I want to show the 4 clients with the cumulated total highest bill.
So far I have used the groupby function:
data.groupby(by = ["CustomerName","Bill"], as_index=False).sum()
I tried to group by the name of the customers and the bill but it's not giving me the total sum of all the individual customer orders but only each single order from the customer.
Can someone help and tell me how I can receive on the first position customer x (with the highest accumulated bill) and the sum of all his orders and on position 2 the customer with the second highest accumulated bill and so on?
Big thanks!
Since, I don't know the full structure of your data data frame, I recommend subsetting the relevant columns first:
data = data[["CustomerName", "Bill"]]
Then, you just need to group by CustomerName and sum over all columns (Bill in that case):
data.groupby(by=["CustomerName"]).sum()
Finally, you need to sort by the Bill column in non-ascending order:
data.sort_values(by='Bill', ascending=False)
print(data.head(4))

Dataset with lots of zero value as missing value. What should I do?

I am currently working on IMDB 5000 movie dataset for a class project. The budget variable has a lot of zero values.
They are missing entries. I cannot drop them because they are 22% of my entire data.
What should I do in Python? Some suggested binning? Could you provide more details?
Well there are a few options.
Take an average of the non zero values and fill all the zeros with the average. This yields 'tacky' results and is not best practice a few outliers can throw off the whole.
Use the median of the non zero values, also not a super option but less likely to be thrown off by outliers.
Binning would be taking the sum of the budgets then say splitting the movies into a certain number of groups like say budgets over or under a million, take the average budget then divide that by the amount of groups you want then use the intervals created from the average if they fall in group 0 give them a zero if group 1 a one etc.
I think finding the actual budgets for the movies and replacing the bad itemized budgets with the real budget would be a good option depending on the analysis you are doing. You could take the median or average of each feature column of the budget make that be the percent of each budget for a movie then fill in the zeros with the percent of budget the median occupies. If median value for the non zero actor_pay column is budget/actor_pay=60% then filling the actor_pay column of a zeroed value with 60 percent of the budget for that movie would be an option.
Hard option create a function that takes the non zero values of a movies budgets and attempts to interpolate the movies budget based upon the other movies data in the table. This option is more like its own project and the above options should really be tried first.

Categories

Resources