I have a problem that I cannot seem able to solve.
I have a dataset with two categorical variables: Gender (Male vs Female) and Smoking status (Smokers vs Non-smokers). The dataset contains 60% Male and 50% of Smokers.
df = pd.DataFrame()
df['Gender'] = ['M','M','M','M','M','M','F','F','F','F']
df['Smoking_status'] = ['S','S','S','S','S','NS','NS','NS','NS','NS']
Is there a way to create a subset such that the new dataset will have 50% Male and 30% Smokers? (it does not matter the percentage of male and smokers since it is an information that I do not have for the final dataset).
I am implementing this in python but I will be happy with just an idea of a solution.
Thank you all!
Related
Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.
df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[1000,200,850,350,4000,20,35,4585,2],})
I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:
sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[15,2,4,4,35,0,15,25,0],})
I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).
How could I go about doing this? I feel like there must be a simple answer but I can't think of it!
Thank you in advance!
You can try sample with replacement:
df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()
I have a large DataFrame with 2 million observations. For my further analysis, I intend to use a relatively smaller sample (around 15-20% of the original DataFrame) drawn from the original DataFrame. While sampling, I also intend to keep the proportion of categorical values present in one of the columns intact.
For eg: if one column has 5 categories as its values: red(20% of total observations), green(10%), blue(15%), white(25%), yellow(30%); I would like the column in the sample dataset to also show the same proportion of different categories.
Please assist!
I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()
I have a dataframe which is made up of rows that represent every play from several football games and I am simply calculating and aggregating the data so that I can do logistic regression on 2 data features... and a final feature that is just a W/L column. Makes sense right? Nothing too snazzy for you experienced developers.
I already have all three of my aggregated columns available to view, but I can't seem to get them into ONE smaller dataframe with just 3 columns - Don't need the rest of the columns for this exercise.
Total_Score = (NFLDataFrame.groupby(["gameid","off"]).offscore.max()) - (NFLDataFrame.groupby(["gameid","off"]).defscore.max())
Total_Score_boolean = Total_Score > 0
Win_Loss = Total_Score_boolean.replace({True:"Win", False:"Loss"})
How do I get the Win_Loss into it's own NEW dataframe?
Thanks!
I'm conducting some linear regressions using Python. I have a fairly large data file I'm working with, and one of the columns I'm looking at is titled "male" which points to a the gender of a subject. Column values can be 1 = male, 0 = female. "rgroupx" is the treatment variable (0 = control, 6 = high status treatment) and "log_mm" is the outcome variable.
One of the questions I need to answer is: How much does the high status treatment affect the number of traffic violations post intervention for male drivers? Is there a significant treatment effect for female drivers?
I have below my current Python statement. My problem is for both questions, how would I specify a column value to include in the regression? If the question is asking for male drivers, how do I tell Python to include only 1s? Thanks in advance!
model3 = smf.ols('log_mm ~ rgroupx + male', data=Traffic).fit()
If the structure of your data is in a dataframe, than a combination of indexing and dropping data while assigning it to a new variable 'male' would work.
Example:
males_df = data.drop(data[data.gender != 1].index)
variable for regression:
males = males_df.gender