Python Pandas - Sample certain number of individuals from binned data - python

Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.
df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[1000,200,850,350,4000,20,35,4585,2],})
I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:
sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[15,2,4,4,35,0,15,25,0],})
I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).
How could I go about doing this? I feel like there must be a simple answer but I can't think of it!
Thank you in advance!

You can try sample with replacement:
df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()

Related

How do I assign 'other' to low frequency categories? (pandas)

I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.

Python: groupby and aggregate > adding to orginal df

I have a data frame, with a categorical variable where the group sizes vary.
Within every group of the categorical variable, I want to assign a random number between 1 and 10. I create as many random numbers between 1 and 10 as entries in a specific group.
To assign a random number I made a simple function called createrandomnum.
Then I used this line of code:
grouped_vales = data.groupby("categories").categories.agg(newnumber = createrandomnum)
Then the output is a data frame, where every row represents a category. The column named 'newnumber' contains lists with numbers between 1 and 10. The length of the list corresponds to the group sizes in the original data frame.
I would like to add these numbers to my original data frame. Which number is allocated to which entry is not that important, as long as the category is the same.
I figured I probably have to sort my original data frame;
data.sort_values("categories")
But then...
Anyone that could help me? Thanks in advance!
P.S. I just started learning Python, so maybe the code I provided here is not the most efficient. Tips are welcome of course :)
I believe you can use GroupBy.transform function for return new column (Series) with same size like original DataFrame:
data['new'] = data.groupby("categories").categories.transform(createrandomnum)
A method to add random number added:
import random
data['new'] = data.groupby('categories')['categories'].transform(lambda group: random.randint(1,10))

How can we extract duplicate values from multiple columns?

I have a dataset regarding Big Mart sales.
(You can find it here)
https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data
In the dataset there are columns like 'Outlet_Location_Type' and 'Outlet_Size'.
I want to find how many Tier1 locations have Medium 'Outlet_Size' and want to visualize this using grouped bar chart.I need a pythonic solution to this.
Any help is appreciated.
You need to use the groupby method :
df = pd.read_csv('Test.csv')
df = df[df['Outlet_Location_Type']=='Tier 1'].groupby(['Outlet_Size']).count()
Each column is equal and contains the number of element so you can select one randomly to plot the count :
df['Item_Identifier'].plot(kind='bar', stacked=True)
plt.show()

Detecting bad information (python/pandas)

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!
First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]
Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

How can I build a confidence interval calculator in Python?

I need some help with calculating the confidence interval for a range of sample sizes and according population sizes. So I have a data frame with 3 columns; 1 column has the country name in it, one column the sample size of a survey that was done in that country, and one column has the population size. I want to iterate through those sample sizes and population sizes and calculate the confidence interval for each sample. Only thing is, I have no idea where to start.
Basically, I want to build something like the 'find confidence interval' calculator (the 2nd one) on this page: http://www.surveysystem.com/sscalc.htm, only thing is I want to pass a list of sample sizes and population sizes. I hope you guys can help! Thank you in advance.

Categories

Resources