I have a huge df (~1 Million rows) with a bunch of columns. One of this column contains some categorical data, like Name:
Code Regione CodeProv Origin Name
0 1 Piemonte 1 Torino
1 1 Piemonte 2 Vercelli
2 1 Piemonte 2 Vercelli
what I want to do is to get a random number of rows, say 10k, but these rows should contain at least 20 unique values of the Name columns, no matters if each unique category has the same row number.
If you number of names is >> 20 and your distribution of names is not concentrated amoungst fewer than 20 names, then don't over complicate it and just do this:
number_of_unique_names_in_sample = 0
while number_of_unique_names_in_sample < 20:
df_sample = df.sample(n=10_000)
number_of_unique_names_in_sample = df_sample["Name"].nunique()
And maybe add in a counter to limit the number of iterations in case your distribution changes (like in a small test sample for example).
This might be what your asking for
name_cols = [list_of_names]
samples_per_name = 500
df[df['Name'].isin(name_cols)].groupby('Name').apply(lambda x: x.sample(samples_per_name))
the result will be 10000 rows with len(name_cols) (20 in your example) each containing 500 rows
Related
I have a very large df, lots of rows and columns. I want to rename the category of the categorical variable as "other" if it's less than 0.5% of the count of the mode.
I know df[colname].value_counts(normalize=True) gives me distribution of all categories. how do i extract the ones less than 0.5% of the mode, and how to rename it as other?
apple
large 100
medium 50
small 3
desired output
apple
large 100
medium 50
other 3
First, you need to find values whose frequency is smaller than 0.5%, by value_counts and index. Second, you need to make a dictionary whose keys are the index and the value is "others".Third, use replace with the dictionary to change values to others.
Here is an example.
import pandas as pd
df = pd.DataFrame({"apple":["large"] * 1000 + ["medium"] * 500 + ["small"] * 1})
cond = df['apple'].value_counts(normalize = True) < 0.005
others = cond[cond].index
others_dict = {k:"others" for k in others}
df['apple'] = df['apple'].replace(others_dict)
Use Series.map with Series.value_counts and compre by less by Series.lt for mask same size like original column, so new values are set in Series.mask:
m = df['apple'].map(df['apple'].value_counts(normalize=True).lt(0.005))
df['apple'] = df['apple'].mask(m, 'other')
For count:
s = df['apple'].value_counts()
print (s)
large 100
medium 50
other 3
Name: apple, dtype: int64
I have a dataframe and I want to sample it. However while sampling it randomly I want to have at least 1 sample from every element in the column. I also want the distribution have an effect as well.(ex: values with more samples on the original have more on the sampled df)
Similar to this and this question, but with minimum sample size per group.
Lets say this is my df:
df = pd.DataFrame(columns=['class'])
df['class'] = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,2]
df_sample = df.sample(n=4)
And when I sample this I want the df_sample to look like:
Class
0
0
1
2
Thank you.
As suggested by #YukiShioriii you could :
1 - sample one row of each group of values
2 - randomly sample over the remaining rows regardless of the values
Following YukiShioriii's and mprouveur's suggestion
# random_state for reproducibility, remove in production code
sample = df.groupby('class').sample(1, random_state=1)
sample = sample.append(
df[~df.index.isin(sample.index)] # only rows that have not been selected
.sample(n=sample_size-sample.shape[0]) # sample more rows as needed
).sort_index()
Output
class
2 0
4 0
13 1
14 2
I have a Pandas DataFrame of the form
df = pd.DataFrame({'1':['a','b','c'], '2':['b','a','d'], '3':['0.7','0.6','0.1']}).
I'd like to add a column to this DataFrame which contains the number of times a specific row is present, without considering the order (since the first two columns are the nodes of an undirected graph). Moreover, I'd like to merge those rows that differ only for the order of the first two columns, and take the mean of the numbers in the third one. In this case, it should be
df = pd.DataFrame({'1':['a','c'], '2':['b','d'], '3':['0.65','0.1'], '4':['2','1']}).
Consider also that the DataFrame contains more than 100.000 rows.
Use -
a=df[['1','2']].values
a.sort(axis=1)
df[['1','2']] = a
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()
Output
1 2 count mean
0 a b 2 0.65
1 c d 1 0.10
or
df[['1','2']] = df[['1','2']].sort_values(1,axis=1)
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()
Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96
Basically, how would I create a pivot table that consolidates data, where one of the columns of data it represents is calculated, say, by likelihood percentage (0.0 - 1.0) by taking the mean, and another is calculated by number ordered which sums all of them?
Right now I can specify values=... to indicate what should make up one of the two, but then when I specify the aggfunc=... I don't know how the two interoperate.
In my head I'd specify two values for values=... (likelihood percentage and number ordered) and two values for aggfunc=..., but this does not seem to be working.
You could supply to aggfunc a dictionary with column:funtion (key:value) pairs:
df = pd.DataFrame({'a':['a','a','a'],'m':[1,2,3],'s':[1,2,3]})
print df
a m s
0 a 1 1
1 a 2 2
2 a 3 3
df.pivot_table(index='a', values=['m','s'], aggfunc={'m':pd.Series.mean,'s':sum})
m s
a
a 2 6