I have a data frame, with a categorical variable where the group sizes vary.
Within every group of the categorical variable, I want to assign a random number between 1 and 10. I create as many random numbers between 1 and 10 as entries in a specific group.
To assign a random number I made a simple function called createrandomnum.
Then I used this line of code:
grouped_vales = data.groupby("categories").categories.agg(newnumber = createrandomnum)
Then the output is a data frame, where every row represents a category. The column named 'newnumber' contains lists with numbers between 1 and 10. The length of the list corresponds to the group sizes in the original data frame.
I would like to add these numbers to my original data frame. Which number is allocated to which entry is not that important, as long as the category is the same.
I figured I probably have to sort my original data frame;
data.sort_values("categories")
But then...
Anyone that could help me? Thanks in advance!
P.S. I just started learning Python, so maybe the code I provided here is not the most efficient. Tips are welcome of course :)
I believe you can use GroupBy.transform function for return new column (Series) with same size like original DataFrame:
data['new'] = data.groupby("categories").categories.transform(createrandomnum)
A method to add random number added:
import random
data['new'] = data.groupby('categories')['categories'].transform(lambda group: random.randint(1,10))
Related
Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.
df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[1000,200,850,350,4000,20,35,4585,2],})
I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:
sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[15,2,4,4,35,0,15,25,0],})
I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).
How could I go about doing this? I feel like there must be a simple answer but I can't think of it!
Thank you in advance!
You can try sample with replacement:
df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()
I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.
I have highly unbalanced data (with binary labels, zeros are 96% of data, while ones are just 4%) to balance it I have decided to delete some rows with label zero. However by iterating over the whole dataframe program would take several hours to delete the rows using pandas.dataframe.drop() method. What is the most time efficient way to delete the data?
I have tried sorting the data and then just clearing out bunch of rows with label 0, but unfortunately I must not change the order of data.
I have selected indexes of rows with label 0 and chosen random indexes from that list to delete like so:
drops = random.sample(zero_indexes, X) (where X is amount of rows I want to delete) but I am not sure how to delete rows with such indexes in acceptable time. Any help would be appreciated
Get a list of indices you want to chuck
bad_labels = df[df['label'] == 0].sample(500).index
Then filter df to rows not in there
df1 = df[~df.index.isin(bad_labels)]
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.