I have a dataframe with multiple scores and multiple dates. My goal is to bin each day into equal sized buckets (let's say 5 buckets) based on whatever score I choose. The problem is that some scores have an abundance of ties and therefore I need to first compute rank to introduce a tie-breaker criteria and then the qcut can be applied.
The simple solution is to create a field for the rank and then do groupby('date')['rank'].transform(pd.qcut). However, since efficiency is key, this implies doing two expensive groupbys and I was wondering if it is possible to "chain" the two operations into one sweep.
This is the closest I got; my goal is to create 5 buckets but the qcut seems to be wrong since it is asking me to provide hundreds of labels
df_main.groupby('date')['score'].\
apply(lambda x: pd.qcut(x.rank(method='first'),
5,
duplicates='drop',
labels=lbls)
)
Thanks
Related
I have a big dataset, with 10,000 or so rows as pandas Dataframe. [['Date', 'TAMETR']].
The float values under 'TAMETR' increases and decreases over time.
I wish to loop through the 'TAMETR' column and check if there are consecutive instances where values are greater than let's say 1. Ultimately I'd like to get the average duration length and the distribution of the instances.
I've played around a little with what is written here: How to count consecutive ordered values on pandas data frame
Doubt I fully understand the code but I cant make it work. I don't understand how to tweak it with greater or lower than (</>).
The preferred output would be a dataframe, or array, with all the instances (greater than 1).
I can calculate the average and plot the distribution.
I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.
I'm using the diamonds dataset, below are the columns
Question: to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut. Represent the number under each cell as a percentage of total
I have the above query. Although being a beginner, I created the Volume column and tried to create bins with equal population using qcut, but I'm not able to proceed further. Could someone help me out with the approach to solve the question?
pd.qcut(diamond['Volume'], q=4)
You are on the right path: pd.qcut() attempts to break the data you provide into q equal-sized bins (though it may have to adjust a little, depending on the shape of your data).
pd.qcut() also lets you specify labels=False as an argument, which will give you back the number of the bin into which the observation falls. This is a little confusing, so here's a quick exaplanation: you could pass labels=['A','B','C','D'] (given your request for 4 bins), which would return the labels of the bin into which each row falls. By telling pd.qcut that you don't have labels to give the bins, the function returns a bin number, just without a specific label. Otherwise, what the function gives back is a tuple with the range into which the observation (row) fell, and the bin number.
The reason you want the bin number is because of your next request: a cross-tab for the bin-indicator column and cut. First, create a column with the bin numbering:
diamond['binned_volume] = pd.qcut(diamond['Volume'], q=4, labels=False)`
Next, use the pd.crosstab() method to get your table:
pd.crosstab(diamond['binned_volume'], diamond['cut'], normalize=True)
The normalize=True argument will have the table calculate the entries as the entry divided by their sum, which is the last part of your question, I believe.
I have a two column dataframe with name limitData where the first column is CcyPair and second is trade notional
CcyPair,TradeNotional
USDCAD,1000000
USDCAD,7600
USDCAD,40000
GBPUSD,100000
GBPUSD,345000
etc
with a large number of CcyPair's and TradeNotional's per CcyPair. From here I generate summary statistics as follows
limitDataStats = limitData.groupby(['CcyPair']).describe()
This is easy enough. However, I would like to add a column to sumStats that contains the count of TradeNotional's greater than that ccyPair's 75% determined by .describe() stored in limitDataStats. I've searched a great deal and tried a number of variations but can't figure it out. Think it should be somewhere along the lines of the below (I thought I could reference the index of the groupby as mentioned here but that gives me the actual integer index here
limitData.groupby(['CcyPair'])['AbsBaseTrade'].apply(lambda x: x[x > limitDataStats.loc[x.index , '75%']].count())
Any ideas? Thanks, Colin
You can filter values greater than the 75th percentile and then count how many are greater than or equal to that value (used .sum() since boolean series is returned from ge())
limitData.groupby('CcyPair')['AbsBaseTrade'].apply(
lambda x: x.ge(x.quantile(.75)).sum()))
I want to produce an aggregation along a certain criterion, but also need a row with the same aggregation applied to the non-aggregated dataframe.
When using customers.groupby('year').size(), is there a way to keep the total among the groups, in order to output something like the following?
year customers
2011 3
2012 5
total 8
The only thing I could come up with so far is the following:
n_customers_per_year.loc['total'] = customers.size()
(n_customers_per_year is the dataframe aggregated by year. While this method is fairly straightforward for a single index, it seems to get messy when it has to be done on a multi-indexed aggregation.)
I believe the pivot_table method has a 'totals' boolean argument. Have a look.
margins : boolean, default False Add all row / columns (e.g. for
subtotal / grand totals)
I agree that this would be a desirable feature, but I don't believe it is currently implemented. Ideally, one would like to display an aggregation (e.g. sum) along one or more axis and or levels.
A workaround is to create a series that is the sum and then concatenate it to your DataFrame when delivering the data.