I am new to Pandas.
My DataFrame looks like this:
I am having problems with adding 1st, 2nd, 3rd quartiles to my DataFrame.
I am trying to get quartiles for column CTR if they are on the same group determined by column Cat.
In total, I have about 40 groups.
What I've tried:
df_final['1st quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.25), 2)
df_final['2nd quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.5), 2)
df_final['3rd quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.75), 2)
But values get added in a way I cannot explain, like starting in second row and rather than being added the way as last column CTR Average Difference vs category.
My desired output would look the same as the last column, CTR Average Difference vs category, one line per category.
Any suggestions what might be wrong? Thank you.
If want new column filled by aggregated values like mean, sum or quantile use GroupBy.transform:
#similar ofr 2. and 3rd quantile
df_final['1st quartile'] = (df_final.groupby('Cat')['CTR']
.transform(lambda x: x.quantile(0.25))
.round(2))
Or you can use DataFrameGroupBy.quantile and then DataFrame.join by Cat column:
df = df_final.groupby('Cat')['CTR'].quantile([0.2, 0.5, 0.75]).round(2)
df.columns = ['1st quartile','2nd quartile','3rd quartile']
df_final = df_final.join(df, on='Cat')
Related
I try to create 2 new columns in DataFrame in Pandas Python and the first column aa which shows average temperaturÄ™ is correct, nevertheless, the second column bb which should present temperature in City minus average temperature in all cities displays value 0??
Where is the problem? Did I correctly use lambda? Could you give me the solution? Thank you very much!
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.groupby(['City'])["Temperature"].transform(lambda x: x - np.mean(x))
display(file.head(10))
EDIT: Updated according to gereleth's comment. You can simplify it even more!
file['bb'] = file.Temperature - file.aa
Since we've already calculated the mean value in the aa column we can simply reuse this column to calculate the difference of the Temperature and aa column of each row by using pandas apply method like below:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.apply(lambda row: row['Temperature'] - row['aa'], axis=1)
display(file.sample(10))
If you are looking to subtract the average of all cities temperature you can use mean on the column aa instead:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
avg_all_cities = file['aa'].mean()
file["bb"] = file.apply(lambda row: row['Temperature'] - avg_all_cities, axis=1)
display(file.sample(10))
I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()
I'm trying to understand how to go around this in python pandas. My objective is to fill column "RESULT" with the initial investment and apply the profit on top of the previous result.
So if I would use an excel spreadsheet I would do this:
Ask what's the initial_investment (in this example $350)
Compute the first row as profit/100*initial_investment + initial_investment
the 2nd and forth will be the same with the exception that "initial_investment" is in the raw above.
my initial python code is this
import pandas as pd
df = pd.DataFrame({"DATE":[2009,2010,2011,2012,2013,2014,2015,2016],"PROFIT":[10,4,5,7,-10,5,-5,3],"RESULT":[350,350,350,350,350,350,350,350]})
print df
You can use the cumulative product function cumprod():
df['RESULT'] = ((df.PROFIT + 100) / 100.).cumprod() * 350
First you transform df.PROFIT into a proportion of the previous value. Then cumprod() multiplies each row by the previous rows. You can then just multiply this by whatever your initial value is.
Suppose I have a DataFrame with columns person_id and mean_act, where every row is a numerical value for a specific person. I want to calculate the zscore for all the values at a person level. That is, I want a new column mean_act_person_zscore that is computed as the zscore of mean_act using the mean and std of the zscores for that person only (and not the whole dataset).
My first approach is something like this:
person_ids = df['person_id'].unique()
for pid in person_ids:
person_df = df[df['person_id'] == pid]
person_df = (person_df['mean_act'] - person_df['mean_act'].mean())/person_df['mean_act'].std()
At every iteration, it computes the right zscore output series, but the problem is that since the selection is by reference, not by value, the original df ends up without having the mean_act_person_zscore column.
Thoughts as to how to do this?
Should be straight forward:
df['mean_act_person_zscore'] = df.groupby('person_id').mean_act.transform(lambda x: (x - x.mean()) / x.std())
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')