I try to create 2 new columns in DataFrame in Pandas Python and the first column aa which shows average temperaturÄ™ is correct, nevertheless, the second column bb which should present temperature in City minus average temperature in all cities displays value 0??
Where is the problem? Did I correctly use lambda? Could you give me the solution? Thank you very much!
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.groupby(['City'])["Temperature"].transform(lambda x: x - np.mean(x))
display(file.head(10))
EDIT: Updated according to gereleth's comment. You can simplify it even more!
file['bb'] = file.Temperature - file.aa
Since we've already calculated the mean value in the aa column we can simply reuse this column to calculate the difference of the Temperature and aa column of each row by using pandas apply method like below:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.apply(lambda row: row['Temperature'] - row['aa'], axis=1)
display(file.sample(10))
If you are looking to subtract the average of all cities temperature you can use mean on the column aa instead:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
avg_all_cities = file['aa'].mean()
file["bb"] = file.apply(lambda row: row['Temperature'] - avg_all_cities, axis=1)
display(file.sample(10))
Related
I am new to Pandas.
My DataFrame looks like this:
I am having problems with adding 1st, 2nd, 3rd quartiles to my DataFrame.
I am trying to get quartiles for column CTR if they are on the same group determined by column Cat.
In total, I have about 40 groups.
What I've tried:
df_final['1st quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.25), 2)
df_final['2nd quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.5), 2)
df_final['3rd quartile'] = round(
df_final.groupby('Cat')['CTR'].quantile(0.75), 2)
But values get added in a way I cannot explain, like starting in second row and rather than being added the way as last column CTR Average Difference vs category.
My desired output would look the same as the last column, CTR Average Difference vs category, one line per category.
Any suggestions what might be wrong? Thank you.
If want new column filled by aggregated values like mean, sum or quantile use GroupBy.transform:
#similar ofr 2. and 3rd quantile
df_final['1st quartile'] = (df_final.groupby('Cat')['CTR']
.transform(lambda x: x.quantile(0.25))
.round(2))
Or you can use DataFrameGroupBy.quantile and then DataFrame.join by Cat column:
df = df_final.groupby('Cat')['CTR'].quantile([0.2, 0.5, 0.75]).round(2)
df.columns = ['1st quartile','2nd quartile','3rd quartile']
df_final = df_final.join(df, on='Cat')
I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()
I want to check if one date is between two other dates (everything in the same row). If this is true I want that a new colum is filled with a sales value of the same table. If not the row shall be dropped.
The code shall iterate over the entire dataframe.
This is my code:
for row in final:
x = 0
if pd.to_datetime(final['start_date'].iloc[x]) < pd.to_datetime(final['purchase_date'].iloc[x]) < pd.to_datetime(final['end_date'].iloc[x]):
final['new_col'].iloc[x] = final['sales'].iloc[x]
else:
final.drop(final.iloc[x])
x = x + 1
print(final['new_col'])
Instead of the values of final[sales] I just get 0 back.
Does anyone know where the mistake is or any other efficient way to tackle this?
The DataFrame looks like this:
I will do something like this:
First, creating the new column:
import numpy as np
final['new_col'] = np.where(pd.to_datetime(final['start_date'])<(pd.to_datetime(final['purchase_date']), final['sales'], np.NaN)
Then, you just drop the Na's:
final.dropna(inplace=True)
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
Suppose I have a DataFrame with columns person_id and mean_act, where every row is a numerical value for a specific person. I want to calculate the zscore for all the values at a person level. That is, I want a new column mean_act_person_zscore that is computed as the zscore of mean_act using the mean and std of the zscores for that person only (and not the whole dataset).
My first approach is something like this:
person_ids = df['person_id'].unique()
for pid in person_ids:
person_df = df[df['person_id'] == pid]
person_df = (person_df['mean_act'] - person_df['mean_act'].mean())/person_df['mean_act'].std()
At every iteration, it computes the right zscore output series, but the problem is that since the selection is by reference, not by value, the original df ends up without having the mean_act_person_zscore column.
Thoughts as to how to do this?
Should be straight forward:
df['mean_act_person_zscore'] = df.groupby('person_id').mean_act.transform(lambda x: (x - x.mean()) / x.std())