Below I defined two dataframes, input and output.
df_input:
Column A which is to be grouped.
Column B is a sort of index withing groups of A, enumeration.
Column C contains elements to be summed up.
df_output:
Column D with the calculated average
Within each group of A take average over elements of C, but only the first half of those. If the number of elements is odd ceil it.
This is just a simplified problem of some huge dataset. And this is how I did solve it so far.
import pandas as pd
import numpy as np
df_input = pd.DataFrame({"A": [2,2,3,3,3,4,4,4,4],
"B": [2,1,1,3,2,4,2,1,3],
"C": [1,1,2,2,2,4,4,4,4]})
df_output = pd.DataFrame({"A": [2,2,3,3,3,4,4,4,4],
"B": [2,1,1,3,2,4,2,1,3],
"C": [1,1,2,2,2,4,4,4,4],
"D": [1,1,2,2,2,4,4,4,4]})
df = df_input.copy()
df.sort_values(by=['A', 'B'], inplace=True)
df['E'] = np.ceil(df['A'] / 2) # this is number of elements to sum up withing group 'A'
df['G'] = df['C'] / df['E'] # this is element to be included or not inside the sum
df['H'] = np.where(df['E'] >= df['B'], df['G'], 0)
df['D'] = df.groupby('A')['H'].transform('sum')
I'm hoping to get this done in a more neat and one-liner(s) type of way...please :)
Related
I am trying to add a row to a multi-index level, and perform calculations which are constructed based on the individual rows in the ungrouped dataframe. The calculations are then added to the grouped dataframe.
import numpy as np
import pandas as pd
import random
years = [2000, 2001, 2002]
products = ["A", "B", "C"]
num_combos = 10
years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]
sum_values = list(range(0, num_combos))
random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]
cols = {"years": years,
"products": products,
"sum_col": sum_values,
"av_col": av_values}
df = pd.DataFrame(cols)
The above dataframe is randomly generated. I have a df with a number of columns that I want to either sum based on the individual accounts, or average based on the individual accounts. I can achieve this using the following:
gdf = df.groupby(["products", "years"]).agg(s = ("sum_col", "sum"),
a = ("av_col", "mean"))
However, I want to now add a row to this multi-index level denoted "Total/Avg" where some columns "Total/Avg" row is determined by the sum of the individual rows, (In the case of sum, I could just sum over that level) or determine the average of the individual rows for other columns. One solution to this is provided below:
def addTotalAvgMultiindex(df):
num_indexes = len(list(df.index.levels))
if num_indexes == 3:
a, b, c = df.index.levels
df = df.reindex(pd.MultiIndex.from_product([a, b, [*c, 'Total/Avg']]))
elif num_indexes == 4:
a, b, c, d = df.index.levels
df = df.reindex(pd.MultiIndex.from_product([a, b, c, [*d, 'Total/Avg']]))
elif num_indexes == 2:
a, b = df.index.levels
df = df.reindex(pd.MultiIndex.from_product([a, [*b, 'Total/Avg']]))
return df
gdf = addTotalAvgMultiindex(gdf)
gdf.index = gdf.index.set_names(["products", "years"])
for col in gdf.columns:
if col == "s":
total = df.groupby(["products"]).agg(total=("sum_col", "sum"))
elif col == "a":
total = df.groupby(["products"]).agg(total=("av_col", "mean"))
total_values = [x for xs in total.values for x in xs]
gdf[col][gdf.index.get_level_values("years") == "Total/Avg"] = total_values
This seems very tedious, especially if I have a lot of columns (currently only requiring either sum or average, but additional measures such as median could be added).
Is there a smarter way to add the row to the multi-index and to calculate the results based on the individual rows in the df dataframe? (without needing to reindexing and renaming the levels, needing to loop over the columns and then filling in the values one at a time?) Assume there are several columns that need to be summed, and several others that need to be averaged.
You can aggregate on the same multi-index, where some of the columns are set to a constant value, and then use this aggregated result to merge with the previous aggregated result.
total_gdf = df.assign(years="Total/avg").groupby(
["products", "years"]).agg(s=("sum_col", "sum"), a=("av_col", "mean"))
pd.concat([gdf,total_gdf]).sort_index()
I'm trying to set create a new column on my DataFrame grouping two existing columns
import pandas as pd
import numpy as np
DATA=pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
DATA['index']=np.arange(5)
DATA.set_index('index', inplace=True)
The output is something like this
'A' 'B'
index
0 -0.003635 -0.644897
1 -0.617104 -0.343998
2 1.270503 -0.514588
3 -0.053097 -0.404073
4 -0.056717 1.870671
I would like to have an extra column 'C' that has an np.array with the elements of 'A' and 'B' for the corresponding row. In the real case, 'A' and 'B' are already 1D np.arrays, but of different lengths. I would like to make a longer array with all the elements stacked or concatenated.
Thanks
If columns a and b contains numpy arrays, you could apply hstack across rows:
import pandas as pd
import numpy as np
num_rows = 10
max_arr_size = 3
df = pd.DataFrame({
"a": [np.random.rand(max_arr_size) for _ in range(num_rows)],
"b": [np.random.rand(max_arr_size) for _ in range(num_rows)],
})
df["c"] = df.apply(np.hstack, 1)
assert all(row.a.size + row.b.size == row.c.size for _, row in df.iterrows())
DATA['C'] = DATA.apply(lambda x: np.array([x.A, x.B]), axis=1)
pandas requires all rows to be of the same length so the problem of uneven pandas series shouldn't be present
The objective is to create a new multiindex column (stat) based on the condition of the column (A and B)
Condition for A
CONDITION_A='n'if A<0 else 'p'
and
Condition for B
CONDITION_B='l'if A<0 else 'g'
Currently, the idea is to separately analyse condition A and B, and combine the analysis to obtain the column stat as below, and finally append back to the main dataframe.
However, I wonder whether there is a way to maximise the line of code to achieve similar objective
The expected output
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_1 = df.loc[:,idx[:,'A']]<0
appenddf=mask_1.replace({True:'N',False:'P'}).rename(columns={'A':'iii'},level=1)
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'l',False:'g'}).rename(columns={'A':'iv'},level=1)
# combine the multiindex
stat_comparison=[''.join(i) for i in zip(appenddf["iii"],appenddf_2["iv"])]
You can try concatinating both df's:
s=pd.concat([appenddf,appenddf_2],axis=1)
cols=pd.MultiIndex.from_product([s.columns.get_level_values(0),['stat']])
out=pd.concat([s.loc[:,(x,slice(None))].agg('_'.join,axis=1) for x in s.columns.get_level_values(0).unique()],axis=1,keys=cols)
output of out:
One Two
stat stat
0 P_g P_l
1 N_l N_l
2 N_l N_g
3 P_g P_l
4 N_l P_l
I am trying to make a script that loops through rows in dataframe and makes a new column from appending values from column A or B based on a condition in column C. However, there seem to be something wrong in the appending of the rows in the columns, as my new column contain several values.
import pandas as pd
import numpy as np
#Loading in the csv file
filename = '35180_TRA_data.csv'
df1 = pd.read_csv(filename, sep=',', nrows=1300, skiprows=25, index_col=False, header=0)
#Calculating the B concentration using column A and a factor
B_calc = df1['A']*137.818
#The measured B concentration
B_measured = df1['B']
#Looping through the dataset, and append the B_calc values where the C column is 2, while appending the B_measured values where the C column is 1.
calculations = []
for row in df1['C']:
if row == 2:
calculations.append(B_calc)
if row ==1:
calculations.append(B_measured)
df1['B_new'] = calculations
The values of my new column (B_new) are all wrong. For example in the first row it should be just 0.00, but it contains numerous values. So something is going wrong in the appending. Anyone who can spot that problem?
B_calc and B_measured are arrays. As such you have to specify which value you want to assign, otherwise you assign the whole array. Here is how you could do it :
df1 = pd.DataFrame({"A":[1,3,5,7,9], "B" : [9,7,5,3,1], "C":[1,2,1,2,1]})
#Calculating the B concentration using column A and a factor
B_calc = df1['A']*137.818
#The measured B concentration
B_measured = df1['B']
#Looping through the dataset, and append the B_calc values where the C column is 2, while appending the B_measured values where the C column is 1.
calculations = []
for index, row in df1.iterrows():
if row['C'] == 2:
calculations.append(B_calc[index])
if row['C'] ==1:
calculations.append(B_measured[index])
df1['B_new'] = calculations
But it's a bad practice to iterate on rows because it takes a long time. A better way is to use pandas masks, here is how it works :
mask_1 = df1['C'] == 1
mask_2 = df1['C'] == 2
df1.loc[mask_1, 'C'] = df1[mask_1]['A']*137.818
df1.loc[mask_2, 'C'] = df1[mask_2]['B']
I'd like to create an emtpy column in an existing DataFrame with the first value in only one column to = 100. After that I'd like to iterate and fill the rest of the column with a formula, like row[C][t-1] * (1 + row[B][t])
very similar to:
Creating an empty Pandas DataFrame, then filling it?
But the difference is fixing the first value of column 'C' to 100 vs entirely formulas.
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B','C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0)
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df['B'] = df['A'].pct_change()
df['C'] = df['C'].shift() * (1+df['B'])
## how do I set 2016-10-03 in Column 'C' to equal 100 and then calc consequtively from there?
df
Try this. Unfortunately, something similar to a for loop is likely needed because you will need to calculate the next row based on the prior rows value which needs to be saved to a variable as it moves down the rows (c_column in my example):
c_column = []
c_column.append(100)
for x,i in enumerate(df['B']):
if(x>0):
c_column.append(c_column[x-1] * (1+i))
df['C'] = c_column