TL;DR: I understand that .apply() is slow in pandas. However, I have a function that acts on indexes, which I cannot figure out how to vectorize. I want this function to act on two sets of parameters (500,000 and 1,500 long, respectively). I want to produce a dataframe with the first parameter as the row index, the second parameter as column names, and the cells containing the function's output for that particular row and column. As it stands it looks like the code will take several days to run. More details and minimal reproducible example below:
INPUT DATA:
I have a series of unique student IDs, which is 500,000 students long. I have a df (exam_score_df) that is indexed by these student IDs, containing each student's corresponding scores in math, language, history, and science.
I also have a series of school codes (each school code corresponds to a school), which is 1,500 schools long. I have a df (school_weight_df) that is indexed by school codes, containing the school's weights in math, language, history, and science that it uses to calculate a student's scores. Each row also contains a 'Y' or 'N' indexed 'Alternative_Score' because some schools allow you to take the best subject score between history and science to calculate your overall score.
FUNCTION I WROTE TO BE VECTORIZED:
def calc_score(student_ID, program_code):
'''
For a given student and program, returns students score for that program.
'''
if school_weight_df.loc[program_code]['Alternative_Score'] == 'N':
return np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST','%SCI']]))
elif school_weight_df.loc[program_code]['Alternative_Score'] == 'Y':
history_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST']]))
science_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%SCI']]))
return max(history_score, science_score)
EXAMPLE DFs:
Here are example dfs for exam_score_df and school_weight_df:
student_data = [[3, 620, 688, 638, 688], [5, 534, 602, 606, 700], [9, 487, 611, 477, 578]]
exam_score_df = pd.DataFrame(student_data, columns = ['student_ID', 'LANG', 'MAT', 'HIST', 'SCI'])
exam_score_df.set_index('student_ID')
program_data = [[101, 20, 30, 25, 25, 'N'], [102, 40, 10, 50, 50, 'Y']]
school_weight_df = pd.DataFrame(program_data, columns = ['program_code', '%LANG','%MAT','%HIST','%SCI', 'Alternative_Score'])
school_weight_df.set_index('program_code', inplace = True)
Here's the series which are used to index the code below:
series_student_IDs = pd.Series(exam_score_df.index, inplace = True)
series_program_codes = pd.Series(school_weight_df.index, inplace = True)
CODE TO CREATE DF USING FUNCTION:
To create the df of all of the students' scores at each program, I used nested .apply()'s:
new_df = pd.DataFrame(series_student_IDs.apply(lambda x: series_program_codes.apply(lambda y: calc_score(x, y))))
I've already read several primers on optimizing code in pandas, including the very well-written Guide by Sofia Heisler. My primary concern, and reason for why I can't figure out how to vectorize this code, is that my function needs to act on indexes. I also have a secondary concern that, even if I do vectorize, there is this problem with np.dot on large matrices for which I would want to loop anyway.
Thanks for all the help! I have only been coding for a few months, so all the helpful comments are really appreciated.
Apply = bad, Double Apply = very bad
If you are going Numpy in the function, why not go Numpy all the way? You would still prefer a batch-wise approach since the overall matrix would take tons of memory. Check the following approach.
Each iteration took me 2.05 seconds on a batch of 5000 students on a low-end macbook pro. This means for 500,000 students, you can expect 200 seconds approx, which is not half bad.
I ran the following on 100000 students and 1500 schools which took me a total of 30-40 seconds approx.
First I created a dummy set data: Exam scores (100,000 students, 4 scores), school weights (1500 schools, 4 weights) AND a boolean flag for which school has alternative as Y or N, Y==True, N==False
Next, for a batch of 5000 students, I simply calculate the element-wise product of each of the 4 subjects between the 2 matrices using np.einsum. This gives me (5000,4) * (1500,4) -> (1500,5000,4). Consider this as the first part of the dot product (without the sum).
The reason I do this is because this is a necessary step for both your conditions N or Y.
Next, FOR N: I simply filter the above matrix based on alt_flag, reduce it (sum) over last axis and transpose to get (5000, 766), where 766 are the number of schools with alternative == N
FOR Y:, I filter based on alt_flag and then I calculate the sum of the first 2 subjects (because they are common) and add those to the 3rd and 4th subject separately, take a max and return that as my final score. Post that a Transpose. This gives me (5000, 734).
I do this for each batch of 5000, until I have appended all the batches and then simply np.vstack to get the final tables (100000, 766) and (100000, 734).
Now I can simply stack these over axis=0 to get (100000, 1500) but if I want to map them to the IDs (student, schools), it would be easier to do it separately using pd.DataFrame(data, columns=list_of_schools_alt_Y, index=list_of_student_ids and then combine them. Read the last step for you.
Last step is for you to perform since I don't have the complete dataset. Since the order of the indexes is retained through batch-wise vectorization, you can now simply map the 766 school IDs with N, 734 school IDs with Y, and 100000 student IDs, in the order they occur in your main dataset. Then simply append the 2 data frames to create a final (massive) dataframe.
NOTE: you will have to change the 100000 to 500000 in the for loop, don't forget!!
import numpy as np
import pandas as pd
from tqdm import notebook
exam_scores = np.random.randint(300,800,(100000,4))
school_weights = np.random.randint(10,50,(1500,4))
alt_flag = np.random.randint(0,2,(1500,), dtype=bool) #0 for N, 1 for Y
batch = 5000
n_alts = []
y_alts = []
for i in notebook.tqdm(range(0,100000,batch)):
scores = np.einsum('ij,kj->kij', exam_scores[i:i+batch], school_weights) #(1500,5000,4)
#Alternative == N
n_alt = scores[~alt_flag].sum(-1).T #(766, 5000, 4) -> (5000, 766)
#Alternative == Y
lm = scores[alt_flag,:,:2].sum(-1) #(734, 5000, 2) -> (734, 5000); lang+math
h = scores[alt_flag,:,2] #(734, 5000); history
s = scores[alt_flag,:,3] #(734, 5000); science
y_alt = np.maximum(lm+h, lm+s).T #(5000, 734)
n_alts.append(n_alt)
y_alts.append(y_alt)
final_n_alts = np.vstack(n_alts)
final_y_alts = np.vstack(y_alts)
print(final_n_alts.shape)
print(final_y_alts.shape)
Related
PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I'm using is very slow, so I'm wondering what the fastest way is.
My data has this structure:
STUDENT
ASSIGNMENT1
ASSIGNMENT2
ASSIGNMENT3
...
ASSIGNMENT20
Student1
50
75
100
...
50
Student2
75
25
NaN
...
NaN
...
Student2000
100
50
NaN
...
50
TARGET OUTPUT:
For every possible combination of assignments, I'm trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:
ASSIGNMENT_COMBO
NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO
ASSIGNMENT1 TOTAL POINTS
ASSIGNMENT2 TOTAL POINTS
ASSIGNMENT3 TOTAL POINTS
...
ASSIGNMENT20 TOTAL POINTS
Assignment 1, Assignment 2
900
5000
400
NaN
...
NaN
Assignment 1, Assignment 2, Assignment 3
100
3000
500
...
NaN
Assignment 2, Assignment 3
750
NaN
7000
750
...
NaN
...
All possible combos, including any number of assignments
WHAT I'VE TRIED: First, I'm using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:
for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
ifor = str(combo)
ifor_val = 'no'
for item in combo:
if row[str(item)]>0:
ifor_val = 'yes'
starting_data.at[i,ifor] = ifor_val
Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:
numberofstudents =[]
for combo in assignmentcombostats['combo']:
column = str(combo)
number = len(starting_data[starting_data[column] == 'yes'])
numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents
This works, but it is very slow.
RESOURCES: I've looked at a few resources -
This post is what I based my current method on
This page has ideas for faster iterating, but I'm not sure of the best way to
solve my problem using vectorization
One approach to speed up your code is to avoid using for loops and instead use pandas built-in functions to apply transformations on your data. Here's an example implementation that should accomplish your desired output:
import itertools
import pandas as pd
# sample data
data = {
'STUDENT': ['Student1', 'Student2', 'Student3', 'Student4'],
'ASSIGNMENT1': [50, 75, 100, 100],
'ASSIGNMENT2': [75, 25, 50, 75],
'ASSIGNMENT3': [100, None, 75, 50],
'ASSIGNMENT4': [50, None, None, 100]
}
df = pd.DataFrame(data)
# create a list of all possible assignment combinations
assignments = df.columns[1:].tolist()
combinations = []
for r in range(1, len(assignments)+1):
combinations += list(itertools.combinations(assignments, r))
# create a dictionary to hold the results
results = {'ASSIGNMENT_COMBO': [],
'NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO': [],
'ASSIGNMENT_TOTAL_POINTS': []}
# iterate over the combinations and compute the results
for combo in combinations:
# filter the dataframe for students who have completed this combo
combo_df = df.loc[df[list(combo)].notnull().all(axis=1)]
num_students = len(combo_df)
# compute the total points for each assignment in the combo
points = combo_df[list(combo)].sum()
# append the results to the dictionary
results['ASSIGNMENT_COMBO'].append(combo)
results['NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO'].append(num_students)
results['ASSIGNMENT_TOTAL_POINTS'].append(points.tolist())
# create a new dataframe from the results dictionary
combo_stats_df = pd.DataFrame(results)
# explode the ASSIGNMENT_COMBO column into separate rows for each
assignment in the combo
combo_stats_df = combo_stats_df.explode('ASSIGNMENT_COMBO')
# create separate columns for each assignment in the combo
for i, assignment in enumerate(assignments):
combo_stats_df[f'{assignment} TOTAL POINTS'] =
combo_stats_df['ASSIGNMENT_TOTAL_POINTS'].apply(lambda x: x[i])
# drop the ASSIGNMENT_TOTAL_POINTS column
combo_stats_df = combo_stats_df.drop('ASSIGNMENT_TOTAL_POINTS',
axis=1)
print(combo_stats_df)
This code first creates a list of all possible assignment combinations using itertools.combinations. Then, it iterates over each combo and filters the dataframe to include only students who have completed the combo. It computes the number of students and the total points for each assignment in the combo using built-in pandas functions like notnull, all, and sum. Finally, it creates a new dataframe from the results dictionary and explodes the ASSIGNMENT_COMBO column into separate rows for each assignment in the combo. It then creates separate columns for each assignment and drops the ASSIGNMENT_TOTAL_POINTS column. This approach should be much faster than using for loops, especially for large dataframes.
I had a go at tidying up Bryan's Answer
Make a list of all possible combinations
Iterate over each combination to find the totals and number of students
Combine the results in to a dataframe
Setup: (Makes a dataset of 20,000 students and 10 assignments)
import itertools
import pandas as pd
import numpy as np
# Bigger random sample data
def make_data(rows, cols, nans, non_nans):
df = pd.DataFrame()
df["student"] = list(range(rows))
for i in range(1,cols+1):
a = np.random.randint(low=1-nans, high=non_nans, size=(rows)).clip(0).astype(float)
a[ a <= 0 ] = np.nan
df[f"a{i:02}"] = a
return df
rows = 20000
cols = 10
df = make_data(rows, cols, 50, 50)
# dummy columns, makes aggregates easier
df["students"] = 1
df["combo"] = ""
Transformation:
# create a list of all possible assignment combinations (ignore first and last two)
assignments = df.columns[1:-2].tolist()
combos = []
for r in range(1, len(assignments)+1):
new_combos = list(itertools.combinations(assignments, r))
combos += new_combos
# create a list to hold the results
results = list(range(len(combos)))
# ignore the student identifier column
df_source = df.iloc[:, 1:]
# iterate over the combinations and compute the results
for ix, combo in enumerate(combos):
# filter the dataframe for students who have completed this combo
df_filter = df.loc[ df[ list(combo) ].notnull().all(axis=1) ]
# aggregate the results to a single row (sum of the dummy students column counts the rows)
df_agg = df_filter.groupby("combo", as_index=False).sum().reset_index(drop=True)
# store the assignment comination in the results
df_agg["combo"] = ",".join(combo)
# add the results to the list
results[ix] = df_agg
# create a new dataframe from the results list
combo_stats_df = pd.concat(results).reset_index(drop = True)
In this demo it takes ~6 seconds to return ~1000 rows of results.
For 20 assignments that's ~1,000,000 rows of results, so ~6000 seconds (over 1.5 hours).
Even on my desktop it takes ~2 seconds to process 1,000 combinations, so ~0.5 hours for ~1,000,000 combinations from 20 assignments.
I initially tried to write it without the loop, but the process was killed for using too much memory. I like the puzzle, it helps me learn, so I'll ponder if there's a way to avoid the loop while staying within memory.
Here is an example of the data I am dealing with:
This example of data is a shortened version of each Run. Here the runs are about 4 rows long. In a typical data set they are anywhere between 50-100 rows long. There are also 44 different runs.
So my goal is to get the average of the last 4 rows a given column in stage 2, right now I am achieving that, but it grabs the average based on these conditions for the whole spreadsheet. I want to be able to get these average values for each and every 'Run'.
df["Run"] = pd.DataFrame({
"Run": ["Run1.1", "Run1.2", "Run1.3", "Run2.1", "Run2.2", "Run2.3", "Run3.1", "Run3.2", "Run3.3", "Run4.1",
"Run4.2", "Run4.3", "Run5.1", "Run5.2", "Run5.3", "Run6.1", "Run6.2", "Run6.3", "Run7.1", "Run7.2",
"Run7.3", "Run8.1", "Run8.2", "Run8.3", "Run9.1", "Run9.2", "Run9.3", "Run10.1", "Run10.2", "Run10.3",
"Run11.1", "Run11.2", "Run11.3"],
})
av = df.loc[df['Stage'].eq(2),'Vout'].groupby("Run").tail(4).mean()
print(av)
I want to be able to get these averages for a given column that is in Stage 2, based on each and every 'Run'. As you can see before each data set there is a corresponding 'Run' e.g the second data set has 'Run1.2' before it.
Also, each file I am dealing with, the amount of rows per Run is different/not always the same.
So, it is important to note that this is not achievable with np.repeat, as with each new sheet of data, the rows can be any length, not just the same as the example above.
Expected output:
Run1.1 1841 (example value)
Run1.2 1703 (example value)
Run1.3 1390 (example value)
... so on
etc
Any help would be greatly appreciated.
What does your panda df look like after you import the csv?
I would say you can just groupby on the run column like such:
import pandas as pd
df = pd.DataFrame({
"run": ["run1.1", "run1.2", "run1.1", "run1.2"],
"data": [1, 2, 3, 4],
})
df.groupby("run").agg({"data": ["sum"]}).head()
Out[4]:
data
sum
run
run1.1 4
run1.2 6
This will do the trick:
av = df.loc[df["Stage"].eq(2)]
av = av.groupby("Run").tail(4).groupby("Run")["Vout"].mean()
Now df.groupby("a").tail(n) will return dataframe with only last n elements for each value of a. Then the second groupby will just aggregate these and return average per group.
I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.
I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).
The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.
## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)
## Add an auxiliary variable
df['count'] = 1
## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
shape=(date_c.categories.size, action_c.categories.size))
## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)
## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(),
index = action_c.categories, columns = action_c.categories)
There are a couple of fairly straightforward simplifications you can consider.
One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:
df = df.groupby('date').max()
Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So try that as well:
df = df.groupby('date', sort=False).max()
Third is you can also use a simple pivot_table() to produce the same result.
df = df.pivot_table(index='date', aggfunc='max')
Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)
actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)
Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.
If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.
So you can enter:
%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)
And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.
I have the following code right now:
import pandas as pd
df_area=pd.DataFrame({"area":["Coesfeld","Recklinghausen"],"factor":[2,5]})
df_timeseries=pd.DataFrame({"Coesfeld":[1000,2000,3000,4000],"Recklinghausen":[2000,5000,6000,7000]})
columns_in_timeseries=list(df_timeseries)
columns_to_iterate=columns_in_timeseries[0:]
newlist=[]
for i,k in enumerate(columns_to_iterate):
new=df_area.loc[i,"factor"]*df_timeseries[k]
newlist.append(new)
newframe=pd.DataFrame(newlist)
df1_transposed = newframe.T
The code multiplys each factor from an area with the timeseries from that area. In this example the code is iterating immediately the rows and columns after multiplying. In the next step I want to expand the df_area-Dataframe like the following:
df_area=pd.DataFrame({"area":["Coesfeld","Coesfeld","Recklinghausen","Recklinghausen"],"factor":[2,3,5,6]})
As you can see, I have different factors for the same area. The goal is to iterate the columns in df_timeseries only when the area in df_area changes. My first intention is to use an if-Statement but right now I have no idea how to realize that with the for-loop.
I can't shake off the suspicion that there is something wrong about your whole approach. A first red flag is your use of wide format instead of long format – in my experience, that's probably going to cause you unnecessary trouble.
Be it as it may, here's a function that takes a data frame with time series data and a second data frame with multiplier values and area names as arguments. The two data frames use the same structure as your examples df_timeseries (area names as columns, time series values as cell values) and df_area (area name as values in column area, multiplier as value in column factor). I'm pretty sure that this is not a good way to organize your data, but that's up to you to decide.
What the function does is it iterates through the rows of the second data frame (the df_area-like). It uses the area value to select the correct series from the first data frame (the df_timeseries-like), and multiplies this series with the factor value from that row. The result is added as an element within a list generator.
def do_magic(df1, df2):
return [df1[area] * factor for area, factor in zip(df2.area, df2.factor)]
You can insert this directly into your code to replace your loop:
df_area = pd.DataFrame({"area": ["Coesfeld", "Recklinghausen"],
"factor": [2, 5]})
df_timeseries = pd.DataFrame({"Coesfeld": [1000, 2000, 3000, 4000],
"Recklinghausen": [2000, 5000, 6000, 7000]})
newlist = do_magic(df_timeseries, df_area)
newframe = pd.DataFrame(newlist)
df1_transposed = newframe.T
It also works with your expanded df_area. The resulting list will consist of four series (two for Coesfeld, two for Recklinghausen).
my database structure is such that I have units, that belong to several groups and have different variables (I focus on one, X, for this question). Then we have year-based records. So the database then looks like
unitid, groupid, year, X
0 1 1, 1990, 5
1 2 1, 1990, 2
2 2 1, 1991, 3
3 3 2, 1990, 10
etc. Now what I would like to do is measure some "intensity" variable, that is going to be the number of units per group and year, and I would like to put it back into the database.
So far, I am doing
asd = df.drop_duplicates(cols=['unitid', 'year'])
groups = asd.groupby(['year', 'groupid'])
intensity = groups.size()
And intensity then looks like
year groupid
1961 2000 4
2030 3
2040 1
2221 1
2300 2
However, I don't know how to put them back into the old dataframe. I can access them through intensity[0], but intensity.loc() gives a LocIndexer not callable error.
Secondly, it would be very nice if I could scale intensity. Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year". If {t,g} denotes a group-year cell, that would be:
That is, if my simple intensity variable (for time and group) is called intensity(t, g), I would like to create relativeIntensity(t,g) = intensity(t,g)/mean(intensity(t=t,g=:)) - if this fake code helps at all making myself clear.
Thanks!
Update
Just putting the answer here (explicitly) for readability. The first part was solved by
intensity = intensity.reset_index()
df['intensity'] = intensity[0]
It's a multi-index. You can reset the index by calling .reset_index() to your resultant dataframe. Or you can disable it when you compute the group-by operation, by specifying as_index=False to the groupby(), like:
intensity = asd.groupby(["year", "groupid"], as_index=False).size()
As to your second question, I'm not sure what you mean in Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year".. If you want to compute "intensity" by intensity / mean(intensity), you can use the transform method, like:
asd.groupby(["year", "groupid"])["X"].transform(lambda x: x/mean(x))
Is this what you're looking for?
Update
If you want to compute intensity / mean(intensity), where mean(intensity) is based only on the year and not year/groupid subsets, then you first have to create the mean(intensity) based on the year only, like:
intensity["mean_intensity_only_by_year"] = intensity.groupby(["year"])["X"].transform(mean)
And then compute the intensity / mean(intensity) for all year/groupid subset, where the mean(intensity) is derived only from year subset:
intensity["relativeIntensity"] = intensity.groupby(["year", "groupid"]).apply(lambda x: pd.DataFrame(
{"relativeIntensity": x["X"] / x["mean_intensity_only_by_year"] }
))
Maybe this is what you're looking for, right?
Actually, days later, I found out that the first answer to this double question was wrong. Perhaps someone can elaborate to what .size() actually does, but this is just in case someone googles this question does not follow my wrong path.
It turned out that .size() had way less rows than the original object (also if I used reset_index(), and however I tried to stack the sizes back into the original object, there were a lot of rows left with NaN. The following, however, works
groups = asd.groupby(['year', 'groupid'])
intensity = groups.apply(lambda x: len(x))
asd.set_index(['year', 'groupid'], inplace=True)
asd['intensity'] = intensity
Alternatively, one can do
groups = asd.groupby(['fyearq' , 'sic'])
# change index to save groupby-results
asd= asd.set_index(['fyearq', 'sic'])
asd['competition'] = groups.size()
And the second part of my question is answered through
# relativeSize
def computeMeanInt(group):
group = group.reset_index()
# every group has exactly one weight in the mean:
sectors = group.drop_duplicates(cols=['group'])
n = len(sectors)
val = sum(sectors.competition)
return float(val) / n
result = asd.groupby(level=0).apply(computeMeanInt)
asd= asd.reset_index().set_index('fyearq')
asd['meanIntensity'] = result
# if you don't reset index, everything crashes (too intensive, bug, whatever)
asd.reset_index(inplace=True)
asd['relativeIntensity'] = asd['intensity']/asd['meanIntensity']