I want to apply .agg pandas operations to a huge dataset
As an example, I have this code:
from tqdm import tqdm
import pandas as pd
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})
df2 = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
print(df2)
The problem is that my original dataset has 2.000.000 of rows. Transforming it to 130.000 takes some minutes and I would like to see a progress bar
I've tried with tqdm but I don't know how to apply it here. Is there any function similar to .progress_apply() but for .agg()?
This will print the progress as you go, where progress is measured by the fraction of the groups for which statistics are computed. But I'm not sure how much the loop will slow down your computations.
agger = {
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()}
gcols = ['B'] # columns defining the groups
groupby = df.groupby(gcols)
ngroups = len(groupby)
gfrac = 0.3 # fraction of groups for which you want to print progress
gfrac_size = max((1, int(ngroups*gfrac)))
groups = []
rows = []
for i,g in enumerate(groupby):
if (i+1)%gfrac_size == 0:
print('{:.0f}% complete'.format(100*(i/ngroups)))
gstats = g[1].agg(agger)
if i==0:
if gstats.ndim==2:
newcols = gstats.columns.tolist()
else:
newcols = gstats.index.tolist()
groups.append(g[0])
rows.append(gstats.values.flat)
df3 = pd.DataFrame(np.vstack(rows), columns=newcols)
if len(gcols) == 1:
df3.index = groups
else:
df3.index = pd.MultiIndex.from_tuples(groups, names=gcols)
df3 = df3.astype(df[newcols].dtypes)
df3
C D E
1.0 1.5 10.0 a
2.0 3.0 12.0 b
3.0 7.0 8.0 a
An alternative (though somewhat hacky) way would be to take advantage of the fact that you use your own function lambda x: x.mode. Since you're already compromising speed using this function, you can write a class that stores information about progress. For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})
df2 = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
print(df2)
class ModeHack:
def __init__(self, size=5, N=10):
self.ix = 0
self.K = 1
self.size = size
self.N = N
def mode(self, x):
self.ix = self.ix + x.shape[0]
if self.K*self.size <= self.ix:
print('{:.0f}% complete'.format(100*self.ix/self.N))
self.K += 1
return x.mode()
def reset(self):
self.ix = 0
self.K = 1
mymode = ModeHack(size=int(.1*df.shape[0]), N=df.shape[0])
mymode.reset()
agger = {
'C': 'mean',
'D': 'sum',
'E': lambda x: mymode.mode(x)}
df3 = df.groupby('B').agg(agger)
Related
I have a Pandas DataFrame:
import pandas as pd
df = pd.DataFrame([[0.0, 2.0, 0.0, 0.0, 5.0, 6.0, 7.0],
[1.0, 0.0, 1.0, 3.0, 0.0, 0.0, 7.0],
[0.0, 0.0, 13.0, 14.0, 0.0, 16.0, 0.0]
]
, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G'])
A B C D E F G
0 0.0 2.0 0.0 0.0 5.0 6.0 7.0
1 1.0 0.0 1.0 3.0 0.0 0.0 7.0
2 0.0 0.0 13.0 14.0 0.0 16.0 17.0
And I would like to save it as an .xlsx file, with the first and last non-zero values in each row marked in color. something like:
I removed the index column though. The first column.
# import dependencies
import pandas as pd
import openpyxl
from openpyxl.styles import PatternFill
from openpyxl.utils import get_column_letter
# data
df = pd.DataFrame([[0.0, 2.0, 0.0, 0.0, 5.0, 6.0, 7.0],
[1.0, 0.0, 1.0, 3.0, 0.0, 0.0, 7.0],
[0.0, 0.0, 13.0, 14.0, 0.0, 16.0, 0.0]
], columns=['A', 'B', 'C', 'D', 'E', 'F', 'G'])
first_and_last_non_zeroes_index = []
for index, row in df.iterrows():
# all non zeroes index in a row
non_zeroes_index= [i for i, x in enumerate(row) if x>0]
# append the first and last non zero in a row to list
first_and_last_non_zeroes_index.append([non_zeroes_index[0],non_zeroes_index[-1]])
# output to excel
df.to_excel('output.xlsx', index=False)
# open excel
wb = openpyxl.load_workbook("output.xlsx")
ws = wb['Sheet1']
# set the color
fill_cell = PatternFill(patternType='solid',
fgColor='ffff00')
# color the appropriate cells
for index, row in enumerate(first_and_last_non_zeroes_index):
for col in row:
ws[f'{get_column_letter(col+1)}{index+2}'].fill = fill_cell
# save output
wb.save("output.xlsx")
I have a bivariate distribution below that is generated from the xy points for each Group in 'Int_1','Int_2'. The aim to to apply these points and return a multivariate distribution between the Groups. I then want to normalise the distribution value via Norm so the z-value ranges between 0 and 1. When looking at the z-value now via the colorer, the values vary between 0.24-0.72.
In a previous question, it was mentioned that I'm not actually returning a multivariate distribution. Rather a ratio of probabilities between the two groups.
import pandas as pd
import numpy as np
from scipy.stats import multivariate_normal as mvn
import matplotlib.pyplot as plt
from scipy.interpolate import RectBivariateSpline
df = pd.DataFrame({'Int_1': [1.0, 2.0, 1.0, 3.0, 1.0, 2.0, 3.0, 2.0],
'Int_2': [1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0],
'Item_X': [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0],
'Item_Y': [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0],
'Period': [1, 1, 1, 1, 2, 2, 2, 2],
'Group': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Item': ['Y', 'Y', 'A', 'B', 'A', 'B', 'A', 'B'],
'id': ['1', '2', '3', '4', '1', '2', '3', '4']})
Group_A = [df[df['Group'] == 'A'][['Int_1','Int_2']].to_numpy()]
Group_B = [df[df['Group'] == 'B'][['Int_1','Int_2']].to_numpy()]
Item = [df[['Item_X','Item_Y']].to_numpy()]
period = df['Period'].drop_duplicates().reset_index(drop = True)
def bivart_func(member_no, location, time_index, group):
if group == 'A':
data = Group_A.copy()
elif group == 'B':
data = Group_B.copy()
else:
return
if np.all(np.isfinite(data[member_no][[time_index,time_index + 1],:])) & np.all(np.isfinite(Item[0][time_index,:])):
sxy = (data[member_no][time_index + 1,:] - data[member_no][time_index,:]) / (period[time_index + 1] - period[time_index])
mu = data[member_no][time_index,:] + sxy * 0.5
out = mvn.pdf(location,mu) / mvn.pdf(data[member_no][time_index,:],mu)
else:
out = np.zeros(location.shape[0])
return out
xx,yy = np.meshgrid(np.linspace(-10,10,200),np.linspace(-10,10,200))
Z_GA = np.zeros(40000)
Z_GB = np.zeros(40000)
for k in range(1):
Z_GA += bivart_func(k,np.c_[xx.flatten(),yy.flatten()],0,'A')
Z_GB += bivart_func(k,np.c_[xx.flatten(),yy.flatten()],0,'B')
fig, ax = plt.subplots(figsize=(8,8))
ax.set_xlim(-10,10)
ax.set_ylim(-10,10)
Z_GA = Z_GA.reshape((200,200))
Z_GB = Z_GB.reshape((200,200))
Norm = xx,yy, 1 / (1 + np.exp(Z_GB - Z_GA))
cfs = ax.contourf(*Norm, cmap = 'magma')
ax.scatter(Item[0][1,0],Item[0][1,1], color = 'white', edgecolor = 'black')
fig.colorbar(cfs, ax = ax)
#f = RectBivariateSpline(xx[0, :], yy[:, 0], Norm)
#z = f(df['Item_X'], df['Item_Y'], grid = False)
Is it what you expect:
Z = Z_GB - Z_GA
Norm = xx,yy, (Z - np.min(Z)) / (np.max(Z) - np.min(Z))
>>> np.min(Norm[2])
0.0
>>> np.max(Norm[2])
1.0
Sample of a much larger DataFrame I'm working on below
import pandas as pd
data = {"Trial": ['Trial_1', 'Trial_2', 'Trial 3', 'Trial 4'], "Results" : [[['a', 11.0, 1, 1.0], ['b', 12.0, 0, 6.0], ['c', 2.6, 0, 3.0]], [['d', 7.3, 1, 8.0], ['e', 13.0, 0, 5.0], ['f', 8.6, 0, 3.0]],
[['g', 9.1, 1, 1.0], ['h', 10.0, 0, 7.0], ['i', 95.6, 0, 5.0]], [['j', 6.6, 1, 1.0], ['k', 14.0, 0, 3.0], ['l', 8.1, 0, 9.0]]]}
df = pd.DataFrame(data)
2 Queries
I am wanting to filter df to display the rows only where within the results columns list of lists it contains an item at index 2 == 1 and an item at index 3 != 1. In the example this filter would show only trial 2 as item ['d', 7.3, 1, 8.0] has an index 2 item equal to 1.0 but an index 3 item equal to 8.0
the desired output after filtering is below
Index Trial Results
1 Trial_2 [[d, 7.3, 1, 8.0], [e, 13.0, 0, 5.0], [f, 8.6, 0, 3.0]]
How would I then drop the rows where the condition stated in in Query 1 is True. So dataframe would now have Trial_2 dropped and output would be
Index Trial Results
0 Trial_1 [[a, 11.0, 1, 1.0], [b, 12.0, 0, 6.0], [c, 2.6, 0, 3.0]]
2 Trial_3 [[g, 9.1, 1, 1.0], [h, 10.0, 0, 7.0], [i, 95.6, 0, 5.0]]
3 Trial 4 [[j, 6.6, 1, 1.0], [k, 14.0, 0, 3.0], [l, 8.1, 0, 9.0]]
I have a list comprehension below that outputs the individual items where condition is True but not sure how to apply this as a filter on df and then to use it as a condition drop rows.
[place for places in df['Results'] for place in places if place[2] == 1 and place[3] != 1]
The function below collects the indices of your conditions, then you can use the list of indices to either get a dataframe that matches your condition, or a dataframe that removes the rows that meets the conditions. Using apply() on each row and iterating through the list of lists. You can clean the for loop up if the first list matches the conditions you don't have to complete the for loop against the remaining lists, but I didn't go that far into the exercise.
idxs = [] # for collecting indices
def loop_results(x):
for res in x['Results']:
if res[2] ==1 and res[3] != 1:
idxs.append(x.name) # here, .name is the index value
df_temp = df.apply(loop_results, axis=1) # apply the function to each row
idxs = list(set(idxs)) # if there are duplicates, set() will remove them
df_match = df.loc[idxs] # matched criteria
df_unmatched = df.drop(idxs, axis=0) # drops rows matching criteria
You can use apply on Results using query_check , which you can further modify based on any changes in the filtering logic
import pandas as pd
data = {"Trial": ['Trial_1', 'Trial_2', 'Trial 3', 'Trial 4'], "Results" : [[['a', 11.0, 1, 1.0], ['b', 12.0, 0, 6.0], ['c', 2.6, 0, 3.0]], [['d', 7.3, 1, 8.0], ['e', 13.0, 0, 5.0], ['f', 8.6, 0, 3.0]],
[['g', 9.1, 1, 1.0], ['h', 10.0, 0, 7.0], ['i', 95.6, 0, 5.0]], [['j', 6.6, 1, 1.0], ['k', 14.0, 0, 3.0], ['l', 8.1, 0, 9.0]]]}
df = pd.DataFrame(data)
def query_check(inp):
for i,lst in enumerate(inp.values):
if isinstance(lst,list):
if lst[i][2] == 1 and lst[i][3] != 1:
return True
return False
df['Flag'] = df[['Results']].apply(query_check,axis=1)
Once you have the Flag column created you can filter further -
Query - 1
>>> df[df['Flag'] == True]
Trial Results Flag
1 Trial_2 [[d, 7.3, 1, 8.0], [e, 13.0, 0, 5.0], [f, 8.6,... True
Query - 2
>>> df[df['Flag'] != True]
Trial Results Flag
0 Trial_1 [[a, 11.0, 1, 1.0], [b, 12.0, 0, 6.0], [c, 2.6... False
2 Trial 3 [[g, 9.1, 1, 1.0], [h, 10.0, 0, 7.0], [i, 95.6... False
3 Trial 4 [[j, 6.6, 1, 1.0], [k, 14.0, 0, 3.0], [l, 8.1,... False
I am really new to Python and I am trying to find average of a list of lists.
I have a list of lists of float numbers that indicate the grades of courses per semester and looks like this:
mylist = [[[2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0]], [[2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0], [2.67, 2.67, 2.0, 2.0]]]
What I want to do is find the average of each sublist and place it as a sublist again in order to access it easier. For example I want the following:
myaverage= [[[2.335],[2.335],[2.335],...]]]
It is not on purpose the same numbers it just happened at this part of the list that I am showing you. I tried to do this:
for s in mylist: # for each list
gpa = sum(s) / len(s)
allGPA.append(gpa)
for x in s: # for each sublist
x_ = x if type(x) is list else [x]
myaverage.append(sum(x_) / float(len(x_)))
but I am getting this error:
gpa = sum(s) / len(s)
TypeError: unsupported operand type(s) for +: 'int' and 'list'
I cannot understand if my approach is completely wrong or if I am looping wrong through the list.
Give this a try:
from statistics import mean
avg = [[ mean(sub_list) for sub_list in list ] for list in mylist]
If the syntax looks a little confusing have a look at list comprehensions
I think it would be prudent to hold your data in some sort of collection, lets use a dictionary and create a readable function to parse your data.
Function
from collections import defaultdict
def return_averages(gpa_lists):
""" Takes in a list of lists and returns a dictionary of averages.
the key will be the level of each sublist."""
gpa_dict = {number_of_list : outer_list for number_of_list, outer_list in enumerate(gpa_lists)}
gpa_averages = defaultdict(list)
for list_number,lists in gpa_dict.items():
for each_list in lists:
gpa_averages[list_number].append(sum(each_list) / len(each_list))
return gpa_averages
Usage.
return_averages(mylist)
defaultdict(list,
{0: [2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335],
1: [2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335,
2.335]})
Check this out i have updated my answer, output is as it is you want.
allGPA = []
myaverage = mylist
c = 0
count = 0
gpa = [0]
for list in mylist:
for i in range(len(list)):
gpa[0] = sum(mylist[c][i]) / len(mylist[c][i])
allGPA.append(gpa)
myaverage[c][i] = gpa
print(myaverage[c][i])
c = c + 1
print(myaverage)
Let's say I have the following list of dict
t = [{'a': 1.0, 'b': 2.0},
{'a': 3.0, 'b': 4.0},
{'a': 5.0, 'b': 6.0},
{'a': 7.0, 'b': 9.0},
{'a': 9.0, 'b': 0.0}]
Is there an efficient way to extract all the values contained in the dictionaries with a dictionary key value of a?
So far I have come up with the following solution
x = []
for j in t:
x.append(j['a'])
However, I don't like to loop over items, and was looking at a nicer way to achieve this goal.
You can use list comprehension:
t = [{'a': 1.0, 'b': 2.0},
{'a': 3.0, 'b': 4.0},
{'a': 5.0, 'b': 6.0},
{'a': 7.0, 'b': 9.0},
{'a': 9.0, 'b': 0.0}]
new_list = [i["a"] for i in t]
Output:
[1.0, 3.0, 5.0, 7.0, 9.0]
Since this solution uses a for-loop, you can use map instead:
x = list(map(lambda x: x["a"], t))
Output:
[1.0, 3.0, 5.0, 7.0, 9.0]
Performance-wise, you prefer to use list-comprehension solution rather the map one.
>>> timeit('new_list = [i["a"] for i in t]', setup='from __main__ import t', number=10000000)
4.318223718035199
>>> timeit('x = list(map(lambda x: x["a"], t))', setup='from __main__ import t', number=10000000)
16.243124993163093
def temp(p):
return p['a']
>>> timeit('x = list(map(temp, t))', setup='from __main__ import t, temp', number=10000000)
16.048683850689343
There is a slightly difference when using a lambda or a regular function; however, the comprehension execution takes 1/4 of the time.
You can use itemgetter:
from operator import itemgetter
t = [{'a': 1.0, 'b': 2.0},
{'a': 3.0, 'b': 4.0},
{'a': 5.0, 'b': 6.0},
{'a': 7.0, 'b': 9.0},
{'a': 9.0, 'b': 0.0}]
print map(itemgetter('a'), t)
result:
[1.0, 3.0, 5.0, 7.0, 9.0]
Use a list comprehension as suggested in Ajax1234'a answer, or even a generator expression if that would benefit your use case:
t = [{'a': 1.0, 'b': 2.0}, {'a': 3.0, 'b': 4.0}, {'a': 5.0, 'b': 6.0}, {'a': 7.0, 'b': 9.0}, {'a': 9.0, 'b': 0.0}]
x = (item["a"] for item in t)
print(x)
Output:
<generator object <genexpr> at 0x7f0027def550>
The generator has the advantage of not executing or consuming memory until a value is needed. Use next() to take the next item from the generator, or iterate over it with a for loop.
>>> next(x)
1.0
>>> next(x)
3.0
>>> for n in x:
... print(n)
5.0
7.0
9.0
An alternative, albeit a expensive one, is to use pandas:
import pandas as pd
x = pd.DataFrame(t)['a'].tolist()
print(x)
Output:
[1.0, 3.0, 5.0, 7.0, 9.0]