Average a python dataframe column based on another column - python

I would like to take the average of column b when the corresponding value in column a is > 5
I get the error message:
TypeError: '>' not supported between instances of 'str' and 'int'
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b = [0.05, 0.05, 0.05, 0.04, 0.03, 0, 0, 0, 0, 0.03]
d = {'col_a': a, 'col_b': b}
df = pd.DataFrame(d)
x = df['col_a' > 5]['col_b'].mean()
print(x)

df['col_a' > 5]
This tries to check if the string 'col_a' is > 5, which can't be done.
You meant df[df['col_a'] > 5]['col_b'].mean()

Related

What is the most efficient way to choose from a list of variables and generate a number combination that falls between a specific range with python?

I have a set of variables each containing an array of chosen integers:
var_1 = [0.5, 1, 2]
var_2 = [0.5, 1, 4, 7.5]
var_3 = [1, 1.5, 3.5, 4, 5.5, 10]
I would like to choose each number from each of the stated variables above and add them together until the first combination of those variables falls within a specified win range such as:
winning_range = [15-20]
So the above winning_range would be the first combination that falls between integers 15-20
I would like to print the winning combination as a dictionary with each combination piece along with a dictionary key showing the value of the numbers added up:
{var_1 = 2, var_2= 7.5, var_3= 10, total= 19.5}
What would be the most efficient way to obtain this through python?
You can use a recursive generator function:
r = [15, 20]
var_1 = [0.5, 1, 2]
var_2 = [0.5, 1, 4, 7.5]
var_3 = [1, 1.5, 3.5, 4, 5.5, 10]
def combos(d, c = [], s = 0):
if not d and r[0] <= s and r[-1] >= s:
yield (c, s)
elif d:
for i in filter(lambda x:r[-1] >= x+s, d[0]):
yield from combos(d[1:], c=c+[i], s=s+i)
print(list(combos([var_1, var_2, var_3])))
Output:
[([0.5, 7.5, 10], 18.0), ([1, 4, 10], 15), ([1, 7.5, 10], 18.5), ([2, 4, 10], 16), ([2, 7.5, 5.5], 15.0), ([2, 7.5, 10], 19.5)]
Here, at each recursive call, potential values from a var list are only included if the running sum plus the value does not exceed the maximum threshold, thus minimizing the total number of recursive calls needed. While list(combos([var_1, var_2, var_3])) loads all the possibilities into memory, you can use next to grab only the first result:
vals, total = next(combos([var_1, var_2, var_3]))
Maybe not the most efficient, but pretty human readable
var_1 = [0.5, 1, 2]
var_2 = [0.5, 1, 4, 7.5]
var_3 = [1, 1.5, 3.5, 4, 5.5, 10]
winning_range = [15, 20] # Converting to a list for [min, max]
result = {}
for v1 in var_1:
for v2 in var_2:
for v3 in var_3:
if v1+v2+v3 in range(winning_range[0], winning_range[1]):
result = {"var_1": v1, "var_2": v2, "var_3": v3}
print(result)
# Output
# {'var_1': 2, 'var_2': 7.5, 'var_3': 5.5}
I should add you specified the first combination, so this solution is literally attempting them sequentially.

How to create a third vector from two existing vectors with looping and indexing?

I have two vectors (a, b) and want to create a third one (c) from these two. They all should have the same length. If there is a zero (0) in the vector a, the value of the vector b with this index, where a is zero, should be summed with the next values until the value in the vector a is eleven (11) and then storage the sum of this values in the vector c. The rest of the values of c should be zero (0).
a=[0, 11, 0, 11, 0, 0, 0, 11]
b=[1.1, 1.1, 9, 9, 9, 6.6, 6.6, 9]
So the vector c should look like:
c=[0, 2.2, 0, 18, 0, 0, 0, 31.2]
I create the next code and it works almost for this case except for the last value (it comes as output for the last value 15.6). And I also need something with more efficient, then it could happen that the vector a has more than 3 zeros after each other.
for w in range(0,len(a)):
if a[w]==0 and a[w+1]!=0:
c[w+1]=b[w]+b[w+1]
elif a[w]==0 and a[w+1]==0 and a[w+2]!=0:
c[w+2]=b[w]+b[w+1]+b[w+2]
elif a[w]==0 and a[w+1]==0 and a[w+2]==0:
c[w+3]=b[w]+b[w+1]+b[w+2]+b[w+3]
So we can loop over both a and b at the same time by using zip
We then just need to sum the cumulated values from b until the value in a is 11, when we get 11 then we write the cumulated value and reset it, otherwise write a 0
a = [0, 11, 0, 11, 0, 0, 0, 11]
b = [1.1, 1.1, 9, 9, 9, 6.6, 6.6, 9]
culm = 0
result = []
for a_val, b_val in zip(a, b):
culm += b_val
if a_val == 11:
result.append(culm)
culm = 0
else:
result.append(0)
print(result) # [0, 2.2, 0, 18, 0, 0, 0, 31.2]

Filter a pandas Dataframe based on specific month values and conditional on another column

I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.

Getting the indexes of a Dataframe after a numpy array function

I have a function which implements the k-mean algorithm and I want to use it with DataFrames in order to take into account indexes. For the moment I use DataFrame.values and it works. Yet I don't get the indexes of the output.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)
For instance with thus example minimal and sufficient :
import itertools
df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'
I get :
find_centers(df.values, 2)
([array([ 3.8, 3. , 3.6, 2. , 3.6]),
array([ 6.8, 3.6, 5.6, 6.8, 6.8])],
{0: [array([2, 0, 5, 6, 4]),
array([1, 1, 2, 3, 3]),
array([6, 0, 4, 0, 3]),
array([7, 9, 4, 1, 7]),
array([3, 5, 3, 0, 1])],
1: [array([6, 2, 5, 9, 6]),
array([8, 9, 7, 2, 8]),
array([7, 5, 3, 7, 8]),
array([7, 1, 5, 7, 6]),
array([6, 1, 8, 9, 6])]})
I have the values but don't have the indexes.
If you want to get the array of values including the index, you can simply add the index to the columns with reset_index():
values_with_index = df.reset_index().values
Update
If what you want is to have the index on the output, but not use it during the actual clustering, you can do the following. First, pass the actual data frame object to find_centers:
find_centers(df, 2)
Then change cluster_points as follows:
def cluster_points(X, mu):
clusters = {}
for _, x in X.iterrows():
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
for i in enumerate(mu)], key=lambda t:t[1])[0]
# You can replace this try/except block with
# clusters.setdefault(bestmukey, []).append(x)
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
The centers in the output will still be arrays, but the clusters will contain series objects with each row. The name property of each of these series is the index value in the data frame.

how to design agg funtion for pandas groupby

my dataFrame is like this:
user,rating, f1,f2,f3,f4
20, 3, 0.1, 0, 3, 5
20, 4, 0.2, 3, 5, 2
18, 4, 0.6, 8, 7, 2
18, 1, 0.7, 9, 2, 7
I want to compute a profile for a user, for instance
for user 20, it should be 3*[0.1,0,3,5]+4*[0.2,3,5,2]
which is a weighted sum of f1 to f4
How should I write a agg function to complete this task?
df.groupby('user').agg(....)
you can try this :
df.groupby('user').apply(lambda x : sum(x['rating'] * (x['f1']+x['f2']+x['f3']+x['f4'])))

Categories

Resources