GroupBy + Condition + Mean() - python

suppose we have 3 columns, A-B-C, I need to group by "A", but then B needs to be a range where B>0 and B<20, and then with that set calculate the mean from C.
Can u help me?
tyvm!

Try this:
import pandas as pd
data = pd.read_csv('rows.csv')
temp = []
for val in data['PPV']:
if val<20:
temp.append(1)
elif 20<val and val<40:
temp.append(2)
else:
temp.append(3)
data['temp'] = temp
output = data.groupby(['Responsable', 'temp'])['Yield'].mean()
print(output)
You should customize it. You can also do more elegant with numpy.digitize.

Related

Python- I have a list of 9 DataFrames I want concatenate each 3 DataFrames

Input
mydfs= [df1,df2,df3,df4,df5,df6,df7,df8,df9]
My Code
import pandas as pd
df_1 = pd.concat([mydfs[0],mydfs[1],mydfs[2]])
df_m = df_1.merge(mydfs[2])
df_2 = pd.concat([mydfs[3],mydfs[4],mydfs[5]])
df_m1 = df_2.merge(mydfs[5])
df_3 = pd.concat([mydfs[6],mydfs[7],mydfs[8]])
df_m2 = df_3.merge(mydfs[8])
But I want my code dynamic way instead of doing manually,
using for loop is it possible? may be in future the list of data frames will increase
You can use a dictionary comprehension:
N = 3
out_dfs = {f'df_{i//N+1}': pd.concat(mydfs[i:i+N])
for i in range(0, len(mydfs), N)}
output:
{'df_1': <concatenation result of ['df1', 'df2', 'df3']>,
'df_2': <concatenation result of ['df4', 'df5', 'df6']>,
'df_3': <concatenation result of ['df7', 'df8', 'df9']>,
}
You can use a loop with "globals" to iterate through mydfs and define two "kth" variables each round
i = 0
k = 1
while i < len(mydfs):
globals()["df_{}".format(k)] = pd.concat([mydfs[i],mydfs[i+1],mydfs[i+2]])
globals()["df_m{}".format(k)] = globals()["df_{}".format(k)].merge(mydfs[i+2])
i = i+3
k = k+1

Count matching combinations in a pandas dataframe

I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!
Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.

using previous row value by looping through index conditioning

If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03

Translating to R which function and cl classification to pandas library

I am trying to translate this:
n <- NROW(train)
s <-which(train$cl[-n] == state)
I know that which is just a comparison so I believe in pandas I could just do:
n = train.count()
s = train['-n'] == state
I am really not sure how to translate cl in R to pandas
thanks!
If need size of DataFrame use:
n = len(train)
Or:
n = len(train.index)
Or:
n = train.shape[0]
Second is OK:
s = train['-n'] == state

Creating a sequence of dataframes

a quick question.
I want to know if there is a way to create a sequence of data frames, by setting a variable inside the name of a data frame. For example:
df_0 = pd.read_csv(file1, sep =',')
b=0
x=1
while (b == 0):
df_+str(x) = pd.merge(df_+str(x-1) , Source, left_on='R_Key', right_on = 'S_Key', how='inner')
if Final_+str(x).empty != 'True':
x = x + 1
else:
b = b + 1
Now when executed, this returns "can't assign to operator" for df_+str(x). Any idea how to fix this?
This is the right time to use a list (a sequence type in Python), so you can refer to exactly as many data frames as you need.
dfs = []
dfs.append(pd.read_csv(file1, sep =',')) # It is now dfs[0]
b=0
x=1
while (b == 0):
dfs.append(pd.merge(dfs[x-1],
Source, left_on='R_Key',
right_on = 'S_Key', how='inner'))
if Final[x].empty != 'True':
x = x + 1
else:
b = b + 1
Now, you never define Final. You'll need to use the same trick there.
Not sure why you want to do this, but I think a clearer and more logical way is just to create a dictionary with dataframe name strings as keys and your generated dataframes as values?

Categories

Resources