I have a dataframe that I would like to compute the average across columns. I have the following dataframe:
Column 'A' repeats but not column 'B'. I would like to compute the average of the values in column 'B' for the repeating numbers in column 'A'. For example for the first value in column 'A' which is 1 the value in 'B' is 3 and the next value in column 'A' which is 1 the value in 'B' is 9 and the next is 4 and so on. Then continue with 2 and 3 etc...
I was thinking that if I can move those values to columns then compute the average across columns it would be easier but I can't find a way to copy the values there. Maybe there is an easier way?
This is what I would like :
You can use groupby and mean()
df.groupby('A').B.mean()
As #fuglede mentioned
df.groupby('A').mean()
would work as well as there is only column B left for aggregation.
Either way you get
A
1 6.25
2 6.50
3 4.75
Related
I have 2 dataframe with numeric values. I want to compare each column (they have the same column_names), and if them all are equal, execute a condition (add 10 points to score). I have done it "manually", it works, but I don't like it that way. Let me show you:
score=0
DATAFRAME 1
Column A
Column B
Column C
Column D
1
2
0
1
DATAFRAME 2
Column A
Column B
Column C
Column D
1
2
0
1
So, if the values of each columns are equal, score will be score=score+10
score=0
if (df1['Column A'][0]==df2['Column A'][0])&(df1['Column B'][0]==df2['Column B'][0])&(df1['Column C'][0]==df2['Column C'][0])&(df1['Column D'][0]==df2['Column D'][0]):
score=score+10
I want to do this but optimize it, like with a for loop or something like that. How could it be done? Thanks a lot
Use pandas equals function.
Here is the simple implementation:
if df1.equals(df2):
score += 10
else:
print("Columns are not equal")
Compare the values element by element, then check if you only have True per row and sum to find out how many rows satisfy the condition and finally multiply by 10:
>>> df1.eq(df2).all(axis=1).sum() * 10
10
I have created a dataframe in python lets say:
testingdf = pd.DataFrame({'A':[1,2,1,2,1,2],
'B':[1,2,1,2,3,3],
'C':[9,8,7,6,5,6]})
Now i want to get count of column 'C' according to 'A' and 'B' for that i am performing
testingdf.groupby(['A','B']).count()
to get:
C
A B
1 1 2
3 1
2 2 2
3 1
Now i want to get the Sum value of this count of 'C' with Respect to 'A' like:
A C
1 3
2 3
After grouping 'A' and 'B' i can select the 'A' column and apply the sum aggregate function on it. So I wanted to know what is the efficient way of doing this.
Note** : This sum is just an example i want to perform different things too like aggregate function to get max and min of count of C with respect to A after grouping A and B together.
P.S. : Sorry I should have had mentioned this earlier, but I don't want to use groupby twice. I want to know the most efficient way to get the results. Even if that means I don't have to use groupby.
you can use sum() method with level parameter after groupby()+count():
out=testingdf.groupby(['A','B']).count().sum(level=0).reset_index()
OR
other way is groupby twice:
out=testingdf.groupby(['A','B']).count().groupby(level=0).sum().reset_index()
output for your given data:
A C
0 1 2
1 2 2
2 3 1
I have a huge dataframe with 40 columns (10 groups of 4 columns), with value in some groups and NaN for others. I want the values for all the row left-shifted, such that wherever values be present in that row, the final Dataframe should be filled with Group1 -> Group 2 -> Group 3 and so on.
Here is a sample dataframe and the required output below:
Here is the required output:
I have used the below code to achieve shifting the values left. However, if a value is missing in an available group, e.g. Item 2 type-1, or Item 3 cat-2, the below code will ignore that and will replace it with the value to its right, and so on.
v = df1.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
df2 = pd.DataFrame(v[a,b],df1.index,df1.columns)
How to achieve this?
Thanks.
I have two dataframe's one with expression and another with values. Dataframe 1 criteria column has value with column name of another dataframe. My need is to take each row values from Dataframe 2 and replace Dataframe 1 criteria without loop.
How should I do it in an optimized way ?
DataFrame 1:
Criteria point
0 chgdsl='10' 1
1 chgdt ='01022007' 2
3 chgdsl='9' 3
DataFrame 2:
chgdsl chgdt chgname
0 10 01022007 namrr
1 9 02022007 chard
2 9 01022007 exprr
I expect that when I take first row of DataFrame 2 , output of Dataframe 1 will be 10='10' , 01022007 ='01022007' 10='9'
Need to take one row at a time from Dataframe 2 and replace it in all rows of Dataframe 1.
I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).
Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').