Use of index in pandas DataFrame for groupby and aggregation

Use of index in pandas DataFrame for groupby and aggregation - python

I want to aggregate a single column DataFrame and count the number of elements. However, I always end up with an empty DataFrame:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[46]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
If I add a second column, I get the desired result:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5], "B":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[45]:
B
A
1 1
2 1
3 1
4 1
5 3
Can you explain the reason for this?

Give this a shot:
import pandas as pd
print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count())
prints
A
1 1
2 1
3 1
4 1
5 3

You have to add the grouped by column in your result:
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count()
Output:
A
1 1
2 1
3 1
4 1
5 3

Related

Pandas data frame index

if I have a Series
s = pd.Series(1, index=[1,2,3,5,6,9,10])
But, I need a standard index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], with index[4, 7, 8] values equal to zeros.
So I expect the updated series will be
s = pd.Series([1,1,1,0,1,1,0,0,1,1], index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How should I update the series?
Thank you in advance!

Try this:
s.reindex(range(1,s.index.max() + 1),fill_value=0)
Output:
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
10 1

Is there any method to append test data with predicted data?

I have 1 random array of tested dataset like array=[[5, 6 ,7, 1], [5, 6 ,7, 4], [5, 6 ,7, 3]] and 1 array of predicted data like array_pred=[10, 3, 4] both with the equal length. Now I want to append this result like this in 1 res_array = [[5, 6 ,7, 1, 10], [5, 6 ,7, 4, 3], [5, 6 ,7, 3, 4]]. I don't know what to say it but I want this type of result in python. Actually I have to store it in a dataframe and then have to generate an excel file from this data. this is what I want. Is it possible??

Use numpy.vstack for join arrays, convert to Series and then to excel:
a = np.hstack((array, np.array(array_pred)[:, None]))
#thank you #Ch3steR
a = np.column_stack([array, array_pred])
print(a)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s = pd.Series(a.tolist())
print (s)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s.to_excel(file, index=False)
Or if need flatten values convert to DataFrame, Series and use concat:
df = pd.concat([pd.DataFrame(array), pd.Series(array_pred)], axis=1, ignore_index=True)
print(df)
0 1 2 3 4
0 5 6 7 1 10
1 5 6 7 4 3
2 5 6 7 3 4
And then:
df.to_excel(file, index=False)

Pandas drop duplicated values partially

I have a dataframe as
df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[7, 2, 2, 5, 7, 2, 2]})
I would like to drop the duplicated values from columns A and C. However, I want it to work partially.
If I use
df.drop_duplicates(subset=['A','C'], keep='first')
It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:
df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
'B':[0, 2, 4, 5, 6],
'C':[7, 2, 5, 7, 2]})

Here's how you can do this, using shift:
df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)
Output:
A B C
0 1 0 7
1 3 2 2
2 4 4 5
3 5 5 7
4 3 6 2
This question is a nice reference.

You can just keep every second repetition of A, C pair:
df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]
Outputs:
A B C
0 1 0 7
1 3 2 2
3 4 4 5
4 5 5 7
5 3 6 2

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.

You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

Group lists of different rows by multiple columns values with Pandas

I have a dataframe df1 like this:
import pandas as pd
dic = {'A':[0,0,2,2,2,1,5,5],'B':[[1,5,3,8],[1,8,7,5],[7,8,9,5],[3],[1,5,9,3],[0,3,5],[],[4,2,3,1]],'C':['a','b','c','c','d','e','f','f'],'D':['0','8','7','6','4','5','2','2']}
df1 = pd.DataFrame(dic)
and looks like this:
#Initial dataframe
A B C D
0 0 [1, 5, 3, 8] a 0
1 0 [1, 8, 7, 5] b 8
2 2 [7, 8, 9, 5] c 7
3 2 [3] c 6
4 2 [1, 5, 9, 3] d 4
5 1 [0, 3, 5] e 5
6 5 [] f 2
7 5 [4, 2, 3, 1] f 2
My goal is to group rows that have the same values in column A and C and merge the content of column B in such a way that the result looks like this:
#My GOAL
A B C
0 0 [1, 5, 3, 8] a
1 0 [1, 8, 7, 5] b
2 2 [3, 7, 8, 9, 5] c
3 2 [1, 5, 9, 3] d
4 1 [0, 3, 5] e
5 5 [4, 2, 3, 1] f
As you can see, rows having the same items in column A and C are merged while if at least one is different they are left as is.
My idea was to use the groupby and sum functions like this:
df1.groupby(by=['A','C'],as_index=False,sort=True).sum()
but Python returns an error message: Function does not reduce
Could you please tell me what is wrong with my line of code? What should I write in order to achieve my goal?
Note: I do not care about what happens to column D which can be discarted.

One of the possibilities would be to flatten the list of lists until it gets exhausted with the help of itertools.chain(*iterables)
import itertools
df1.groupby(['A', 'C'])['B'].apply(lambda x: list(itertools.chain(*x))).reset_index()
(Or)
Use sum with lambda:
df1.groupby(by=['A','C'])['B'].apply(lambda x: x.sum()).reset_index()
Both yield:
By default, groupby().sum() looks for numeric types (scalar) values to perform aggregation and not a collection of elements like list for example.

Another possibility:
df1.groupby(by=['A','C'],as_index=False,sort=True).agg({'B': lambda x: tuple(sum(x, []))})
Result:
A C B
0 0 a (1, 5, 3, 8)
1 0 b (1, 8, 7, 5)
2 1 e (0, 3, 5)
3 2 c (7, 8, 9, 5, 3)
4 2 d (1, 5, 9, 3)
5 5 f (4, 2, 3, 1)
Based in this answer (it seems that lists do not work too well with aggregation).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use of index in pandas DataFrame for groupby and aggregation - python

Give this a shot: import pandas as pd print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count()) prints A 1 1 2 1 3 1 4 1 5 3

You have to add the grouped by column in your result: import pandas as pd pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count() Output: A 1 1 2 1 3 1 4 1 5 3

Related

Pandas data frame index

Is there any method to append test data with predicted data?

Pandas drop duplicated values partially

Duplicating rows with certain value in a column

Group lists of different rows by multiple columns values with Pandas

Categories

Resources