How to get last group in Pandas' groupBy? - python

I wish to get the last group of my group by:
df.groupby(pd.TimeGrouper(freq='M')).groups[-1]:
but that gives the error:
KeyError: -1
Using get_group is useless as I don't know the last group's value (unless there's a specific way to get that value?). Also I might want to get the last 2 groups, etc
How do I do this?

Using Ed's example
You can slice out the last group. The groups iterate in the correct order (meaning the given order, or sorted, as determined by the options).
In [12]: df = pd.DataFrame({'a':['1','2','2','4','5','2'], 'b':np.random.randn(6)})
In [13]: g = df.groupby('a')
In [14]: g.groups
Out[14]: {'1': [0], '2': [1, 2, 5], '4': [3], '5': [4]}
In [15]: import itertools
In [16]: list(itertools.islice(g,len(g)-1,len(g)))
Out[16]:
[('5', a b
4 5 -0.644857)]

You can call last which computes the last values for each group and use iloc to get the row values and access the index group values using the name attribute, there is probably a better way but unable to figure this out yet:
In [170]:
# dummy data
df = pd.DataFrame({'a':['1','2','2','4','5','2'], 'b':np.random.randn(6)})
df
Out[170]:
a b
0 1 0.097176
1 2 -1.400536
2 2 0.352093
3 4 -0.696436
4 5 -0.308680
5 2 -0.217767
In [179]:
gp = df.groupby('a', sort=False)
gp.get_group(df.groupby('a').last().iloc[-1].name)
Out[179]:
a b
4 5 0.608724
In [180]:
df.groupby('a').last().iloc[-2:]
Out[180]:
b
a
4 0.390451
5 0.608724
In [181]:
mult_groups = gp.last().iloc[-2:].index
In [182]:
for gp_val in mult_groups:
print(gp.get_group(gp_val))
a b
3 4 0.390451
a b
4 5 0.608724

Easiest is to convert the groups to a DataFrame and index it as you would a DataFrame. The resulting DataFrame has a row for each group, there the first column is the group index, and the second column is the DataFrame from that group. The one-liner for the last group's DataFrame is:
last_dataframe = pd.Dataframe(df.groupby('whatever')).iloc[-1, 1]
If you want the index and group:
last_group = pd.DataFrame(df.groupby('whatever')).iloc[-1, :]
last_group[0] is the index of the last group, and
last_group[1] is the DataFrame of the last group

Related

Splitting dataframe with a specific rule at a specific row, on loop

Given any df with only 3 columns, and n rows. Im trying to split, horizontally, on loop, at the position where the value on a column is max.
Something close to what np.array_split() does, but not on equal sizes necessarily. It would have to be at the row with the value determined by the max rule, at that moment on the loop. I imagine the over or under cutting bit is not necessarily the harder part.
An example: (sorry, its my first time actually making a question. Formatting code here is unknown for me yet)
df = pd.DataFrame({'a': [3,1,5,5,4,4], 'b': [1,7,1,2,5,5], 'c': [2,4,1,3,2,2]})
This df, with the max value condition applied on column b (7), would be cutted on a 2 row df and other with 4 rows.
Perhaps this might help you. Assume our n by 3 dataframe is as follows:
df = pd.DataFrame({'a': [1,2,3,4], 'b': [4,3,2,1], 'c': [2,4,1,3]})
>>> df
a b c
0 1 4 2
1 2 3 4
2 3 2 4
3 4 1 3
We can create a list of rows where max values occur for each column.
rows = [df[df[i] == max(df[i])] for i in df.columns]
>>> rows[0]
a b c
3 4 1 3
>>> rows[2]
a b c
1 2 3 4
2 3 2 4
This can also be written as a list of indexes if preferred.
indexes = [i.index for i in rows]
>>> indexes
[Int64Index([3], dtype='int64'), Int64Index([0], dtype='int64'), Int64Index([1, 2], dtype='int64')]

How to group by one column and sort the values of another column?

Here is my dataframe
import pandas as pd
df = pd.DataFrame({'A': ['one', 'one', 'two', 'two', 'one'] ,
'B': ['Ar', 'Br', 'Cr', 'Ar','Ar'] ,
'C': ['12/15/2011', '11/11/2001', '08/30/2015', '07/3/1999','03/03/2000' ],
'D':[1,7,3,4,5]})
My goal is to group by column A and sort within grouped results by column B.
Here is what I came up with:
sort_group = df.sort_values('B').groupby('A')
I was hoping that grouping operation would not distort order, but it does not work and also returns not a dataframe, but groupby object
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000008B190B8>
Any suggestions?
You cannot apply sort_values directly to a groupby object but you need an apply:
df.groupby('A').apply(lambda x: x.sort_values('B'))
gives you the desired output:
A B C D
A
one 0 one Ar 12/15/2011 1
4 one Ar 03/03/2000 5
1 one Br 11/11/2001 7
two 3 two Ar 07/3/1999 4
2 two Cr 08/30/2015 3
I usually use only sort_values to indirectly group values based on column A and sort within the groups by column B. This is:
sort_group = df.sort_values(['A', 'B'])
which will give you this:
A B C D
0 one Ar 12/15/2011 1
4 one Ar 03/03/2000 5
1 one Br 11/11/2001 7
3 two Ar 07/3/1999 4
2 two Cr 08/30/2015 3
This will return a normal DataFrame where you continue your analysis.

Replace a column in Pandas dataframe with another that has same index but in a different order

I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

Keeping the N first occurrences of

The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?
## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence
coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])
You want to use head, either on the dataframe itself or on the groupby:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 1 6
3 2 8
In [13]: df.head(2) # the first two rows
Out[13]:
A B
0 1 2
1 1 4
In [14]: df.groupby('A').head(2) # the first two rows in each group
Out[14]:
A B
0 1 2
1 1 4
3 2 8
Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.
Use groupby() and nth():
According to Pandas docs, nth()
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
Therefore all you need is:
df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)

Categories

Resources