Keeping 'key' column when using groupby with transform in pandas - python

Finding a normalized dataframe removes the column being used to group by, so that it can't be used in subsequent groupby operations. for example (edit: updated):
df = pd.DataFrame({'a':[1, 1 , 2, 3, 2, 3], 'b':[0, 1, 2, 3, 4, 5]})
a b
0 1 0
1 1 1
2 2 2
3 3 3
4 2 4
5 3 5
df.groupby('a').transform(lambda x: x)
b
0 0
1 1
2 2
3 3
4 4
5 5
Now, with most operations on groups the 'missing' column becomes a new index (which can then be adjusted using reset_index, or set as_index=False), but when using transform it just disappears, leaving the original index and a new dataset without the key.
Edit: here's a one liner of what I would like to be able to do
df.groupby('a').transform(lambda x: x+1).groupby('a').mean()
KeyError 'a'
In the example from the pandas docs a function is used to split based on the index, which appears to avoid this issue entirely. Alternatively, it would always be possible just to add the column after the groupby/transform, but surely there's a better way?
Update:
It looks like reset_index/as_index are intended only for functions that reduce each group to a single row. There seem to be a couple options, from answers

The issue is discussed also here.
The returned object has the same indices as the original df, therefore you can do
pd.concat([
df['a'],
df.groupby('a').transform(lambda x: x)
], axis=1)

that is bizzare!
I tricked it like this
df.groupby(df.a.values).transform(lambda x: x)

Another way to achieve something similiar to what Pepacz suggested:
df.loc[:, df.columns.drop('a')] = df.groupby('a').transform(lambda x: x)

Try this:
df['b'] = df.groupby('a').transform(lambda x: x)
df.drop_duplicates()

Related

Pandas, how to combine multiple columns into an array column

I need to put a combined column as the concat of all values of the row.
Source:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4]
})
Target:
pd.DataFrame(data={
'a' : [1,2,3],
'b' : [2,3,4],
'combine' : [[1,2],[2,3],[3,4]]
})
Current solution:
test['combine'] = test[['a','b']].apply(lambda x: pd.Series([x.values]), axis=1)
Issues:
I actually have many columns, it seems taking too long to run. Is it a better way.
df
a b
0 1 2
1 2 3
2 3 4
If you want to add a column of lists as a single column, you'll need to call the .values attribute, convert it to a nested list, and assign it back -
df['combine'] = df.values.tolist()
# or,
df['combine'] = df[['a', 'b']].values.tolist()
df
a b combine
0 1 2 [1, 2]
1 2 3 [2, 3]
2 3 4 [3, 4]
Note that just assigning the .values result directly does not work, as pandas special cases numpy arrays, leading to undesirable outcomes,
df['combine'] = df[['a', 'b']].values
ValueError: Wrong number of items passed 2, placement implies 1
A couple of notes -
try not to use apply/transform as much as possible. It is only a convenience function meant to hide the application of a loop, and is slow, offering no performance/vectorization benefits whatosever
keeping columns of `objects offers no performance gains as far as pandas is concerned, so unless the goal is to display data, try to avoid it.

Selecting multiple columns R vs python pandas

I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?
You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])
Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.

Python filling string column "forward" and groupby attaching groupby result to dataframe

I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.

How to set a value in a pandas DataFrame by mixed iloc and loc

Say I want a function that changes the value of a named column in a given row number of a DataFrame.
One option is to find the column's location and use iloc, like that:
def ChangeValue(df, rowNumber, fieldName, newValue):
columnNumber = df.columns.get_loc(fieldName)
df.iloc[rowNumber, columnNumber] = newValue
But I wonder if there is a way to use the magic of iloc and loc in one go, and skip the manual conversion.
Any ideas?
I suggest just using iloc combined with the Index.get_loc method. eg:
df.iloc[0:10, df.columns.get_loc('column_name')]
A bit clumsy, but simple enough.
MultiIndex has both get_loc and get_locs which takes a sequence; unfortunately Index just seems to have the former.
Using loc
One has to resort to either employing integer location iloc all the way —as suggested in this answer,— or using plain location loc all the way, as shown here:
df.loc[df.index[[0, 7, 13]], 'column_name']
According to this answer,
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
So you should especially be able to use df.ix[rowNumber, fieldname] in case type(df.index) != type(rowNumber).
Even it does not hold for each case, I'd like to add an easy one, if you are looking for top or bottom entries:
df.head(1)['column_name'] # first entry in 'column_name'
df.tail(5)['column_name'] # last 5 entries in 'column_name'
Edit: doing the following is not a good idea. I leave the answer as a counter example.
You can do this:
df.iloc[rowNumber].loc[fieldName] = newValue
Example
import pandas as pd
def ChangeValue(df, rowNumber, fieldName, newValue):
df.iloc[rowNumber].loc[fieldName] = newValue
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
print(df)
A B C
4 0 2 3
5 0 4 1
6 10 20 30
ChangeValue(df, 1, "B", 999)
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
But be careful, if newValue is not the same type it does not work and will fail silently
ChangeValue(df, 1, "B", "Oops")
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
There is some good info about working with columns data types here: Change column type in pandas

What is the equivalent of SQL "GROUP BY HAVING" on Pandas?

what would be the most efficient way to use groupby and in parallel apply a filter in pandas?
Basically I am asking for the equivalent in SQL of
select *
...
group by col_name
having condition
I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful.
I need a very good performance, so ideally such a command would not be the result of several layered operations done in python.
As mentioned in unutbu's comment, groupby's filter is the equivalent of SQL'S HAVING:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 3
2 5 6
In [13]: g = df.groupby('A') # GROUP BY A
In [14]: g.filter(lambda x: len(x) > 1) # HAVING COUNT(*) > 1
Out[14]:
A B
0 1 2
1 1 3
You can write more complicated functions (these are applied to each group), provided they return a plain ol' bool:
In [15]: g.filter(lambda x: x['B'].sum() == 5)
Out[15]:
A B
0 1 2
1 1 3
Note: potentially there is a bug where you can't write you function to act on the columns you've used to groupby... a workaround is the groupby the columns manually i.e. g = df.groupby(df['A'])).
I group by state and county where max is greater than 20 then subquery the resulting values for True using the dataframe loc
counties=df.groupby(['state','county'])['field1'].max()>20
counties=counties.loc[counties.values==True]

Categories

Resources