Filtering grouped df in Dask - python

Related to this similar question for Pandas: filtering grouped df in pandas
Action
To eliminate groups based on an expression applied to a different column than the groupby column.
Problem
Filter is not implemented for grouped dataframes.
Tried
Groupby and apply to eliminate certain groups, which returns an index error because the apply function is supposed to always return something?
In [16]:
def filter_empty(df):
if not df.label.values.all(4):
return df
df_nonempty = df_norm.groupby('hash').apply(filter_empty, meta=meta)
In [17]:
len(df_nonempty.hash.unique())
...
<ipython-input-16-6da6d9b6c069> in filter_empty()
1 def filter_empty(df):
----> 2 if not df.label.values.all(4):
3 return df
4
5 df_nonempty = df_norm.groupby('hash').apply(filter_empty, meta=meta)
/opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py in _all()
39
40 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 41 return umr_all(a, axis, dtype, out, keepdims)
42
43 def _count_reduce_items(arr, axis):
ValueError: 'axis' entry is out of bounds
Question
Is there another way to achieve the Dask equivalent of Pandas grouped.filter(lambda x: len(x) > 1) ? Or the groupby apply simply implemented wrongly?
Example
import numpy as np
import pandas as pd
import dask.dataframe as dd
In [3]:
df = pd.DataFrame({'A':list('aacaaa'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')})
df = dd.from_pandas(df, npartitions=1)
In [8]:
df.A.unique().compute()
Out[8]:
0 a
1 c
Name: A, dtype: object
In [6]:
def filter_4(df):
if not df.B.values.all(4):
return df
df_notalla = df.groupby('A').apply(filter_4, meta=df)
In [10]:
df_notall4.A.unique().compute()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-894a491faa57> in <module>()
----> 1 df_notalla.A.unique().compute()
...
<ipython-input-6-ef10326ae42a> in filter_4(df)
1 def filter_4(df):
----> 2 if not df.B.values.all(4):
3 return df
4
5 df_notalla = df.groupby('A').apply(filter_4, meta=df)
/opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py in _all(a, axis, dtype, out, keepdims)
39
40 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 41 return umr_all(a, axis, dtype, out, keepdims)
42
43 def _count_reduce_items(arr, axis):
ValueError: 'axis' entry is out of bounds

I think you can groupby + size first, then map for Series (it is like transform, but not implemented in dask too) and last filter by boolean indexing:
df = pd.DataFrame({'A':list('aacaaa'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
5 a 4 3 0 4 c
a = df.groupby('F')['A'].size()
print (a)
F
a 3
b 2
c 1
Name: A, dtype: int64
s = df['F'].map(a)
print (s)
0 3
1 3
2 3
3 2
4 2
5 1
Name: F, dtype: int64
df = df[s > 1]
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
EDIT:
I think here is not necessary groupby:
df_notall4 = df[df.C != 4].drop_duplicates(subset=['A','D'])['D'].compute()
But if really need it:
def filter_4(x):
return x[x.C != 4]
df_notall4 = df.groupby('A').apply(filter_4, meta=df).D.unique().compute()
print (df_notall4)
0 1
1 3
2 0
3 5
Name: D, dtype: int64

Thanks to #jezrael I reviewed my implementation and created the following solution (see my provided example).
df_notall4 = []
for d in list(df[df.C != 4].D.unique().compute()):
df_notall4.append(df.groupby('D').get_group(d))
df_notall4 = dd.concat(df_notall4, interleave_partitions=True)
Which results in
In [8]:
df_notall4.D.unique().compute()
Out[8]:
0 1
1 3
2 5
3 0
Name: D, dtype: object

Related

Pandas - How to swap column contents leaving label sequence intact?

I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)

Create Two new Columns from two existing column using a function which takes two parameters

I have a similar problem like this, but this answer couldn't solve my problem
Pandas: create two new columns in a dataframe with values calculated from a pre-existing column .
I have a function which takes two parameters and return two different values in float Original dataframes are also in float
def FunctionName (a, b):
some calculations---
return x, y
I have df and I want to use function FunctionName so that I will have two new Series from those existing Series df['A], df['B]
df['A], df['B]
df['A_new'], df['B_new'] = df[['A'], df['B']].apply(FunctionName)
gives me an error
TypeError: unhashable type: 'list'
I also tried
df['A_new'], df['B_new'] = FunctionName ( df['A'], df['B']) gives me an error
TypeError: cannot convert the series to <class 'float'>
I want to use return X values to df['A_new'] and Y values to df['B_new']
Can someone please tell, what i am missing here ?
I believe need parameter axis=1 to apply for processes by rows with lambda function for define columns bynames and return Series for new columns - added by join:
df = pd.DataFrame({'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]})
print (df)
A B C
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
def FunctionName (a, b):
x = a * 5
y = b * 7
return pd.Series([x, y], index=['A_new','B_new'])
df = df.join(df.apply(lambda x: FunctionName(x['A'], x['B']), axis=1))
print (df)
A B C A_new B_new
0 4 7 1 20 49
1 5 8 3 25 56
2 4 9 5 20 63
3 5 4 7 25 28
4 5 2 1 25 14
5 4 3 0 20 21
def FunctionName (a, b):
x = a * 5
y = b * 7
return pd.Series([x, y])
df[['A_new', 'B_new']] = df.apply(lambda x: FunctionName(x['A'], x['B']), axis=1)
print (df)
A B C A_new B_new
0 4 7 1 20 49
1 5 8 3 25 56
2 4 9 5 20 63
3 5 4 7 25 28
4 5 2 1 25 14
5 4 3 0 20 21

Pandas - df.size() error: 'numpy.int64' object is not callable

Using pandas, I'm trying to plot some data using:
df.size().unstack().plot(kind=barh)
but I got this error:
TypeError: 'numpy.int64' object is not callable
then I tried df.size() only and got the same error. Now I'm not sure what causes this because according to the documentation, DataFrame.size() should work fine. Any idea?
There is one problem you need omit () in DataFrame.size, but output is scalar, so impossible call unstack:
df.size
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
a = df.size
print (a)
36
Maybe need groupby + GroupBy.size():
df1 = df.groupby(['F', 'B']).size().unstack()
print (df1)
B 4 5
F
a 2 1
b 1 2

Pandas join: Does not recognize joining column

I have no idea what's happening, the title is just a first-order approximation. I'm trying to join two data frames:
>>> df_sum.head()
TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \
0 20030100013280 0 0 0 0 0 0
1 20030100013344 0 0 0 0 0 0
2 20030100013352 60 0 0 0 0 0
3 20030100013848 0 0 0 0 0 0
4 20030100014165 0 0 0 0 0 0
t070201 t070299 shopping year
0 0 0 0 2003
1 0 0 0 2003
2 0 0 60 2003
3 0 0 0 2003
4 0 0 0 2003
>>> emp.head()
TUCASEID status
0 20030100013280 emp
1 20030100013344 emp
2 20030100013352 emp
4 20030100014165 emp
5 20030100014169 emp
That's the data frames, I want to join them over the common column TUCASEID, of which there are intersections:
>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
20131212132469, 20131212132475])
Now...
>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
rsuffix=rsuffix, sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
rdata.items, rsuf)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')
Well, that's weird, the only column that appears in both data frames is the one to join over, but well, let's concur[1]:
>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []
Despite there being a huge intersection. What's going on here?
>>> pd.__version__
'0.15.0'
[1]: I actually enforced integer for dtype of the joining column because it said "object" there, made no difference:
>>> emp.dtypes
TUCASEID int64
status object
dtype: object
>>> df_sum.dtypes
TUCASEID int64
(...)
shopping int64
year int64
dtype: object
df.join generally calls pd.merge (except in a special case when it calls concat). Therefore, anything join can do, merge can do
also. Although perhaps not strictly correct, I tend to use df.join only when
joining on the index and use pd.merge for joining on columns.
Thus, I can reproduce the problem you describe:
import numpy as np
import pandas as pd
df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XY'))
emp = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XZ'))
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner'))
# Empty DataFrame
# Columns: [X, Y, X_r, Z]
# Index: []
but pd.merge works as expected -- and without having to supply rsuffix:
print(pd.merge(df_sum, emp, on='X')
yields
X Y Z
0 0 1 1
1 2 3 3
2 4 5 5
3 6 7 7
4 8 9 9
5 10 11 11
Under the hood, df_sum.join calls merge this way:
if isinstance(other, DataFrame):
return merge(self, other, left_on=on, how=how,
left_index=on is None, right_index=True,
suffixes=(lsuffix, rsuffix), sort=sort)
So, even though you use df_sum.join(emp, on='...'), under the hood, Pandas converts this to pd.merge(df_sum, emp, left_on='...').
Furthermore, the merge is empty when called this way:
In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True)
Out[228]:
Empty DataFrame
Columns: [X, X_x, Y, X_y, Z]
Index: []
because the left_on='X' needs to be on='X' for the merge to succeed as desired:
In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True)
Out[233]:
X Y Z
A 0 1 1
B 2 3 3
C 4 5 5
D 6 7 7
E 8 9 9
F 10 11 11

How do I get panda's "update" function to overwrite numbers in one column but not another?

Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?
Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6
If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])

Categories

Resources