pandas group filter issue - python

I cannot for the life of me figure out why the filter method refuses to work on my dataframes in pandas.
Here is an example showing my issue:
In [99]: dff4
Out[99]: <pandas.core.groupby.DataFrameGroupBy object at 0x1143cbf90>
In [100]: dff3
Out[100]: <pandas.core.groupby.DataFrameGroupBy object at 0x11439a810>
In [101]: dff3.groups
Out[101]:
{'iphone': [85373, 85374],
'remote_api_created': [85363,
85364,
85365,
85412]}
In [102]: dff4.groups
Out[102]: {'bye': [3], 'bye bye': [4], 'hello': [0, 1, 2]}
In [103]: dff4.filter(lambda x: len(x) >2)
Out[103]:
A B
0 0 hello
1 1 hello
2 2 hello
In [104]: dff3.filter(lambda x: len(x) >2)
Out[104]:
Empty DataFrame
Columns: [source]
Index: []
Notice how filter refuses to work on dff3.
Any help appreciated.

If you group by column name, you move it to index, so your dataframe becomes empty, if no other columns is present, see:
>>> def report(x):
... print x
... return True
>>> df
source
85363 remote_api_created
85364 remote_api_created
85365 remote_api_created
85373 iphone
85374 iphone
85412 remote_api_created
>>> df.groupby('source').filter(report)
Series([], dtype: float64)
Empty DataFrame
Columns: []
Index: [85373, 85374]
Series([], dtype: float64)
Empty DataFrame
Columns: [source]
Index: []
You can group by column values:
>>> df.groupby(df['source']).filter(lambda x: len(x)>2)
source
85363 remote_api_created
85364 remote_api_created
85365 remote_api_created
85412 remote_api_created

Related

Sum columns in a pandas dataframe which contain a string

I am trying to do something relatively simple in summing all columns in a pandas dataframe that contain a certain string. Then making that a new column in the dataframe from the sum. These columns are all numeric float values...
I can get the list of columns which contain the string I want
StmCol = [col for col in cdf.columns if 'Stm_Rate' in col]
But when I try to sum them using:
cdf['PadStm'] = cdf[StmCol].sum()
I get a new column full of "nan" values.
You need to pass in axis=1 to .sum, by default (axis=0) sums over each column:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df[["A"]].sum() # Here I'm passing the list of columns ["A"]
Out[13]:
A 4
dtype: int64
In [14]: df[["A"]].sum(axis=1)
Out[14]:
0 1
1 3
dtype: int64
Only the latter matches the index of df:
In [15]: df["C"] = df[["A"]].sum()
In [16]: df["D"] = df[["A"]].sum(axis=1)
In [17]: df
Out[17]:
A B C D
0 1 2 NaN 1
1 3 4 NaN 3

How can I get intersection of two pandas series text column?

I have two pandas series of text column how can I get intersection of those?
print(df)
0 {this, is, good}
1 {this, is, not, good}
print(df1)
0 {this, is}
1 {good, bad}
I'm looking for a output something like below.
print(df2)
0 {this, is}
1 {good}
I've tried this but it returns
df.apply(lambda x: x.intersection(df1))
TypeError: unhashable type: 'set'
Looks like a simple logic:
s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}])
s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}])
s1 - (s1 - s2)
#Out[122]:
#0 {this, is}
#1 {good}
#dtype: object
This approach works for me
import pandas as pd
import numpy as np
data = np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}])
data1 = np.array([{'this', 'is'},{'good', 'bad'}])
df = pd.Series(data)
df1 = pd.Series(data1)
df2 = pd.Series([df[i] & df1[i] for i in xrange(df.size)])
print(df2)
I appreciate above answers. Here is a simple example to solve the same if you have DataFrame (As I guess, after looking into your variable names like df & df1, you had asked this for DataFrame .).
This df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1) will do that. Let's see how I reached to the solution.
The answer at https://stackoverflow.com/questions/266582... was helpful for me.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({
... "set": [{"this", "is", "good"}, {"this", "is", "not", "good"}]
... })
>>>
>>> df
set
0 {this, is, good}
1 {not, this, is, good}
>>>
>>> df1 = pd.DataFrame({
... "set": [{"this", "is"}, {"good", "bad"}]
... })
>>>
>>> df1
set
0 {this, is}
1 {bad, good}
>>>
>>> df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1)
0 {this, is}
1 {good}
dtype: object
>>>
How I reached to the above solution?
>>> df.apply(lambda x: print(x.name), axis=1)
0
1
0 None
1 None
dtype: object
>>>
>>> df.loc[0]
set {this, is, good}
Name: 0, dtype: object
>>>
>>> df.apply(lambda row: print(row[0]), axis=1)
{'this', 'is', 'good'}
{'not', 'this', 'is', 'good'}
0 None
1 None
dtype: object
>>>
>>> df.apply(lambda row: print(type(row[0])), axis=1)
<class 'set'>
<class 'set'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), df1.loc[row.name]), axis=1)
<class 'set'> set {this, is}
Name: 0, dtype: object
<class 'set'> set {good}
Name: 1, dtype: object
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name])), axis=1)
<class 'set'> <class 'pandas.core.series.Series'>
<class 'set'> <class 'pandas.core.series.Series'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name][0])), axis=1)
<class 'set'> <class 'set'>
<class 'set'> <class 'set'>
0 None
1 None
dtype: object
>>>
Similar to above except if you want to keep everything in one dataframe
Current df:
df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])})
Intersection of series 0 & 1
df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

pandas ValueError: Can only compare identically-labeled Series objects python

I have two CSV file that I'm comparing and returning only the columns side by side that have different values. So if one value is empty in one of the columns the code will through a error:
ValueError: Can only compare identically-labeled Series objects
import pandas as pd
df1=pd.read_csv('csv1.csv')
df2=pd.read_csv('csv2.csv')
def process_df(df):
res = df.set_index('Country').stack()
res.index.rename('Column', level=1, inplace=True)
return res
df1 = process_df(df1)
df2 = process_df(df2)
mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)
print(df3)
My current output without missing values :
From To
Country Column
Bermuda 1980 0.00793 0.00093
1981 0.00687 0.00680
1986 0.00700 1.00700
Mexico 1980 3.72819 3.92819
If some values are missing I just want a empty cell, like the example below :
From To
Country Column
Bermuda 1980 0.00793 0.00093
1981 0.00687 <--- Missing value
1986 0.00700 1.00700
Mexico 1980 3.72819 3.92819
The issue is that the indexes don't match... As a simplified example (note that if you pass an empty element ('') into df1 instead of, say, the [4] element it produces the same result):
In [21]: df1 = pd.DataFrame([[1], [4]])
In [22]: df1
Out[22]:
0
0 1
1 4
Using the same DF structure but changing the index...
In [23]: df2 = pd.DataFrame([[3], [2]], index=[1, 0])
In [24]: df2
Out[24]:
0
1 3
0 2
Now to compare...
In [25]: df1[0] == df2[0]
ValueError: Can only compare identically-labeled Series objects
To prove out the index issue - recast df2 without the reverse index...
In [26]: df3 = pd.DataFrame([[3], [2]])
In [27]: df3
Out[27]:
0
0 3
1 2
And the resulting comparison:
In [28]: df1[0] == df3[0]
Out[28]:
0 False
1 False
Name: 0, dtype: bool
The Fix
You'll have to reindex one of the df's - like so (this is using "sortable" index - so more challenging for a more complex multi-index):
In [44]: df2.sort_index(inplace=True)
In [45]: df1[0] == df2[0]
Out[45]:
0 False
1 False
Name: 0, dtype: bool
If you can provide the CSV data, we could give it a try with a multi index...
Multi-Index
The .sort_index() method has a level= attribute that can be passed. You can pass an int or level name or list of ints or list of level names. So you could do something like:
df2.sort_index(level='level_name', inplace=True)
# as a list of levels... it will all depend on your original df index
levels = ['level_name1', 'level_name2]
df2.sort_index(level=levels, inplace=True)

How to iterate over column values for unique rows of a data frame with sorted, numerical index with duplicates in pandas?

I have a pandas DataFrame with the sorted, numerical index with duplicates, and the column values are identical for the same values of the index in the given column. I would like to iterate through the values of the given column for the unique values of the index.
Example
df = pd.DataFrame({'a': [3, 3, 5], 'b': [4, 6, 8]}, index=[1, 1, 2])
a b
1 3 4
1 3 6
2 5 8
I want to iterate through the values in column a for the unique entries in the index - [3,5].
When I iterate using the default index and print the type for column a, I get the Series entries for the duplicate index entries.
for i in df.index:
cell_value = df['a'].loc[i]
print(type(cell_value))
Output:
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'numpy.int64'>
First remove duplicated index by mask and assign positions by arange, then select with iloc:
arr = np.arange(len(df.index))
a = arr[~df.index.duplicated()]
print (a)
[0 2]
for i in a:
cell_value = df['a'].iloc[i]
print(type(cell_value))
<class 'numpy.int64'>
<class 'numpy.int64'>
No loop solution - use boolean indexing with duplicated and inverted mask by ~:
a = df.loc[~df.index.duplicated(), 'a']
print (a)
1 3
2 5
Name: a, dtype: int64
b = df.loc[~df.index.duplicated(), 'a'].tolist()
print (b)
[3, 5]
print (~df.index.duplicated())
[ True False True]
Try np.unique:
_, i = np.unique(df.index, return_index=True)
df.iloc[i, df.columns.get_loc('a')].tolist()
[3, 5]
This seems an XY Problem if, as per your comment, same index means same data.
You also don't need a loop for this.
Assuming you want to remove duplicate rows and extract the first column only (i.e. 3, 5), the below should suffice.
res = df.drop_duplicates().loc[:, 'a']
# 1 3
# 2 5
# Name: a, dtype: int64
To return types:
types = list(map(type, res))
print(types)
# [<class 'numpy.int64'>, <class 'numpy.int64'>]
Another solution using groupby and apply:
df.groupby(level=0).apply(lambda x: type(x.a.iloc[0]))
Out[330]:
1 <class 'numpy.int64'>
2 <class 'numpy.int64'>
dtype: object
To make your loop solution to work, create a temp df:
df_new = df.groupby(level=0).first()
for i in df_new.index:
cell_value = df_new['a'].loc[i]
print(type(cell_value))
<class 'numpy.int64'>
<class 'numpy.int64'>
Or to use drop_duplicates()
for i in df.drop_duplicates().index:
cell_value = df.drop_duplicates()['a'].loc[i]
print(type(cell_value))
<class 'numpy.int64'>
<class 'numpy.int64'>

Pandas DataFrame list strange behavior

When assigning a new column to one dataframe in the list, it copies it to all other dataframes. Example:
In [219]: a = [pd.DataFrame()]*2
In [220]: a[0]['a'] = [1,2,3]
In [221]: a[1]
Out[221]:
a
0 1
1 2
2 3
Is this a bug? And what can I do to prevent it?
Thanks!
The answer is because when you define a list with that syntax
x = [something]*n
You end up with a list, where each item is THE SAME something. It doesn't create copies, it references the SAME object:
>>> import pandas as pd
>>> a=pd.DataFrame()
>>> g=[a]*2
>>> g
1: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> id(g[0])
4: 129264216L
>>> id(g[1])
5: 129264216L
The comments are pointing to some useful examples which you should read through and grok.
To avoid it in your situation, just use another way of instantiating the list:
>>> map(lambda x: pd.DataFrame(),range(2))
6: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> [pd.DataFrame() for i in range(2)]
7: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>>
EDIT: I now see that there is an explanation for this in the replies^
I don't understand what this is caused by just yet, but you can get around it by defining your dataframes separately prior to putting them in a list.
In [2]: df1 = pd.DataFrame()
In [3]: df2 = pd.DataFrame()
In [4]: a = [df1, df2]
In [5]: a[0]['a'] = [1,2,3]
In [6]: a[0]
Out[6]:
a
0 1
1 2
2 3
In [7]: a[1]
Out[7]:
Empty DataFrame
Columns: []
Index: []

Categories

Resources