How can I get intersection of two pandas series text column?

How can I get intersection of two pandas series text column? - python

I have two pandas series of text column how can I get intersection of those?
print(df)
0 {this, is, good}
1 {this, is, not, good}
print(df1)
0 {this, is}
1 {good, bad}
I'm looking for a output something like below.
print(df2)
0 {this, is}
1 {good}
I've tried this but it returns
df.apply(lambda x: x.intersection(df1))
TypeError: unhashable type: 'set'

Looks like a simple logic:
s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}])
s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}])
s1 - (s1 - s2)
#Out[122]:
#0 {this, is}
#1 {good}
#dtype: object

This approach works for me
import pandas as pd
import numpy as np
data = np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}])
data1 = np.array([{'this', 'is'},{'good', 'bad'}])
df = pd.Series(data)
df1 = pd.Series(data1)
df2 = pd.Series([df[i] & df1[i] for i in xrange(df.size)])
print(df2)

I appreciate above answers. Here is a simple example to solve the same if you have DataFrame (As I guess, after looking into your variable names like df & df1, you had asked this for DataFrame .).
This df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1) will do that. Let's see how I reached to the solution.
The answer at https://stackoverflow.com/questions/266582... was helpful for me.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({
... "set": [{"this", "is", "good"}, {"this", "is", "not", "good"}]
... })
>>>
>>> df
set
0 {this, is, good}
1 {not, this, is, good}
>>>
>>> df1 = pd.DataFrame({
... "set": [{"this", "is"}, {"good", "bad"}]
... })
>>>
>>> df1
set
0 {this, is}
1 {bad, good}
>>>
>>> df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1)
0 {this, is}
1 {good}
dtype: object
>>>
How I reached to the above solution?
>>> df.apply(lambda x: print(x.name), axis=1)
0
1
0 None
1 None
dtype: object
>>>
>>> df.loc[0]
set {this, is, good}
Name: 0, dtype: object
>>>
>>> df.apply(lambda row: print(row[0]), axis=1)
{'this', 'is', 'good'}
{'not', 'this', 'is', 'good'}
0 None
1 None
dtype: object
>>>
>>> df.apply(lambda row: print(type(row[0])), axis=1)
<class 'set'>
<class 'set'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), df1.loc[row.name]), axis=1)
<class 'set'> set {this, is}
Name: 0, dtype: object
<class 'set'> set {good}
Name: 1, dtype: object
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name])), axis=1)
<class 'set'> <class 'pandas.core.series.Series'>
<class 'set'> <class 'pandas.core.series.Series'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name][0])), axis=1)
<class 'set'> <class 'set'>
<class 'set'> <class 'set'>
0 None
1 None
dtype: object
>>>

Similar to above except if you want to keep everything in one dataframe
Current df:
df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])})
Intersection of series 0 & 1
df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

Related

Is there a Pandas function to get the data types of the elements of dataframe as a new dataframe?

I am trying to get the datatypes of all the elements in a dataframe into a new dataframe and was wondering if there is a native Pandas function for this case?
Example:
import pandas as pd
d = {'col1': [1, '2', 2.0, []], 'col2': [3, '4', 4.0, []]}
df = pd.DataFrame(data=d)
col1 col2
0 1 3
1 2 4
2 2 4
3 [] []
Expected result:
col1 col2
0 int int
1 str str
2 float float
3 list list

Use DataFrame.applymap with type:
df = df.applymap(type)
print (df)
col1 col2
0 <class 'int'> <class 'int'>
1 <class 'str'> <class 'str'>
2 <class 'float'> <class 'float'>
3 <class 'list'> <class 'list'>
If need remove class use lambda function with __name__
df = df.applymap(lambda x: type(x).__name__)
print (df)
col1 col2
0 int int
1 str str
2 float float
3 list list

Import a CSV with columns as str and set

I want to import a CSV with first column as str, and second as set. This works:
import pandas as pd, io
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval})
print(df)
print(type(df.iloc[0,0]), type(df.iloc[0,1])) # OK: str and set
But when doing it with index_col=0 to force to use column 0 as index, it does not work anymore:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
print(df)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
Output:
1
0
12 {hello}
34 {bar, foo}
12 {'hello'}
<class 'int'> <class 'set'>
34 {'bar', 'foo'}
<class 'int'> <class 'set'>
Why is the str conversion missing here?

The reason is you have set 0 as index, you need to change the datatype of the index column:
s = io.StringIO("""12,{'hello'}
34,"{'foo', 'bar'}"
""")
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}, index_col=0)
df.index = df.index.astype(str)
for a, b in df[1].items(): # iterate on the series df[1]
print(a, b)
print(type(a), type(b)) # <class 'int'> <class 'set'> instead of str and set!
12 {'hello'}
<class 'str'> <class 'set'>
34 {'foo', 'bar'}
<class 'str'> <class 'set'>

You can load the dataframe as it is and then convert the index to str with:
df.index = df.index.astype(str)

As mentioned by #SergeBallesta in a comment, this is a short solution:
df = pd.read_csv(s, header=None, converters={0: str, 1: eval}).set_index(0)

Improving on pandas tolist() performance

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna().tolist()
if (_item is not None) and str(_item)]
Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?
What I'm trying to do here is convert something like:
field field2
'2014-01-01' 1.0000000
'2015-01-01' nan
Into something like this:
values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]
So I can then infer the type of the columns. For example, the end product I'd want would be to get:
type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #

It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:
>>> import pandas as pd
>>> import numpy as np
Create a DataFrame with columns with mixed types:
>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
a b c
0 1 1 NaN
1 NaN 2 NaN
2 2 3 2
3 a 4 2
4 None 5 a
5 b 6 a
>>> df.dtypes
a object
b int64
c object
dtype: object
>>> for col in df.select_dtypes('object'):
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>
Use pd.Series.astype to cast object dtypes to str:
>>> for col in df.select_dtypes('object'):
... df[col] = df[col].astype(str)
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>

If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna()
if (_item is not None) and str(_item)]

Why does pandas Series.str convert numbers to NaN?

This might be a fundamental misunderstanding on my part, but I would expect pandas.Series.str to convert the pandas.Series values into strings.
However, when I do the following, numeric values in the series are converted to np.nan:
df = pd.DataFrame({'a': ['foo ', 'bar', 42]})
df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
print(df)
Out:
a
0 foo
1 bar
2 NaN
If I apply the str function to each column first, numeric values are converted to strings instead of np.nan:
df = pd.DataFrame({'a': ['foo ', 'bar', 42]})
df = df.apply(lambda x: x.apply(str) if x.dtype == 'object' else x)
df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
print(df)
Out:
a
0 foo
1 bar
2 42
The documentation is fairly scant on this topic. What am I missing?

In this line:
df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
The x.dtype is looking at the entire Series (column). The column is not numeric. Thus the entire column is operated on like strings.
In your second example, the number is not preserved, it is a string '42'.
The difference in the output will be due to the difference in panda's str and python's str.
In the case of pandas .str, this is not a conversion, it is an accessor, that allows you to do the .strip() to each element. What this means is that you attempt to apply .strip() to an integer. This throws an exception, and pandas responds to the exception by returning Nan.
In the case of .apply(str), you are actually converting the values to a string. Later when you apply .strip() this succeeds, since the value is already a string, and thus can be stripped.

The way you are using .apply is by columns, so note while:
>>> df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
a
0 foo
1 bar
2 NaN
It acted on the column, x.dtype was always object.
>>> df.apply(lambda x:x.dtype)
a object
dtype: object
If you did go by row, using axis=1, you'd still see the same behavior:
>>> df.apply(lambda x:x.dtype, axis=1)
0 object
1 object
2 object
dtype: object
Lo and behold:
>>> df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x, axis=1)
a
0 foo
1 bar
2 NaN
>>>
So, when it says object dtype, it means Python object. So consider a non-object numeric column:
>>> S = pd.Series([1,2,3])
>>> S.dtype
dtype('int64')
>>> S[0]
1
>>> S[0].dtype
dtype('int64')
>>> isinstance(S[0], int)
False
Whereas with this object dtype column:
>>> df
a
0 foo
1 bar
2 42
>>> df['a'][2]
42
>>> isinstance(df['a'][2], int)
True
>>>
You Are effectively doing this:
>>> s = df.a.astype(str).str.strip()
>>> s
0 foo
1 bar
2 42
Name: a, dtype: object
>>> s[2]
'42'
Note:
>>> df.apply(lambda x: x.apply(str) if x.dtype == 'object' else x).a[2]
'42'

pandas group filter issue

I cannot for the life of me figure out why the filter method refuses to work on my dataframes in pandas.
Here is an example showing my issue:
In [99]: dff4
Out[99]: <pandas.core.groupby.DataFrameGroupBy object at 0x1143cbf90>
In [100]: dff3
Out[100]: <pandas.core.groupby.DataFrameGroupBy object at 0x11439a810>
In [101]: dff3.groups
Out[101]:
{'iphone': [85373, 85374],
'remote_api_created': [85363,
85364,
85365,
85412]}
In [102]: dff4.groups
Out[102]: {'bye': [3], 'bye bye': [4], 'hello': [0, 1, 2]}
In [103]: dff4.filter(lambda x: len(x) >2)
Out[103]:
A B
0 0 hello
1 1 hello
2 2 hello
In [104]: dff3.filter(lambda x: len(x) >2)
Out[104]:
Empty DataFrame
Columns: [source]
Index: []
Notice how filter refuses to work on dff3.
Any help appreciated.

If you group by column name, you move it to index, so your dataframe becomes empty, if no other columns is present, see:
>>> def report(x):
... print x
... return True
>>> df
source
85363 remote_api_created
85364 remote_api_created
85365 remote_api_created
85373 iphone
85374 iphone
85412 remote_api_created
>>> df.groupby('source').filter(report)
Series([], dtype: float64)
Empty DataFrame
Columns: []
Index: [85373, 85374]
Series([], dtype: float64)
Empty DataFrame
Columns: [source]
Index: []
You can group by column values:
>>> df.groupby(df['source']).filter(lambda x: len(x)>2)
source
85363 remote_api_created
85364 remote_api_created
85365 remote_api_created
85412 remote_api_created

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get intersection of two pandas series text column? - python

Looks like a simple logic: s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}]) s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}]) s1 - (s1 - s2) #Out[122]: #0 {this, is} #1 {good} #dtype: object

Similar to above except if you want to keep everything in one dataframe Current df: df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])}) Intersection of series 0 & 1 df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

Related

Is there a Pandas function to get the data types of the elements of dataframe as a new dataframe?

Import a CSV with columns as str and set

Improving on pandas tolist() performance

Why does pandas Series.str convert numbers to NaN?

pandas group filter issue

Categories

Resources