Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns - python

I'm looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.
Here is what I mean - given two Pandas DataFrames as follows:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
How to find a result of d1 MINUS d2 taking into account only columns "a" and "b" in order to get the following result:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
What have I tried:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.
PS DataFrame.isin() method won't help in this case as it'll produce a wrong result set

Execution time comparison for larger data sets:
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)
In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)
In [102]: df1.shape
Out[102]: (700000, 3)
In [103]: df2.shape
Out[103]: (300000, 3)
pd.concat().drop_duplicates() approach:
In [10]: %%timeit
...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
...:
...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
multi-index NOT IS IN approach:
In [11]: %%timeit
...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
...:
...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
multi-index difference approach:
In [12]: %%timeit
...: tmp1 = df1.reset_index().set_index(["a", "b"])
...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
...: res = df1.loc[tmp1.loc[idx, "index"]]
...:
...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
merge(how="outer") approach - gives me a MemoryError:
In [106]: %%timeit
...: res = (df1.reset_index()
...: .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
...: .query('_merge == "left_only"')
...: .set_index('index')
...: .rename_axis(None)
...: .reindex(df1.columns, axis=1))
...:
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
compare concatenated strings approach:
In [13]: %%timeit
...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
...:
...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am thinking a little bit like excel here:
d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]
a b c
1 0 1 2
2 1 0 3
6 2 2 7

One possible solution with merge and indicator=True:
df = (d1.reset_index()
.merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
.query('_merge == "left_only"')
.set_index('index')
.rename_axis(None)
.reindex(d1.columns, axis=1))
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Solution with isin:
df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7

We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Edit after comment by OP
If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7

I had similar question, I tried your idea
(
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
)
for test and it works.
However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.
if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot

Related

Create a sequence of numbers and reset itself when certain number is reached

I have a dataframe which first column has 11 rows, i want to create a second column and count from 1 to 4 and then reset the count and start from 1 to 4 and stop counting when reaches the last row.
for instance, I have df['item'] and the code should create a df['new column']:
df['item']= [a b c d e f g h i j k]
df['new column'] = [1 2 3 4 1 2 3 4 1 2 3]

Use modulo with 4 and add 1:
import pandas as pd
df = pd.DataFrame({'item': list('abcdefghijk')})
#default index solution
df['new column'] = df.index % 4 + 1
#general solution
#df['new column'] = np.arange(len(df)) % 4 + 1
print(df)
Output:
item new column
0 a 1
1 b 2
2 c 3
3 d 4
4 e 1
5 f 2
6 g 3
7 h 4
8 i 1
9 j 2
10 k 3
If large DataFrame performance is for each solution different:
df = pd.DataFrame({'a':range(1000000)})
In [307]: %timeit df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]
363 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [308]: %timeit df['new column1'] = df.index % 4 + 1
35.1 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [309]: %timeit df['new column2'] = np.arange(len(df)) % 4 + 1
14.4 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can repeat the list [1, 2, 3, 4] n times simply by doing n * [1, 2, 3, 4]. Thus your new column is created with:
df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]

Pandas DataFrame aggregated column with names of other columns as value

I'm trying to create a new column in my DataFrame that is a list of aggregated column names. Here's a sample DataFrame:
In [1]: df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
In [2]: df
Out[2]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
I'd like to create a new column containing a list of column names where a certain condition is met. Say that I'm interested in columns where value > 3 -- I would want an output that looks like this:
In [3]: df
Out[3]:
A B C D E F Flag
0 1 4 7 1 5 7 ['B', 'C', 'E', 'F']
1 2 5 8 3 3 4 ['B', 'C', 'F']
2 3 6 9 5 6 3 ['B', 'C', 'D', 'E']
Currently, I'm using apply:
df['Flag'] = df.apply(lambda row: [list(df)[i] for i, j in enumerate(row) if j > 3], axis = 1)
This gets the job done, but feels clunky and I'm wondering if there is a more elegant solution.
Thanks!

Use df.dot() here:
df['Flag']=(df>3).dot(df.columns).apply(list)
print(df)
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

I still like for loop here
df['Flag']=[df.columns[x].tolist() for x in df.gt(3).values]
df
Out[968]:
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

One option is to create a dataframe of booleans by checking which values are above a certain threshold with DataFrame.gt, and take the dot product with the column names. Finally use apply(list) to obtain lists from the resulting strings:
df['Flag'] = df.gt(3).dot(df.columns).apply(list)
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

Another way:
df['Flag'] = df.T.apply(lambda x: list(x[x>3].index))

Edit: adding timing for all solutions of this question
I prefer a solution without apply
df['Flag'] = df.reset_index().melt(id_vars='index', value_name='val', var_name='col').query('val > 3').groupby('index')['col'].agg(list)
Or
df['Flag'] = df.stack().rename('val').reset_index(level=1).query('val > 3').groupby(level=0)['level_1'].agg(list)
Out[2576]:
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]
Test data:
a = [
[1, 4, 7, 1, 5, 7],
[2, 5, 8, 3, 3, 4],
[3, 6, 9, 5, 6, 3],
] * 10000
df = pd.DataFrame(a, columns = list('ABCDEF'))
Timing with %timeit:
In [79]: %timeit (df>3).dot(df.columns).apply(list)
40.8 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit [df.columns[x].tolist() for x in df.gt(3).values]
1.23 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [81]: %timeit df.gt(3).dot(df.columns).apply(list)
37.6 ms ± 644 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [82]: %timeit df.T.apply(lambda x: list(x[x>3].index))
16.4 s ± 99.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [83]: %timeit df.stack().rename('val').reset_index(level=1).query('val > 3')
...: .groupby(level=0)['level_1'].agg(list)
4.05 s ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [84]: %timeit df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values
...: , 1)
c:\program files\python37\lib\site-packages\numpy\core\fromnumeric.py:56: Future
Warning: Series.nonzero() is deprecated and will be removed in a future version.
Use Series.to_numpy().nonzero() instead
return getattr(obj, method)(*args, **kwds)
12 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Fastest are solution using .dot

Using numpy.argwhere and ravel():
df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values, 1)

we can use # also
df['Flag'] = ((df >3) # df.columns).map(list)

Splitting columns of a mixed dataframe

Have a dataframe df:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([('x', 'y')] + [('y', 'x')] +
list([0, np.nan]*2)), columns=['Col'])
df
How can df be split into two columns as follows?:
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN

Use list comprehension with convert scalars to tuples:
df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']],
columns=['Col1','Col2'])
print (df1)
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
More general solution:
lens = int(df['Col'].str.len().max())
df1 = pd.DataFrame([x if isinstance(x, tuple) else [x] * lens for x in df['Col']])
Another solution, slowier in large data:
df1 = df['Col'].apply(pd.Series).ffill(axis=1)
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [51]: %%timeit
...: df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']],
...: columns=['Col1','Col2'])
...:
2.42 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %%timeit
...: df['Col'].apply(pd.Series).ffill(axis=1)
...:
1 s ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#coldspeed solution
In [53]: %%timeit
...: v = pd.to_numeric(df.Col, errors='coerce')
...: pd.DataFrame({
...: 'Col1': v.fillna(df.Col.str[0]),
...: 'Col2': v.fillna(df.Col.str[-1])})
...:
15.8 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A nice, concise solution is to use pd.to_numeric to convert non-numeric data to NaN, and then fillna.
v = pd.to_numeric(df.Col, errors='coerce')
pd.DataFrame({
'Col1': v.fillna(df.Col.str[0]),
'Col2': v.fillna(df.Col.str[-1])})
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
Solution, for multiple possible columns:
pd.DataFrame({
f'Col{i+1}': v.fillna(df.Col.str[i])
for i in range(int(df.Col.str.len().max()))})
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN

Remove '.' from Thousands of Column Heads [python]

My DataFrame has around 9K columns, and I want to remove the . from every column name, see example column names below:
`traffic.seas1`
`traffic.seas2`
`traffic.seas3`
These are just three, I have 9K columns, some do not have . but many do. How can I remove them efficiently, as the rename function is too manual.

You can use str.replace:
df.columns = df.columns.str.replace('.','')
Or list comprehension with replace:
df.columns = [x.replace('.','') for x in df.columns]
Sample:
df = pd.DataFrame({'traffic.seas1':list('abcdef'),
'traffic.seas2':[4,5,4,5,5,4],
'traffic.seas3':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
D E F traffic.seas1 traffic.seas2 traffic.seas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
df.columns = df.columns.str.replace('.','')
print (df)
D E F trafficseas1 trafficseas2 trafficseas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
Timings:
N = 9000
df = pd.DataFrame(np.random.randint(10, size=(3, N))).add_prefix('traffic.seas')
print (df)
In [161]: %timeit df.columns = df.columns.str.replace('.','')
4.4 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [162]: %timeit df.columns = [x.replace('.','') for x in df.columns]
2.53 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use list comprehension on df.columns like this:
df.columns = [c.replace('.', '') for c in df.columns]
For example:
df = pd.DataFrame({'foo': [1], 'bar.z': [2]})
>>> df.columns
Index(['bar.z', 'foo'], dtype='object')
df.columns = [c.replace('.', '') for c in df.columns]
>>> df
barz foo
0 2 1

Selecting all column names where value is greater than another column in pandas

I'm trying to find the column names of each column in a pandas dataframe where the value is greater than that of another column.
For example, if I have the following dataframe:
A B C D threshold
0 1 3 3 1 2
1 2 3 6 1 5
2 9 5 0 2 4
For each row I would like to return the names of the columns where the values are greater than the threshold, so I would have:
0: B, C
1: C
2: A, B
Any help would be much appreciated!

If you want a large increase in speed you can use NumPy's vectorized where function.
s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
pd.Series([''.join(x).strip(', ') for x in s])
0 B, C
1 C
2 A, B
dtype: object
There is more than an order of magnitude speedup vs #jezrael and MaxU solutions when using a dataframe of 100,000 rows. Here I create the test DataFrame first.
n = 100000
df = pd.DataFrame(np.random.randint(0, 10, (n, 5)),
columns=['A', 'B', 'C', 'D', 'threshold'])
Timings
%%timeit
>>> s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
>>> pd.Series([''.join(x).strip(', ') for x in s])
280 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
>>> df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
3.15 s ± 82.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> x = df.drop('threshold',1)
>>> x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
3.28 s ± 145 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can use:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
print (df1)
0 B, C
1 C
2 A, B
dtype: object
Similar solution:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0).stack().rename_axis(('a','b'))
.reset_index(name='boolean')
a = df1[df1['boolean']].groupby('a')['b'].apply(', '.join).reset_index()
print (a)
a b
0 0 B, C
1 1 C
2 2 A, B

you can do it this way:
In [99]: x = df.drop('threshold',1)
In [100]: x
Out[100]:
A B C D
0 1 3 3 1
1 2 3 6 1
2 9 5 0 2
In [102]: x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
Out[102]:
0 B, C
1 C
2 A, B
dtype: object

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns - python

I am thinking a little bit like excel here: d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))] a b c 1 0 1 2 2 1 0 3 6 2 2 7

Related

Create a sequence of numbers and reset itself when certain number is reached

Pandas DataFrame aggregated column with names of other columns as value

Splitting columns of a mixed dataframe

Remove '.' from Thousands of Column Heads [python]

Selecting all column names where value is greater than another column in pandas

Categories

Resources