Related
I have a df as follows.
TimeStamp,Value
t1,akak
t2,bb
t3,vvv
t5,ff
t6,44
t7,99
t8,kfkkf
t9,ff
t10,oo
I want to split df into sizes of 2 rows and assign class as group number.
TimeStamp,Value, class
t1,akak,c1
t2,bb,c1
t3,vvv,c2
t4,ff,c2
t5,44,c3
t6,99,c3
t7,kfkkf,c4
t8,ff,c4
t9,oo,c5
t10,oo,c5
One approach is to iterate and do it one at a time. Was thinking of there is inbuilt way in pandas to do it
Another possible solution:
df['class'] = ['c' + str(1+x) for x in np.repeat(range(int(len(df)/2)), 2)]
Output:
TimeStamp Value class
0 t1 akak c1
1 t2 bb c1
2 t3 vvv c2
3 t4 ff c2
4 t5 ff c3
5 t6 44 c3
6 t7 99 c4
7 t8 kfkkf c4
8 t9 ff c5
9 t10 oo c5
try this:
df.assign(Class=(df.index//2+1).map('c{}'.format))
>>>
TimeStamp Value Class
0 t1 akak c1
1 t2 bb c1
2 t3 vvv c2
3 t5 ff c2
4 t6 44 c3
5 t7 99 c3
6 t8 kfkkf c4
7 t9 ff c4
8 t10 oo c5
You could do:
df['class'] = [i//2 for i in range(len(df))]
But this is a pretty limited answer; you might want to apply a certain value on your other columns to get the group ID, or you may have a specific label in mind to apply for the class column, in which case you could follow up with a map function on the series to turn those numbers into something else.
You can use this to achieve what you want:
df["class"] = [f"c{(i // 2) + 1}" for i in range(df.shape[0])]
You can vectorize the operation with numpy:
import numpy as np
df['class'] = np.core.defchararray.add('c', (np.arange(len(df))//2+1).astype(str))
Or, with a Series:
df['class'] = pd.Series(np.arange(len(df))//2+1, index=df.index, dtype='string').radd('c')
Output:
TimeStamp Value class
0 t1 akak c1
1 t2 bb c1
2 t3 vvv c2
3 t4 ff c2
4 t5 ff c3
5 t6 44 c3
6 t7 99 c4
7 t8 kfkkf c4
8 t9 ff c5
9 t10 oo c5
I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.
try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference
I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).
pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df
Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6
I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).
pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df
Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6
Need to get the difference between 2 csv files, kill duplicates and Nan fields.
I am trying this one but it adds them together instead of subtracting.
df1 = pd.concat([df,cite_id]).drop_duplicates(keep=False)[['id','website']]
df is main dataframe
cite_id is dataframe that has to be subtracted.
You can do this efficiently using 'isin'
df.dropna().drop_duplicates()
cite_id.dropna().drop_duplicates()
df[~df.id.isin(cite_id.id.values)]
Or You can merge them and keep only the lines that have a NaN
df[pd.merge(cite_id, df, how='outer').isnull().any(axis=1)]
import pandas as pd
df1 = pd.read_csv("1.csv")
df2 = pd.read_csv("2.csv")
df1 = df1.dropna().drop_duplicates()
df2 = df2.dropna().drop_duplicates()
df = df2.loc[~df2.id.isin(df1.id)]
You can concatenate two dataframes as one, after that you can remove all dupicates
df1
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
cite_id
ID B C D
4 A2 B4 C4 D4
5 A3 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
pd.concat([df1,cite_id]).drop_duplicates(subset=['ID'], keep=False)
Out:
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
6 A6 B6 C6 D6
7 A7 B7 C7 D7