Splitting columns of a mixed dataframe - python

Have a dataframe df:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([('x', 'y')] + [('y', 'x')] +
list([0, np.nan]*2)), columns=['Col'])
df
How can df be split into two columns as follows?:
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN

Use list comprehension with convert scalars to tuples:
df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']],
columns=['Col1','Col2'])
print (df1)
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
More general solution:
lens = int(df['Col'].str.len().max())
df1 = pd.DataFrame([x if isinstance(x, tuple) else [x] * lens for x in df['Col']])
Another solution, slowier in large data:
df1 = df['Col'].apply(pd.Series).ffill(axis=1)
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [51]: %%timeit
...: df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']],
...: columns=['Col1','Col2'])
...:
2.42 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %%timeit
...: df['Col'].apply(pd.Series).ffill(axis=1)
...:
1 s ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#coldspeed solution
In [53]: %%timeit
...: v = pd.to_numeric(df.Col, errors='coerce')
...: pd.DataFrame({
...: 'Col1': v.fillna(df.Col.str[0]),
...: 'Col2': v.fillna(df.Col.str[-1])})
...:
15.8 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A nice, concise solution is to use pd.to_numeric to convert non-numeric data to NaN, and then fillna.
v = pd.to_numeric(df.Col, errors='coerce')
pd.DataFrame({
'Col1': v.fillna(df.Col.str[0]),
'Col2': v.fillna(df.Col.str[-1])})
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
Solution, for multiple possible columns:
pd.DataFrame({
f'Col{i+1}': v.fillna(df.Col.str[i])
for i in range(int(df.Col.str.len().max()))})
Col1 Col2
0 x y
1 y x
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN

Related

Repeating pandas Series with specific order

Repeating pandas Series with repeat() function:
s = pd.Series(['a', 'b', 'c'])
s.repeat(2)
0 a
0 a
1 b
1 b
2 c
2 c
dtype: object
Need to get output like this:
0 a
1 b
2 c
0 a
1 b
2 c
dtype: object
Use np.tile with Series.loc if performance is important:
a = s.loc[np.tile(s.index, 2)]
print (a)
0 a
1 b
2 c
0 a
1 b
2 c
dtype: object
s = pd.Series(['a', 'b', 'c'])
In [25]: %timeit (s.loc[np.tile(s.index, 2000)])
612 µs ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: %timeit (pd.concat([s] * 2000))
22.2 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
EDIT:
s = pd.Series(['a', 'b', 'c'], index = pd.date_range('2015-01-01', periods=3))
print (s)
a = s.loc[np.tile(s.index, 2)]
print (a)
2015-01-01 a
2015-01-02 b
2015-01-03 c
2015-01-01 a
2015-01-02 b
2015-01-03 c
dtype: object
You can use from pandas concat function as follow
pd.concat([s] * 2)

Create a sequence of numbers and reset itself when certain number is reached

I have a dataframe which first column has 11 rows, i want to create a second column and count from 1 to 4 and then reset the count and start from 1 to 4 and stop counting when reaches the last row.
for instance, I have df['item'] and the code should create a df['new column']:
df['item']= [a b c d e f g h i j k]
df['new column'] = [1 2 3 4 1 2 3 4 1 2 3]
Use modulo with 4 and add 1:
import pandas as pd
df = pd.DataFrame({'item': list('abcdefghijk')})
#default index solution
df['new column'] = df.index % 4 + 1
#general solution
#df['new column'] = np.arange(len(df)) % 4 + 1
print(df)
Output:
item new column
0 a 1
1 b 2
2 c 3
3 d 4
4 e 1
5 f 2
6 g 3
7 h 4
8 i 1
9 j 2
10 k 3
If large DataFrame performance is for each solution different:
df = pd.DataFrame({'a':range(1000000)})
In [307]: %timeit df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]
363 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [308]: %timeit df['new column1'] = df.index % 4 + 1
35.1 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [309]: %timeit df['new column2'] = np.arange(len(df)) % 4 + 1
14.4 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can repeat the list [1, 2, 3, 4] n times simply by doing n * [1, 2, 3, 4]. Thus your new column is created with:
df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]

Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

I'm looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.
Here is what I mean - given two Pandas DataFrames as follows:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
How to find a result of d1 MINUS d2 taking into account only columns "a" and "b" in order to get the following result:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
What have I tried:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.
PS DataFrame.isin() method won't help in this case as it'll produce a wrong result set
Execution time comparison for larger data sets:
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)
In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)
In [102]: df1.shape
Out[102]: (700000, 3)
In [103]: df2.shape
Out[103]: (300000, 3)
pd.concat().drop_duplicates() approach:
In [10]: %%timeit
...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
...:
...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
multi-index NOT IS IN approach:
In [11]: %%timeit
...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
...:
...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
multi-index difference approach:
In [12]: %%timeit
...: tmp1 = df1.reset_index().set_index(["a", "b"])
...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
...: res = df1.loc[tmp1.loc[idx, "index"]]
...:
...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
merge(how="outer") approach - gives me a MemoryError:
In [106]: %%timeit
...: res = (df1.reset_index()
...: .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
...: .query('_merge == "left_only"')
...: .set_index('index')
...: .rename_axis(None)
...: .reindex(df1.columns, axis=1))
...:
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
compare concatenated strings approach:
In [13]: %%timeit
...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
...:
...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am thinking a little bit like excel here:
d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]
a b c
1 0 1 2
2 1 0 3
6 2 2 7
One possible solution with merge and indicator=True:
df = (d1.reset_index()
.merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
.query('_merge == "left_only"')
.set_index('index')
.rename_axis(None)
.reindex(d1.columns, axis=1))
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Solution with isin:
df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Edit after comment by OP
If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
I had similar question, I tried your idea
(
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
)
for test and it works.
However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.
if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot

Remove '.' from Thousands of Column Heads [python]

My DataFrame has around 9K columns, and I want to remove the . from every column name, see example column names below:
`traffic.seas1`
`traffic.seas2`
`traffic.seas3`
These are just three, I have 9K columns, some do not have . but many do. How can I remove them efficiently, as the rename function is too manual.
You can use str.replace:
df.columns = df.columns.str.replace('.','')
Or list comprehension with replace:
df.columns = [x.replace('.','') for x in df.columns]
Sample:
df = pd.DataFrame({'traffic.seas1':list('abcdef'),
'traffic.seas2':[4,5,4,5,5,4],
'traffic.seas3':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
D E F traffic.seas1 traffic.seas2 traffic.seas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
df.columns = df.columns.str.replace('.','')
print (df)
D E F trafficseas1 trafficseas2 trafficseas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
Timings:
N = 9000
df = pd.DataFrame(np.random.randint(10, size=(3, N))).add_prefix('traffic.seas')
print (df)
In [161]: %timeit df.columns = df.columns.str.replace('.','')
4.4 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [162]: %timeit df.columns = [x.replace('.','') for x in df.columns]
2.53 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use list comprehension on df.columns like this:
df.columns = [c.replace('.', '') for c in df.columns]
For example:
df = pd.DataFrame({'foo': [1], 'bar.z': [2]})
>>> df.columns
Index(['bar.z', 'foo'], dtype='object')
df.columns = [c.replace('.', '') for c in df.columns]
>>> df
barz foo
0 2 1

Selecting all column names where value is greater than another column in pandas

I'm trying to find the column names of each column in a pandas dataframe where the value is greater than that of another column.
For example, if I have the following dataframe:
A B C D threshold
0 1 3 3 1 2
1 2 3 6 1 5
2 9 5 0 2 4
For each row I would like to return the names of the columns where the values are greater than the threshold, so I would have:
0: B, C
1: C
2: A, B
Any help would be much appreciated!
If you want a large increase in speed you can use NumPy's vectorized where function.
s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
pd.Series([''.join(x).strip(', ') for x in s])
0 B, C
1 C
2 A, B
dtype: object
There is more than an order of magnitude speedup vs #jezrael and MaxU solutions when using a dataframe of 100,000 rows. Here I create the test DataFrame first.
n = 100000
df = pd.DataFrame(np.random.randint(0, 10, (n, 5)),
columns=['A', 'B', 'C', 'D', 'threshold'])
Timings
%%timeit
>>> s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
>>> pd.Series([''.join(x).strip(', ') for x in s])
280 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
>>> df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
3.15 s ± 82.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> x = df.drop('threshold',1)
>>> x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
3.28 s ± 145 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
print (df1)
0 B, C
1 C
2 A, B
dtype: object
Similar solution:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0).stack().rename_axis(('a','b'))
.reset_index(name='boolean')
a = df1[df1['boolean']].groupby('a')['b'].apply(', '.join).reset_index()
print (a)
a b
0 0 B, C
1 1 C
2 2 A, B
you can do it this way:
In [99]: x = df.drop('threshold',1)
In [100]: x
Out[100]:
A B C D
0 1 3 3 1
1 2 3 6 1
2 9 5 0 2
In [102]: x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
Out[102]:
0 B, C
1 C
2 A, B
dtype: object

Categories

Resources