I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?
As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64
You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])
Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
You can use the method str.get:
df['A'].str.get(0)
You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop
Related
Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
I have a pandas.core.series.Series with data
0 [00115840, 00110005, 001000033, 00116000...
1 [00267285, 00263627, 00267010, 0026513...
2 [00335595, 00350750]
I want to remove leading zeros from the series.I tried
x.astype('int64')
But got error message
ValueError: setting an array element with a sequence.
Can you suggest me how to do this in python 3.x?
s=pd.Series(s.apply(pd.Series).astype(int).values.tolist())
s
Out[282]:
0 [1, 2]
1 [3, 4]
dtype: object
Data input
s=pd.Series([['001','002'],['003','004']])
Update: Thanks for Jez and cold point it out :-)
pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
Out[317]:
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
If want list of strings convert to list of integerss use list comprehension:
s = pd.Series([[int(y) for y in x] for x in s], index=s.index)
s = s.apply(lambda x: [int(y) for y in x])
Sample:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
print (s)
0 [00115840, 00110005, 001000033, 00116000]
1 [00267285, 00263627, 00267010, 0026513]
2 [00335595, 00350750]
dtype: object
s = s.apply(lambda x: [int(y) for y in x])
print (s)
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
EDIT:
If want integers only you can flatten values and cast to ints:
s = pd.Series([item for sublist in s for item in sublist]).astype(int)
Alternative solution:
import itertools
s = pd.Series(list(itertools.chain(*s))).astype(int)
print (s)
0 115840
1 110005
2 1000033
3 116000
4 267285
5 263627
6 267010
7 26513
8 335595
9 350750
dtype: int32
Timings:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
s = pd.concat([s]*1000).reset_index(drop=True)
In [203]: %timeit pd.Series([[int(y) for y in x] for x in s], index=s.index)
100 loops, best of 3: 4.66 ms per loop
In [204]: %timeit s.apply(lambda x: [int(y) for y in x])
100 loops, best of 3: 5.13 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [205]: %%timeit
...: v = pd.Series(np.concatenate(s.values.tolist()))
...: v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
...:
1 loop, best of 3: 226 ms per loop
#Wen solution
In [211]: %timeit pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
1 loop, best of 3: 1.12 s per loop
Solutions with flatenning (idea of #cᴏʟᴅsᴘᴇᴇᴅ):
In [208]: %timeit pd.Series([item for sublist in s for item in sublist]).astype(int)
100 loops, best of 3: 2.55 ms per loop
In [209]: %timeit pd.Series(list(itertools.chain(*s))).astype(int)
100 loops, best of 3: 2.2 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [210]: %timeit pd.Series(np.concatenate(s.values.tolist()))
100 loops, best of 3: 7.71 ms per loop
Flatten your data with np.concatenate -
s
0 [00115840, 36869, 262171, 39936]
1 [00267285, 92055, 93704, 11595]
2 [00335595, 119272]
Name: 1, dtype: object
v = pd.Series(np.concatenate(s.tolist()))
Or (thanks to jezrael for the suggestion), using .values.tolist which is faster -
v = pd.Series(np.concatenate(s.values.tolist()))
v
0 00115840
1 36869
2 262171
3 39936
4 00267285
5 92055
6 93704
7 11595
8 00335595
9 119272
dtype: object
Now, what you're doing with astype should work -
v.astype(int)
0 115840
1 36869
2 262171
3 39936
4 267285
5 92055
6 93704
7 11595
8 335595
9 119272
dtype: int64
If you have data as floats, use astype(float) instead.
If you want to, you could reshape the result back to its original format using groupby + agg -
v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
0 [115840, 36869, 262171, 39936]
1 [267285, 92055, 93704, 11595]
2 [335595, 119272]
dtype: object
If you want a more crisp solution, you could try following:
Assuming a is the original series.
b = a.explode().astype(int)
a = b.groupby(b.index).agg(list)
Albeit, this is slower than solutions posted by #cs95 and #jezrael
#where x is a series
x = x.str.lstrip('0')
Below lines should work if you have mixed dtype
df['col'] = df['col'].apply(lambda x:x.lstrip('0') if type(x) == str else x)
I have a list of 4 dataframes, called df.
I'd like to add a "number" column to each dataframe (df[i]['number']) that represent the dataframe number.
I tried to use list comprehension for that:
df=[df['number']=(x+1) for x in range(0,4)]
Which resulted in
File "<ipython-input-52-0b708f543fbb>", line 1
df=[df['number']=(x+1) for x in range(0,4)]
^
SyntaxError: invalid syntax
I also tried:
df=[x['number']=(y+1) for x,y in enumerate(df)]
With the same result, pointing at the '=' sign.
What am I doing wrong?
Use enumerate, starting from 1 and assign to each dataframe in your list.
for i, d in enumerate(df, 1):
d['number'] = i
In-place assignment is much cheaper than assignment in a list comprehension.
df[0]
id marks
0 1 100
1 2 200
2 3 300
df[1]
name score flag
0 'abc' 100 T
1 'zxc' 300 F
for i, d in enumerate(df, 1):
d['number'] = i
df[0]
id marks number
0 1 100 1
1 2 200 1
2 3 300 1
df[1]
name score flag number
0 'abc' 100 T 2
1 'zxc' 300 F 2
Performance
Small
1000 loops, best of 3: 278 µs per loop # mine
vs
1000 loops, best of 3: 567 µs per loop # John Galt
Large (df * 10000)
1000 loops, best of 3: 607 µs per loop # mine
vs
1000 loops, best of 3: 1.16 ms per loop # John Galt - assign
1 loop, best of 1: 1.42 ms per loop # John Galt - side effects
Note that the loop-based assignment is also space efficient.
Use
1)
In [454]: df = [x.assign(number=i) for i, x in enumerate(df, 1)]
In [455]: df[0]
Out[455]:
0 1 number
0 0.068330 0.708835 1
1 0.877747 0.586654 1
In [456]: df[1]
Out[456]:
0 1 number
0 0.430418 0.477923 2
1 0.049980 0.018981 2
Good part you can assign it to a new variable without altering old list like
dff = [x.assign(number=i) for i, x in enumerate(df, 1)]
2)
If you want inplace and list comprehension
In [474]: [x.insert(x.shape[1] ,'number', i) for i, x in enumerate(df, 1)]
Out[474]: [None, None, None, None]
In [475]: df[0]
Out[475]:
0 1 number
0 0.207806 0.315701 1
1 0.464864 0.976156 1
I'm trying to compute the Hamming distance between all strings in a column in a large dataframe. I have over 100,000 rows in this column so with all pairwise combinations, which is 10x10^9 comparisons. These strings are short DNA sequences. I would like to quickly convert every string in the column to a list of integers, where a unique integer represent each character in the string. E.g.
"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]
then I use scipy.spatial.distance.pdist to quickly and efficiently compute the hamming distance between all of these. Is there a fast way to do this in Pandas?
I have tried using apply but it is pretty slow:
mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))
get_dummies and other Categorical operations don't apply because they operate on a per row level. Not within the row.
Since Hamming distance doesn't care about magnitude differences, I can get about a 40-60% speedup just replacing df.apply(lambda x: np.array([mapping[char] for char in x])) with df.apply(lambda x: map(ord, x)) on made-up datasets.
Create your test data
In [39]: pd.options.display.max_rows=12
In [40]: N = 100000
In [41]: chars = np.array(list('ABCDEF'))
In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))
In [45]: s
Out[45]:
0 BEBC
1 BEEC
2 FEFA
3 BBDA
4 CCBB
5 CABE
...
99994 EEBC
99995 FFBD
99996 ACFB
99997 FDBE
99998 BDAB
99999 CCFD
dtype: object
These don't actually have to be the same length the way we are doing it.
In [43]: maxlen = s.str.len().max()
In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
In [47]: result
Out[47]:
0 1 2 3
0 1 4 1 2
1 1 4 4 2
2 5 4 5 0
3 1 1 3 0
4 2 2 1 1
5 2 0 1 4
... .. .. .. ..
99994 4 4 1 2
99995 5 5 1 3
99996 0 2 5 1
99997 5 3 1 4
99998 1 3 0 1
99999 2 2 5 3
[100000 rows x 4 columns]
So you get a factorization according the same categories (e.g. the codes are meaningful)
And pretty fast
In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop
I didn't test the performance of this, but you could also try somthing like
atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]
results in:
[3, 2, 1, 0, 3, 2, 3]
Edit: Ok, so testing it with atest = "ACTACA"*100000 takes a while :/
Maybe not the best idea...
Edit 5:
Another improvement:
import datetime
import numpy as np
class Test(object):
def __init__(self):
self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
def char2num(self, astring):
return [self.mapping[c] for c in astring]
def main():
now = datetime.datetime.now()
atest = "AGTCAGTCATG"*10000000
t = Test()
alist = t.char2num(atest)
testme = np.array(alist)
print testme, len(testme)
print datetime.datetime.now() - now
if __name__ == "__main__":
main()
Takes about 16 seconds for 110.000.000 characters and keeps your processor busy instead of your ram:
[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
There doesn't seem to be much difference between using ord or a dictionary-based lookup that exactly maps A->0, C->1 etc:
import pandas as pd
import numpy as np
bases = ['A', 'C', 'T', 'G']
rowlen = 4
nrows = 1000000
dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))
lookup = dict(zip(bases, range(4)))
%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop
%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop
Jeff's solution is also not far off in terms of performance:
%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop
A major advantage of this approach over mapping the rows to lists of ints is that the categories can then be viewed as a single (nrows, rowlen) uint8 array via the .values attribute, which could then be passed directly to pdist.
If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]