I have a pandas.core.series.Series with data
0 [00115840, 00110005, 001000033, 00116000...
1 [00267285, 00263627, 00267010, 0026513...
2 [00335595, 00350750]
I want to remove leading zeros from the series.I tried
x.astype('int64')
But got error message
ValueError: setting an array element with a sequence.
Can you suggest me how to do this in python 3.x?
s=pd.Series(s.apply(pd.Series).astype(int).values.tolist())
s
Out[282]:
0 [1, 2]
1 [3, 4]
dtype: object
Data input
s=pd.Series([['001','002'],['003','004']])
Update: Thanks for Jez and cold point it out :-)
pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
Out[317]:
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
If want list of strings convert to list of integerss use list comprehension:
s = pd.Series([[int(y) for y in x] for x in s], index=s.index)
s = s.apply(lambda x: [int(y) for y in x])
Sample:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
print (s)
0 [00115840, 00110005, 001000033, 00116000]
1 [00267285, 00263627, 00267010, 0026513]
2 [00335595, 00350750]
dtype: object
s = s.apply(lambda x: [int(y) for y in x])
print (s)
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
EDIT:
If want integers only you can flatten values and cast to ints:
s = pd.Series([item for sublist in s for item in sublist]).astype(int)
Alternative solution:
import itertools
s = pd.Series(list(itertools.chain(*s))).astype(int)
print (s)
0 115840
1 110005
2 1000033
3 116000
4 267285
5 263627
6 267010
7 26513
8 335595
9 350750
dtype: int32
Timings:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
s = pd.concat([s]*1000).reset_index(drop=True)
In [203]: %timeit pd.Series([[int(y) for y in x] for x in s], index=s.index)
100 loops, best of 3: 4.66 ms per loop
In [204]: %timeit s.apply(lambda x: [int(y) for y in x])
100 loops, best of 3: 5.13 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [205]: %%timeit
...: v = pd.Series(np.concatenate(s.values.tolist()))
...: v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
...:
1 loop, best of 3: 226 ms per loop
#Wen solution
In [211]: %timeit pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
1 loop, best of 3: 1.12 s per loop
Solutions with flatenning (idea of #cᴏʟᴅsᴘᴇᴇᴅ):
In [208]: %timeit pd.Series([item for sublist in s for item in sublist]).astype(int)
100 loops, best of 3: 2.55 ms per loop
In [209]: %timeit pd.Series(list(itertools.chain(*s))).astype(int)
100 loops, best of 3: 2.2 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [210]: %timeit pd.Series(np.concatenate(s.values.tolist()))
100 loops, best of 3: 7.71 ms per loop
Flatten your data with np.concatenate -
s
0 [00115840, 36869, 262171, 39936]
1 [00267285, 92055, 93704, 11595]
2 [00335595, 119272]
Name: 1, dtype: object
v = pd.Series(np.concatenate(s.tolist()))
Or (thanks to jezrael for the suggestion), using .values.tolist which is faster -
v = pd.Series(np.concatenate(s.values.tolist()))
v
0 00115840
1 36869
2 262171
3 39936
4 00267285
5 92055
6 93704
7 11595
8 00335595
9 119272
dtype: object
Now, what you're doing with astype should work -
v.astype(int)
0 115840
1 36869
2 262171
3 39936
4 267285
5 92055
6 93704
7 11595
8 335595
9 119272
dtype: int64
If you have data as floats, use astype(float) instead.
If you want to, you could reshape the result back to its original format using groupby + agg -
v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
0 [115840, 36869, 262171, 39936]
1 [267285, 92055, 93704, 11595]
2 [335595, 119272]
dtype: object
If you want a more crisp solution, you could try following:
Assuming a is the original series.
b = a.explode().astype(int)
a = b.groupby(b.index).agg(list)
Albeit, this is slower than solutions posted by #cs95 and #jezrael
#where x is a series
x = x.str.lstrip('0')
Below lines should work if you have mixed dtype
df['col'] = df['col'].apply(lambda x:x.lstrip('0') if type(x) == str else x)
Related
I have a list of 4 dataframes, called df.
I'd like to add a "number" column to each dataframe (df[i]['number']) that represent the dataframe number.
I tried to use list comprehension for that:
df=[df['number']=(x+1) for x in range(0,4)]
Which resulted in
File "<ipython-input-52-0b708f543fbb>", line 1
df=[df['number']=(x+1) for x in range(0,4)]
^
SyntaxError: invalid syntax
I also tried:
df=[x['number']=(y+1) for x,y in enumerate(df)]
With the same result, pointing at the '=' sign.
What am I doing wrong?
Use enumerate, starting from 1 and assign to each dataframe in your list.
for i, d in enumerate(df, 1):
d['number'] = i
In-place assignment is much cheaper than assignment in a list comprehension.
df[0]
id marks
0 1 100
1 2 200
2 3 300
df[1]
name score flag
0 'abc' 100 T
1 'zxc' 300 F
for i, d in enumerate(df, 1):
d['number'] = i
df[0]
id marks number
0 1 100 1
1 2 200 1
2 3 300 1
df[1]
name score flag number
0 'abc' 100 T 2
1 'zxc' 300 F 2
Performance
Small
1000 loops, best of 3: 278 µs per loop # mine
vs
1000 loops, best of 3: 567 µs per loop # John Galt
Large (df * 10000)
1000 loops, best of 3: 607 µs per loop # mine
vs
1000 loops, best of 3: 1.16 ms per loop # John Galt - assign
1 loop, best of 1: 1.42 ms per loop # John Galt - side effects
Note that the loop-based assignment is also space efficient.
Use
1)
In [454]: df = [x.assign(number=i) for i, x in enumerate(df, 1)]
In [455]: df[0]
Out[455]:
0 1 number
0 0.068330 0.708835 1
1 0.877747 0.586654 1
In [456]: df[1]
Out[456]:
0 1 number
0 0.430418 0.477923 2
1 0.049980 0.018981 2
Good part you can assign it to a new variable without altering old list like
dff = [x.assign(number=i) for i, x in enumerate(df, 1)]
2)
If you want inplace and list comprehension
In [474]: [x.insert(x.shape[1] ,'number', i) for i, x in enumerate(df, 1)]
Out[474]: [None, None, None, None]
In [475]: df[0]
Out[475]:
0 1 number
0 0.207806 0.315701 1
1 0.464864 0.976156 1
I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop
I have a data frame which looks like this,
df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
df
and I have two lists,
lis1=['A']
Lis2=['S','O']
I need to replace the value in col2 based on the lis1 and lis2.
So I used np.where to do so.
like this,
df['col2'] = np.where(df.col2.isin(lis1),'PC',df.col2.isin(lis2),'Ln','others')
But its throwing me follwoing error,
TypeError: function takes at most 3 arguments (5 given)
Any suggestion is very appreciated.!!
At the end I am aiming to have the values replace in col2 of my data frame as,
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Use double numpy.where:
lis1=['A']
lis2=['S','O']
df['col2'] = np.where(df.col2.isin(lis1),'PC',
np.where(df.col2.isin(lis2),'Ln','others'))
print (df)
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Timings:
#[60000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [257]: %timeitnp.where(df.col2.isin(lis1),'PC',np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 8.15 ms per loop
In [258]: %timeit in1d_based(df, lis1, lis2)
100 loops, best of 3: 4.98 ms per loop
Here's one approach -
a = df.col2.values
df.col2 = np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
Sample step-by-step run -
# Input dataframe
In [206]: df
Out[206]:
col1 col2
0 1 A
1 2 A
2 3 S
3 4 O
4 5 S
5 6 P
# Extract out col2 values
In [207]: a = df.col2.values
# Form an indexing array based on where we have matches in lis1 or lis2 or neither
In [208]: idx = np.in1d(a,lis1) + 2*np.in1d(a,lis2)
In [209]: idx
Out[209]: array([1, 1, 2, 2, 2, 0])
# Index into a list of new strings with those indices
In [210]: newvals = np.take(['others','PC','Ln'], idx)
In [211]: newvals
Out[211]:
array(['PC', 'PC', 'Ln', 'Ln', 'Ln', 'others'],
dtype='|S6')
# Finally assign those into col2
In [212]: df.col2 = newvals
In [213]: df
Out[213]:
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Runtime test -
In [251]: df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
In [252]: df = pd.concat([df]*10000).reset_index(drop=True)
In [253]: lis1
Out[253]: ['A']
In [254]: lis2
Out[254]: ['S', 'O']
In [255]: def in1d_based(df, lis1, lis2):
...: a = df.col2.values
...: return np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
...:
# #jezrael's soln
In [256]: %timeit np.where(df.col2.isin(lis1),'PC', np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 3.78 ms per loop
In [257]: %timeit in1d_based(df, lis1, lis2)
1000 loops, best of 3: 1.89 ms per loop
I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?
As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64
You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])
Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
You can use the method str.get:
df['A'].str.get(0)
You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop
If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]