How to replace non string values with Nan in pandas series? - python

Tried to do this with pandas.Series.apply function but it consider to be slow on big amount of data. Is there any quicker way to replace values?
Here is what I've tried, but it's slow on big Series (with million items for example)
s = pd.Series([1, 2, 3, 'str1', 'str2', 3])
s.apply(lambda x: x if type(x) == str else np.nan)

Use to_numeric with errors='coerce':
pd.to_numeric(s, errors='coerce')
If need also integers add Int64:
pd.to_numeric(s, errors='coerce').astype('Int64')
EDIT: You can use isinstance with map, and also Series.where:
#test 600k
N = 100000
s = pd.Series([1, 2, 3, 'str1', 'str2', 3] * N)
In [152]: %timeit s.apply(lambda x: x if type(x) == str else np.nan)
196 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [153]: %timeit s.map(lambda x: x if isinstance(x, str) else np.nan)
174 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [154]: %timeit s.where(s.map(lambda x: isinstance(x, str)))
168 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [155]: %timeit s.where(pd.to_numeric(s, errors='coerce').isna())
366 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Is it possible to multiply a string to an integer using numpy multiplication?

I'm trying to element-wise multiply two arrays to form a single string.
Can anyone advise?
import numpy as np
def array_translate(array):
intlist = [x for x in array if isinstance(x, int)]
strlist = [x for x in array if isinstance(x, str)]
joinedlist = np.multiply(intlist, strlist)
return "".join(joinedlist)
print(array_translate(["Cat", 2, "Dog", 3, "Mouse", 1])) # => "CatCatDogDogDogMouse"
I receive this error:
File "/Users/peteryoon/PycharmProjects/Test3/Test3.py", line 8, in array_translate
joinedlist = np.multiply(intlist, strlist)
numpy.core._exceptions.UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')
I was able to solve using list comprehension below. But curious to see how numpy works.
def array_translate(array):
intlist = [x for x in array if isinstance(x, int)]
strlist = [x for x in array if isinstance(x, str)]
return "".join(intlist*strlist for intlist, strlist in zip(intlist, strlist))
print(array_translate(["Cat", 2, "Dog", 3, "Mouse", 1])) # => "CatCatDogDogDogMouse"
In [79]: arr = np.array(['Cat','Dog','Mouse'])
In [80]: cnt = np.array([2,3,1])
Timings for various alternatives. The relative placement may vary with the size of the arrays (and whether you start with lists or arrays). So do your own testing:
In [93]: timeit ''.join(np.repeat(arr,cnt))
7.98 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [94]: timeit ''.join([str(wd)*i for wd,i in zip(arr,cnt)])
5.96 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [95]: timeit ''.join(arr.astype(object)*cnt)
13.3 µs ± 50.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [96]: timeit ''.join(np.char.multiply(arr,cnt))
27.4 µs ± 307 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: timeit ''.join(np.frompyfunc(lambda w,i: w*i,2,1)(arr,cnt))
10.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [101]: %%timeit f = np.frompyfunc(lambda w,i: w*i,2,1)
...: ''.join(f(arr,cnt))
7.95 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [102]: %%timeit x=arr.tolist(); y=cnt.tolist()
...: ''.join([str(wd)*i for wd,i in zip(x,y)])
1.36 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
np.repeat works for all kinds of arrays.
List comprehension uses the string multiply, and shouldn't be dismissed out of hand. Often it is fastest, especially if starting with lists.
Object dtype converts the string dtype to Python strings, and then delegates the action to the string multiply.
np.char applies string methods to elements of an array. While convenient, it seldom is fast.
edit
In [104]: timeit ''.join(np.repeat(arr,cnt).tolist())
4.04 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
perhaps using repeat
z = array(['Cat', 'Dog', 'Mouse'], dtype='<U5')
"".join(np.repeat(z, (2, 3, 1)))
'CatCatDogDogDogMouse'

pandas and tuple check

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
where num_legs, num_wings and num_specimen_seen are columns.
Now, I've tuple like ('num_wings', 'num_legs') and wanted to check are there values are df columns? if yes then return true else false.
('num_wings', 'num_legs') -> this will return true
('abc', 'num_legs') -> false
You can use get_indexer here.
idxr = df.columns.get_indexer(tup)
all(idxr>-1)
Performance
cols = pd.Index(np.arange(10_000))
tup = tuple(np.arange(10_001))
%timeit all(cols.get_indexer(tup)>-1)
3.86 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit all(e in cols for e in tup)
5.96 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You simply have to check if all elements of the tuple are contained in df.columns:
df = ...
def check(tup):
return all((e in df.columns) for e in tup)
Performance comparison
#user3483203 proposed an alternative, quite succinct, solution using get_indexer, so I performed a timeit comparison of both our solutions.
import random
import string
import pandas as pd
def rnd_str(l):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(l))
unique_strings = set(rnd_str(3) for _ in range(20000))
cols = pd.Index(unique_strings)
tup = tuple(rnd_str(3) for _ in range(5000))
%timeit all(cols.get_indexer(tup)>-1)
# 714 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 639 ns ± 0.988 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
###
tup = tuple(rnd_str(3) for _ in range(10000))
%timeit all(cols.get_indexer(tup)>-1)
# 1.29 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 1.23 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Turns out the solution proposed in this post is significantly faster. The key advantage of this approach is that the all() functions exits early as soon as any element of the tuple that's not in df.columns has been spotted.
Y can't u iterate over each of the value in the tuple & check for them individually, if they are present in the dataframe.
>>> def check_presence(tuple):
... for x in tuple:
... if x not in df.columns:
... return False
... return True
check_presence(('num_wings', 'num_legs')) # returns True
check_presence(('abc', 'num_legs')) # returns False

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?
It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

Create a new Pandas df column with boolean values that depend on another column

I need to add a new column to a Pandas dataframe.
If the column "Inducing" contains text (not empty and not "") I need to add a 1 otherwise 0
I tried with
df['newColumn'] = np.where(df['INDUCING']!="", 1, 0)
This command works only for the values that are Strings initiated as "" but does not work if it is null.
Any idea on how to add this column correctly?
By De Morgan's laws, NOT(cond1 OR cond2) is equivalent to AND(NOT(cond1) AND NOT(cond2)).
You can combine conditions via the bitwise "and" (&) / "or" (|) operators as appropriate. This gives a Boolean series, which you can then cast to int:
df['newColumn'] = (df['INDUCING'].ne('') & df['INDUCING'].notnull()).astype(int)
Easiest way would be to .fillna('') first. Correction:
df['newColumn'] = np.where(df['INDUCING'].fillna('') != "", 1, 0)
or pass .astype(int) directly to the mask. This converts True to 1 and False to 0:
df['newcol'] = (df['INDUCING'].fillna('') != '').astype(int)
As the built-in bool produces True on a string exactly if it is non-empty, you can achieve this simply through
df['newColumn'] = df['INDUCING'].astype(bool).astype(int)
Some performance comparisons:
In [61]: df = pd.DataFrame({'INDUCING': ['test', None, '', 'more test']*10000})
In [63]: %timeit np.where(df['INDUCING'].fillna('') != "", 1, 0)
5.68 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %timeit (df['INDUCING'].ne('') & df['INDUCING'].notnull()).astype(int)
5.1 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [64]: %timeit np.where(df['INDUCING'], 1, 0)
667 µs ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [65]: %timeit df['INDUCING'].astype(bool).astype(int)
655 µs ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [99]: %timeit df['INDUCING'].values.astype(bool).astype(int)
553 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Get total values_count from a dataframe with Python Pandas

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?
index row 1 row 2 ...
0 string1 string3
1 string1 string1
2 string2 string2
...
If performance is an issue try:
from collections import Counter
Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})
Or stack it into one Series then use value_counts
df.stack().value_counts()
#string1 3
#string2 2
#string3 1
#dtype: int64
For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
#string1 3.0
#string2 2.0
#string3 1.0
#dtype: float64
Also, there's a numpy solution:
import numpy as np
np.unique(df.to_numpy(), return_counts=True)
#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))
df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
'row2': ['string3', 'string1', 'string2']})
def vc_from_loop(df):
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
return s
Small DataFrame
%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Long DataFrame
df = pd.concat([df]*300000, ignore_index=True)
%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Categories

Resources