df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
where num_legs, num_wings and num_specimen_seen are columns.
Now, I've tuple like ('num_wings', 'num_legs') and wanted to check are there values are df columns? if yes then return true else false.
('num_wings', 'num_legs') -> this will return true
('abc', 'num_legs') -> false
You can use get_indexer here.
idxr = df.columns.get_indexer(tup)
all(idxr>-1)
Performance
cols = pd.Index(np.arange(10_000))
tup = tuple(np.arange(10_001))
%timeit all(cols.get_indexer(tup)>-1)
3.86 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit all(e in cols for e in tup)
5.96 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You simply have to check if all elements of the tuple are contained in df.columns:
df = ...
def check(tup):
return all((e in df.columns) for e in tup)
Performance comparison
#user3483203 proposed an alternative, quite succinct, solution using get_indexer, so I performed a timeit comparison of both our solutions.
import random
import string
import pandas as pd
def rnd_str(l):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(l))
unique_strings = set(rnd_str(3) for _ in range(20000))
cols = pd.Index(unique_strings)
tup = tuple(rnd_str(3) for _ in range(5000))
%timeit all(cols.get_indexer(tup)>-1)
# 714 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 639 ns ± 0.988 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
###
tup = tuple(rnd_str(3) for _ in range(10000))
%timeit all(cols.get_indexer(tup)>-1)
# 1.29 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 1.23 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Turns out the solution proposed in this post is significantly faster. The key advantage of this approach is that the all() functions exits early as soon as any element of the tuple that's not in df.columns has been spotted.
Y can't u iterate over each of the value in the tuple & check for them individually, if they are present in the dataframe.
>>> def check_presence(tuple):
... for x in tuple:
... if x not in df.columns:
... return False
... return True
check_presence(('num_wings', 'num_legs')) # returns True
check_presence(('abc', 'num_legs')) # returns False
Related
I am unsure as to the cost of transforming a matrix of tuples into a list form which is easier to manipulate. The main priority is being able to change a column of the matrix as fast as possible
I have a matrix in the form of
[(a,b,c),(d,e,f),(g,h,i)]
which can appear in any size of n x m but for this example we'll take 3x3 matrix.
my main goal is to be able to change the values of any column in the matrix (only one at a time) (eg (b,e,h)).
my initial attempt was to transform the matrix into a list ie
[[a,b,c],[d,e,f],[g,h,i]]
which would be easier
but I feel that it would be costly in terms of transforming every tuple into a list and back into a tuple.
My main question could be how to optimize this to its fullest?
In [37]: def change_column_list_comp(old_m, col, value):
...: return [
...: tuple(list(row[:col]) + [value] + list(row[col + 1:]))
...: for row in old_m
...: ]
...:
In [38]: def change_column_list_convert(old_m, col, value):
...: list_m = list(map(list, old_m))
...: for row in list_m:
...: row[col] = value
...:
...: return list(map(tuple, list_m))
...:
In [39]: m = [tuple('abc'), tuple('def'), tuple('ghi')]
In [40]: %timeit change_column_list_comp(m, 1, 2)
2.05 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [41]: %timeit change_column_list_convert(m, 1, 2)
1.28 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like converting to a list, modifying the values, and converting back to tuple is faster. Note that this may not be the most efficient way of writing these functions.
However, these functions seem to start to converge as we scale up our matrix.
In [6]: m_100k = [tuple(string.printable)] * 100_000
In [7]: %timeit change_column_list_comp(m_100k, 1, 2)
163 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit change_column_list_convert(m_100k, 1, 2)
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: m_1m = [tuple(string.printable)] * 1_000_000
In [43]: %timeit change_column_list_comp(m_1m, 1, 2)
1.72 s ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit change_column_list_convert(m_1m, 1, 2)
1.24 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At the end of the day you should be using the right tools for the job. While it's not really in the OP, it's just worth mentioning that numpy is simply the better way to go.
In [13]: m_np = np.array([list('abc'), list('def'), list('ghi')])
In [17]: %timeit m_np[:, 1] = 2; m_np
610 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]: m_np_100k = np.array([[string.printable] * 100_000])
In [21]: %timeit m_np_100k[:, 1] = 2; m_np_100k
545 ns ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: m_np_1m = np.array([[string.printable] * 1_000_000])
# This might be using cached data
In [23]: %timeit m_np_1m[:, 1] = 2; m_np_1m
515 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Avoiding cache
In [24]: %timeit m_np_1m[:, 4] = 9; m_np_1m
557 ns ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This might not be the fairest comparison as we're manually returning the matrix, but you can see there is significant improvement.
Tried to do this with pandas.Series.apply function but it consider to be slow on big amount of data. Is there any quicker way to replace values?
Here is what I've tried, but it's slow on big Series (with million items for example)
s = pd.Series([1, 2, 3, 'str1', 'str2', 3])
s.apply(lambda x: x if type(x) == str else np.nan)
Use to_numeric with errors='coerce':
pd.to_numeric(s, errors='coerce')
If need also integers add Int64:
pd.to_numeric(s, errors='coerce').astype('Int64')
EDIT: You can use isinstance with map, and also Series.where:
#test 600k
N = 100000
s = pd.Series([1, 2, 3, 'str1', 'str2', 3] * N)
In [152]: %timeit s.apply(lambda x: x if type(x) == str else np.nan)
196 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [153]: %timeit s.map(lambda x: x if isinstance(x, str) else np.nan)
174 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [154]: %timeit s.where(s.map(lambda x: isinstance(x, str)))
168 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [155]: %timeit s.where(pd.to_numeric(s, errors='coerce').isna())
366 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?
Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm trying to element-wise multiply two arrays to form a single string.
Can anyone advise?
import numpy as np
def array_translate(array):
intlist = [x for x in array if isinstance(x, int)]
strlist = [x for x in array if isinstance(x, str)]
joinedlist = np.multiply(intlist, strlist)
return "".join(joinedlist)
print(array_translate(["Cat", 2, "Dog", 3, "Mouse", 1])) # => "CatCatDogDogDogMouse"
I receive this error:
File "/Users/peteryoon/PycharmProjects/Test3/Test3.py", line 8, in array_translate
joinedlist = np.multiply(intlist, strlist)
numpy.core._exceptions.UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')
I was able to solve using list comprehension below. But curious to see how numpy works.
def array_translate(array):
intlist = [x for x in array if isinstance(x, int)]
strlist = [x for x in array if isinstance(x, str)]
return "".join(intlist*strlist for intlist, strlist in zip(intlist, strlist))
print(array_translate(["Cat", 2, "Dog", 3, "Mouse", 1])) # => "CatCatDogDogDogMouse"
In [79]: arr = np.array(['Cat','Dog','Mouse'])
In [80]: cnt = np.array([2,3,1])
Timings for various alternatives. The relative placement may vary with the size of the arrays (and whether you start with lists or arrays). So do your own testing:
In [93]: timeit ''.join(np.repeat(arr,cnt))
7.98 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [94]: timeit ''.join([str(wd)*i for wd,i in zip(arr,cnt)])
5.96 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [95]: timeit ''.join(arr.astype(object)*cnt)
13.3 µs ± 50.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [96]: timeit ''.join(np.char.multiply(arr,cnt))
27.4 µs ± 307 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: timeit ''.join(np.frompyfunc(lambda w,i: w*i,2,1)(arr,cnt))
10.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [101]: %%timeit f = np.frompyfunc(lambda w,i: w*i,2,1)
...: ''.join(f(arr,cnt))
7.95 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [102]: %%timeit x=arr.tolist(); y=cnt.tolist()
...: ''.join([str(wd)*i for wd,i in zip(x,y)])
1.36 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
np.repeat works for all kinds of arrays.
List comprehension uses the string multiply, and shouldn't be dismissed out of hand. Often it is fastest, especially if starting with lists.
Object dtype converts the string dtype to Python strings, and then delegates the action to the string multiply.
np.char applies string methods to elements of an array. While convenient, it seldom is fast.
edit
In [104]: timeit ''.join(np.repeat(arr,cnt).tolist())
4.04 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
perhaps using repeat
z = array(['Cat', 'Dog', 'Mouse'], dtype='<U5')
"".join(np.repeat(z, (2, 3, 1)))
'CatCatDogDogDogMouse'
I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?
index row 1 row 2 ...
0 string1 string3
1 string1 string1
2 string2 string2
...
If performance is an issue try:
from collections import Counter
Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})
Or stack it into one Series then use value_counts
df.stack().value_counts()
#string1 3
#string2 2
#string3 1
#dtype: int64
For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
#string1 3.0
#string2 2.0
#string3 1.0
#dtype: float64
Also, there's a numpy solution:
import numpy as np
np.unique(df.to_numpy(), return_counts=True)
#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))
df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
'row2': ['string3', 'string1', 'string2']})
def vc_from_loop(df):
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
return s
Small DataFrame
%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Long DataFrame
df = pd.concat([df]*300000, ignore_index=True)
%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)