Related
I have a dataframe containing 15K+ strings in the format of xxxx-yyyyy-zzz. The yyyyy is a random 5 digit number generated. Given that I have xxxx as 1000 and zzz as 200, how can I generate the random yyyyy and add it to the dataframe so that the string is unique?
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
I'd generate a new column with just the middle values and generate random numbers until you find one that's not in the column.
from random import randint
df["excl"] = df.number.apply(lambda x:int(x.split("-")[1]))
num = randint(10000, 99999)
while num in df.excl.values:
num = randint(10000, 99999)
I tried to come up with a generic approach, you can use this for lists:
import random
number_series = ["1000-12345-100", "1000-82045-200", "1000-93035-200"]
def rnd_nums(n_numbers: int, number_series: list, max_length: int=5, prefix: int=1000, suffix: int=100):
# ignore following numbers
blacklist = [int(x.split('-')[1]) for x in number_series]
# define space with allowed numbers
rng = range(0, 10**max_length)
# get unique sample of length "n_numbers"
lst = random.sample([i for i in rng if i not in blacklist], n_numbers)
# return sample as string with pre- and suffix
return ['{}-{:05d}-{}'.format(prefix, mid, suffix) for mid in lst]
rnd_nums(5, number_series)
Out[69]:
['1000-79396-100',
'1000-30032-100',
'1000-09188-100',
'1000-18726-100',
'1000-12139-100']
Or use it to generate new rows in a dataframe Dataframe:
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
df.append(pd.DataFrame({'number': rnd_nums(5, number_series)}), ignore_index=True)
Out[72]:
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
3 1000-00439-100
4 1000-36284-100
5 1000-64592-100
6 1000-50471-100
7 1000-02005-100
In addition to the other suggestions, you could also write a function that takes your df and the amount of new numbers you would like to add as arguments, appends it with the new numbers and returns the updated df. The function could look like this:
import pandas as pd
import random
def add_number(df, num):
lst = []
for n in df["number"]:
n = n.split("-")[1]
lst.append(int(n))
for i in range(num):
check = False
while check == False:
new_number = random.randint(10000, 99999)
if new_number not in lst:
lst.append(new_number)
l = len(df["number"])
df.at[l+1,"number"] = "1000-%i-200" % new_number
check = True
df = df.reset_index(drop=True)
return df
This would have the advantage that you could use the function every time you want to add new numbers.
try:
import random
df['number'] = [f"1000-{x}-200" for x in random.sample(range(10000, 99999), len(df))]
output:
number
0 1000-24744-200
1 1000-28991-200
2 1000-98322-200
...
One option is to use sample from the random module:
import random
num_digits = 5
col_length = 15000
rand_nums = random.sample(range(10**num_digits),col_length)
data["number"]=['-'.join(
'1000',str(num).zfill(num_digits),'200')
for num in rand_nums]
It took my computer about 30 ms to generate the numbers. For numbers with more digits, it may become infeasible.
Another option is to just take sequential integers, then encrypt them. This will result in a sequence in which each element is unique. They will be pseudo-random, rather than truly random, but then Python's random module is producing pseudo-random numbers as well.
I have a for loop which deals with more than 9 million combinations (for this, I've used itertools library), must perform the code below faster, it's taking too long to loop over all combinations. Appreciate any suggestions
wb = xw.books('FX VEGA BT.xlsm')
sht = wb.sheets['Sheet1']
#retrieving data from excel
df = pd.DataFrame(sht.range('PY_PNL').value, columns=['10','20','25','40','50','60','70','75','80','90'])
#df has shape of 3115 rows × 10 columns
def sharpe(x):
s = round(np.average(x)/np.std(x)*np.sqrt(252),2)
return s
shrps = []
outlist = []
mult = (-1,-2.5,0,1,2.5)
perm = itertools.product(mult,repeat = 10)
for p in perm:
c = df*p
c = c.sum(axis='columns')
outlist.append(p)
shrps.append(sharpe(c))
You can use a list comprehension and it'll be a bit more faster:
shrps = [sharpe((df*p).sum(axis='columns')) for p in perms]
If you really need a copy of perm named as outlist, you can use deepcopy from copy package:
import copy
outlist = copy.deepcopy(perm)
To speed up the process more, you can change something (I don't know how it looks like) in sharpe() function.
Simply put, I want to change the following code into a funtion that doesn't use apply or progress_apply, so that the performance doesn't take 4+ hours to execute on 20 million+ rows.
d2['B'] = d2['C'].progress_apply(lambda x: [z for y in d1['B'] for z in y if x.startswith(z)])
d2['B'] = d2['B'].progress_apply(max)
Full question below:
I have two dataframes. The first dataframe has a column with 4 categories (A,B,C,D) with four different lists of numbers that I want to compare against a column in the second dataframe, which is not a list like in the first dataframe but instead just a single value that will start with one or more values from the first dataframe. As such, after executing some list comprehension to return a list of matching values in a new column in the second dataframe, the final step is to get the max of those values per list per row:
d1 = pd.DataFrame({'A' : ['A', 'B', 'C', 'D'],
'B' : [['84'], ['8420', '8421', '8422', '8423', '8424', '8425', '8426'], ['847', '8475'], ['8470', '8471']]})
A B
0 A [84]
1 B [8420, 8421, 8422, 8423, 8424, 8425, 8426]
2 C [847, 8475]
3 D [8470, 8471]
d2 = pd.DataFrame({'C' : [8420513, 8421513, 8426513, 8427513, 8470513, 8470000, 8475000]})
C
0 8420513
1 8421513
2 8426513
3 8427513
4 8470513
5 8470000
6 8475000
My current code is this:
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()
d1 = pd.DataFrame({'A' : ['A', 'B', 'C', 'D'], 'B' : [['84'], ['8420', '8421', '8422', '8423', '8424', '8425', '8426'], ['847', '8475'], ['8470', '8471']]})
d2 = pd.DataFrame({'C' : [8420513, 8421513, 8426513, 8427513, 8470513, 8470000, 8475000]})
d2['C'] = d2['C'].astype(str)
d2['B'] = d2['C'].progress_apply(lambda x: [z for y in d1['B'] for z in y if x.startswith(z)])
d2['B'] = d2['B'].progress_apply(max)
d2
and successfully returns this output:
C B
0 8420513 8420
1 8421513 8421
2 8426513 8426
3 8427513 84
4 8470513 8470
5 8470000 8470
6 8475000 8475
The problem lies with the fact that the tqdm progress bar is estimating the code will take 4-5 hours to run on my actual DataFrame with 20 million plus rows. I know that .apply should be avoided and that a custom function can be much faster, so that I don't have to go row-by-row. I can usually change apply to a function, but I am struggling with this particular one. I think I am far away, but I will share what I have tried:
def func1(df, d2C, d1B):
return df[[z for y in d1B for z in y if z in d2C]]
d2['B'] = func1(d2, d2['C'], d1['B'])
d2
With this code, I am receiving ValueError: Wrong number of items passed 0, placement implies 1 and also still need to include code to get the max of each list per row.
Let's try, using explode and regex with extract:
d1e = d1['B'].explode()
regstr = '('+'|'.join(sorted(d1e)[::-1])+')'
d2['B'] = d2['C'].astype('str').str.extract(regstr)
Output:
C B
0 8420513 8420
1 8421513 8421
2 8426513 8426
3 8427513 84
4 8470513 8470
5 8470000 8470
6 8475000 8475
Since, .str access is slower than list comprehension
import re
regstr = '|'.join(sorted(d1e)[::-1])
d2['B'] = [re.match(regstr, i).group() for i in d2['C'].astype('str')]
Timings:
from timeit import timeit
import re
d1 = pd.DataFrame({'A' : ['A', 'B', 'C', 'D'], 'B' : [['84'], ['8420', '8421', '8422', '8423', '8424', '8425', '8426'], ['847', '8475'], ['8470', '8471']]})
d2 = pd.DataFrame({'C' : [8420513, 8421513, 8426513, 8427513, 8470513, 8470000, 8475000]})
d2['C'] = d2['C'].astype(str)
def orig(d):
d['B'] = d['C'].apply(lambda x: [z for y in d1['B'] for z in y if x.startswith(z)])
d['B'] = d['B'].apply(max)
return d
def comtorecords(d):
d['B']=[max([z for y in d1.B for z in y if str(row[1]) .startswith(z)]) for row in d.to_records()]
return d
def regxstracc(d):
d1e = d1['B'].explode()
regstr = '('+'|'.join(sorted(d1e)[::-1])+')'
d['B'] = d['C'].astype('str').str.extract(regstr)
return d
def regxcompre(d):
regstr = '|'.join(sorted(d1e)[::-1])
d['B'] = [re.match(regstr, i).group() for i in d['C'].astype('str')]
return d
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='orig comtorecords regxstracc regxcompre'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([d2]*i)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
print(stmt, d.shape)
res.at[i, j] = timeit(stmt, setp, number=100)
# res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
res.plot(loglog=True);
Output:
I have found yet faster solution, compared to both propositions of Scott.
def vect(d):
def extr(txt):
mtch = pat.match(txt)
return mtch.group() if mtch else ''
d1e = d1.B.explode()
pat = re.compile('|'.join(sorted(d1e)[::-1]))
d['B'] = np.vectorize(extr)(d.C)
return d
One speed gain is from prior compilation of the regex.
The second gain is due to use of a Numpy vectorization, instead of
a list comprehension.
Running a testing loop like the one employed by Scott, I received
the following result:
So "my" execution time (the red line), especially for larger data volumes,
is about 60 % of both regxstracc and regxcompre.
Try explode the d1 so we reduce one for loop
[max([z for z in d1.B.explode() if x.startswith(z)]) for x in d2.C.astype(str) ]
['8420', '8421', '8426', '84', '8470', '8470', '8475']
Using re.compile() to compile the pattern prior to execution (i.e. outside of the ‘loop’) helps a lot when solving these big data row-by-row ‘requires a custom solution that isn’t a NumPy vector function already‘ problems.
Using the correct flags when compiling a pattern is work checking too.
Not to sound like a broken record but with NLP/text processing it’s definitely significant.
Another suggestion: Also use efficient chunking.
Each computing project has some X memory threshold and some Y processing threshold; balancing these by chunking helps improve execution speed.
Pandas chunking with its read/write functions are useful from my experience.
Finally, if you are executing I/O operations you might squeeze out some extra performance mapping the function within a ThreadPoolExecutor (A quick google can give you some good template code, and the documentation explains the utility well; I found the template code from documentation hard to implement at first so I’d check other resources).
Best of luck!
I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!
Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result