I have a dataframe df containing 40 millions of rows. There is a column named group_id to specific the group identifier of a row. There is a total of 2000 groups.
I would like to label randomly elements in each group and add this information to a column batch of df. For example, if group 1 contains rows 1, 2, 3, 4, and 5, then I choose a permutation of (1, 2, 3, 4, 5), for example, we take (5, 3, 4, 2, 1). Then I assign to a column batch of these rows the values [5, 3, 4, 2, 1].
I defined a function func and used parallelization dummy.Pool, but the speed is very slow. Could you suggest a faster way to do so?
import pandas as pd
import numpy as np
import random
import os
from multiprocessing import dummy
import itertools
core = os.cpu_count()
P = dummy.Pool(processes = core)
N = int(4e7)
M = int(2e3) + 1
col_1 = np.random.randint(1, M, N)
col_2 = np.random.uniform(low = 1, high = 5, size = N)
df = pd.DataFrame({'group_id': col_1, 'value': col_2})
df.sort_values(by = 'group_id', inplace = True)
df.reset_index(inplace = True, drop = True)
id_ = np.unique(df.group_id)
def func(i):
idx = df.group_id == i
m = sum(idx) # count the number of rows in each group
r = list(range(1, m + 1, 1)) # create an enumeration
random.shuffle(r) # create a permutation the enumeration
return(r)
order_list = P.map(func, id_)
# merge the list containing permutations
order = list(itertools.chain.from_iterable(order_list))
df['batch'] = order
Perhaps this could solve you problem. Take a random permutation of the group size.
import numpy as np
import pandas as pd
l = np.repeat([x for x in range(2000)],20000)
df = pd.DataFrame(l, columns=['group'])
df['batch'] = df.groupby('group')['group'].transform(lambda x: np.random.permutation(np.arange(x.size)))
So I have the following data, which comes from two different pandas dataframes:
lis = []
for index, rows in full.iterrows():
my_list = [rows.ARIEL, rows.LBHD, rows.LFHD, rows.RFHD, rows.RBHD]
lis.append(my_list)
lis2 = []
for index, rows in reduced.iterrows():
my_list = rows.bar_head
lis2.append(my_list)
For example, part of lis and lis are shown below:
lis = [[[-205.981, 1638.787, 1145.274], [-264.941, 1482.371, 1168.693], [-263.454, 1579.4370000000001, 1016.279], [-148.062, 1592.005, 1016.75], [-134.313, 1479.1429999999998, 1167.109]], ...
lis2 = [[-203.3502, 1554.3486, 1102.821], [-203.428, 1554.3492, 1103.0592], [-203.4954, 1554.3234, 1103.2794], [-203.5022, 1554.2974, 1103.4522], ...
What I want is to use lis and lis2 with the following apply method (where mdf is another empty dataframe of the same length as the other two, and md is a function I've created):
mdf['head_md'] = mdf['head_md'].apply(md, args=(5, lis, lis2))
But the way it does it now, is it output the same result to all rows of mdf.
What I want is for it to loop through lis and lis2 and based on the indexes, to output the corresponding result to the corresponding row of mdf. All dataframes and variables have length 7446.
I tried for example this, but it doesn't work:
for i in range(len(mdf)):
for j in range(0, 5):
mdf['head_md'] = mdf['head_md'].apply(md, args=(5, lis[i][j], lis2[i]))
Let me know if you need any more information from the code, and thanks in advance!
EDIT: Examples of the dataframes:
bar_head
0 [-203.3502, 1554.3486, 1102.821]
1 [-203.428, 1554.3492, 1103.0592]
2 [-203.4954, 1554.3234, 1103.2794]
3 [-203.5022, 1554.2974, 1103.4522]
4 [-203.5014, 1554.2948, 1103.6594]
ARIEL LBHD LFHD RBHD RFHD
0 [-205.981, 1638.787, 1145.274] [-264.941, 1482.371, 1168.693] [-263.454, 1579.4370000000001, 1016.279] [-134.313, 1479.1429999999998, 1167.109] [-148.062, 1592.005, 1016.75]
1 [-206.203, 1638.649, 1145.734] [-264.85400000000004, 1482.069, 1168.776] [-263.587, 1579.6129999999998, 1016.627] [-134.286, 1479.0839999999998, 1167.076] [-148.21, 1592.3310000000001, 1017.0830000000001]
2 [-206.37599999999998, 1638.531, 1146.135] [-264.803, 1481.8210000000001, 1168.8519999999... [-263.695, 1579.711, 1016.922] [-134.265, 1478.981, 1167.104] [-148.338, 1592.5729999999999, 1017.3839999999...
3 [-206.493, 1638.405, 1146.519] [-264.703, 1481.5439999999999, 1168.95] [-263.742, 1579.8139999999999, 1017.207] [-134.15200000000002, 1478.922, 1167.112] [-148.421, 1592.8020000000001, 1017.4730000000...
4 [-206.56900000000002, 1638.33, 1146.828] [-264.606, 1481.271, 1169.0330000000001] [-263.788, 1579.934, 1017.467] [-134.036, 1478.888, 1167.289] [-148.50799999999998, 1593.0510000000002, 1017...
If the items in the columns of full and reduced are lists convert them to numpy ndarrays first.
ariel = np.array(full.ARIEL.to_list())
lbhd = np.array(full.LBHD.to_list())
lfhd = np.array(full.LFHD.to_list())
rfhd = np.array(full.RFHD.to_list())
rbhd = np.array(full.RBHD.to_list())
barhead = np.array(reduced.bar_head.to_list())
Subtract barhead from ariel using broadcasting, square the results and sum along the last axis (assuming I understood the comment about your function).
a = np.sum(np.square(ariel-barhead[:,None,:]),-1)
Using the setup below the result is a (4,5) array of values (rounded to two places).
>>> a # a[0] a[1] a[2] a[3] a[4]
array([[8939.02, 8956.22, 8971.93, 8984.87, 8999.85], # b[0]
[8918.35, 8935.3 , 8950.79, 8963.53, 8978.35], # b[1]
[8903.82, 8920.53, 8935.82, 8948.36, 8963.04], # b[2]
[8893.7 , 8910.24, 8925.38, 8937.78, 8952.34]]) # b[3]
It seemed that you wanted a 1-d sequence for the result: a.ravel() produces a 1-d array like:
[(a[0]:b[0]),(a[1]:b[0]),(a[2]:b[0]),...,(a[0]:b[1]),(a[1]:b[1]),...,(a[0]:b[2]),...]
The other four columns of full.
lb = np.sum(np.square(lbhd-barhead[:,None,:]),-1)
lf = np.sum(np.square(lfhd-barhead[:,None,:]),-1)
rf = np.sum(np.square(rfhd-barhead[:,None,:]),-1)
rb = np.sum(np.square(rbhd-barhead[:,None,:]),-1)
Again assuming I understood your process the result would be 100 values (using the setup below).
full reduced
(rows * columns) * (rows)
x = np.concatenate([a.ravel(),lb.ravel(),lf.ravel(),rf.ravel(),rb.ravel()])
Setup
import numpy as np
import pandas as pd
lis = [[[-205.981, 1638.787, 1145.274],[-264.941, 1482.371, 1168.693],[-263.454, 1579.437, 1016.279],[-134.313, 1479.1429, 1167.109],[-148.062, 1592.005, 1016.75]],
[[-206.203, 1638.649, 1145.734],[-264.854, 1482.069, 1168.776],[-263.587, 1579.6129, 1016.627],[-134.286, 1479.0839, 1167.076],[-148.21, 1592.331, 1017.083]],
[[-206.3759, 1638.531, 1146.135],[-264.803, 1481.821, 1168.85199],[-263.695, 1579.711, 1016.922],[-134.265, 1478.981, 1167.104],[-148.338, 1592.5729, 1017.3839]],
[[-206.493, 1638.405, 1146.519],[-264.703, 1481.5439, 1168.95],[-263.742, 1579.8139, 1017.207],[-134.152, 1478.922, 1167.112],[-148.421, 1592.802, 1017.473]],
[[-206.569, 1638.33, 1146.828],[-264.606, 1481.271, 1169.033],[-263.788, 1579.934, 1017.467],[-134.036, 1478.888, 1167.289],[-148.5079, 1593.051, 1017.666]]]
barhd = [[[-203.3502, 1554.3486, 1102.821]],
[[-203.428, 1554.3492, 1103.0592]],
[[-203.4954, 1554.3234, 1103.2794]],
[[-203.5022, 1554.2974, 1103.4522]]]
full = pd.DataFrame(lis, columns=['ARIEL', 'LBHD', 'LFHD', 'RFHD', 'RBHD'])
reduced = pd.DataFrame(barhd,columns=['bar_head'])
I hope to understand you well, is it what you want?
v is lis and v2 is lis2.
Arithmatic function for 3D by 2D.
import numpy as np
na = np.array
v=na([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9],[ 10, 11, 1]]])
v2=na([[1, 2, 3], [4, 5, 6], [7, 8, 9],[ 10, 11, 12]])
lst = []
for a in v:
for b in a:
for a2 in v2:
lst.append(b+a2) # you can do any arithmetic functions
I have a table like the one below as an input:
Which was produced by the following code:
import pandas as pd
dates = ['3-Apr-2018', '4-Apr-2018', '15-Apr-2018', '5-May-2018', '3-Jun-2018']
prices = [300, 200, 100, 900, 200]
list_of_tuples = list(zip(dates, prices))
df = pd.DataFrame(list_of_tuples, columns=['dates', 'prices'])
I need all sets of date indexes that fall into a one month range or 31 days. The output of this should be:
[0, 1, 2], [2, 3], [3, 4]
Below is a solution … added a few extra steps just to make it easier to follow.
Always make sure to convert strings to dates
Add a start_date column
Add an end_date column
Write a for-loop that loops through each start_date and end_date, check entire column
Store the results in a temporary list that gets appended to a master list of results
Add the results master list to a new column
# Step 1: make sure to convert the dates
df['dates'] = pd.to_datetime(df['dates'])
# Step 2: create start_date
df['start_date'] = pd.to_datetime(df['dates'])
# Step 3: create end_date column that projects date forward 31 days
df['end_date'] = df['dates'] + pd.Timedelta(days=31)
# create master list to store results of search
list_column_index = []
# loop through each row in dataframe, start_date and end_date
for each_start, each_end in zip(df['start_date'], df['end_date']):
# compare the entire 'dates' column to the start_date and end_date in this row
mask_range = df['dates'].between(each_start, each_end)
# create a new temporary dataframe with dates in this range
temp_df = df.loc[mask_range]
# convert the index of the temporary dataframe into a temp_list
temp_list_index = list(temp_df.index)
# add the temp list to the master list
list_column_index.append(temp_list_index)
# add a new column with the master list
df['column_index'] = list_column_index
print(df)
df
For sure there is more optimal solution, but here is my proposal:
import pandas as pd
dates = ['3-Apr-2018', '4-Apr-2018', '15-Apr-2018', '5-May-2018', '3-Jun-2018']
prices = [300, 200, 100, 900, 200]
list_of_tuples = list(zip(dates, prices))
df = pd.DataFrame(list_of_tuples, columns=['dates', 'prices'])
#solution:
df['dates'] = pd.to_datetime(df['dates'])
for index, r in df.iterrows():
df['c_' + str(index)] = (df['dates'] - r['dates']).apply(lambda x: 1 if pd.Timedelta(0, unit='d')< x <pd.Timedelta(32, unit='d') else 0)
df['m'] = df.groupby(df['dates'].dt.month).ngroup()
d31 = [df.index[df[col] == 1].tolist() for col in df if col.startswith('c_') and df[col].sum() > 1]
months = [*(df.groupby(df['dates'].dt.month).groups.values())]
months = [m.to_list() for m in months]
d31_months = d31 + months
Output slightly different than yours, but I don't unstertand why don't include [3], [4] for months:
[[1, 2], [2, 3], [0, 1, 2], [3], [4]]
I've managed to refactor it a bit:
months = [list(m) for m in df.groupby(df['dates'].dt.month).indices.values()]
diff = lambda r: (df['dates'] - r['dates']).apply(lambda x: 1 if pd.Timedelta(0, unit='d') < x < pd.Timedelta(32, unit='d') else 0)
d31 = [list(np.nonzero(df.index[diff(r)])[0]) for i, r in df.iterrows() if diff(r).sum() > 1]
d31_months = d31 + months
I have a multiindexed pandas dataframe sort of like this:
data = np.random.random((1800,9))
col = pd.MultiIndex.from_product([('A','B','C'),('a','b','c')])
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
idx = pd.MultiIndex.from_product([year,month,day], names=['Year','Month','Day'])
df1 = pd.DataFrame(data, idx, col)
Which has multiindexed rows of Year, Month, Day. I want to be able to select rows from this Dataframe as if it were one that has a DatetimeIndex.
The equivalent DataFrame with a DatetimeIndex would be:
idx = pd.DatetimeIndex(start='2006-01-01', end='2010-12-31', freq='d')
timeidx = [ix for ix in idx if ix.day < 29]
df2 = pd.DataFrame(data, timeidx, col)
What I would like is this:
all(df2.ix['2006-06-06':'2008-10-11'] == df1'insert expression here')
to equal True
I know I can select cross-sections via df1.xs('2006', level='Year'), but I basically need an easy way to replicate what was done for df2 as I am forced to use this index as opposed to the DatetimeIndex.
One issue you'll immediately have by storing these as strings is '2' > '10', which is almost certainly not what you want, so I recommend using ints. That is:
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
I though that you ought to be able to use pd.IndexSlice here, my first thought was to use it as follows:
In [11]: idx = pd.IndexSlice
In [12]: df1.loc[idx[2006:2008, 6:10, 6:11], :]
...
but this shows those between 2006-8 and june-oct and 6-11th (ie 3*5*6 = 90 days).
So here's a non-vectorized way, just compare the tuples:
In [21]: df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))
Out[21]: array([False, False, False, ..., False, False, False], dtype=bool)
In [22]: df1[df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))]
# just the (844) rows you want
If this was unbearably slow, a trick (to vectorize) would be to use some float representation, for example:
In [31]: df1.index.get_level_values(0).values + df1.index.get_level_values(1).values * 1e-3 + df1.index.get_level_values(2).values * 1e-6
Out[31]:
array([ 2006.001001, 2006.001002, 2006.001003, ..., 2010.012028,
2010.012029, 2010.01203 ])