Quicker, Pandas-friendly way to complete this algorithm? - python

I have a very large DataFrame where each element is populate with a 1-5 integer, or else 0 if there is no data for that element. I would like to create two adjusted copies of it:
train will be a copy where a random 20% of non-zero elements per row are set to 0
test will be a copy where all but these same 20% of elements are set to 0
Here is a sample:
ORIGINAL
0 1 2 3 4 5 6 7 8 9
0 3 0 1 1 3 5 3 5 4 2
1 4 2 3 2 3 3 4 4 1 2
2 2 4 2 5 4 4 0 0 4 2
TRAIN
0 1 2 3 4 5 6 7 8 9
0 3 0 0 1 3 5 3 5 4 2
1 4 2 3 0 3 3 4 4 0 2
2 2 4 2 5 4 4 0 0 4 0
TEST
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 2 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 2
Here is my current brute-force algorithm that gets the job done, but is far too slow:
train, test = original.copy(), original.copy()
for i in range(original.shape[0]):
print("{} / {}".format(i + 1, original.shape[0]))
row = original.iloc[i] # Select row
nonZeroIndices = np.where(row > 0)[0] # Find all non-zero indices
numTest = int(len(nonZeroIndices) * 0.2) # Calculate 20% of this amount
rand = np.random.choice(nonZeroIndices, numTest, replace=False) # Select a rancom 20% of non-zero indices
for j in range(original.shape[1]):
if j in rand:
train.iloc[i, j] = 0
else:
test.iloc[i, j] = 0
Is there a quicker way to achieve this using Pandas or Numpy?

One approach would be
def make_train_test(df):
train, test = df.copy(), df.copy()
for i, row in df.iterrows():
non_zero = np.where(row > 0)[0]
num_test = int(len(non_zero) * 0.2)
rand = np.random.choice(non_zero, num_test, replace=False)
row_train = train.iloc[i, :]
row_test = test.iloc[i, :]
row_train[rand] = 0
row_test[~row_test.index.isin(rand)] = 0
return train, test
In my testing, this runs in about 4.85 ms, your original solution in about 9.07 ms, and andrew_reece's (otherwise elegant) solution in 15.6 ms.

First, create the 20% subset of non-zero values with sample():
subset = df.apply(lambda x: x[x.ne(0)].sample(frac=.2, random_state=42), axis=1)
subset
1 2 5 8
0 NaN 1.0 NaN 4.0
1 2.0 NaN NaN 1.0
2 4.0 NaN 4.0 NaN
Now train and test can be set by multiplying subset against the original df, and either using 1s or 0s as fill_value:
train = df.apply(lambda x: x.multiply(subset.iloc[x.name].isnull(), fill_value=1), axis=1)
train
0 1 2 3 4 5 6 7 8 9
0 3 0 0 1 3 5 3 5 0 2
1 4 0 3 2 3 3 4 4 0 2
2 2 0 2 5 4 0 0 0 4 2
test = df.apply(lambda x: x.multiply(subset.iloc[x.name].notnull(), fill_value=0), axis=1)
test
0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 4 0
1 0 2 0 0 0 0 0 0 1 0
2 0 4 0 0 0 4 0 0 0 0
Data:
df
0 1 2 3 4 5 6 7 8 9
0 3 0 1 1 3 5 3 5 4 2
1 4 2 3 2 3 3 4 4 1 2
2 2 4 2 5 4 4 0 0 4 2

Related

how to add a DataFrame to some columns of another DataFrame

I want to add a DataFrame a (containing a loadprofile) to some of the columns of another DataFrame b (also containing one load profile per column). So some columns (load profiles) of b should be overlaid withe the load profile of a.
So lets say my DataFrames look like:
a:
P[kW]
0 0
1 0
2 0
3 8
4 8
5 0
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 4 4
4 2 2 2
5 2 2 2
Now I want to overlay some colums of b:
b.iloc[:, [1]] += a.iloc[:, 0]
I would expect this:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
but what I actually get:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 nan 2
1 3 nan 3
2 3 nan 3
3 4 nan 4
4 2 nan 2
5 2 nan 2
That's not exactly what my code and data look like, but the principle is the same as in this abstract example.
Any guesses, what could be the problem?
Many thanks for any help in advance!
EDIT:
I actually have to overlay more than one column.Another example:
load = [0,0,0,0,0,0,0]
data = pd.DataFrame(load)
for i in range(1, 10):
data[i] = data[0]
data
overlay = pd.DataFrame([0,0,0,0,6,6,0])
overlay
data.iloc[:, [1,2,4,5,7,8]] += overlay.iloc[:, 0]
data
WHAT??! The result is completely crazy. Columns 1 and 2 aren't changed at all. Columns 4 and 5 are changed, but in every row. Columns 7 and 8 are nans. What am I missing?
That is what I would expect the result to look like:
Please do not pass the column index '1' of dataframe 'b' as a list but as an element.
Code
b.iloc[:, 1] += a.iloc[:, 0]
b
Output
P1[kW] P2[kW] Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
Edit
Seems like this what we are looking for i.e to sum certain columns of data df with overlay df
Two Options
Option 1
cols=[1,2,4,5,7,8]
data[cols] = data[cols] + overlay.values
data
Option 2, if we want to use iloc
cols=[1,2,4,5,7,8]
data[cols] = data.iloc[:,cols] + overlay.iloc[:].values
data
Output
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 6 6 0 6 6 0 6 6 0
5 0 6 6 0 6 6 0 6 6 0
6 0 0 0 0 0 0 0 0 0 0

Padding and reshaping pandas dataframe

I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0

Merge multiple group ids to form a single consolidated group id?

I have following dataset in pandas Dataframe.
group_id sub_group_id
0 0
0 1
1 0
2 0
2 1
2 2
3 0
3 0
But the I want to those group ids and form a consolidated group id
group_id sub_group_id consolidated_group_id
0 0 0
0 1 1
1 0 2
2 0 3
2 1 4
2 2 5
2 2 5
3 0 6
3 0 6
Is there any generic or mathematical way to do it?
cols = ['group_id', 'sub_group_id']
df.assign(
consolidated_group_id=pd.factorize(
pd.Series(list(zip(*df[cols].values.T.tolist())))
)[0]
)
group_id sub_group_id consolidated_group_id
0 0 0 0
1 0 1 1
2 1 0 2
3 2 0 3
4 2 1 4
5 2 2 5
6 3 0 6
7 3 0 6
You need convert values to tuples and then use factorize:
df['consolidated_group_id'] = pd.factorize(df.apply(tuple,axis=1))[0]
print (df)
group_id sub_group_id consolidated_group_id
0 0 0 0
1 0 1 1
2 1 0 2
3 2 0 3
4 2 1 4
5 2 2 5
6 3 0 6
7 3 0 6
Numpy solutions are a bit modify this answer - change ordering by [::-1] with selecting by [0] for return array (numpy.unique):
a = df.values
def unique_return_inverse_2D(a): # a is array
a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1))
return np.unique(a1D, return_inverse=1)[::-1][0]
def unique_return_inverse_2D_viewbased(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
return np.unique(a.view(void_dt).ravel(), return_inverse=1)[::-1][0]
df['consolidated_group_id'] = unique_return_inverse_2D(a)
df['consolidated_group_id1'] = unique_return_inverse_2D_viewbased(a)
print (df)
group_id sub_group_id consolidated_group_id consolidated_group_id1
0 0 0 0 0
1 0 1 1 1
2 1 0 2 2
3 2 0 3 3
4 2 1 4 4
5 2 2 5 5
6 3 0 6 6
7 3 0 6 6

Transform dataframe to have a row for every observation at every time point

I have the following short dataframe:
A B C
1 1 3
2 1 3
3 2 3
4 2 3
5 0 0
I want the output to look like this:
A B C
1 1 3
2 1 3
3 0 0
4 0 0
5 0 0
1 1 3
2 1 3
3 2 3
4 2 3
5 0 0
use pd.MultiIndex.from_product with unique As and Bs. Then reindex.
cols = list('AB')
mux = pd.MultiIndex.from_product([df.A.unique(), df.B.unique()], names=cols)
df.set_index(cols).reindex(mux, fill_value=0).reset_index()
A B C
0 1 1 3
1 1 2 0
2 1 0 0
3 2 1 3
4 2 2 0
5 2 0 0
6 3 1 0
7 3 2 3
8 3 0 0
9 4 1 0
10 4 2 3
11 4 0 0
12 5 1 0
13 5 2 0
14 5 0 0

Finding efficiently pandas (part of) rows with unique values

Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?
Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?

Categories

Resources