Faster Way to Translate DataFrame Column to Feature and Target Matrix - python

I have a column (binary) in a dataframe (df) of the form:
Vector
0
1
0
1
0
.
.
.
I am using this in a binary classification model. My objective is to take these 0's and 1's and move them into two seperate lists, which then get translated into numpy arrays. As an example, I would like to move the first 5 items from Vector into X, then the 6th item into Y. Then the next 5 items into X, and then the following 6th item into Y, till the end of the df (currently 200k rows).
My first instinct is to write a for loop for this (but I know this is hugely inefficient):
for i in range(0, df.shape[0] - 6):
# as we iterate through the df
# we will use a step of 5
if i_cnt > 5:
y = df['Vector'].iloc[i]
Y.append(y)
i_cnt = 1
else:
x = df['Vector'].iloc[i]
X.append(x)
i_cnt +=1
There is definitely a faster way to do this and hoping someone knows how I can do that?

Use modulo with 6 by array created by length of index and compare for X and Y:
#sample data for easy verify
df = pd.DataFrame({'Vector': range(20)})
idx = np.arange(len(df)) % 6
X = df.loc[idx < 5, 'Vector']
print (X)
0 0
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9
10 10
12 12
13 13
14 14
15 15
16 16
18 18
19 19
Y = df.loc[idx == 5, 'Vector']
print (Y)
5 5
11 11
17 17
If output format is different - X is 2d array use reshape with -1 for automatic count length with 6 and select by indexing:
df = pd.DataFrame({'Vector': range(18)})
arr = df['Vector'].to_numpy().reshape(-1, 6)
X = arr[:, :-1]
Y = arr[:, -1]
print (X)
[[ 0 1 2 3 4]
[ 6 7 8 9 10]
[12 13 14 15 16]]
print (Y)
[ 5 11 17]

For k = 5 + 1 = 6,
k = 6
n_rows = len(df.index)
n_samples = n_rows // k
X_and_y = df.Vector.to_numpy().reshape(n_samples, k)
X = X_and_y[:, :-1]
y = X_and_y[:, -1]
We reshape the column to a (n_samples, 5 + 1) array where n_samples = n_rows / 6, then we take all but last column into X and last column into y.
e.g.
>>> df = pd.DataFrame(np.random.randint(0, 2, size=18), columns=["Vector"])
>>> df
Vector
0 0
1 0
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 1
15 0
16 0
17 1
>>> # after
>>> X
array([[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])
>>> y
array([0, 1, 1])

You can try
X = list(df[df.index % 6 < 5]["Vector"])
y = list(df[df.index % 6 == 5]["Vector"])

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

how to print ones and zeros in columns with their indexes in python?

I have a list of zeros and ones, I want to print them in two different columns with headings and index numbers. Something like this.
list = [1,0,1,1,1,0,1,0,1,0,0]
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
This is the desired output.
I tried this:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
if ele==1:
print(index,ele,end=" ")
elif ele==0:
print(" ")
print(index,ele,end=" ")
else:
print()
But this gives output like this:
ones zeros
1 1
2 0 3 1 4 1 5 1
6 0 7 1
8 0 9 1
10 0
11 0
How do get the desired output?
Any help is appreciated.
You can use itertools.zip_longest, str.ljust, f-strings (for formatting), and some calculations for the printing part, and use two lists to hold the indices of both zeros and ones:
l = [1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
ones, zeros = [], []
max_len_zeros = max_len_ones = 0
for index, num in enumerate(l, 1):
if num == 0:
zeros.append(index)
max_len_zeros = max(max_len_zeros, len(str(index)))
else:
ones.append(index)
max_len_ones = max(max_len_ones, len(str(index)))
from itertools import zip_longest
print('ones' + ' ' * (max_len_ones + 2) + 'zeros')
for ones_index, zeros_index in zip_longest(ones, zeros, fillvalue = ''):
one = '1' if ones_index else ' '
this_one_index = str(ones_index).ljust(max_len_ones)
zero = '0' if zeros_index else ''
this_zero_index = str(zeros_index).ljust(max_len_zeros)
print(f'{this_one_index} {one} {this_zero_index} {zero}')
Output:
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
List with more zeros than ones:
In: l = [1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]
Out:
ones zeros
1 1 2 0
4 1 3 0
7 1 5 0
9 1 6 0
10 1 8 0
14 1 11 0
12 0
13 0
15 0
List with equal number of zeros and ones:
In: l = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1]
Out:
ones zeros
1 1 2 0
3 1 4 0
5 1 6 0
8 1 7 0
9 1 10 0
11 1 13 0
12 1 14 0
15 1 16 0
18 1 17 0
20 1 19 0
It's hard to do what you need in an iterative way. I have kind of a "broken" solution that both shows how you could better do what you are trying to do and why an iterative approach is limited in this case.
I updated your code as following:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
# First check if extra space OR new lines OR both are needed
if index > 1:
if ele==1:
print()
elif ele==0:
if list[index-2]==1:
print('', end=' \t')
else:
print('', end='\n\t\t')
# THEN, write your desired output without any end
if ele==1:
print(index,ele,end="")
elif ele==0:
print(index,ele,end="")
# Finally an empty line
print()
It gives the following ouput:
ones zeros
1 1 2 0
3 1
4 1
5 1 6 0
7 1 8 0
9 1 10 0
11 0
As you can see, its limitation is that you can't go "up" and rewrite in old lines.
However, if you need to display EXACTLY as you've shown, you need to construct an intermediate data structure (for example a dict) and then display it using zip

How to recursively split a 2D array into a tensor?

I have turned dataframe that has a tuple of length 2 as index
1 2 -1
(0, 1) 0 1 0
(0, 2) 1 0 0
(0, -1) 0 0 0
(1, 1) 1 0 0
(1, 2) 0 1 0
(1, -1) 1 1 1
into numpy 2D array and managed to split it to 3D array(in regards to the first value) by split funcion:
arr = np.array(np.array_split(arr,2))
with result
[[[0 1 0]
[1 0 0]
[0 0 0]]
[[1 0 0]
[0 1 0]
[1 1 1]]]
I want to make a function to do the split even further, for example, to create 5D tensor from (0,0,0,0) (length 4) indices.
Any idea on how to do this recursively?
Use the following code to generate sample data:
import pandas as pd
import numpy as np
import itertools
def create_fake_data_frame(nlevels = 2, ncols = 3):
result = pd.DataFrame(
index=itertools.product(*(nlevels * [[0, 1]])),
data=np.arange(ncols*2**nlevels).reshape(2**nlevels, ncols)
)
result = convert_index_of_tuples_to_multiindex(result)
return result
def convert_index_of_tuples_to_multiindex(df):
return df.set_index(pd.MultiIndex.from_tuples(df.index))
# Increase nlevels to get dataframes with more levels in their MultiIndex
df = create_fake_data_frame(nlevels=3)
print(df)
This is the result:
0 1 2
0 0 0 0 1 2
1 3 4 5
1 0 6 7 8
1 9 10 11
1 0 0 12 13 14
1 15 16 17
1 0 18 19 20
1 21 22 23
Then, modify the dataframe in such a way that each row contains a single
column, whose value is a list of the values in the corresponding row of
the original dataframe:
def data_frame_with_single_column_of_lists(df):
if len(df.columns) <= 1:
return df
result = df.apply(collapse_columns_into_lists, axis=1)
return result
def collapse_columns_into_lists(s):
result = s.copy()
result['lists'] = result.values.tolist()
result = result[['lists']]
return result
df = data_frame_with_single_column_of_lists(df)
print(df)
The output will be like this:
lists
0 0 0 [0, 1, 2]
1 [3, 4, 5]
1 0 [6, 7, 8]
1 [9, 10, 11]
1 0 0 [12, 13, 14]
1 [15, 16, 17]
1 0 [18, 19, 20]
1 [21, 22, 23]
Finally, use the following code to get a tensor
def increase_list_nesting_by_removing_an_index_level(df):
def list_of_lists(series):
result = series.to_frame().set_index(series.index.droplevel(-1))
result = result.apply(lambda x: x['lists'], axis=1).to_frame()
result = [x[0] for x in result.values.tolist()]
return result
grouped = df.groupby(df.index.droplevel(-1))
result = grouped.agg(list_of_lists)
if type(result.index[0]) == tuple:
result = convert_index_of_tuples_to_multiindex(result)
return result
def tensor_from_data_frame(df):
if df.index.nlevels <= 1:
return np.array([i[0] for i in df.values])
result = increase_list_nesting_by_removing_an_index_level(df)
result = tensor_from_data_frame(result)
return result
tensor = tensor_from_data_frame(df)
print(tensor)
The result will be like this:
[[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
[[[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]]]]

Ranking groups based on size

Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.
The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8
This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1

Categories

Resources