Related
Basically I have list of 0s and 1s. Each value in the list represents a data sample from an hour. Thus, if there are 24 0s and 1s in the list that means there are 24 hours, or a single day. I want to capture the first time the data cycles from 0s to 1s back to 0s in a span of 24 hours (or vice versa from 1s to 0s back to 1s).
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1]
expected output:
# D
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0]
output = [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
# ^ cycle.1:day.1 |dayline ^cycle.1:day.2
In the output list, when there is 1 that means 1 cycle is completed at that position of the signal list and at rest of the position there are 0. There should only 1 cycle in a days that's why only 1 is there.
I don't how to split this list according to that so can someone please help?
It seams to me like what you are trying to do is split your data first into blocks of 24, and then to find either the first rising edge, or the first falling edge depending on the first hour in that block.
Below I have tried to distill my understanding of what you are trying to accomplish into the following function. It takes in a numpy.array containing zeros and ones, as in your example. It checks to see what the first hour in the day is, and decides what type of edge to look for.
it detects an edge by using np.diff. This gives us an array containing -1's, 0's, and 1's. We then look for the first index of either a -1 falling edge, or 1 rising edge. The function returns that index, or if no edges were found it returns the index of the last element, or nothing.
For more info see the docs for descriptions on numpy features used here np.diff, np.array.nonzero, np.array_split
import numpy as np
def get_cycle_index(day):
'''
returns the first index of a cycle defined by nipun vats
if no cycle is found returns nothing
'''
first_hour = day[0]
if first_hour == 0:
edgetype = -1
else:
edgetype = 1
edges = np.diff(np.r_[day, day[-1]])
if (edges == edgetype).any():
return (edges == edgetype).nonzero()[0][0]
elif (day.sum() == day.size) or day.sum() == 0:
return
else:
return day.size - 1
Below is an example of how you might use this function in your case.
import numpy as np
_data = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
#_data = np.random.randint(0,2,280, dtype='int')
data = np.array(_data, 'int')
#split the data into a set of 'day' blocks
blocks = np.array_split(data, np.arange(24,data.size, 24))
_output = []
for i, day in enumerate(blocks):
print(f'day {i}')
buffer = np.zeros(day.size, dtype='int')
print('\tsignal:', *day, sep = ' ')
cycle_index = get_cycle_index(day)
if cycle_index:
buffer[cycle_index] = 1
print('\toutput:', *buffer, sep=' ')
_output.append(buffer)
output = np.concatenate(_output)
print('\nfinal output:\n', *output, sep=' ')
this yeilds the following output:
day 0
signal: 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0
output: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 1
signal: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
output: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 2
signal: 0 0 0 0 0 0
output: 0 0 0 0 0 0
final output:
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have a dataset called "data" with categorical values I'd like to encode with mean (likelihood/target) encoding rather than label encoding.
My dataset looks like:
data.head()
ID X0 X1 X10 X100 X101 X102 X103 X104 X105 ... X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
0 0 k v 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6 k t 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
2 7 az w 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 9 az t 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
4 13 az v 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
5 rows × 377 columns
I've tried:
# Select categorical features
cat_features = data.dtypes == 'object'
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)
which raises:
KeyError: False
I've also tried:
# Define function
def mean_encoding(df, target):
for c in df.columns:
if df[c].dtype == 'object':
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
which raises:
KeyError: 'Columns not found: 87.68, 87.43, 94.38, 72.11, 73.7, 74.0,
74.28, 76.26,...
I've concated train and test dataset into one called "data" and saved train target before dropping in the dataset as:
target = train.y
split = len(train)
data = pd.concat(objs=[train, test])
data = data.drop('y', axis=1)
data.shape
Help would be appreciated. Thanks.
I think you are not selecting categorical columns correctly. By doingcat_features = data.dtypes == 'object' you are not getting columns names, instead you get boolean showing if column type is categorical or not. Resulting in KeyError: False
You can select categorical column as
mycolumns = data.columns
numerical_columns = data._get_numeric_data().columns
cat_features= list(set(mycolumns) - set(numerical_columns))
or
cat_features = df.select_dtypes(['object']).columns
Rest of you code will be same
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)
I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.
The dataframes are one-hot encoded, so they look like this:
col_1, col_2, col_3, ...
0 1 0
1 0 0
0 0 1
...
I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.
One of the dataframes has 500 columns, the other has 100 columns.
This is the fastest version that I've been able to write so far:
interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)
I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?
EDIT: example
df_1:
1col_1, 1col_2, 1col_3
0 1 0
1 0 0
0 0 1
df_2:
2col_1, 2col_2
0 1
1 0
0 0
interact_pd:
1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2
0 0 0 0 1 0
1 0 0 0 0 0
0 0 0 0 0 0
# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
Timing
code
from string import ascii_letters
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))
def pir1(df_1, df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
results
You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):
df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
1col_1_2col_1 1col_2_2col_1 1col_3_2col_1 1col_1_2col_2 \
0 0 0 0 0
1 1 0 0 0
2 0 0 0 0
1col_2_2col_2 1col_3_2col_2
0 1 0
1 0 0
2 0 0
--> See benchmark for comparisons with other answers to choose the best option for your dataset.
Benchmark
Functions:
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
def Test3(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
return df
def Test4(df_1,df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def jeanrjc_imp(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns)
return df
Code:
Sorry, ugly code, the plot at the end matters :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
resa[r] = []
resb[r] = []
resc[r] = []
for i in [5, 10, 30, 50, 150, 200]:
a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
resa[r].append(a.best)
resb[r].append(b.best)
resc[r].append(c.best)
X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
for i in r:
a.plot(X, r[i], label=i)
a.set_xlabel("df_1 columns #")
a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()
With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.
Conclusion:
Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!
HTH
You can use numpy.
Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:
import pandas as pd
import numpy as np
A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]
DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T
def Test1(DA,DB):
E = pd.DataFrame(index=DA.index)
DAC = [column for column in DA]
for column in DB:
C = DA[DAC].multiply(DB[column], axis="index")
E = E.join(C, lsuffix='_' + str(column))
return E
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
print Test1(DA,DB)
print Test2(DA,DB)
Output:
0_1 1_1 2_1 0 1 2 0_3 1_3 2_3 0 1 2 0 1 2
0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
2 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
1col_1_2col_1 1col_1_2col_2 1col_1_2col_3 1col_2_2col_1 1col_2_2col_2 \
0 0 0 0 1 0
1 0 0 0 0 0
2 1 1 0 1 1
3 0 0 0 0 0
1col_2_2col_3 1col_3_2col_1 1col_3_2col_2 1col_3_2col_3 1col_4_2col_1 \
0 0 1 0 0 1
1 0 0 1 1 0
2 0 0 0 0 0
3 0 0 0 0 1
1col_4_2col_2 1col_4_2col_3 1col_5_2col_1 1col_5_2col_2 1col_5_2col_3
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 1 0 0 0
Performance of your function:
%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop
Performance of my function:
%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop
It's not beautiful, but it's efficient.
I have this line in some matlab script that Im trying to convert to python. So, m=20, and n=20. The dimensions of I_true equals [400,1].
I want to convert following Matlab code:
A=zeros((2*m*n),(2*m*n)+2);
A(1:m*n,(2*m*n)+1)=-I_true(:);
Am I converting it right?
Converted code in Python:
for i in range(0,m*n):
for j in range((2*m*n)+1):
A[i][j] = I_true[i]
Let's look at a small example, with n = 2, m = 2:
In Octave (and presumably Matlab):
octave:50> m = 2; n = 2;
octave:51> I_true = [1;2;3;4];
octave:52> A = zeros((2*m*n),(2*m*n)+2);
octave:53> A(1:m*n,(2*m*n)+1)=-I_true(:)
A =
0 0 0 0 0 0 0 0 -1 0
0 0 0 0 0 0 0 0 -2 0
0 0 0 0 0 0 0 0 -3 0
0 0 0 0 0 0 0 0 -4 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
The equivalent in Python (with n = 20, m = 20) would be
import numpy as np
n, m = 20, 20
I_true = np.arange(1, n*m+1) # just as an example
A = np.zeros((2*m*n, 2*(n*m+1)), dtype=I.dtype)
A[:m*n, 2*m*n] = -I_true
The reason why the last line uses A[:m*n, 2*m*n] and not A[1:m*n, (2*m*n)+1] is
because Python uses 0-based indexing whereas Matlab uses 1-based indexing.
Check this so question as well.
You can define a matrix with 2*m*n rows and 2*m*n+2 columns in python like this:
m = 20
n = 20
a = [[0 for i in range(2*m*n)] for j in range((2*m*n)+2)]
Now you have your matrix you can assign values to its elements using different ways. One example would be using for loops to assign values from another matrix with same size:
for i in range(2*m*n):
for j in range((2*m*n)+2):
a[i][j] = I_true[i][j]
I hope it helps.
I need to extract some data from .dat file which I usually do with
import numpy as np
file = np.loadtxt('blablabla.dat')
Here my data are not separated by a specific delimiter but have predefined length (digits) and some lines don't have any values for some columns.
Here an sample to be clear :
3 0 36 0 0 0 0 0 0 0 99.
-2 0 0 0 0 0 0 0 0 0 99.
2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
-5 0 0 0 0 0 0 0 0 0 99.
99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5
My little code above get the error :
# Convert each value according to its column and store
ValueError: Wrong number of columns at line 3
Does someone have an idea about how to collect this kind of data?
numpy.genfromtxt seems to be what you want; it you can specify field widths for each column and treats missing data as NaNs.
For this case:
import numpy as np
data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5])
If you want to keep information in the string part of the file, you could read twice and specify the usecols parameter:
import numpy as np
number_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(0,1,2,3,4,5,6,7,8,9,11))
string_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(10),dtype=str)
What you essentially need is to get list of empty "columns" position that serve as delimiters
That will get you started
In [108]: table = ''' 3 0 36 0 0 0 0 0 0 0 99.
.....: -2 0 0 0 0 0 0 0 0 0 99.
.....: 2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
.....: 5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
.....: -5 0 0 0 0 0 0 0 0 0 99.
.....: 99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5'''.split('\n')
In [110]: max_row_len = max(len(row) for row in table)
In [117]: spaces = reduce(lambda res, row: res.intersection(idx for idx, c in enumerate(row) if c == ' '), table, set(range(max_row_len)))
This code builds set of character positions in the longest row - and reduce leaves only set of positions that have spaces in all rows