I want to reshape my 24x20 matrices 'A','B','C' which are extracted from text file and are saved before and after normalizing by def normalize() in for-loop through cycles in such way that each cycles would be a row with all elements of 3 matrices side by side like below:
[[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle1
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)] #cycle2
[A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),...,A(24,20),B(24,20),C(24,20)]] #cycle3
So far based on #odyse suggestion I used following snippet in the end of for-loop:
for cycle in range(cycles):
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D2))
np.savetxt("Results.csv", Results, delimiter=",")
but there is a problem when I use after def normalize() in for-loop in spite of its error (ValueError) it also has warning FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead for D = dff.as_matrix().ravel() which is not important but right now since it is FutureWarning nevertheless I checked the shape of output was correct for 3 cycles by using print(data1.shape) and it was (3, 1440) which is 3 rows as 3 cycles and number of columns should be 3 times 480= 1440 but all in all wasn't stable solution.
the complete scripts are following:
import numpy as np
import pandas as pd
import os
def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value
#the size of matrices are (24,20)
df1 = np.zeros((24,20))
df2 = np.zeros((24,20))
df3 = np.zeros((24,20))
#next iteration create all plots, change the number of cycles
cycles = int(len(df)/480)
print(cycles)
for cycle in range(3):
count = '{:04}'.format(cycle)
j = cycle * 480
new_value1 = df['A'].iloc[j:j+480]
new_value2 = df['B'].iloc[j:j+480]
new_value3 = df['C'].iloc[j:j+480]
df1 = print_df(mkdf(new_value1))
df2 = print_df(mkdf(new_value2))
df3 = print_df(mkdf(new_value3))
for i in df:
try:
os.mkdir(i)
except:
pass
min_val = df[i].min()
min_nor = -1
max_val = df[i].max()
max_nor = 1
ordered_data = mkdf(df.iloc[j:j+480][i])
csv = print_df(ordered_data)
#Print .csv files contains matrix of each parameters by name of cycles respectively
csv.to_csv(f'{i}/{i}{count}.csv', header=None, index=None)
if 'C' in i:
min_nor = -40
max_nor = 150
#Applying normalization for C between [-40,+150]
new_value3 = normalize(df['C'].iloc[j:j+480], min_val, max_val, -40, 150)
C_norm = print_df(mkdf(new_value3))
C_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
else:
#Applying normalization for A,B between [-1,+1]
new_value1 = normalize(df['A'].iloc[j:j+480], min_val, max_val, -1, 1)
new_value2 = normalize(df['B'].iloc[j:j+480], min_val, max_val, -1, 1)
A_norm = print_df(mkdf(new_value1))
B_norm = print_df(mkdf(new_value2))
A_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
B_norm.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
dff = pd.DataFrame({'A_norm':A_norm[cycle] , 'B_norm': B_norm[cycle] , 'C_norm': C_norm[cycle] } , index=[0])
D = dff.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D))
np.savetxt("Results.csv", Results , delimiter=',', encoding='utf-8')
#Check output shape whether is (3, 1440) or not
data1 = np.loadtxt('Results.csv', delimiter=',')
print(data1.shape)
Note1: my data is txt file is following:
id_set: 000
A: -2.46882615679
B: -2.26408246559
C: -325.004619528
Note2: I provided a dataset in text file for 3 cycles:
Text dataset
Note3: for mapping A, B, C parameters into matrices in right order I used print_df() mkdf() functions but I didn't mention due to reduce it to the core problem and just leave a minimal example in start of this post. Let me know if you need that.
Expected result should be done by completing for-loop on 'A_norm','B_norm','C_norm' which are represented normalized versions of 'A','B','C' respectively and output let's call it "Results.csv" should be reversible to regenerate 'A','B','C' matrices through cycles again save them in csv. files for controlling , therefore if you have any ideas about reverse part please mention that separately otherwise just control it by using print(data.shape) and it should be (3, 1440).
Have a nice day and thanks in advance!
Related
I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.
This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())
I have a data frame that contains information about the positions of various objects, and a unique index for each object (index in this case is not related to the data frame). Here is some example data:
ind pos
x y z
-1.0 7.0 0.0 21 [-2.76788330078, 217.786453247, 26.6822681427]
0.0 22 [-7.23852539062, 217.274139404, 26.6758270264]
0.0 1.0 152 [-0.868591308594, 2.48404550552, 48.4036369324]
6.0 2.0 427 [-0.304443359375, 182.772140503, 79.4475860596]
The actual data frame is quite long. I have written a function that takes two vectors as inputs and outputs the distance between them:
def dist(a, b):
diff = N.array(a)-N.array(b)
d = N.sqrt(N.dot(diff, diff))
return d
and a function that, given two arrays, will output all the unique combinations of elements between these arrays:
def getPairs(a, b):
if N.array_equal(a, b):
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1,
len(b))]
else:
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
return pairs
I want to take my data frame and find all the pairs of elements whose distance between them is less than some value, say 30. For the pairs that meet this requirement, I also need to store the distance I calculated in some other data frame. Here is my attempt at solving this, but this turned out to be extremely slow.
pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]
pairs = list(itertools.chain(*pairs))
subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs
subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == #x[0]').pos[0], df_yz.query('ind == #x[1]').pos[0]))
Don't worry about what I'm iterating over in this for loop, it works for the system I'm dealing with and isn't where I'm getting slowed down. The step I use .query() is where the major jam happens.
The output I am looking for is something like:
pair distance
(21, 22) 22.59
(21, 152) 15.01
(22, 427) 19.22
I made the distances up, and the pair list would be much longer, but that's the basic idea.
Took me a while, but here are thee possible solution. Hope they are self explanatory. Written in Python 3.x in Jupyter Notebook. One remark: if your coordinates are world coordinates, you may think of using the Haversine distance (circular distance) instead of the Euclidean distance which is a straight line.
First, create your data
import pandas as pd
import numpy as np
values = [
{ 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
{ 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
{ 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
{ 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]
def dist(a, b):
"""
Calculates the Euclidean distance between two 3D-vectors.
"""
diff = np.array(a) - np.array(b)
d = np.sqrt(np.dot(diff, diff))
return d
df_initial = pd.DataFrame(values)
The following three solutions will generate this output:
pairs distance
1 (21, 22) 4.499905
3 (21, 427) 63.373886
7 (22, 427) 63.429709
First solution is based on a full join of the data with itself. Downside is that it may exceed your memory if the dataset is huge. Advantages are the easy readability of the code and the usage of Pandas only:
#%%time
df = df_initial.copy()
# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)
# remove rows with similar index
df = df[df['ind1'] != df['ind2']]
# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)
# filter only those within a specific distance
df = df[df['distance'] < 70]
# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The second solution tries to avoid the memory issue of the first version by iterating over the original data line by line and calculating the distance between the current line and the original data while keeping only values that satisfy the minimum distance constraint. I was expecting a bad performance, but wasn't bad at all (see summary at the end).
#%%time
df = df_initial.copy()
results = list()
for index, row1 in df.iterrows():
# calculate distance between current coordinate and all original rows in the data
df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)
# filter only those within a specific distance and drop rows with same index as current coordinate
df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()
# prepare final data
df_tmp['ind2'] = row1['ind']
df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))
# remember data
results.append(df_tmp)
# combine all into one dataframe
df = pd.concat(results)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The third solution is based on spatial operations using the KDTree from Scipy.
#%%time
from scipy import spatial
tree = spatial.KDTree(list(df_initial['pos']))
# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)
# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)
def get_cell_value(idx: int, column: str = 'ind'):
return df_initial.iloc[idx][column]
def extract_indices(row):
distance, idx1, idx2 = row
return get_cell_value(int(idx1)), get_cell_value(int(idx2))
df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
So what about performance. If you just want to know which row of your original data is within the desired distance, then the KDTree version (third version) is super fast. It took just 4ms to generate the sparse matrix. But since I then used the indices from that matrix to extract the data from the original data, the performance dropped. Of course this should be tested on your full dataset.
version 1: 93.4 ms
version 2: 42.2 ms
version 3: 52.3 ms (4 ms)
I need to split dataframe into 10 parts then use one part as the testset and remaining 9 (merged to use as training set) , I have come up to the following code where I am able to split the dataset , and m trying to merge the remaining sets after picking one of those 10.
The first iteration goes fine , but I get following error in second iteration.
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
for x in range(3):
dfList = np.array_split(df, 3)
testdf = dfList[x]
dfList.remove(dfList[x])
print testdf
traindf = pd.concat(dfList)
print traindf
print "================================================"
I don't think you have to split the dataframe in 10 but just in 2.
I use this code for splitting a dataframe in training set and validation set:
test_index = np.random.choice(df.index, int(len(df.index)/10), replace=False)
test_df = df.loc[test_index]
train_df = df.loc[~df.index.isin(test_index)]
okay I got it working this way :
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
dfList = np.array_split(df, 3)
for x in range(3):
trainList = []
for y in range(3):
if y == x :
testdf = dfList[y]
else:
trainList.append(dfList[y])
traindf = pd.concat(trainList)
print testdf
print traindf
print "================================================"
But better approach is welcome.
You can use the permutation function from numpy.random
import numpy as np
import pandas as pd
import math as mt
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame({'a': l, 'b': l})
shuffle the dataframe index
shuffled_idx = np.random.permutation(df.index)
divide the shuffled_index into N equal(ish) parts
for this example, let N = 4
N = 4
n = len(shuffled_idx) / N
parts = []
for j in range(N):
parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)])
# to show each shuffled part of the data frame
for k in parts:
print(df.iloc[k])
I wrote a piece of script find / fork it on github for the purpose of splitting a Pandas dataframe randomly. Here's a link to Pandas - Merge, join, and concatenate functionality!
Same code for your reference:
import pandas as pd
import numpy as np
from xlwings import Sheet, Range, Workbook
#path to file
df = pd.read_excel(r"//PATH TO FILE//")
df.columns = [c.replace(' ',"_") for c in df.columns]
x = df.columns[0].encode("utf-8")
#number of parts the data frame or the list needs to be split into
n = 7
seq = list(df[x])
np.random.shuffle(seq)
lists1 = [seq[i:i+n] for i in range(0, len(seq), n)]
listsdf = pd.DataFrame(lists1).reset_index()
dataframesDict = dict()
# calling xlwings workbook function
Workbook()
for i in range(0,n):
if Sheet.count() < n:
Sheet.add()
doubles[i] =
df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))]
Range(i,"A1").value = doubles[i]
Looks like you are trying to do a k-fold type thing, rather than a one-off. This code should help. You may also find the SKLearn k-fold functionality works in your case, that's also worth checking out.
# Split dataframe by rows into n roughly equal portions and return list of
# them.
def splitDf(df, n) :
splitPoints = list(map( lambda x: int(x*len(df)/n), (list(range(1,n)))))
splits = list(np.split(df.sample(frac=1), splitPoints))
return splits
# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest)
def makeTrainAndTest(splits, index) :
# index is zero based, so range 0-9 for 10 fold split
test = splits[index]
leftLst = splits[:index]
rightLst = splits[index+1:]
train = pd.concat(leftLst+rightLst)
return train, test
You can then use these functions to make the folds
df = <my_total_data>
n = 10
splits = splitDf(df, n)
trainTest = []
for i in range(0,n) :
trainTest.append(makeTrainAndTest(splits, i))
# Get test set 2
test2 = trainTest[2][1].shape
# Get training set zero
train0 = trainTest[0][0]
I have a bunch of data (10M + records) that breaks down to an identifier, a location and a date. I want to find the number of times that any identifier moved from some locationA to some other locationB over the entire set of dates. Any identifier may not have a location for all possible dates. When an identifier does not have a location recorded, that should be treated as an actual 'unknown' location for that date.
Here is some reproducible fake data...
import numpy as np
import pandas as pd
import datetime
base = datetime.date.today()
num_days = 50
dates = np.array([base - datetime.timedelta(days=x) for x in range(num_days-1, -1, -1)])
ids = np.arange(50)
mi = pd.MultiIndex.from_product([ids, dates])
locations = np.array([chr(x) for x in 97 + np.random.randint(26, size=len(mi))])
s = pd.Series(locations, index=mi)
mask = np.random.rand(len(mi)) > .5
s[mask] = np.nan
s = s.dropna()
My initial thought was to create a dataframe and use boolean masking/vectorized operations to solve this
df = s.unstack(0).fillna('unknown')
Apparently my data is sparse enough to cause a MemoryError (from all the extra entries resulting from unstacking).
My current working solution is the following
def series_fn(s):
s = s.reindex(pd.date_range(s.index.levels[1].min(), s.index.levels[1].max()), level=-1).fillna('unknown')
mask_prev = (s != s.shift(-1))[:-1]
mask_next = (s != s.shift())[1:]
s_prev = s[:-1][mask_prev]
s_next = s[1:][mask_next]
s_tup = pd.Series(list(zip(s_prev, s_next)))
return s_tup.value_counts()
result_per_id = s.groupby(level=0).apply(series_fn)
result = result_per_id.sum(level=-1)
result looks like
(a, b) 1
(a, c) 5
(a, e) 3
(a, f) 3
(a, g) 3
(a, h) 3
(a, i) 1
(a, j) 1
(a, k) 2
(a, l) 2
...
This is going to take ~5 hours for all my data. Does anyone know any faster ways of doing this?
Thanks!
Hmmm, I guess I should have transposed the data... well that was a relatively simple fix. Instead of using groupby and apply,
s = s.reorder_levels(['date', 'id'])
s = s.sortlevel(0)
results = []
for i in range(len(s.index.levels[0])-1):
t = time.time()
s0 = s.loc[s.index.levels[0][i]]
s1 = s.loc[s.index.levels[0][i+1]]
df = pd.concat((s0, s1), axis=1)
# Note: this is slower than the line above
# df = s.loc[s.index.levels[0][0:2], :].unstack(0)
df = df.fillna('unknown')
mi = pd.MultiIndex.from_arrays((df.iloc[:, 0], df.iloc[:, 1]))
s2 = pd.Series(1, mi)
res = s2.groupby(level=[0, 1]).apply(np.sum)
results.append(res)
print(time.time() - t)
results = pd.concat(results, axis=1)
Still unclear on why the commented out section takes about three times as long as the three lines above it.
I wrote a small script to assign values to a numpy array by knowing their row and column coordinates:
gridarray = np.zeros([3,3])
gridarray_counts = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
for nn in np.arange(len(data)):
gridarray[rows[nn],cols[nn]] += data[nn]
gridarray_counts[rows[nn],cols[nn]] += 1
In fact, then I know how many values are stored in the same grid cell and what the sum is of them. However, performing this on arrays of lengths 100000+ it is getting quite slow. Is there another way without using a for-loop?
Is an approach similar to this possible? I know this is not working yet.
gridarray[rows,cols] += data
gridarray_counts[rows,cols] += 1
I would use bincount for this, but for now bincount only takes 1darrays so you'll need to write your own ndbincout, something like:
def ndbincount(x, weights=None, shape=None):
if shape is None:
shape = x.max(1) + 1
x = np.ravel_multi_index(x, shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
out.shape = shape
return out
Then you can do:
gridarray = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
x = np.vstack([rows, cols])
temp = ndbincount(x, data, gridarray.shape)
gridarray = gridarray + temp
gridarray_counts = ndbincount(x, shape=gridarray.shape)
You can do this directly:
gridarray[(rows,cols)]+=data
gridarray_counts[(rows,cols)]+=1