Create matrix from indices and value points - python

I want to read a text file with values of matrix. Let's say you have got a .txt file looking like this:
0 0 4.0
0 1 5.2
0 2 2.1
1 0 2.1
1 1 2.9
1 2 3.1
Here, the first column gives the indices of the matrix on the x-axis and the second column fives the indices of the y-axis. The third column is a value at this position in the matrix. When values are missing the value is just zero.
I am well aware of the fact, that data formats like the .mtx format exist, but I would like to create a scipy sparse matrix or numpy array from this txt file alone instead of adjusting it to the .mtx file format. Is there a Python function out there, which does this for me, which I am missing?

import numpy
with open('filename.txt','r') as f:
lines = f.readlines()
f.close()
data = [i.split(' ') for i in lines]
z = list(zip(*data))
row_indices = list(map(int,z[0]))
column_indices = list(map(int,z[1]))
values = list(map(float,z[2]))
m = max(row_indices)+1
n = max(column_indices)+1
p = max([m,n])
A = numpy.zeros((p,p))
A[row_indices,column_indices]=values
print(A)
If you want a square matrix with maximum of column 1 as the number of rows and and the maximum of column 2 to be the size, then you can remove p = max([m,n]) and replace A = numpy.zeros((p,p)) with A = numpy.zeros((m,n)).

Starting from the array (a) sorted on the first column (major) and second (minor) as in your example, you can reshape:
# a = np.loadtxt('filename')
x = len(np.unique(a[:,0]))
y = len(np.unique(a[:,1]))
a[:,2].reshape(x,y).T
Output:
array([[4. , 2.1],
[5.2, 2.9],
[2.1, 3.1]])

Related

pandas dataframe and external list interaction

I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL

How to perform calculations on csv rows and cols and create new cols using numpy and pandas?

I have a csv that contains 12 cols and 4 rows of data.
As seen in the img
I would like to divide each of those values by their area of which I have created an array, and then multiply by 100 to get a % and have these values in a new column.
Image for array
So for example, A2, A3, A4, will all be divided by 52,600 and then x100.
My current df looks like this dataframe
I interpreted your request for a new column to be a new column for each Sub_* in your spreadsheet, since there were 12 values in your numpy array.
Code edit: I see you wanted to do the math to the 'Baseline' column as well. So I step through each column except the first (which is "Label" and at index 0)
import numpy as np
import pandas as pd
df = pd.read_excel("d:\stack67477476.xlsx")
area_arr = np.array([[52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538]])
for ii, col in enumerate(df.columns):
if ii == 0:
continue
df[col + "_Area"] = round(df[col] / area_arr[0][ii - 1] * 100, 2)
This is vectorized in one dimension (the 4 rows dimension) but loops through the 12 columns dimension. The output is as follows (don't quote me on this, I may have typed your inputs incorrectly):
df
Label Baseline Sub_A Sub_B Sub_C Sub_D Sub_E Sub_F Sub_G Sub_H Sub_I ... Sub_A_Area Sub_B_Area Sub_C_Area Sub_D_Area Sub_E_Area Sub_F_Area Sub_G_Area Sub_H_Area Sub_I_Area Sub_J_Area Sub_K_Area
0 0 0 15535 5128 8847 10784 5679 20481 8398 10012 5162 ... 103801.95 66580.11 212209.16 290673.85 100548.87 301857.04 449812.53 190053.15 103467.63 275527.43 380177.42
1 1 159506 149454 157456 155680 154327 154671 146863 150761 150446 155335 ... 998623.55 2044352.12 3734228.83 4159757.41 2738509.21 2164524.69 8075040.17 2855846.62 3113549.81 9387040.39 1963949.22
2 2 129087 111918 121515 122066 119557 123813 114746 123140 122156 125480 ... 747815.05 1577707.09 2927944.35 3222560.65 2192156.52 1691171.70 6595607.93 2318830.68 2515133.29 7608679.93 1653533.19
3 3 137562 102318 114509 124641 127442 130324 123331 130392 130715 134528 ... 683669.65 1486743.70 2989709.76 3435094.34 2307436.26 1817700.81 6984038.56 2481302.20 2696492.28 8123206.75 1881890.49
4 4 35901 26488 30836 33756 34549 34000 33269 34071 34151 35149 ... 176987.84 400363.54 809690.57 931239.89 601983.00 490331.61 1824906.27 648272.59 704529.97 2146473.78 531691.65
[5 rows x 25 columns]
Note that it's unclear why your numpy array is 2D, one assumes there is something deeper to that in the rest of your code. Seems it would be clearer to avoid a set of braces:
area_arr = np.array([52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538])
And simplify the divisor to just:
area_arr[ii] # not area_arr[0][ii]
or for that matter, a simple list would be ok, since numpy isn't needed here.
Apologies if we have miscommunicated on commas and decimal points, but the code still works if you change the numbers.

User-Item rating matrix : IndexError

my dataframe urm has a shape of (96438, 3)
user_id anime_id user_rating
0 1 20 7.808497
1 3 20 8.000000
2 5 20 6.000000
3 6 20 7.808497
4 10 20 7.808497
i'm trying to build an item-user-rating matrix :
X = urm[["user_id", "anime_id"]].as_matrix()
y = urm["user_rating"].values
n_u = len(urm["user_id"].unique())
n_m = len(urm["anime_id"].unique())
R = np.zeros((n_u, n_m))
for idx, row in enumerate(X):
R[row[0]-1, row[1]-1] = y[idx]
if the code succes the matrix looks like that : (i filled NaN with 0)
with in index user_id, anime_id in columns and rating for the value (i got this matrix from pivot_table)
is in some tutorial it works but there i got an
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-278-0e06bd0f3133> in <module>()
15 R = np.zeros((n_u, n_m))
16 for idx, row in enumerate(X):
---> 17 R[row[0]-1, row[1]-1] = y[idx]
IndexError: index 5276 is out of bounds for axis 1 with size 5143
I tried the second suggestion of dennlinger and it worked for me.
This was the code I wrote:
def id_to_index(df):
"""
maps the values to the lowest consecutive values
:param df: pandas Dataframe with columns user, item, rating
:return: pandas Dataframe with the extra columns index_item and index_user
"""
index_item = np.arange(0, len(df.item.unique()))
index_user = np.arange(0, len(df.user.unique()))
df_item_index = pd.DataFrame(df.item.unique(), columns=["item"])
df_item_index["new_index"] = index_item
df_user_index = pd.DataFrame(df.user.unique(), columns=["user"])
df_user_index["new_index"] = index_user
df["index_item"] = df["item"].map(df_item_index.set_index('item')["new_index"]).fillna(0)
df["index_user"] = df["user"].map(df_user_index.set_index('user')["new_index"]).fillna(0)
return df
I am assuming you have non-consecutive user IDs (or movie IDs), which means that there exist indices that either have
no rating, or
no movie
In your case, you are setting up your matrix dimensions with the assumption that every value will be consecutive (since you are defining the dimension with the amount of unique values), which causes some non-consecutive values to reach out of bounds.
In that case, you have two options:
You can define you matrix to be of size urm["user_id"].max() by urm["anime_id"].max()
Create a dictionary that maps your values to the lowest consecutive values.
The disadvantage of the first approach is obviously that it requires you to store a bigger matrix. Also, you can use scipy.sparse to create a matrix from the data format you have (commonly referred to as the coordinate matrix format).
Potentially, you can do something like this:
from scipy import sparse
# scipy expects the data in (value_column, (x, y))
mat = sparse.coo_matrix((urm["user_rating"], (urm["user_id"], urm["anime_id"]))
# if you want it as a dense matrix
dense_mat = mat.todense()
You can then also work your way to the second suggestion, as I have previously asked here

How to pre-process a very large data in python

I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?
I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)

Python: read timesteps from csv to arrays: Post-processing model-data with numpy;

I am still trying to come around with python, but this problem exceeds my knowledge:
Topic: hydrodynamic postprocessing:
csv output of hydraulic software to array, split timesteps
Here is the data and how far i came with a working code:
Input-file (see below):
First row: Number of result-nodes
Second row: Header
Third row: timestep # time=
Following: all results of this timestep (in this file: 13541 nodes, variable)
....the same again for next timestep.
# Number of Nodes: 13541
#X Y Z depth wse
# Output at t = 0
5603.7598 4474.4902 37.470001 0 37.470001
5610.5 4461.6001 36.020001 0 36.020001
5617.25 4448.71 35.130001 0 35.130001
5623.9902 4435.8198 35.07 0 35.07
5630.7402 4422.9199 35.07 0 35.07
5761.5801 4402.79 35.369999 0 35.369999
COMMENT:....................13541 timesteps...........
# Output at t = 120.04446
5603.7598 4474.4902 37.470001 3.6977223 41.167724
5610.5 4461.6001 36.020001 4.1377293 40.15773
5617.25 4448.71 35.130001 3.9119012 39.041902
5623.9902 4435.8198 35.07 3.7923947 38.862394
5630.7402 4422.9199 35.07 3.998436 39.068436
5761.5801 4402.79 35.369999 3.9750571 39.345056
COMMENT:....................13541 timesteps...........
# Output at t = 240.06036
5603.7598 4474.4902 37.470001 11.131587 48.601588
5610.5 4461.6001 36.020001 12.564266 48.584266
5617.25 4448.71 35.130001 13.498463 48.628464
5623.9902 4435.8198 35.07 13.443041 48.513041
5630.7402 4422.9199 35.07 11.625824 46.695824
5761.5801 4402.79 35.369999 19.49551 54.865508
PROBLEM:
I need a loop, which reads in n-timesteps into arrays.
The result should be: array for each timestep: in this case 27 timesteps with 13541 elements each.
timestep_1=[all elements of this timestep: shape=13541,5]
timestep_2=[]
timestep_3[]
........
timestep_n=[]
My code so far:
import numpy as np
import csv
from numpy import *
import itertools
#read file to big array
array=np.array([row for row in csv.reader(open("ascii-full.csv", "rb"), delimiter='\t')])
firstRow=array[0]
secondRow=array[1]
# find out how many nodes
strfirstRow=' '.join(map(str,firstRow))
first=strfirstRow.split()
print first[4]
nodes=first[4]
nodes=float(nodes)
#count timesteps
temp=(len(array)-3)/nodes
timesteps=int(temp)+1
#split array into timesteps:
# X Y Z h(t1) h(t2) h(tn)
ts1=array[3:nodes+3]#13541
#print ts1
ts2=array[nodes+4:nodes*2+4]
#print ts2
.......
read ts3 to last timestep to arrays with loop....
Maybe someone can help me, thanks!!!
You can use np.genfromtxt() to get a 3-D array like:
import numpy as np
gen = (a for a in open('test.txt') if not a[0] in ['#', 'C'])
a = np.genfromtxt(gen).reshape(-1, 6, 5)
where a[i] will represent the output at timestep i.
My take on your problem is, instead of reading the whole file into an array and process the array, read it line by line, creating the arrays as the data is read.
I read the number of rows and columns per timestep as described in the file, then create a new array for each timestep read (adding it to a list), then populating it with the read data.
import numpy as np
timesteps = []
timestep_results = []
f = open("ascii-full.csv", "rb")
# First line is number of rows (not counting the initial #)
rows = int(f.readline().strip()[1:].split()[-1])
counter = 0
# Second line is number of columns
columns = len(f.readline().strip().split())
# Next lines
for line in f:
if line.startswith("#"):
# it's a header: add time to timestep list, begin new array
timesteps.append( float(line.strip().split("=")[1]) )
timestep_results.append( np.zeros((rows, columns)) )
counter = 0
else:
# it's data: add to array in appropiate row
timestep_results[-1][counter] = map(float, line.strip().split())
counter += 1
f.close()
Hope it helps!

Categories

Resources