How to rewrite the node ids that stored in .mtx file - python

I have a .mtx file that looks like below:
0 435 1
0 544 1
1 344 1
2 410 1
2 471 1
This matrix has shape of (1000, 1000).
As you can see, node ids starts at 0. I want to change this to start at 1 instead of 0.
In other words, I need to add 1 to all the numbers in the first and second columns that represent the node ids.
So I converted .mtx file to .txt file and tried to add 1 in each first and second columns.
and simply added 1 to each row like below
import numpy as np
data_path = "my_data_path"
data = np.loadtxt(data_path, delimiter=' ', dtype='int')
for i in data:
print(data[i]+1)
and result was
[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]
now I need to subtract 1 from third column, but I have no idea how to implement that.
Can someone help me to do that?
Or if there's any way to complete my goal way more easier, please tell me. Thank you in advance.

Why wouldn't you increment only the first column?
data[:, 0] += 1
You may want to have a look at indexing in NumPy.
Additionally, I don't think the loop in your code ever worked:
for i in data:
print(data[i]+1)
You are indexing with values from the array which generally is wrong and is surely wrong in this case:
IndexError: index 435 is out of bounds for axis 0 with size 5
You could correct it to print the whole matrix:
print(data + 1)
Giving:
[[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]]

Related

replace repeated values with counting up values in Numpy (vectorized)

I have an array of repeated values that are used to match datapoints to some ID.
How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids and lengths, but only x.
It does not matter what the values in x are, I just want an array with counting up integers which are repeated the same amount as in x.
I can come up with solutions using for-loops or np.unique, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x and returns an array like y?
You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)

Row organized raster file to Column organized raster file

I have a (2x3) raster file with the following values:
-5-6
-4-5
-1-2
Normally the .xyz GIS file format would be column organised represented by the following numpy array: (coordinates are lower left corner)
col = numpy.array([[0,0,-1],[1,0,-2],[0,1,-3],[1,1,-4],[0,2,-5],[1,2,-6]])
Unfortunately I have a row organized structure (from this data comes from https://www.opengeodata.nrw.de/). It can be represented by the following numpy array:
row = numpy.array([[0,0,-1],[0,1,-3],[0,2,-5],[1,0,-2],[1,1,-4],[1,2,-6]])
print (row)
[[ 0 0 -1]
[ 0 1 -3]
[ 0 2 -5]
[ 1 0 -2]
[ 1 1 -4]
[ 1 2 -6]]
I need to rearrange this row array into a col array. I am currently using this code:
rr = row.reshape(2,3,3)
stack = numpy.column_stack(rr[:,:,:])
new_col =(stack.reshape(-1,3))
print (new_col)
[[ 0 0 -1]
[ 1 0 -2]
[ 0 1 -3]
[ 1 1 -4]
[ 0 2 -5]
[ 1 2 -6]]
This works but my question: Is this the best way to tackle this array transformation? I have little experience manipulation numpy arrays.
Thanks
Nicolas
You can use transpose method to rearrange the axes.
import numpy
col = numpy.array([[0,0,-1],[1,0,-2],[0,1,-3],[1,1,-4],[0,2,-5],[1,2,-6]])
row = numpy.array([[0,0,-1],[0,1,-3],[0,2,-5],[1,0,-2],[1,1,-4],[1,2,-6]])
# New solution
new_col = row.reshape(2,3,3).transpose(1,0,2).reshape(-1,3)
print(numpy.array_equal(col, new_col))
It works faster than via using column_stack or hstack.
I think what your doing is fine, but for readability I would use
stack = numpy.hstack(rr)
instead of
stack = numpy.column_stack(rr[:,:,:])

Creating dummy variable using pandas or statsmodel for interaction of two columns

I have a data frame like this:
Index ID Industry years_spend asset
6646 892 4 4 144.977037
2347 315 10 8 137.749138
7342 985 1 5 104.310217
137 18 5 5 156.593396
2840 381 11 2 229.538828
6579 883 11 1 171.380125
1776 235 4 7 217.734377
2691 361 1 2 148.865341
815 110 15 4 233.309491
2932 393 17 5 187.281724
I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts()) varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.
I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy syntax in statsmodels:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry)", data=df).fit()
but If I want to do with 2 columns I get an error that:
IndexError: tuple index out of range
How can I do that with pandas or using some function inside statsmodels?
Using patsy syntax it's just:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()
The : character means "interaction"; you can also generalize this to interactions of more than two items (C(a):C(b):C(c)), interactions between numerical and categorical values, etc. You might find the patsy docs useful.
You could do something like this where you have to first create a calculated field that encapsulates the Industry and years_spend:
df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str') # this is the calculated field
Here's what the df looks like:
Industry years_spend industry_years
0 4 4 4_4
1 3 5 3_5
2 11 8 11_8
3 4 4 4_4
4 1 4 1_4
5 1 1 1_1
Now you can apply get_dummies:
df = pd.get_dummies(df, columns=['industry_years'])
That'll get you what you want :)

np.loadtxt for a file with many matrixes

I have a file that looks something like this:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
.. the file continues, with several 3x3 matrices appearing in the same fashion. Each matrix is prefaced by text with a unique ID, though the IDs aren't particularly important to me. I want to create a matrix of these matrixes. Can I use loadtxt to do that?
Here is my best attempt. The 6 in this code could be replaced with an iterating variable starting at 6 and incrementing by the number of rows in the matrix. I thought that skiprows would accept a list, but apparently it only accepts integers.
np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
TypeError Traceback (most recent call last)
<ipython-input-23-7d82fb7ef14a> in <module>()
----> 1 np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
/usr/local/lib/python2.7/site-packages/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
932
933 # Skip the first `skiprows` lines
--> 934 for i in range(skiprows):
935 next(fh)
936
Maybe I misunderstand, but if you can match the lines preceding the 3x3 matrices, then you can create a generator to feed to loadtxt:
import numpy as np
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if 'matrix' in line: # or whatever matches the line before a matrix
yield next(fs)
yield next(fs)
yield next(fs)
with open('matrices.dat') as fs:
g = get_matrices(fs)
M = np.loadtxt(g)
M = M.reshape((M.size//9, 3, 3))
print(M)
If you feed it:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
new matrix 7 6
1 0 1
2 0 3
0 1 2
You get an array of the matrices:
[[[ 3. 2. 4.]
[ 7. 4. 2.]
[ 9. 1. 1.]]
[[ 9. 4. 1.]
[ 1. 3. 4.]
[ 4. 3. 1.]]
[[ 7. 2. 1.]
[ 1. 3. 4.]
[ 2. 3. 2.]]
[[ 1. 0. 1.]
[ 2. 0. 3.]
[ 0. 1. 2.]]]
Alternatively, if you just want to yield all lines that look like they might be rows from a 3x3 matrix of integers, match to a regular expression:
import re
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if re.match('\d+\s+\d+\s+\d+', line):
yield line
You need to change your processing workflow to use steps: first, extract substrings corresponding to your desired matrices, then call numpy.loadtxt. To do this, a great way would be:
Find matrix start and end with re.
Load matrix within that range
Reset your range and continue.
Your matrix marker seem to be diverse, so you could use a regular expression like this:
start = re.compile("\w+\s+matrix\s+(\d+)\s+(\d+)\n")
end = re.compile("\n\n")
Then, you can find start/end pairs and then load the text for each matrix:
import io
import numpy as np
# read our data
data = open("/path/to/file.txt").read()
def load_matrix(data, *args):
# find start and end bounds
s = start.search(data)
if not s:
# no matrix leftover, return None
return None
e = end.search(data, s.end())
e_index = e.end() if e else len(data)
# load text
buf = io.StringIO(data[s.end(): e_index])
matrix = np.loadtxt(buf, *args) # add other args here
# reset our buffer
data = data[e_index:]
return matrix
Idea
In this case, my regular expression marker for the start of the matrix has capturing groups (\d+) for the matrix dimensions, so you can get the MxN representation of the matrix if you wish. List itemI also then search for items with the word "matrix" on the line, with arbitrary leading text and two numbers separated by whitespace at the end.
My match for the end is two "\n\n" groups, or two newlines (if you have Windows line endings, you may need to consider "\r" too).
Automating This
Now that we have a way to find a single case, all you need to do is iterate this and populate a list of matrices while you still get matches.
matrices = []
# read our data
data = open("/path/to/file.txt").read()
while True:
result = load_matrix(data, ...) # pass other arguments to loadtxt
if not result:
break
matrices.append(result)

How to one-hot encode category features with pandas or tensorflow?

The data is like this:
features1 features2 labels
1 1 563 1
2 1 254 1
3 missing 145 1
4 0 126 1
5 0 145 0
6 1 124 0
7 0 456 0
I am going to apply this data to a Tensorflow training process, so I wanna one-hot encode the feature1's values.
the matrix of the data above is :
[[1,563,1],
[2,254,1],
[missing,145,1],
[0,126,1],
[0,145,0],
[1,124,0],
[0,456,0]]
So I think it can be one-hot encoded to this:
> [1,0,0] represents 1
> [0,1,0] represents 0
> [0,0,1] represents 'missing'
and the output I want is like:
[[1,0,0,563,1],
[1,0,0,254,1],
[0,0,1,145,1],
[0,1,0,126,1],
[0,1,0,145,0],
[1,0,0,124,0],
[0,1,0,456,0]]
I've tried pd.get_dummies.But I couldn't make it.
I'm not sure how did you use pd.get_dummies but please note that this function generates a new data frame or array for you , so if you want to apply 1 hot encoding to 1st column in your array and keep the other columns as is , you need to reassign the your array like so :
newArrayWithOneHotEncoding = pd.get_dummies(arrayThatYouWantToTransform, columns = ['firstColumnHeader'])
Update :
I forgot to mention that you need to set a distinct value for the missing for example -1 , then apply the one Hot Encoding this way you will have three new columns

Categories

Resources