I try to create a matrix 100x100 which should have in each row next ordinal number like below:
I created a vector from 1 to 100 and then using for loop I copied this vector 100 times. I received an array with correct data so I tried to sort arrays using np.argsort, but it didn't worked as I want (I don't know even why there are zeros in after sorting).
Is there any option to get this matrix using another functions? I tried many approaches, but the final layout was not what I expected.
max_x = 101
z = np.arange(1,101)
print(z)
x = []
for i in range(1,max_x):
x.append(z.copy())
print(x)
y = np.argsort(x)
y
argsort returns the indices to sort by, that's why you get zeros. You don't need that, what you want is to transpose the array.
Make x a numpy array and use T
y = np.array(x).T
Output
[[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
[ 3 3 3 ... 3 3 3]
...
[ 98 98 98 ... 98 98 98]
[ 99 99 99 ... 99 99 99]
[100 100 100 ... 100 100 100]]
You also don't need to loop to copy the array, use np.tile instead
z = np.arange(1, 101)
x = np.tile(z, (100, 1))
y = x.T
# or one liner
y = np.tile(np.arange(1, 101), (100, 1)).T
import numpy as np
np.asarray([ (k+1)*np.ones(100) for k in range(100) ])
Or simply
np.tile(np.arange(1,101),(100,1)).T
Related
I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df
There are a lot of ways to go about this.
According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248
I have an array of repeated values that are used to match datapoints to some ID.
How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids and lengths, but only x.
It does not matter what the values in x are, I just want an array with counting up integers which are repeated the same amount as in x.
I can come up with solutions using for-loops or np.unique, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x and returns an array like y?
You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)
I would like to know if there exists a similar way of doing this (Mathematica) in Python:
Mathematica
I have tried it in Python and it does not work. I have also tried it with numpy.put() or with simple 2 for loops. This 2 ways work properly but I find them very time consuming with larger matrices (3000×3000 elements for example).
Described problem in Python,
import numpy as np
a = np.arange(0, 25, 1).reshape(5, 5)
b = np.arange(100, 500, 100).reshape(2, 2)
p = np.array([0, 3])
a[p][:, p] = b
which outputs non-changed matrix a: Python
Perhaps you are looking for this:
a[p[...,None], p] = b
Array a after the above assignment looks like this:
[[100 1 2 200 4]
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[300 16 17 400 19]
[ 20 21 22 23 24]]
As documented in Integer Array Indexing, the two integer index arrays will be broadcasted together, and iterated together, which effectively indexes the locations a[0,0], a[0,3], a[3,0], and a[3,3]. The assignment statement would then perform an element-wise assignment at these locations of a, using the respective element-values from RHS.
I have a pandas DataFrame with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.
How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.
You can use the sample method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
One solution is to use the choice function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
You could add a "section" column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42") or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by #msh5678
Thank you, Jeff,
But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50) using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))
This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]
Given two tensors, A (m x n x q) and B (m x n x 1), how do you create a function which loops through rows of A, treating each element of B (n x 1) as a scalar and applying them to the vectors (q x 1) of the sub-matrices of A (n x q)?
e.g., A is (6000, 1000, 300) shape. B is (6000, 1000, 1). Loop through 6000 "slices" of A, for each vector of the 1000 sub-matrices of A (, 1000, 300), apply scalar multiplication of each element from the vectors the sub-matrices of B (, 1000, 1).
My wording may be absolutely terrible. I will adjust the phrasing accordingly as issues arise.
Sidenote: I am working with Python, so Theano is probably the best to do this in?
Use tf.mul as follows:
import tensorflow as tf
a = tf.constant([[[1,2,1,2],[3,4,1,2],[5,6,10,12]],[[7,8,1,2],[9,10,1,1],[11,12,0,3]]])
b= tf.constant([[[7],[8],[9]],[[1],[2],[3]]])
res=tf.mul(a,b)
sess=tf.Session()
print(sess.run(res))
which prints:
[[[ 7 14 7 14]
[ 24 32 8 16]
[ 45 54 90 108]]
[[ 7 8 1 2]
[ 18 20 2 2]
[ 33 36 0 9]]]