Data:
I have an Array with Size (2200, 1000, 12). The first value (2200) is the index, in each index there are 1000 records.
I have another Array for the Class with Size (2200). Each variable here represents a label for the 1000 records in each index.
I want:
How can I in the first array put everything together to transform from 3 dimensions to 2 dimensions?
And how can I put each class variable in the 1000 records?
Desired result:
Dataframe Size (2200000,13)
The 2200000 would be the combined amount of the 1000 records in the 2200 index. And column 13 would be the junction with the Class, where each variable of the class would be repeated a thousand times to keep the same number of lines.
Let us first import the necessary modules and generate mock data:
import numpy as np
import pandas as pd
M = 2200
N = 1000
P = 12
data = np.random.rand(M, N, P)
classes = np.arange(M)
How can I transform from 3 dimensions to 2 dimensions?
data.reshape(M*N, P)
How can I put each class variable in the 1000 records?
np.repeat(classes, N)
Desired result: Dataframe Size (2200000,13)
arr = np.hstack([data.reshape(M*N, P), np.repeat(classes, N)[:, None]])
df = pd.DataFrame(arr)
print(df)
The code above outputs:
0 0.371495 0.598211 0.038224 ... 0.777405 0.193472 0.0
1 0.356371 0.636690 0.841467 ... 0.403570 0.330145 0.0
2 0.793879 0.008617 0.701122 ... 0.021139 0.514559 0.0
3 0.318618 0.798823 0.844345 ... 0.931606 0.467469 0.0
4 0.307109 0.076505 0.865164 ... 0.809495 0.914563 0.0
... ... ... ... ... ... ... ...
2199995 0.215133 0.239560 0.477092 ... 0.050997 0.727986 2199.0
2199996 0.249206 0.881694 0.985973 ... 0.897410 0.564516 2199.0
2199997 0.378455 0.697581 0.016306 ... 0.985966 0.638413 2199.0
2199998 0.233829 0.158274 0.478611 ... 0.825343 0.215944 2199.0
2199999 0.351320 0.980258 0.677298 ... 0.791046 0.736788 2199.0
Does this help you:
array = array.reshape(2200000,13)
Related
I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df
There are a lot of ways to go about this.
According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248
I'm trying to compare 2 columns (strings) of 2 different pandas' dataframe (A and B) and if they match a piece of the string, I would like to assign the value of one column in dataframe A to dataframe B.
This is my code:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row in B.iterrows():
for date2, row2 in A.iterrows():
if row2['ID'] in row['Name']:
clus = row['Group']
else:
clus = 999
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
What I would like to have is a np.array clus_tx with the length of B, where if there is an element with the string that matches in A ('PI-xx'), I would take the value of the column 'Group' from A and assign to B, if there is no string matching, I would assign the value of 999 to the column 'Group' in B.
I think I'm doing the loop wrong because the size of clus_tx is not what I expected...My real dataset is huge, so I can't do this manually.
First, the reason why the size of clus_tx is not what you want is that you put clus_tx = np.append(clus_tx,clus) in the innermost loop, which has no break. So the length of clus_tx will always be len(A) x len(B).
Second, the logic of if statement block is not what you want.
I've changed the code a bit, hope it helps:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row_B in B.iterrows():
clus = 999
for date2, row_A in A.iterrows():
if row_B['Name'] in row_A['ID']:
clus = row_A['Group']
break
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
print(B)
The print output of B looks like:
Name Group
0 PI-05 0.0
1 PI-10 0.0
2 PI-89 1.0
3 PI-90 999.0
4 PI-93 0.0
5 PI-94 1.0
6 PI-95 1.0
7 PI-96 0.0
8 PI-09 1.0
9 PI-15 0.0
10 PI-16 1.0
11 PI-19 1.0
12 PI-91A 999.0
13 PI-91b 999.0
14 PI-92 1.0
15 PI-25-CU 999.0
16 PI-29 0.0
17 PI-30 1.0
18 PI-34 0.0
19 PI-84-CU-S1 999.0
20 PI-84-CU-S2 999.0
I have a pandas DataFrame with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.
How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.
You can use the sample method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
One solution is to use the choice function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
You could add a "section" column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42") or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by #msh5678
Thank you, Jeff,
But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50) using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))
This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]
I have a column with 100 rows and i want to generate multiple columns(say 100) from this column. These new columns should be generated by multiplying the first column with a random value. Is there a way to do it using python? I have tried it using excel but that is a tedious task as for every column I have to multiply the column with a randomly generated number (randbetween(a,b)).
Let's assume you have a column of numeric data:
import numpy as np
import random
# random.randint(a,b) will choose a random integer between a and b
# this will create a column that is 96 elements long
col = [random.randint(0,500) for i in range(96)]
Now, let's create more columns by leveraging a numpy.array which supports scalar multiplication of vectors:
arr = np.array(col)
# our dataframe has one column in it
df = pd.DataFrame(arr, columns=['x'])
a, b = 100, 5000 # set what interval to select random numbers from
Now, you can loop through to add in new columns
num_cols = 99
for i in range(num_cols): # or however many columns you want to add
df[i] = df.x * random.randint(a,b)
df.head()
x 0 1 2 3 4 5 6 ... 92 93 94 95 96 97 98 99
0 68 257040 214268 107576 266152 229568 309468 319668 ... 74460 25024 85952 320620 331840 175712 87788 254864
1 286 1081080 901186 452452 1119404 965536 1301586 1344486 ... 313170 105248 361504 1348490 1395680 739024 369226 1071928
2 421 1591380 1326571 666022 1647794 1421296 1915971 1979121 ... 460995 154928 532144 1985015 2054480 1087864 543511 1577908
3 13 49140 40963 20566 50882 43888 59163 61113 ... 14235 4784 16432 61295 63440 33592 16783 48724
4 344 1300320 1083944 544208 1346416 1161344 1565544 1617144 ... 376680 126592 434816 1621960 1678720 888896 444104 1289312
[5 rows x 101 columns]
You can use Numpy reshape to multiply column with random number
a, b = 10 ,20
df = pd.DataFrame({'col':np.random.randint(0,500, 100)})
df['col'].values * np.random.randint(a, b, 100).reshape(-1,1)
To get the result in a Dataframe,
pd.DataFrame(df['col'].values * np.random.randint(a, b, 100).reshape(-1,1))
I have read in some with Pandas and I want to add a column after the last column. After I did, the problem is that the values start from zero, and I want them to start from one.
I have 12800 rows and want the added column to start from 1 and go to 100 and the start over and go from 1 to 100. I want this pattern for all the rows. So basically I want this to cycle 128 times go from 1 to 100. Can anyone tell me how I can do this?
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df1=pd.DataFrame(df.values.reshape(12800, -1))
df1['10'] = df1.index
The included picture is not correct. I want the last column which is number 10 to start from one and have a pattern like I said above.
To repeat a pattern of 1..100 and assign to a column you can do:
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1)[:len(df)]
To add a pattern that is 100 per step you can do:
df['by_100'] = np.floor(df.index / 100)
Test Code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(2002, 3))
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1 )[:len(df)]
df['by_100'] = np.floor(df.index / 100)
print(df.head())
print(df.tail())
Results:
0 1 2 1_to_100 by_100
0 0.301862 0.824019 0.267810 1 0.0
1 0.568186 0.040328 0.799634 2 0.0
2 0.887218 0.407702 0.351990 3 0.0
3 0.871072 0.583761 0.498725 4 0.0
4 0.169657 0.026824 0.446667 5 0.0
0 1 2 1_to_100 by_100
1997 0.370640 0.662019 0.541747 98 19.0
1998 0.545908 0.682259 0.970764 99 19.0
1999 0.416177 0.665771 0.926145 100 19.0
2000 0.207109 0.762653 0.813754 1 20.0
2001 0.711998 0.236817 0.025387 2 20.0