Here is code I wrote to generate a dataframe that contains 4 columns
num_rows = 10
df = pd.DataFrame({ 'id_col' : [x+1 for x in range(num_rows)] , 'c1': [randint(0, 9) for x in range(num_rows)], 'c2': [randint(0, 9) for x in range(num_rows)], 'c3': [randint(0, 9) for x in range(num_rows)] })
df
print(df) renders :
id_col c1 c2 c3
0 1 3 1 5
1 2 0 2 4
2 3 1 2 5
3 4 0 5 6
4 5 0 0 1
5 6 6 5 8
6 7 1 6 8
7 8 5 8 8
8 9 1 5 2
9 10 2 9 2
I've set the number or rows to be dynamically generated via the num_rows variable.
How to dynamically generate 1000 columns where each column is prepended by 'c'. So columns c1,c2,c3....c1000 are generated where each columns contains 10 rows ?
For better performance I suggest use for create DataFrame numpy function numpy.random.randint and then change columns names by list comprehension, for new column by position use DataFrame.insert:
np.random.seed(458)
N = 15
M = 10
df = pd.DataFrame(np.random.randint(10, size=(M, N)))
df.columns = ['c{}'.format(x+1) for x in df.columns]
df.insert(0, 'idcol', np.arange(M))
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3
Another solution with numpy.hstack for stack first id column to 2d array:
np.random.seed(458)
arr = np.hstack([np.arange(M)[:, None], np.random.randint(10, size=(M, N))])
df = pd.DataFrame(arr)
df.columns = ['idcol'] + ['c{}'.format(x) for x in df.columns[1:]]
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3
IIUC, use str.format and dict comprehension
num_rows = 10
num_cols = 15
df = pd.DataFrame({ 'c{}'.format(n): [randint(0, 9) for x in range(num_rows)] for n in range(num_cols)},
index=[x+1 for x in range(num_rows)] , )
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
1 1 6 2 1 3 1 8 8 2 0
2 2 6 2 2 5 7 4 1 6 2
3 1 2 6 8 7 5 5 7 2 2
4 5 5 3 3 4 7 8 1 8 6
5 7 2 8 6 5 6 2 0 0 4
6 8 2 4 4 6 3 0 1 0 2
7 5 6 8 5 1 0 4 8 4 7
8 1 5 4 5 2 4 4 6 2 7
9 5 7 7 8 5 0 2 7 3 2
10 4 8 5 3 3 7 5 1 5 1
You can use the np.random.randint to create a full array of random values, f-strings (Python 3.6+) with a list comprehension for column naming, and pd.DataFrame.assign with np.arange for defining "id_col":
import pandas as pd, numpy as np
rows = 10
cols = 5
minval, maxval = 0, 10
df = pd.DataFrame(np.random.randint(minval, maxval, (rows, cols)),
columns=[f'c{i}' for i in range(1, cols+1)])\
.assign(id_col=np.arange(1, num_rows+1))
print(df)
c1 c2 c3 c4 c5 id_col
0 8 4 6 0 8 1
1 8 3 5 9 0 2
2 1 3 3 6 2 3
3 6 4 1 1 7 4
4 3 7 0 9 5 5
5 4 6 8 8 6 6
6 0 3 9 9 7 7
7 0 6 1 2 4 8
8 3 7 1 2 0 9
9 6 6 0 5 8 10
Related
Hi I have a DataFrame for which I have multiple columns I want to combine into 1 with several other columns that I want to be duplicated. An example dataframe:
df = pd.DataFrame(np.random.randint(10, size=60).reshape(6, 10))
df.columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'y1', 'y2', 'y3', 'y4', 'y5']
x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
0 2 6 9 4 3 8 6 1 0 7
1 1 4 8 7 3 0 5 7 3 1
2 6 7 4 8 1 5 7 7 8 5
3 6 3 4 8 0 8 7 2 3 8
4 8 5 6 1 6 3 2 1 1 4
5 1 3 7 5 1 6 5 3 8 5
I would like a nice way to produce the following DataFrame:
x1 x2 x3 x4 x5 y
0 2 6 9 4 3 8
1 1 4 8 7 3 0
2 6 7 4 8 1 5
3 6 3 4 8 0 8
4 8 5 6 1 6 3
5 1 3 7 5 1 6
6 2 6 9 4 3 6
7 1 4 8 7 3 5
8 6 7 4 8 1 7
9 6 3 4 8 0 7
10 8 5 6 1 6 2
11 1 3 7 5 1 5
12 2 6 9 4 3 1
13 1 4 8 7 3 7
14 6 7 4 8 1 7
15 6 3 4 8 0 2
16 8 5 6 1 6 1
17 1 3 7 5 1 3
18 2 6 9 4 3 0
19 1 4 8 7 3 3
20 6 7 4 8 1 8
21 6 3 4 8 0 3
22 8 5 6 1 6 1
23 1 3 7 5 1 8
24 2 6 9 4 3 7
25 1 4 8 7 3 1
26 6 7 4 8 1 5
27 6 3 4 8 0 8
28 8 5 6 1 6 4
29 1 3 7 5 1 5
Is there a nice way to produce this DataFrame with Pandas functions or is it more complicated?
Thanks
You can do this with df.melt().
df.melt(
id_vars = ['x1','x2','x3','x4','x5'],
value_vars = ['y1','y2','y3','y4','y5'],
value_name = 'y'
).drop(columns='variable')
df.melt() will have the column called variable that has the value for which column it originally came from (so is that row coming from y1, y2, etc), so you want to drop that as you see above.
I loaded the data without header.
train = pd.read_csv('caravan.train', delimiter ='\t', header=None)
train.index = np.arange(1,len(train)+1)
train
0 1 2 3 4 5 6 7 8 9
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
but the header started from 0, and I want to create header starting with 1 insteade of 0
How can I do this?
In your case
df.columns = df.columns.astype(int)+1
df
Out[99]:
1 2 3 4 5 6 7 8 9 10
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
I have pasted some part of df as given below , but I have more than 400 columns in actual df .
>>> df_final
c d name e f g h g h
0 0 0 aa 0 0 0 0 0 0
1 1 2 bb 1 2 1 2 1 2
2 2 4 cc 2 4 2 4 2 4
3 3 6 dd 3 6 3 6 3 6
4 4 8 ee 4 8 4 8 4 8
5 5 10 ff 5 10 5 10 5 10
6 6 12 gg 6 12 6 12 6 12
I want 'name' and 'c' and first and second positions but order for other columns don't matter. I would like to use
cols = ['name' , 'c']
col_position = [1 , 2]
How can I re-order data frame using list cols and col_position?
How can i set datatype as str for cols and float for other columns ?
Thanks in advance
I think need:
df1 = df[cols + np.setdiff1d(df.columns, cols).tolist()]
print (df1)
name c d e f g g.1 h h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 1 2 2
2 cc 2 4 2 4 2 2 4 4
3 dd 3 6 3 6 3 3 6 6
4 ee 4 8 4 8 4 4 8 8
5 ff 5 10 5 10 5 5 10 10
6 gg 6 12 6 12 6 6 12 12
And:
c1 = df.columns[col_position].tolist()
df1 = df[c1 + np.setdiff1d(df.columns, c1).tolist()]
print (df1)
d name c e f g g.1 h h.1
0 0 aa 0 0 0 0 0 0 0
1 2 bb 1 1 2 1 1 2 2
2 4 cc 2 2 4 2 2 4 4
3 6 dd 3 3 6 3 3 6 6
4 8 ee 4 4 8 4 4 8 8
5 10 ff 5 5 10 5 5 10 10
6 12 gg 6 6 12 6 6 12 12
Alternative with select by positions:
c1 = np.arange(len(df.columns))
df1 = df.iloc[:, col_position + np.setdiff1d(c1, col_position).tolist()]
print (df1)
d name c e f g h g.1 h.1
0 0 aa 0 0 0 0 0 0 0
1 2 bb 1 1 2 1 2 1 2
2 4 cc 2 2 4 2 4 2 4
3 6 dd 3 3 6 3 6 3 6
4 8 ee 4 4 8 4 8 4 8
5 10 ff 5 5 10 5 10 5 10
6 12 gg 6 6 12 6 12 6 12
Construct a list to slice by
cols = ['name', 'c']
df[cols + df.columns.difference(cols).tolist()]
name c d e f g g.1 h h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 1 2 2
2 cc 2 4 2 4 2 2 4 4
3 dd 3 6 3 6 3 3 6 6
4 ee 4 8 4 8 4 4 8 8
5 ff 5 10 5 10 5 5 10 10
6 gg 6 12 6 12 6 6 12 12
Slice, drop, and join
cols = ['name', 'c']
df[cols].join(df.drop(cols, 1))
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
Slice, drop, and concat
cols = ['name', 'c']
pd.concat([df[cols], df.drop(cols, 1)], axis=1)
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
By position with iloc
positions = df.columns.map({'name': 0, 'c': 1}.get).argsort()
df.iloc[:, positions]
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
Or with a focus on OP's vars
cols = ['name' , 'c']
col_position = [1 , 2]
m = dict(zip(cols, col_position))
positions = df.columns.map(m.get).argsort()
df.iloc[:, positions]
I tried this,
l=df.columns.values
cols = ['name' , 'c']
col_position = [1 , 2]
for u in zip(cols,col_position):
l.remove(u[0])
l.insert(u[1],u[0])
df=df[l]
This question already has answers here:
Set values on the diagonal of pandas.DataFrame
(8 answers)
Closed 5 years ago.
I have a Pandas Dataframe question. I have a df with index=column. It looks like below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 812 62 174 0 4 46 46 7 2 15
B 62 427 27 0 0 12 61 2 4 11
C 174 27 174 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 130 10 57 33 4 5
F 46 12 13 0 10 187 4 5 0 0
......
Another words, df=df.transpose(). All I want to do is find pandas (or numpy for df.values())function to delete index=column values. My ideal output would be below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15
B 62 0 27 0 0 12 61 2 4 11
C 174 27 0 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5
F 46 12 13 0 10 0 4 5 0 0
......
Is there a python function that makes this step very fast? I tried for loop with df.iloc[i,i]=0 but since my dataset is ver big, it takes long time to finish. Thanks in advance!
Setup
np.random.seed([3,1415])
i = pd.MultiIndex.from_product(
[['Cat2'], list('ABCDEFGHIJ')],
names=['DNA', 'Item']
)
a = np.random.randint(5, size=(10, 10))
df = pd.DataFrame(a + a.T + 1, i, i)
df
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 1 6 6 7 7 7 4 4 8 2
B 6 1 3 6 1 6 6 4 8 5
C 6 3 9 8 9 6 7 8 4 9
D 7 6 8 1 6 9 4 5 4 3
E 7 1 9 6 9 7 3 7 2 6
F 7 6 6 9 7 9 3 4 6 6
G 4 6 7 4 3 3 9 4 5 5
H 4 4 8 5 7 4 4 5 4 5
I 8 8 4 4 2 6 5 4 9 7
J 2 5 9 3 6 6 5 5 7 3
Option 1
Simplest way is to multiply by 1 less the identity
df * (1 - np.eye(len(df), dtype=int))
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 2
However, we can also use pd.DataFrame.mask with np.eye. Masking is nice because it doesn't have to be numeric and it will still work.
df.mask(np.eye(len(df), dtype=bool), 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 3
In the event the columns and indices are not identical, OR the are out of order. We can use equality to tell us where to mask.
d = df.iloc[::-1]
d.mask(d.index == d.columns.values[:, None], 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 J 2 5 9 3 6 6 5 5 7 0
I 8 8 4 4 2 6 5 4 0 7
H 4 4 8 5 7 4 4 0 4 5
G 4 6 7 4 3 3 0 4 5 5
F 7 6 6 9 7 0 3 4 6 6
E 7 1 9 6 0 7 3 7 2 6
D 7 6 8 0 6 9 4 5 4 3
C 6 3 0 8 9 6 7 8 4 9
B 6 0 3 6 1 6 6 4 8 5
A 0 6 6 7 7 7 4 4 8 2
Suppose the following pandas dataframe
Wafer_Id v1 v2
0 0 9 6
1 0 7 8
2 0 1 5
3 1 6 6
4 1 0 8
5 1 5 0
6 2 8 8
7 2 2 6
8 2 3 5
9 3 5 1
10 3 5 6
11 3 9 8
I want to group it according to WaferId and I would like to get something like
w
Out[60]:
Wafer_Id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3
0 0 9 7 1 6 ... ...
1 1 6 0 5 6
2 2 8 2 3 8
3 3 5 5 9 1
I think that I can obtain the result with the pivot function but I am not sure of how to do it
Possible solution
oes = pd.DataFrame()
oes['Wafer_Id'] = [0,0,0,1,1,1,2,2,2,3,3,3]
oes['v1'] = np.random.randint(0, 10, 12)
oes['v2'] = np.random.randint(0, 10, 12)
oes['id'] = [0, 1, 2] * 4
oes.pivot(index='Wafer_Id', columns='id')
oes
Out[74]:
Wafer_Id v1 v2 id
0 0 8 7 0
1 0 3 3 1
2 0 8 0 2
3 1 2 5 0
4 1 4 1 1
5 1 8 8 2
6 2 8 6 0
7 2 4 7 1
8 2 4 3 2
9 3 4 6 0
10 3 9 2 1
11 3 7 1 2
oes.pivot(index='Wafer_Id', columns='id')
Out[75]:
v1 v2
id 0 1 2 0 1 2
Wafer_Id
0 8 3 8 7 3 0
1 2 4 8 5 1 8
2 8 4 4 6 7 3
3 4 9 7 6 2 1