Performing PCA on a dataframe with Python with sklearn

Performing PCA on a dataframe with Python with sklearn - python

I have a sample input file that has many rows of all variants, and columns represent the number of components.
A01_01 A01_02 A01_03 A01_04 A01_05 A01_06 A01_07 A01_08 A01_09 A01_10 A01_11 A01_12 A01_13 A01_14 A01_15 A01_16 A01_17 A01_18 A01_19 A01_20 A01_21 A01_22 A01_23 A01_24 A01_25 A01_26 A01_27 A01_28 A01_29 A01_30 A01_31 A01_32 A01_33 A01_34 A01_35 A01_36 A01_37 A01_38 A01_39 A01_40 A01_41 A01_42 A01_43 A01_44 A01_45 A01_46 A01_47 A01_48 A01_49 A01_50 A01_51 A01_52 A01_53 A01_54 A01_55 A01_56 A01_57 A01_58 A01_59 A01_60 A01_61 A01_62 A01_63 A01_64 A01_65 A01_66 A01_67 A01_69 A01_70 A01_71
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
I first import this .txt file as:
#!/usr/bin/env python
from sklearn.decomposition import PCA
inputfile=vcf=open('sample_input_file', 'r')
I would like to performing principal component analysis and plotting the first two components (meaning the first two columns)
I am not sure if this the way to go about it after reading about
sklearn
PCA for two components:
pca = PCA(n_components=2)
pca.fit(inputfile) #not sure how this read in this file
Therefore, I need help importing my input file as a dataframe for Python to perform PCA on it

sklearn works with numpy arrays.
So you want to use numpy.loadtxt:
data = numpy.loadtxt('sample_input_file', skiprows=1)
pca = PCA(n_components=2)
pca.fit(data)

Related

generate an array with 0 and 1

I have two columns in a data frame ( Startpoint and endpoint )
I would like to generate an array with 1 for the duration between the two pints and 0 otherwise
For example :
the total number of increments is 200,
df = pd.DataFrame({'Startpoint': [ 100 , 50, 40 , 75 , 52 , 43, 90 , 48, 56 ,20 ], 'endpoint': [ 150, 70, 80, 90, 140, 160 ,170 , 120 , 135, 170 ]})
df
I want to generate an array (200,1) with 0 & 1. It will have 1 if it is in the range between 50 and 100 and 0 otherisw.
Thank you,

You can use numpy broadcasting to create the desired array in a vectorized way:
rng = np.arange(200)
out = ((df['Startpoint'].to_numpy()[:, None] <= rng) & (rng < df['endpoint'].to_numpy()[:, None])).astype(int)
Output:
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
To see that it's indeed the desired output, we check dimension and number of 1s in each row:
>>> out.shape
(10, 200)
>>> out.sum(axis=1)
array([ 50, 20, 40, 15, 88, 117, 80, 72, 79, 150])

Try this:
import numpy as np
import pandas as pd
increment = 200
array = np.zeros((10, increment), dtype = 'int')
for i in range(len(df)):
array[i, df['Startpoint'][i]:df['endpoint'][i]] = 1
Output(Note: I am printing each row entirely. This is only one array(shape = (10,200)) not 10 arrays each.):
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Python: Creating an adjacency matrix from a dataframe

I have the following data frame:
Company Firm
125911 1
125911 2
32679 3
32679 5
32679 5
32679 8
32679 10
32679 12
43805 14
67734 8
67734 9
67734 10
67734 10
67734 11
67734 12
67734 13
74240 4
74240 6
74240 7
Where basically the firm makes an investment into the company at a specific year which in this case is the same year for all companies. What I want to do in python is to create a simple adjacency matrix with only 0's and 1's. 1 if two firms has made an investment into the same company. So even if firm 10 and 8 for example have invested in two different firms at the same it will still be a 1.
The resulting matrix I am looking for looks like:
Firm 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have seen similar questions where you can use crosstab, however in that case each company will only have one row with all the firms in different columns instead. So I am wondering what the best and most efficient way to tackle this specific problem is? Any help is greatly appreciated.

dfs = []
for s in df.groupby("Company").agg(list).values:
dfs.append(pd.DataFrame(index=set(s[0]), columns=set(s[0])).fillna(1))
out = pd.concat(dfs).groupby(level=0).sum().gt(0).astype(int)
np.fill_diagonal(out.values, 0)
print(out)
Prints:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0

dfm = df.merge(df, on="Company").query("Firm_x != Firm_y")
out = pd.crosstab(dfm['Firm_x'], dfm['Firm_y'])
>>> out
Firm_y 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Firm_x
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 4 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 2 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 5 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 2 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 1 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 1

tf.data.experimental.sample_from_datasets not sampling as expected

The documentation seems to be bare bone and the example given in their standard TF tutorial not highlighting a behavior I see. Lets say you have an imbalanced dataset of 1 and 0 (pos and neg), and you want to sample at weights [0.5, 0.5], such that you see the positives more frequently. You would do this:
pos_ds = tf.data.Dataset.from_tensor_slices(np.ones(shape=(16, 1)))
neg_ds = tf.data.Dataset.from_tensor_slices(np.zeros(shape=(128, 1)))
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
And if I want to see how the pos and neg are distributed as I go through the dataset:
xs = []
for x in resampled_ds:
xs.append(int(x.numpy()[0]))
xs = np.array(xs)
print(xs)
np.bincount(xs)
I see this:
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1
0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
array([128, 16])
There are 128 negatives and 16 positives. If I use this as my train_ds, it will be equivalent to no sampling done, and worse, the negatives are no longer uniformly distributed across the steps / epoch. I am guessing that the 0.5 sampling is happening in the beginning and once it "run out" of 1s, it just started sampling the zeros only. It clearly doesn't do sampling with replacement for the 1s. I think the 1s and 0s will only be 0.5/0.5 if you stop after all the 1s are sampled.
It looks like this is the behavior but it isn't the only sensible one. I want to sample the positives multiple times (i.e. sampling with replacement) in 1 epoch, with approx equal amount of pos and negs, is there any option for this API? Also, I have data augmentation so the positives are actually not the same when trained.

You can do something like this for the replacement issue:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds.repeat(128 // 16), neg_ds], weights=[0.5, 0.5])
And the result is:
[1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1
1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0
0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Out[2]: array([128, 128], dtype=int64)

Actually, I also found the solution is right there on that TF tutorial imbalanced_data.ipynb (i totally missed this one in my own notebook).
pos_ds = pos_ds.shuffle(BUFFER_SIZE).repeat()
neg_ds = neg_ds.shuffle(BUFFER_SIZE).repeat()
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
The tutorial further suggest a heuristic to set the resampled_steps_per_epoch.
However, the shuffle + repeat, is still not equivalent to a true sampling with replacement for the minority class. A repeat() follow by a shuffle() may be do it. I can update this by trying both ways.

Creating week flags from DOW

I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.

Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0

Pandas read data without header or index

Here is the .csv file :
0 0 1 1 1 0 1 1 0 1 1 1 1
0 1 1 0 1 0 1 1 0 1 0 0 1
0 0 1 1 0 0 1 1 1 0 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1 2
0 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 0 1 0 0 0 1 1
0 0 0 0 1 1 0 0 1 0 1 0 2
0 1 1 0 1 1 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0 0 1 1 0 1
0 1 1 1 0 1 1 0 0 0 0 1 1
where the first column must be indices like (0,1,2,3,4 ...) but due to some reasons they are zeros. Is there any way to make them normal when reading the csv file with pandas.read_csv ?
i use
df = pd.read_csv(file,delimiter='\t',header=None,names=[1,2,3,4,5,6,7,8,9,10,11,12])
and getting something like:
1 2 3 4 5 6 7 8 9 10 11 12
0 0 1 1 1 0 1 1 0 1 1 1 1
0 1 1 0 1 0 1 1 0 1 0 0 1
0 0 1 1 0 0 1 1 1 0 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1 2
0 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 0 1 0 0 0 1 1
0 0 0 0 1 1 0 0 1 0 1 0 2
0 1 1 0 1 1 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0 0 1 1 0 1
0 1 1 1 0 1 1 0 0 0 0 1 1
and it's nearly i need, but first column (indices) is still zeros. Can pandas for example ignore this first column of zeros and automatically generate new indices to get this:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 1 0 1 1 0 0 0 1 1 1 0 1
1 0 1 0 1 1 0 0 0 1 1 1 1 2
2 0 1 1 1 0 0 1 1 1 1 1 1 2

You might want index_col=False
df = pd.read_csv(file,delimiter='\t',
header=None,
index_col=False)
From the Docs,
If you have a malformed file with delimiters at the end of each line,
you might consider index_col=False to force pandas to not use the
first column as the index

Why fuss over read_csv? Use np.loadtxt:
pd.DataFrame(np.loadtxt(file, dtype=int))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 0 1 1 1 0 1 1 0 1 1 1 1
1 0 1 1 0 1 0 1 1 0 1 0 0 1
2 0 0 1 1 0 0 1 1 1 0 1 1 1
3 0 1 1 1 1 1 1 1 1 1 1 1 2
4 0 1 1 1 0 1 1 1 1 1 1 1 1
5 0 0 0 1 1 1 0 1 0 0 0 1 1
6 0 0 0 0 1 1 0 0 1 0 1 0 2
7 0 1 1 0 1 1 1 1 0 1 1 1 1
8 0 0 1 0 0 0 0 0 0 1 1 0 1
9 0 1 1 1 0 1 1 0 0 0 0 1 1
The default delimiter is whitespace, and no headers/indexes are read in by default. Column types are also not inferred, since the dtype is specified to be int. All in all, this is a very succinct and powerful alternative.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performing PCA on a dataframe with Python with sklearn - python

sklearn works with numpy arrays. So you want to use numpy.loadtxt: data = numpy.loadtxt('sample_input_file', skiprows=1) pca = PCA(n_components=2) pca.fit(data)

Related

generate an array with 0 and 1

Python: Creating an adjacency matrix from a dataframe

tf.data.experimental.sample_from_datasets not sampling as expected

Creating week flags from DOW

Pandas read data without header or index

Categories

Resources