Pandas read data without header or index - python

Here is the .csv file :
0 0 1 1 1 0 1 1 0 1 1 1 1
0 1 1 0 1 0 1 1 0 1 0 0 1
0 0 1 1 0 0 1 1 1 0 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1 2
0 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 0 1 0 0 0 1 1
0 0 0 0 1 1 0 0 1 0 1 0 2
0 1 1 0 1 1 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0 0 1 1 0 1
0 1 1 1 0 1 1 0 0 0 0 1 1
where the first column must be indices like (0,1,2,3,4 ...) but due to some reasons they are zeros. Is there any way to make them normal when reading the csv file with pandas.read_csv ?
i use
df = pd.read_csv(file,delimiter='\t',header=None,names=[1,2,3,4,5,6,7,8,9,10,11,12])
and getting something like:
1 2 3 4 5 6 7 8 9 10 11 12
0 0 1 1 1 0 1 1 0 1 1 1 1
0 1 1 0 1 0 1 1 0 1 0 0 1
0 0 1 1 0 0 1 1 1 0 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1 2
0 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 1 1 0 1 0 0 0 1 1
0 0 0 0 1 1 0 0 1 0 1 0 2
0 1 1 0 1 1 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0 0 1 1 0 1
0 1 1 1 0 1 1 0 0 0 0 1 1
and it's nearly i need, but first column (indices) is still zeros. Can pandas for example ignore this first column of zeros and automatically generate new indices to get this:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 1 0 1 1 0 0 0 1 1 1 0 1
1 0 1 0 1 1 0 0 0 1 1 1 1 2
2 0 1 1 1 0 0 1 1 1 1 1 1 2

You might want index_col=False
df = pd.read_csv(file,delimiter='\t',
header=None,
index_col=False)
From the Docs,
If you have a malformed file with delimiters at the end of each line,
you might consider index_col=False to force pandas to not use the
first column as the index

Why fuss over read_csv? Use np.loadtxt:
pd.DataFrame(np.loadtxt(file, dtype=int))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0 0 1 1 1 0 1 1 0 1 1 1 1
1 0 1 1 0 1 0 1 1 0 1 0 0 1
2 0 0 1 1 0 0 1 1 1 0 1 1 1
3 0 1 1 1 1 1 1 1 1 1 1 1 2
4 0 1 1 1 0 1 1 1 1 1 1 1 1
5 0 0 0 1 1 1 0 1 0 0 0 1 1
6 0 0 0 0 1 1 0 0 1 0 1 0 2
7 0 1 1 0 1 1 1 1 0 1 1 1 1
8 0 0 1 0 0 0 0 0 0 1 1 0 1
9 0 1 1 1 0 1 1 0 0 0 0 1 1
The default delimiter is whitespace, and no headers/indexes are read in by default. Column types are also not inferred, since the dtype is specified to be int. All in all, this is a very succinct and powerful alternative.

Related

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that
You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Count how many cells are between the last value in the dataframe and the end of the row

I'm using the pandas library in Python.
I have a data frame:
0 1 2 3 4
0 0 0 0 1 0
1 0 0 0 0 1
2 0 0 1 0 0
3 1 0 0 0 0
4 0 0 1 0 0
5 0 1 0 0 0
6 1 0 0 1 1
Is it possible to create a new column that is a count of the number of cells that are empty between the end of the row and the last value above zero? Example data frame below:
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
using argmax
df['value'] = df.apply(lambda x: (x.iloc[::-1] == 1).argmax(),1)
##OR
using np.where
df['Value'] = np.where(df.iloc[:,::-1] == 1,True,False).argmax(1)
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
Use:
df['new'] = df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1)
print (df)
0 1 2 3 4 new
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
Details:
First change order of columns by DataFrame.loc and slicing:
print (df.iloc[:, ::-1])
4 3 2 1 0
0 0 1 0 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 0 1 0 0
5 0 0 0 1 0
6 1 1 0 0 1
Then use cumulative sum per rows by DataFrame.cumsum:
print (df.iloc[:, ::-1].cumsum(axis=1))
4 3 2 1 0
0 0 1 1 1 1
1 1 1 1 1 1
2 0 0 1 1 1
3 0 0 0 0 1
4 0 0 1 1 1
5 0 0 0 1 1
6 1 2 2 2 3
Compare only 1 values by DataFrame.eq:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0))
4 3 2 1 0
0 True False False False False
1 False False False False False
2 True True False False False
3 True True True True False
4 True True False False False
5 True True True False False
6 False False False False False
And last count them per rows by sum:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1))
0 1
1 0
2 2
3 4
4 2
5 3
6 0
dtype: int64

How to concatenate all values of a pandas dataframe into an integer in python?

I have the following dataframe:
1 2 3 4 5 6 7 8 9 10
dog cat 1 1 0 1 1 1 0 0 1 0
dog 1 1 1 1 1 1 0 0 1 1
fox 1 1 1 1 1 1 0 0 1 1
jumps 1 1 1 1 1 1 0 1 1 1
over 1 1 1 1 1 1 0 0 1 1
the 1 1 1 1 1 1 1 0 1 1
I want to first drop all labels from both rows and columns so the df becomes:
1 1 0 1 1 1 0 0 1 0
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 0 1 1 1
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 1 1 0 1 1
And then get then concatenate the values into one long int number so it becomes:
110111001011111100111111110011111111011111111100111111111011
Does any know a way of doing it in the shortest snippet of code possible. I appreciate the suggestions. Thank you.
Option 1
apply(str.join) + str.cat:
df.astype(str).apply(''.join, 1).str.cat(sep='')
'110111001011111100111111110011111111011111111100111111111011'
Option 2
apply + np.add, proposed by Wen:
np.sum(df.astype(str).apply(np.sum, 1))
'110111001011111100111111110011111111011111111100111111111011'
IIUC
''.join(str(x) for x in sum(df.values.tolist(),[]))
Out[344]: '110111001011111100111111110011111111011111111100111111111011'
Or
''.join(map(str,sum(df.values.tolist(),[])))

Drop columns with more than 70% zeros

I would like to know if there is a command that drop columns that has more than 70% zeros or X% zeros. like:
df = df.loc[:, df.isnull().mean() < .7]
for NaN.
Thank you !
Just change df.isnull().mean() to (df==0).mean():
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1

Performing PCA on a dataframe with Python with sklearn

I have a sample input file that has many rows of all variants, and columns represent the number of components.
A01_01 A01_02 A01_03 A01_04 A01_05 A01_06 A01_07 A01_08 A01_09 A01_10 A01_11 A01_12 A01_13 A01_14 A01_15 A01_16 A01_17 A01_18 A01_19 A01_20 A01_21 A01_22 A01_23 A01_24 A01_25 A01_26 A01_27 A01_28 A01_29 A01_30 A01_31 A01_32 A01_33 A01_34 A01_35 A01_36 A01_37 A01_38 A01_39 A01_40 A01_41 A01_42 A01_43 A01_44 A01_45 A01_46 A01_47 A01_48 A01_49 A01_50 A01_51 A01_52 A01_53 A01_54 A01_55 A01_56 A01_57 A01_58 A01_59 A01_60 A01_61 A01_62 A01_63 A01_64 A01_65 A01_66 A01_67 A01_69 A01_70 A01_71
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
I first import this .txt file as:
#!/usr/bin/env python
from sklearn.decomposition import PCA
inputfile=vcf=open('sample_input_file', 'r')
I would like to performing principal component analysis and plotting the first two components (meaning the first two columns)
I am not sure if this the way to go about it after reading about
sklearn
PCA for two components:
pca = PCA(n_components=2)
pca.fit(inputfile) #not sure how this read in this file
Therefore, I need help importing my input file as a dataframe for Python to perform PCA on it
sklearn works with numpy arrays.
So you want to use numpy.loadtxt:
data = numpy.loadtxt('sample_input_file', skiprows=1)
pca = PCA(n_components=2)
pca.fit(data)

Categories

Resources