Read text file of protein sequences in python - python

I am trying to read DNA Sequences in Pandas Data frame but not getting the whole sequence in Data frame column.
I have tried File.open method simple read_csv method these methods didn't help me much.
pd.read_csv('../input/data 1/non-cpp.txt', index_col=0, header=None)
Output:
0
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
myfile = open("../input/data 1/non-cpp.txt")
for line in myfile:
print(line)
myfile.close()
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
QRFSQPTFKLPQGRLTLSRKF
>
FLPVLAGIAAKVVPALFCKITKKC
DataSet Source
Label of Sequence
long Sequence (String)
I need labels in one column which you can see in 1st and whole sequence in the second column which you can see in second row e.g
Label
Sequence

this is a rough not one liner but it will give you what you need, a series with the DNA sequences.
import pandas as pd
data = pd.read_csv('cpp.txt', sep=">",header=None)
data[0].dropna()
I hope it helps

Let's say your file is something like:
>a1|b1|c1
a111
>a2|b2|c2
a222
>a3|b3|c3
a333
Note that here we have 6 lines.
Then, you can read the file, and store the data:
import pandas as pd
with open('filename.txt', 'r') as f:
content = f.readlines()
n = len(content)
label = [content[i].strip() for i in range(0,n,2)]
seq = [content[i].strip() for i in range(1,n,2)]
df = pd.DataFrame({'label':label,
'sequence':seq})
and you get a pandas dataframe:
label sequence
0 >a1|b1|c1 a111
1 >a2|b2|c2 a222
2 >a3|b3|c3 a333

Related

Read txt file with separator is space

I want to read text file. The file is like this:
17430147 17277121 17767569 17352501 17567841 17650342 17572001
I want the result:
17430147
17277121
17767569
17352501
17567841
17650342
17572001
So, i try some codes:
data = pd.read_csv('train.txt', header=None, delimiter=r"\s+")
or
data = pd.read_csv('train.txt', header=None, delim_whitespace=True)
From those codes, the error like this:
ParserError: Too many columns specified: expected 75262 and found 154
Then i try this code:
file = open("train.txt", "r")
data = []
for i in file:
i = i.replace("\n", "")
data.append(i.split(" "))
But i think there are missing value in txt file:
'2847',
'2848',
'2849',
'1947',
'2850',
'2851',
'2729',
''],
['2852',
'2853',
'2036',
Thank you!
The first step would be to read the text file as a string of values.
with open('train.txt','r') as f:
lines = f.readlines()
list_of_values = lines[0].split(' ')
Here, list_of_values looks like:
['17430147',
'17277121',
'17767569',
'17352501',
'17567841',
'17650342',
'17572001']
Now, to create a DataFrame out of this list, simply execute:
import pandas as pd
pd.DataFrame(list_of_values)
This will give a pandas DataFrame with a single column with values read from the text file.
If only different values that exist in the text file are required to be obtained, then the list list_of_values can be directly used.
You can use .T method to transpose your dataframe.
data = pd.read_csv("train.txt", delim_whitespace=True).T

Wrong row count for CSV file in python

I am processing a csv file and before that I am getting the row count using the below code.
total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))
The code has been written with the help given in this link.
However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??
In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.
Any help is appreciated!
Let's assume you have a csv file in which some cell's a multi-line text.
$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"
Which, by look of it, has three lines and wc -l agrees:
$ wc -l example.csv
3 example.csv
And so does open with sum:
sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3
But now if you read is with some csv parser such as pandas.read_csv:
import pandas as pd
df = pd.read_csv('./example.csv')
df
colA colB
0 1 Hi. This is Line 1.\nAnd this is Line2
The other alternative way to fetch the correct number of rows is given below:
with open(csv_file,"r",encoding="utf-8") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
Excluding the header, the csv contains 1 line, which I believe is what you expect.
This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.
I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:
1, "my huge text\n with many lines\n"
2, "other text"
Your method for data above will return 4 when accutaly there are only 2 rows
Try to use Pandas or other library for reading CSV files. Example:
import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()
Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

convert .dat into .csv in python

I want to convert a data set of an .dat file into csv file. The data format looks like,
Each row begins with the sentiment score followed by the text associated with that rating.
I want the have sentiment value of (-1 or 1) to have a column and the text of review corresponding to the sentiment value to have an review to have an column.
WHAT I TRIED SO FAR
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("train.dat").readlines()]
# write it as a new CSV file
with open("train.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
def your_func(row):
return row['Sentiments'] / row['Review']
columns_to_keep = ['Sentiments', 'Review']
dataframe = pd.read_csv("train.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
Sample screen shot of the resulting train.csv it has an comma after every word in the review.
If all your rows follow that consistent format, you can use pd.read_fwf. This is a little safer than using read_csv, in the event that your second column also contains the delimiter you are attempting to split on.
df = pd.read_fwf('data.txt', header=None,
widths=[2, int(1e5)], names=['label', 'text'])
print(df)
label text
0 -1 ieafxf rjzy xfxk ymi wuy
1 1 lqqm ceegjnbjpxnidygr
2 -1 zss awoj anxb rfw kgbvnl
data.txt
-1 ieafxf rjzy xfxk ymi wuy
+1 lqqm ceegjnbjpxnidygr
-1 zss awoj anxb rfw kgbvnl
As mentioned in the comments, read_csv would be appropriate here.
df = pd.read_csv('train_csv.csv', sep='\t', names=['Sentiments', 'Review'])
Sentiments Review
0 -1 alskjdf
1 1 asdfa
2 1 afsd
3 -1 sdf

How to deal with # sign smartly in the start of the csv file in python

this question seems to be answer but in a different manner. I want to skip the first two lines because they are just description and in 3rd line I want to neglect the # sign but no the data because I want to read and compare this data as columns names.
# some description here
# 1 is for good , 2 is bad and 3 for worse
# 0 temp_data 1 temp_flow 2 temp_record 3 temp_all
For skip lines I know I can do something like this
with open('kami.txt') as f:
lines_after_2 = f.readlines()[2:]
and to read a file with respective line number or everyline
def read_data(data):
with open(data, 'rb') as f:
data = [row for row in csv.reader(f.readlines())]
return data
and to do unit testing on columns names I do this
def test_csv_read_data_headers(self):
self.assertEqual(
read_data(self.data)[0],
['temp_data 1 temp_flow 2 temp_record 3 temp_all']
)
but as I am doing some unit testing and therefore I want to neglect # sign in 3rd line and not the rest of the data which is this.
temp_data 1 temp_flow 2 temp_record 3 temp_all
Any help will be really appreciated. Thanx alot
did you try pandas ?
import pandas as pd
df = pd.read_csv("kami.txt", header=None, skiprows = 2, names = [temp_data,
temp_flow, temp_record, temp_all])

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources