How to split one column into two columns in python? - python

I have a contig file loaded in pandas like this:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
Thank you very much.

I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted:
>Identifier
sequence
>Identifier
sequence
You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
Whether this code works depends on the details of your input which you didn't provide, but that might get you started.

Related

How to add column numbers to each column in a large text file

I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions
awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.

How to parse data from .TX0 file into dataframe

Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...

How to combine header files with data files with python?

I have separated files, one part are files only contained header info, like the example shown in below:
~content of "header1.txt"~
a 3
b 2
c 4
~content of "header2.txt"~
a 4
b 3
c 5
~content of "header3.txt"~
a 1
b 7
c 6
And another part are files only contained data, as shown below:
~content of "data1.txt"~
10 20 30 40
20 14 22 33
~content of "data2.txt"~
11 21 31 41
21 24 12 23
~content of "data3.txt"~
21 22 11 31
10 26 14 33
After combined the corresponded data files, the results are similar with examples in below:
~content of "asc1.txt"~
a 3
b 2
c 4
10 20 30 40
20 14 22 33
~content of "asc2.txt"~
a 4
b 3
c 5
11 21 31 41
21 24 12 23
~content of "asc3.txt"~
a 1
b 7
c 6
21 22 11 31
10 26 14 33
Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do
for i in range(1,4):
h = open('header{0}.txt'.format(i),'r')
d = open('data{0}.txt'.format(i),'r')
a = open('asc{0}.txt'.format(i),'a')
hdata = h.readlines()
ddata = d.readlines()
a.writelines(hdata+ddata)
a.close()
Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand:
# can start by creating a function to read contents of
# each file and return the contents as a string
def readFile(file):
contentsStr = ''
for line in file:
contentsStr += line
return contentsStr
# Read all the header files header1, header2, header3
header1 = open('header1.txt','r')
header2 = open('header2.txt','r')
header3 = open('header3.txt','r')
# Read all the data files data1, data2, data3
data1 = open('data1.txt','r')
data2 = open('data2.txt','r')
data3 = open('data3.txt','r')
# Open/create output files asc1, asc2, asc3
asc1_outFile = open('asc1.txt','w')
asc2_outFile = open('asc2.txt','w')
asc3_outFile = open('asc3.txt','w')
# read contents of each header file and data file into string variabls
header1_contents = readFile(header1)
header2_contents = readFile(header2)
header3_contents = readFile(header3)
data1_contents = readFile(data1)
data2_contents = readFile(data2)
data3_contents = readFile(data3)
# Append the contents of each data file contents to its
# corresponding header file
asc1_contents = header1_contents + '\n' + data1_contents
asc2_contents = header2_contents + '\n' + data2_contents
asc3_contents = header3_contents + '\n' + data3_contents
# now write the necessary results to asc1.txt, asc2.txt, and
# asc3.txt output files respectively
asc1_outFile.write(asc1_contents)
asc2_outFile.write(asc2_contents)
asc3_outFile.write(asc3_contents)
# close the file streams
header1.close()
header2.close()
header3.close()
data1.close()
data2.close()
data3.close()
asc1_outFile.close()
asc2_outFile.close()
asc3_outFile.close()
# done!
By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory/header*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory/data*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
Edit
This method will only work if they are in different folder and there should be no other files other than header or data file in that folder
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory1/*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory2/*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()

compare if a value exists in a csv file

I have a 200 MB CSV, file and a 4 GB json file in compressed format(300 MB when in compressed form). now I need to check if a particular field in json has a value which matches with any of the values in column 0 of the csv file. How can this be achieved in a fast as I have to do this for multiple json files, csv file being same. I hope using pandas would speed up things
After reading from CSV File the following datastructure is formed:
Empty DataFrame
Columns: []
Index: [1335063, 1339033, 1344453, 1392603, 1520033, 5342858, 5361498, 5534501, 5542881, 5552665, 5618397, 5824472, 5867442, 5908134, 5908134, 6203501, 6208411, 6209921, 6211681, 6212831, 6213691, 6287061, 6293811, 6387151, 6415771, 6508691, 6649281, 6673261, 6716441, 6782181, 6821631, 7710551, 9413871, 11280941, 11285381, 11762751, 11769381, 11854271, 11964831, 11995871, 12240091, 12541201, 12553471, 12633891, 12648021, 12834201, 12899581, 13177041, 13282401, 13290581, 13292951, 13297681, 14536901, 14592891, 14665721, 14843571, 15120821, 15127231, 15531511, 15969981, 16648561, 16808911, 16809381, 17019781, 17021721, 17224241, 17234921, 17327321, 17923721, 17930901, 18577181, 18606681, 19448911, 19557541, 20272801, 20286621, 20295001, 20351761, 21052471, 21062651, 21106501, 21578741, 22279401, 22312931, 23078211, 23164911, 24937351, 24988721, 26171811, 26188561, 26224001, 26379241, 26380531, 26383571, 26386251, 26388621, 27509171, 27825771, 28282901, 28998561, ...]
Now the data t be read from gzip file will be a json string and I can convert it with read_json. But I dont get how to see if the field 'id' in json is present in the lsit shown here
This should get you started:
import numpy as np
import pandas
magic_value = 11
df = pandas.DataFrame(np.random.random_integers(0, 12, size=(10,2)))
# 0 1
# 0 1 1
# 1 5 3
# 2 12 12
# 3 12 8
# 4 11 4
# 5 11 12
# 6 9 7
# 7 7 1
# 8 0 11
# 9 2 1
magic_value in df[0].values
# True
So just read in the JSON data with pandas.read_json, get the value you want (pandas indexing docs), and go to town.

how do I match a specific number into number set efficiently?

I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?
What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...

Categories

Resources