How to reconstruct and change structure of a dataset using python?

How to reconstruct and change structure of a dataset using python? - python

I have a dataset and I need to reconstruct some data from this dataset to a new style
My dataset is something like below (Stored in a file named train1.txt):
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
I need to convert to below style (I need to store in a new file as train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
My python version is 2.7.13
My operating system is Ubuntu 14.04 LTS
I will appreciate you for any help.
Thank you so much.

I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
Let me explain what everything does.
import re
just imports regular expressions. The regular expression I wrote is
regex = r'\d*'
the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.
I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.
You'll get something like:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212

It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

Since others have added answers, I will include one using numpy.
If you are ok using numpy, it is as simple as:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
If you want a list instead of numpy array,
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]

Related

python alternative for awk?

I have two fasta files, and I want to search for sequence IDs and assign only the sequence corresponding to the ID to a string in Python.
I currently have:
import os
#use awk on the command line to search reference file and cut the reference sequence
os.system("awk '/LOC_OS05G45410.1/{getline;print}' Ref_seqs.fasta > sangerRef")
#use awk on the command line to cut the aligned sequence
os.system("awk '/seq1/{getline;print}' Sanger_seq_1.fasta > sangerAlign")
Ref_seq = open('sangerRef', 'r').read()
Sanger_seq = open('sangerAlign', 'r').read()
When I print these variables, everything looks fine:
TGGTGAGGCTTTTGACAGGGTTGAGCTGAGCCTGGTCTCCCTGGAGAAACTCTTCCAGAGAGCAAATGATGCTTGCACAGCTGCTGAAGAAATGTACTCCCATGGTCATGGTGGTACTGAACCCAG
CTGCTGCCCAAGTACTTCAAGCACAACAACTTCTCCAGCTTCATCAGGCAGCTCAACGCCTACGGTTTCCGAAAAATCGATCCTGAGAGATGGGAGTTCGCAAACGAGGATTTCATAAGAGGGCACACGCACCTT
However, when I try to read these variables into another function, it doesn't work:
from Bio import pairwise2
from Bio.Align import substitution_matrices
#load sequences
s1=Ref_seq
s2=Sanger_seq
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1, s2, matrix, gap_open, gap_extend)
align
I'm thinking it might be better to replace the awk command with a Python command?

I think it's because you haven't parsed the sequences. I don't know if I am using the word 'Parse' right, though.
I think this should work
from Bio import SeqIO
s1 = SeqIO.read('filepath/filename.fasta','fasta')
s2 = SeqIO.read('filepath/file.fasta','fasta')
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1.seq, s2.seq, matrix, gap_open, gap_extend)
align

The immediate problem is that read() returns all the lines with a newline at the end of each.
But indeed, your Awk commands should be trivial to replace with native Python.
def getseq(filename, search):
with open(filename) as reffile:
for line in reffile:
if search in line:
return seqfile.__next__().rstrip('\n')
s1 = getseq("Ref_seqs.fasta", "LOC_OS05G45410.1")
s2 = getseq("Sanger_seq_1.fasta", "seq1")
Probably BioPython already contains a better function for doing this. In particular, your Awk script (and hence this blind reimplementation) assumes that each sequence only occupies one line in the file.

Python DNA sequence slice gives \N as wrong content in slice result

I am surprising, I am using python to slice a long DNA Sequence (4699673 character)to a specific length supstring, it's working properly with a problem in result, after 71 good result \n start apear in result for few slices then correct slices again and so on for whole long file
the code:
import sys
filename = open("out_filePU.txt",'w')
sys.stdout = filename
my_file = open("GCF_000005845.2_ASM584v2_genomic_edited.fna")
st = my_file.read()
length = len(st)
print ( 'Sequence Length is, :' ,length)
for i in range(0,len(st[:-9])):
print(st[i:i+9], i)
figure shows the error from the result file
please i need advice on that.

Your sequence file contains multiple lines, and at the end of each line there is a line break \n. You can remove them with st = my_file.read().replace("\n", "").

Try st = re.sub('\\s', '', my_file.read()) to replace any newlines or other whitespace (you'll need to add import re at the top of your script).
Then for i in range(0,len(st[:-9]),9): to step through your data in increments of nine characters. Otherwise you're only advancing by one character each time: that's why you can see the diagonal patterns in your output.

How to remove brackets and the contents inside from a file

I have a file named sample.txt which looks like below
ServiceProfile.SharediFCList[1].DefaultHandling=1
ServiceProfile.SharediFCList[1].ServiceInformation=
ServiceProfile.SharediFCList[1].IncludeRegisterRequest=n
ServiceProfile.SharediFCList[1].IncludeRegisterResponse=n
Here my requirement is to remove the brackets and the integer and enter os commands with that
ServiceProfile.SharediFCList.DefaultHandling=1
ServiceProfile.SharediFCList.ServiceInformation=
ServiceProfile.SharediFCList.IncludeRegisterRequest=n
ServiceProfile.SharediFCList.IncludeRegisterResponse=n
I am quite a newbie in Python. This is my first attempt. I have used these codes to remove the brackets:
#!/usr/bin/python
import re
import os
import sys
f = os.open("sample.txt", os.O_RDWR)
ret = os.read(f, 10000)
os.close(f)
print ret
var1 = re.sub("[\(\[].*?[\)\]]", "", ret)
print var1f = open("removed.cfg", "w+")
f.write(var1)
f.close()
After this using the file as input I want to form application specific commands which looks like this:
cmcli INS "DefaultHandling=1 ServiceInformation="
and the next set as
cmcli INS "IncludeRegisterRequest=n IncludeRegisterRequest=y"
so basically now I want the all the output to be bunched to a set of two for me to execute the commands on the operating system.
Is there any way that I could bunch them up as set of two?

Reading 10,000 bytes of text into a string is really not necessary when your file is line-oriented text, and isn't scalable either. And you need a very good reason to be using os.open() instead of open().
So, treat your data as the lines of text that it is, and every two lines, compose a single line of output.
from __future__ import print_function
import re
command = [None,None]
cmd_id = 1
bracket_re = re.compile(r".+\[\d\]\.(.+)")
# This doesn't just remove the brackets: what you actually seem to want is
# to pick out everything after [1]. and ignore the rest.
with open("removed_cfg","w") as outfile:
with open("sample.txt") as infile:
for line in infile:
m = bracket_re.match(line)
cmd_id = 1 - cmd_id # gives 0, 1, 0, 1
command[cmd_id] = m.group(1)
if cmd_id == 1: # we have a pair
output_line = """cmcli INS "{0} {1}" """.format(*command)
print (output_line, file=outfile)
This gives the output
cmcli INS "DefaultHandling=1 ServiceInformation="
cmcli INS "IncludeRegisterRequest=n IncludeRegisterResponse=n"
The second line doesn't correspond to your sample output. I don't know how the input IncludeRegisterResponse=n is supposed to become the output IncludeRegisterRequest=y. I assume that's a mistake.
Note that this code depends on your input data being precisely as you describe it and has no error checking whatsoever. So if the format of the input is in reality more variable than that, then you will need to add some validation.

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:

Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))

My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.

Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!

You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.

Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to reconstruct and change structure of a dataset using python? - python

Related

python alternative for awk?

Python DNA sequence slice gives \N as wrong content in slice result

How to remove brackets and the contents inside from a file

Regex remove certain characters from a file

regular expressions in python using quotes

Categories

Resources