How to remove the redundant data from text file

How to remove the redundant data from text file - python

I have calculate distance between two atoms and save in out.txt file. The generated out file is like this.
N_TYR_A0002 O_CYS_A0037 6.12
O_CYS_A0037 N_TYR_A0002 6.12
N_ALA_A0001 O_TYR_A0002 5.34
O_TYR_A0002 N_ALA_A0001 5.34
My outfile has repeats, means same atoms and same distance.
How i can remove redundant line.
i used this program for distance calculation (all to all atoms)
from __future__ import division
from string import *
from numpy import *
def eudistance(c1,c2):
x_dist = (c1[0] - c2[0])**2
y_dist = (c1[1] - c2[1])**2
z_dist = (c1[2] - c2[2])**2
return math.sqrt (x_dist + y_dist + z_dist)
infile = open('file.pdb', 'r')
text = infile.read().split('\n')
infile.close()
text.remove('')
pdbid = []
#define the pdbid
spfcord = []
for g in pdbid:
ratom = g[0]
ratm1 = ratom.split('_')
ratm2 = ratm1[0]
if ratm2 in allatoms:
spfcord.append(g)
#print spfcord[:10]
outfile1 = open('pairdistance.txt', 'w')
for m in spfcord:
name1 = m[0]
cord1 = m[1]
for n in spfcord:
if n != '':
name2 = n[0]
cord2 = n[1]
dist = euDist(cord1, cord2)
if 7 > dist > 2:
#print name1, '\t', name2, '\t', dist
distances = name1 + '\t ' + name2 + '\t ' + str(dist)
#print distances
outfile1.write(distances)
outfile1.write('\n')
outfile1.close()

If you don't care about order:
def remove_duplicates(input_file):
with open(input_file) as fr:
unique = {'\t'.join(sorted([a1, a2] + [d]))
for a1, a2, d in [line.strip().split() for line in fr]
}
for item in unique:
yield item
if __name__ == '__main__':
for line in remove_duplicates('out.txt'):
print line
But simple check if name1 < name2 in your script before computing distance and writing data would be probably better.

Okay, I have an idea. Not pretending it is the best or cleanest way but it was fun so..
import numpy as np
from StringIO import StringIO
data_in_file = """
N_TYR_A0002, O_CYS_A0037, 6.12
N_ALA_A0001, O_TYR_A0002, 5.34
P_CUC_A0001, N_TYR_A0002, 9.56
O_TYR_A0002, N_ALA_A0001, 5.34
O_CYS_A0037, N_TYR_A0002, 6.12
N_TYR_A0002, P_CUC_A0001, 9.56
"""
# Import data using numpy, any method is okay really as we don't really on data being array's
data_in_array = np.genfromtxt(StringIO(data_in_file), delimiter=",", autostrip=True,
dtype=[('atom_1', 'S12'), ('atom_2', 'S12'), ('distance', '<f8')])
N = len(data_in_array['distance'])
pairs = []
# For each item find the repeated index
for index, a1, a2 in zip(range(N), data_in_array['atom_1'], data_in_array['atom_2']):
repeat_index = list((data_in_array['atom_2'] == a1) * (data_in_array['atom_1'] == a2)).index(True)
pairs.append(sorted([index, repeat_index]))
# Each item is repeated, so sort and remove every other one
unique_indexs = [item[0] for item in sorted(pairs)[0:N:2]]
atom_1 = data_in_array['atom_1'][unique_indexs]
atom_2 = data_in_array['atom_2'][unique_indexs]
distance = data_in_array['distance'][unique_indexs]
for i in range(N/2):
print atom_1[i], atom_2[i], distance[i]
#Prints
N_TYR_A0002 O_CYS_A0037 6.12
N_ALA_A0001 O_TYR_A0002 5.34
P_CUC_A0001 N_TYR_A0002 9.56
I should add that this assumes that every pair is repeated exactly once, and no item exists without a pair, this will break the code but could be handled by an error exception.
Note I also made changes to your input data file, using "," deliminators and added another pair to be sure the ordering wouldn't break the code.

Lets try to avoid generating the duplicates in the first place. Change this part of the code -
outfile1 = open('pairdistance.txt', 'w')
length = len(spfcord)
for i,m in enumerate(spfcord):
name1 = m[0]
cord1 = m[1]
for n in islice(spfcord,i+1,length):
Add the import :
from itertools import islice

Related

Problem with reading data from file in Python

EDIT:
Thanks for fixing it! Unfortunatelly, it messed up the logic. I'll explain what this program does. It's a solution to a task about playing cards trick. There are N cards on the table. First and Second are numbers on the front and back of the cards. The trick can only be done, if the visible numbers are in non-decreasing order. Someone from audience can come and swap places of cards. M represents how many cards will be swapped places. A and B represent which cards will be swapped. Magician can flip any number of cards to see the other side. The program must tell, if the magician can do the trick.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
for _ in range(n):
first, second = (int(x) for x in data.readline().split(':'))
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # add to the list by appending
m = data.readline()
m = int(m)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
data:
4
2:5
3:4
6:3
2:7
2
3-4
1-3
results:
YES
YES
YES
YES
YES
YES
YES
What should be in results:
NO
YES

The code is full of bugs: you should write and test it incrementally instead of all at once. It seems that you started using readlines (which is a good way of managing this kind of work) but you kept the rest of the code in a reading one by one style. If you used readlines, the line for i, line in enumerate(data): should be changed to for i, line in enumerate(lines):.
Anyway, here is a corrected version with some explanation. I hope I did not mess with the logic.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
# The following line created a huge list of "Pairs" types, not instances
# pairs = [Pair] * (2*200*1000+1)
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
# removing the reading of all data...
# lines = data.readlines()
# m = lines[n]
# removed bad for: for i, line in enumerate(data):
for _ in range(n): # you don't need the index
first, second = (int(x) for x in data.readline().split(':'))
# removed unnecessary recasting to int
# first = int(first)
# second = int(second)
# changed the swapping to a more elegant way
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # we add to the list by appending
# removed unnecessary for: once you read all the first and seconds,
# you reached M
m = data.readline()
m = int(m)
# you don't need the index... indeed you don't need to count (you can read
# to the end of file, unless it is malformed)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
# removed unnecessary recasting to int
# a = int(a)
# b = int(b)
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
Response previous to edition
range(1, 1) is empty, so this part of the code:
for i in range (1, 1):
n = data.readline()
n = int(n)
does not define n, at when execution gets to line 12 you get an error.
You can remove the for statement, changing those three lines to:
n = data.readline()
n = int(n)

how to create a list and then print it in ascending order

def list():
list_name = []
list_name_second = []
with open('CoinCount.txt', 'r', encoding='utf-8') as csvfile:
num_lines = 0
for line in csvfile:
num_lines = num_lines + 1
i = 0
while i < num_lines:
for x in volunteers[i].name:
if x not in list_name: # l
f = 0
while f < num_lines:
addition = []
if volunteers[f].true_count == "Y":
addition.append(1)
else:
addition.append(0)
f = f + 1
if f == num_lines:
decimal = sum(addition) / len(addition)
d = decimal * 100
percentage = float("{0:.2f}".format(d))
list_name_second.append({'Name': x , 'percentage': str(percentage)})
list_name.append(x)
i = i + 1
if i == num_lines:
def sort_percentages(list_name_second):
return list_name_second.get('percentage')
print(list_name_second, end='\n\n')
above is a segment of my code, it essentially means:
If the string in nth line of names hasn't been listed already, find the percentage of accurate coins counted and then add that all to a list, then print that list.
the issue is that when I output this, the program is stuck on a while loop continuously on addition.append(1), I'm not sure why so please can you (using the code displayed) let me know how to update the code to make it run as intended, also if it helps, the first two lines of code within the txt file read:
Abena,5p,325.00,Y
Malcolm,1p,3356.00,N
this doesn't matter much but just incase you need it, I suspect that the reason it is stuck looping addition.append(1) is because the first line has a "Y" as its true_count

File data binding with column names

I have files with hundreds and thousands rows of data but they are without any column.
I am trying to go to every file and make them row by row and store them in list after that I want to assign values by columns. But here I am confused what to do because values are around 60 in every row and some extra columns with value assigned and they should be added in every row.
Code so for:
import re
import glob
filenames = glob.glob("/home/ashfaque/Desktop/filetocsvsample/inputfiles/*.txt")
columns = []
with open("/home/ashfaque/Downloads/coulmn names.txt",encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
columns.append(l.rstrip())
total = {}
for name in filenames:
modified_data = []
with open(name,encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
if len(l) >= 1:
modified_data.append(re.split(': |,',l))
rows = []
i = len(modified_data)
x = 0
while i > 60:
r = lines[x:x+59]
x = x + 60
i = i - 60
rows.append(r)
z = len(modified_data)
while z >= 60:
z = z - 60
if z > 1:
last_columns = modified_data[-z:]
x = []
for l in last_columns:
if len(l) > 1:
del l[0]
x.append(l)
elif len(l) == 1:
x.append(l)
for row in rows:
for vl in x:
row.append(vl)
for r in rows:
for i in range(0,len(r)):
if len(r) >= 60:
total.setdefault(columns[i],[]).append(r[i])
In other script I have separated both row with 60 values and last 5 to 15 columns which should be added with row are separate but again I am confused how to bind all the data.
Data Should look like this after binding.
outputdata.xlsx
Data Input file:
inputdata.txt
What Am I missing here? any tool ?

I believe that your issue can be resolved by taking the input file and turning it into a CSV file which you can then import into whatever program you like.
I wrote a small generator that would read a file a line at a time and return a row after a certain number of lines, in this case 60. In that generator, you can make whatever modifications to the data as you need.
Then with each generated row, I write it directly to the csv. This should keep the memory requirements for this process pretty low.
I didn't understand what you were doing with the regex split, but it would be simple enough to add it to the generator.
import csv
OUTPUT_FILE = "/home/ashfaque/Desktop/File handling/outputfile.csv"
INPUT_FILE = "/home/ashfaque/Desktop/File handling/inputfile.txt"
# This is a generator that will pull only num number of items into
# memory at a time, before it yields the row.
def get_rows(path, num):
row = []
with open(path, "r", encoding="ISO-8859-1") as f:
for n, l in enumerate(f):
# apply whatever transformations that you need to here.
row.append(l.rstrip())
if (n + 1) % num == 0:
# if rows need padding then do it here.
yield row
row = []
with open(OUTPUT_FILE, "w") as output:
csv_writer = csv.writer(output)
for r in get_rows(INPUT_FILE, 60):
csv_writer.writerow(r)

Levenshtein distance in a file

The statement says:
Modify the above program so that given the GGCCTTGCCATTGG pattern, each of the first 10 lines of the previous file indicates:
· The distance of edition that finds the substring more similar of that line.
· The substrings of that line that finds to minimum distance of edition
The above program is this:
import time
def levenshtein_distance (first, second):
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(fist)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0]*second_length for x in range(first_length)]
for i in range(first_length): distance_matrix[i][0] = i
for j in range(second_length): distance_matrix[0][j] = j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 2
substitution = distance_matrix[i-1][j-1] + 1
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def dna(patro):
t1 = time.clock()
f = open("HUMAN-DNA.txt")
text = f.readlines()
f.close()
distanciaMin = 100000000
distanciaPosicion = 0
distanciaLinea = 0
distanciaSubstring = ""
numeroLinea = 0
for line in text:
numeroLinea = numeroLinea + 1
for i in range(len(line)-len(patro)):
cadena = line[i:i+len(patro)]
distancia = levenshtein_distance(cadena, patro)
if distancia < distanciaMin:
distanciaMin = distancia
distanciaPosicion = 1
distanciaLinea = numeroLinea
distanciaSubstring = cadena
t2 = time.clock()
Now i put the new pattern
dna("GGCCTTGCCATTGG")
I have the distance of edition that is distanciaMin and I'm not sure about result of distanciaSubstring that is the substrings of that line(second point of statement), my question is How can i count the first ten lines in the text?
A part of the file is:
CCCATCTCTTTCTCATTCCTTGGTTGAGAACACGAACTTCAGGACTTGCCTCACACTAGGGCCCATTCTT
TGTTTCCCAGAAAGAAGAGGCTCTCCACACAGAGTCCCATGTACACCAGGCTGTCAACAAACATGAATTG
AATGAAGGAGTGGATGGTTGGGTGGAAGTGATTTAAGAAATCCTAACTGGGGAATTTCACTGGAAACTTA
GGAAATTCAATTTATATAAAGTCTATGAATCGTCCATTTTTGTGTCCGCACATTCAAATGCTGTAGCTAA
TTTCCTGCTAAACAGTAGAAATTCAGTAAGTGTTCATGTTGAAAGGATGAAATTTGAGTGCTCTTGCATC
CTCAAAGAACTCTAGTAAAATAGAAATAAAGCTTTATTTGGAAGATTAAGTCATGAGCATAATTATGAGA
AGGCGGTCATTCTAATAATAGTGTCTTCACAAGTAGATGCTACATGCTGTGTAATATTTTGACTAAAAAA
AGTTCCTCTCAACATTTCTGAAGTGAGATAATGTACAACGATCCATGTTTTTAGCTACCTTGATAAGTTT
AGTGCATCCAGGGCTCCTTTCTTACCTGCTAACCGCCGAGTTTCAAATGCTAAGAAATTCTTCATTTCCT
AACACAAATATTCAATATAATTGCTGGTTGTTTGGGAGAAGAAAAATTTAGAATTCAGAAAGAAATACAG
AATGAAATGTTCTAATCAATCGAAAAAGGATTCTATAGACTTCGACGTTGTCTGGTTTACAAAGCAGTCT

I couldn't understand your full question. But I am trying to solve How can i count the first ten lines in the text?. You can use filehandler.readlines(). It will load files in memory as a list where each row is separated by new line character.
Then you can read 10 lines from the list. You can try something like this,
>>> a = [0,1,2,3,4,5,6,7,8,9] # read file as a list of lines (a)
>>> def line(a, jump=2): # keep jump = 10 for your requirement.
lines = len(a)
i = 0
while i < lines+1:
yield a[i:i+jump]
i += jump
>>> foo = line(a)
>>> foo.next()
[0, 1]
>>> foo.next()
[2, 3]
>>> foo.next()
[4, 5]
For your code it will be,
foo = line(text, 10)
foo.next() # should return you 10 elements in each call

Python: keep top Nth results for csv.reader

I am doing some filtering on csv file where for every title there are many duplicate IDs with different prediction values, so the column 2 (pythoniac) is different. I would like to keep only 30 lowest values but with unique ID. I came to this code, but I don't know how to keep lowest 30 entries.
Can you please help with suggestions how to obtain 30 unique by ID entries?
# title1 id1 100 7.78E-25 # example of the line
with open("test.txt") as fi:
cmp = {}
for R in csv.reader(fi, delimiter='\t'):
for L in ligands:
newR = R[0], R[1]
if R[0] == L:
if (int(R[2]) <= int(1000) and int(R[2]) != int(0) and float(R[3]) < float("1.0e-10")):
if newR in cmp:
if float(cmp[newR][3]) > float(R[3]):
cmp[newR] = R[:-2]
else:
cmp[newR] = R[:-2]

Maybe try something along this line...
from bisect import insort
nth_lowest = [very_high_value] * 30
for x in my_loop:
do_stuff()
...
if x < nth_lowest[-1]:
insort(nth_lowest, x)
nth_lowest.pop() # remove the highest element

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove the redundant data from text file - python

Lets try to avoid generating the duplicates in the first place. Change this part of the code - outfile1 = open('pairdistance.txt', 'w') length = len(spfcord) for i,m in enumerate(spfcord): name1 = m[0] cord1 = m[1] for n in islice(spfcord,i+1,length): Add the import : from itertools import islice

Related

Problem with reading data from file in Python

how to create a list and then print it in ascending order

File data binding with column names

Levenshtein distance in a file

Python: keep top Nth results for csv.reader

Categories

Resources