Find overlaping rows and keep longest

Find overlaping rows and keep longest - python

I want to edit a table based on overlaping values.
On column 1 I have a group name, on column 3 I have a start position value, and in column 4 is the end position.
I want to keep only rows with position values (start and end) that are not contained within the range of other rows of a given group (ex CE170_HUMAN).
For example, for CE170_HUMAN I have 6 rows, some of them have overlapping values: for example 165-523 (358 positions) range is contained within 1-523 range, I want to keep only the row with 1-523 as it covers a longer range (523 positions). Then do the same for the next group PURA2 and so on.
Input:
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 165 523
RAEG_00037368-RA CE170_HUMAN 326 523
RAEG_00037368-RD CE170_HUMAN 165 370
RAEG_00037368-RC CE170_HUMAN 1 523
RAEG_00037368-RE CE170_HUMAN 1 370
RAEG_00037388-RB PURA2_PIG 61 456
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037388-RA PURA2_PIG 181 456
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
RAEG_00037420-RG ELYS_HUMAN 1080 2266
RAEG_00037420-RF ELYS_HUMAN 1 2266
RAEG_00037420-RD ELYS_HUMAN 1080 2266
RAEG_00037420-RC ELYS_HUMAN 205 2266
RAEG_00037420-RB ELYS_HUMAN 1080 2266
Desired output
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 1 523
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
I am looking for a solution either on bash, perl or python.
I appreciate your help!

I don't understand your format, but I am sure you can adapt this:
rows = [
"Hello",
"World",
"Hello World"
]
solution = []
found = False
for i in range(len(rows)):
for j in range(len(rows)):
if i == j:
# Comparing equal things (will result in false positive)
continue
if str(rows[i]) in str(rows[j]):
# Not a solution
found = True
break
if not found:
# We have found a solution!
solution.append(rows[i])
else:
# Not a solution. Resetting
found = False
for i in solution:
print(i)

Related

math operation in list of list

I have created 1000 lists containing some values. I wanted to have a math operation on elements of each list and saving in another list that I want to plot. Each list has the shape like below and I want to subtract the ith element from the i-1th element(distance between two consecutive elements)
[[ 4 29 73 111 130 140 167 231 248 267 284 298 320 333
379 404 421 433 475 510 523 534 544 558 575 602 617 630
661 672 685 698 711 731 742 764 780 828 842 854 874 885
903 916 944 961 985 996 1013 1032 1054 1064 1077 1109 1122 1138
1205 1233 1249 1282 1299 1311 1326 1337 1372 1409 1426 1437 1511 1549
1578 1591 1604 1646]]
I have written the code below but it does not work and I got the error of index out of range.
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import find_peaks
Cases = [f for f in sorted(os.listdir('.')) if f.startswith('config')]
plt.rcParams.update({'font.size': 14})
maxnum = np.max([int(os.path.splitext(f)[0].split('_')[1]) for f in CASES])
CASES = ['configuration_%d.out' % i for i in range(maxnum)]
gg = []
my_l_h = []
for i, d in enumerate(CASES):
a = np.loadtxt(d).T
x = a[3]
peaks, _ = find_peaks(x, distance=10)
gg = [peaks]
L_h = np.array(gg)
for numbers in gg:
jp = L_h[:,i]-L_h[:,i-1]
my_l_h.append(jp)
print(my_l_h)
t = np.arange(0,len(my_l_h)
plt.plot(t,my_l_h)
plt.show()

Python print string alignment

I am printing some values in a loop in Python. My current output is as follows:
0 Data Count: 249 7348 249 4469 2768 261 20 126
1 Data Count: 288 11 288 48 2284 598 137 408
2 Data Count: 808 999 808 2896 32739 138 202 678
3 Data Count: 140 26 140 2688 8054 884 433 987
What I'd like is for all values in each column to align, despite differing character/number counts in some, to make it easier to read.
The pseudo code behind this is as follows:
for i in range(0,3):
print i, " Data Count: ", Count_A, " ", Count_B, " ", Count_C, " ", Count_D, " ", Count_E, " ", Count_F, " ", Count_G, " ", Count_H
Thanks in advance everyone!

You could use format string justification:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("{:5} {:5} {:5} {:5}".format(*data))
output:
92 460 72 630
837 214 118 677
906 328 102 320
895 998 177 922
651 742 215 938
According to the format specification from Python docs

With the % string formatting operator, the minimum width of output is specified in a placeholder as a number before the data type (the full format of a placeholder is %[key][flags][width][.precision][length type]conversion type). If the result is shorter, it will be left-padded to the specified length:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("%5d %5d %5d %5d %5d" % tuple(data))
gives:
946 937 544 636 871
232 860 704 877 716
868 849 851 488 739
419 381 695 909 518
570 756 467 351 537
(code adapted from #andreihondrari's answer)

How to display a sequence of numbers in column-major order?

Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?

Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013

Pandas: select by bigger than a value

My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]

(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.

How can I count the number of numbers in this line-separated string literal in python?

I want to find the number of numbers in the following string literal 'a'. What am I doing wrong in this code? Is there any way I can find 'count' without manually counting through the string?
I thought of adding commas after each number to make it an array but I am sure there has to be a better way to scrape individual numbers when text is given in such a way.
a = """
1004
1003
1003
1002
1001
1000
996
994
992
989
987
984
977
970
963
958
954
951
948
943
939
935
929
917
911
905
903
897
885
878
877
872
857
838
815
796
779
757
725
684
632
578
528
460
258
66
49
42
41
39
39
38
38
38
38
41
53
"""
count = 0
while a:
if a == '\n':
count+=1
print count

This gives you the number of lines excluding empty lines:
print(len([line for line in a.splitlines() if line.strip()]))
A solution without a list comprehension:
counter = 0
for line in a.splitlines():
if line.strip():
counter += 1

why not use the build-in function count
print a.count('\n') - 1
I wrote the code,but I don't know whether I can help U.

You can do it in following way.
l = a.strip().split('\n')
my_list = list(map(lambda x: l.count(x), set(l)))
print(my_list)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find overlaping rows and keep longest - python

Related

math operation in list of list

Python print string alignment

How to display a sequence of numbers in column-major order?

Pandas: select by bigger than a value

How can I count the number of numbers in this line-separated string literal in python?

Categories

Resources