Find overlaping rows and keep longest - python
I want to edit a table based on overlaping values.
On column 1 I have a group name, on column 3 I have a start position value, and in column 4 is the end position.
I want to keep only rows with position values (start and end) that are not contained within the range of other rows of a given group (ex CE170_HUMAN).
For example, for CE170_HUMAN I have 6 rows, some of them have overlapping values: for example 165-523 (358 positions) range is contained within 1-523 range, I want to keep only the row with 1-523 as it covers a longer range (523 positions). Then do the same for the next group PURA2 and so on.
Input:
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 165 523
RAEG_00037368-RA CE170_HUMAN 326 523
RAEG_00037368-RD CE170_HUMAN 165 370
RAEG_00037368-RC CE170_HUMAN 1 523
RAEG_00037368-RE CE170_HUMAN 1 370
RAEG_00037388-RB PURA2_PIG 61 456
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037388-RA PURA2_PIG 181 456
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
RAEG_00037420-RG ELYS_HUMAN 1080 2266
RAEG_00037420-RF ELYS_HUMAN 1 2266
RAEG_00037420-RD ELYS_HUMAN 1080 2266
RAEG_00037420-RC ELYS_HUMAN 205 2266
RAEG_00037420-RB ELYS_HUMAN 1080 2266
Desired output
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 1 523
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
I am looking for a solution either on bash, perl or python.
I appreciate your help!
I don't understand your format, but I am sure you can adapt this:
rows = [
"Hello",
"World",
"Hello World"
]
solution = []
found = False
for i in range(len(rows)):
for j in range(len(rows)):
if i == j:
# Comparing equal things (will result in false positive)
continue
if str(rows[i]) in str(rows[j]):
# Not a solution
found = True
break
if not found:
# We have found a solution!
solution.append(rows[i])
else:
# Not a solution. Resetting
found = False
for i in solution:
print(i)
Related
math operation in list of list
I have created 1000 lists containing some values. I wanted to have a math operation on elements of each list and saving in another list that I want to plot. Each list has the shape like below and I want to subtract the ith element from the i-1th element(distance between two consecutive elements) [[ 4 29 73 111 130 140 167 231 248 267 284 298 320 333 379 404 421 433 475 510 523 534 544 558 575 602 617 630 661 672 685 698 711 731 742 764 780 828 842 854 874 885 903 916 944 961 985 996 1013 1032 1054 1064 1077 1109 1122 1138 1205 1233 1249 1282 1299 1311 1326 1337 1372 1409 1426 1437 1511 1549 1578 1591 1604 1646]] I have written the code below but it does not work and I got the error of index out of range. import numpy as np import os import matplotlib.pyplot as plt import pandas as pd from scipy.signal import find_peaks Cases = [f for f in sorted(os.listdir('.')) if f.startswith('config')] plt.rcParams.update({'font.size': 14}) maxnum = np.max([int(os.path.splitext(f)[0].split('_')[1]) for f in CASES]) CASES = ['configuration_%d.out' % i for i in range(maxnum)] gg = [] my_l_h = [] for i, d in enumerate(CASES): a = np.loadtxt(d).T x = a[3] peaks, _ = find_peaks(x, distance=10) gg = [peaks] L_h = np.array(gg) for numbers in gg: jp = L_h[:,i]-L_h[:,i-1] my_l_h.append(jp) print(my_l_h) t = np.arange(0,len(my_l_h) plt.plot(t,my_l_h) plt.show()
Python print string alignment
I am printing some values in a loop in Python. My current output is as follows: 0 Data Count: 249 7348 249 4469 2768 261 20 126 1 Data Count: 288 11 288 48 2284 598 137 408 2 Data Count: 808 999 808 2896 32739 138 202 678 3 Data Count: 140 26 140 2688 8054 884 433 987 What I'd like is for all values in each column to align, despite differing character/number counts in some, to make it easier to read. The pseudo code behind this is as follows: for i in range(0,3): print i, " Data Count: ", Count_A, " ", Count_B, " ", Count_C, " ", Count_D, " ", Count_E, " ", Count_F, " ", Count_G, " ", Count_H Thanks in advance everyone!
You could use format string justification: from random import randint for i in range(5): data = [randint(0, 1000) for j in range(5)] print("{:5} {:5} {:5} {:5}".format(*data)) output: 92 460 72 630 837 214 118 677 906 328 102 320 895 998 177 922 651 742 215 938 According to the format specification from Python docs
With the % string formatting operator, the minimum width of output is specified in a placeholder as a number before the data type (the full format of a placeholder is %[key][flags][width][.precision][length type]conversion type). If the result is shorter, it will be left-padded to the specified length: from random import randint for i in range(5): data = [randint(0, 1000) for j in range(5)] print("%5d %5d %5d %5d %5d" % tuple(data)) gives: 946 937 544 636 871 232 860 704 877 716 868 849 851 488 739 419 381 695 909 518 570 756 467 351 537 (code adapted from #andreihondrari's answer)
How to display a sequence of numbers in column-major order?
Program description: Find all the prime numbers between 1 and 4,027 and print them in a table which "reads down", using as few rows as possible, and using as few sheets of paper as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height of the columns should all be the same, except for perhaps the last column, which might have a few blank entries towards its bottom row. The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down. 2 23 59 3 29 61 5 31 67 7 37 71 11 41 73 13 43 79 17 47 83 19 53 89 ect... This all I've been able to come up with myself: def findPrimes(n): """ Adds calculated prime numbers to a list. """ prime_list = list() for number in range(1, n + 1): prime = True for i in range(2, number): if(number % i == 0): prime = False if prime: prime_list.append(number) return prime_list def displayPrimes(): pass print(findPrimes(4027)) I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems. I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them. For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end. from itertools import zip_longest import locale import math locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting def zip_discard(*iterables, _NULL=object()): """ Like zip_longest() but doesn't fill out all rows to equal length. https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue """ return [[entry for entry in iterable if entry is not _NULL] for iterable in zip_longest(*iterables, fillvalue=_NULL)] def grouper(n, seq): """ Group elements in sequence into groups of "n" items. """ for i in range(0, len(seq), n): yield seq[i:i+n] def tabularize(width, height, numbers): """ Print list of numbers in column-major tabular form given the dimensions of the table in characters (rows and columns). Will create multiple tables of required to display all numbers. """ # Determine number of chars needed to hold longest formatted numeric value gap = 2 # including space between numbers col_width = len('{:n}'.format(max(numbers))) + gap # Determine number of columns that will fit within the table's width. num_cols = width // col_width chunk_size = num_cols * height # maximum numbers in each table for i, chunk in enumerate(grouper(chunk_size, numbers), start=1): print('---- Page {} ----'.format(i)) num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up table = zip_discard(*grouper(num_rows, chunk)) for row in table: print(''.join(('{:{width}n}'.format(num, width=col_width) for num in row))) def find_primes(n): """ Create list of prime numbers from 1 to n. """ prime_list = [] for number in range(1, n+1): for i in range(2, int(math.sqrt(number)) + 1): if not number % i: # Evenly divisible? break # Not prime. else: prime_list.append(number) return prime_list primes = find_primes(4027) tabularize(80, 15, primes) Output: ---- Page 1 ---- 1 47 113 197 281 379 463 571 659 761 863 2 53 127 199 283 383 467 577 661 769 877 3 59 131 211 293 389 479 587 673 773 881 5 61 137 223 307 397 487 593 677 787 883 7 67 139 227 311 401 491 599 683 797 887 11 71 149 229 313 409 499 601 691 809 907 13 73 151 233 317 419 503 607 701 811 911 17 79 157 239 331 421 509 613 709 821 919 19 83 163 241 337 431 521 617 719 823 929 23 89 167 251 347 433 523 619 727 827 937 29 97 173 257 349 439 541 631 733 829 941 31 101 179 263 353 443 547 641 739 839 947 37 103 181 269 359 449 557 643 743 853 953 41 107 191 271 367 457 563 647 751 857 967 43 109 193 277 373 461 569 653 757 859 971 ---- Page 2 ---- 977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087 983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089 991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099 997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111 1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113 1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129 1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131 1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137 1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141 1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143 1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153 1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161 1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179 1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203 1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207 ---- Page 3 ---- 2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413 2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433 2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449 2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457 2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461 2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463 2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467 2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469 2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491 2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499 2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511 2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517 2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527 2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529 2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533 ---- Page 4 ---- 3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019 3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021 3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027 3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003 3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007 3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013
Pandas: select by bigger than a value
My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example: df['dir'].value_counts().sort_index() It returns a Series 0 855 20 881 40 2786 70 3777 90 3964 100 4 110 2115 130 3040 140 1 160 1697 180 1734 190 3 200 618 210 3 220 1451 250 895 270 2167 280 1 290 1643 300 1 310 1894 330 1 340 965 350 1 Name: dir, dtype: int64 Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350. How can I do that? I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]
(df['dir'].value_counts() > 500).sum() This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.
How can I count the number of numbers in this line-separated string literal in python?
I want to find the number of numbers in the following string literal 'a'. What am I doing wrong in this code? Is there any way I can find 'count' without manually counting through the string? I thought of adding commas after each number to make it an array but I am sure there has to be a better way to scrape individual numbers when text is given in such a way. a = """ 1004 1003 1003 1002 1001 1000 996 994 992 989 987 984 977 970 963 958 954 951 948 943 939 935 929 917 911 905 903 897 885 878 877 872 857 838 815 796 779 757 725 684 632 578 528 460 258 66 49 42 41 39 39 38 38 38 38 41 53 """ count = 0 while a: if a == '\n': count+=1 print count
This gives you the number of lines excluding empty lines: print(len([line for line in a.splitlines() if line.strip()])) A solution without a list comprehension: counter = 0 for line in a.splitlines(): if line.strip(): counter += 1
why not use the build-in function count print a.count('\n') - 1 I wrote the code,but I don't know whether I can help U.
You can do it in following way. l = a.strip().split('\n') my_list = list(map(lambda x: l.count(x), set(l))) print(my_list)