How to Identify Missing Indices

How to Identify Missing Indices - python

I have a text file with millions of index points that are all interpreted as strings and is tab delimited. However, some index points could be missing. Here is an example of my text file:
1 0 4 0d 07:00:37.0400009155273
2 0 4 0d 07:00:37.0400009155273
3 0 4 0d 07:00:37.0400009155273
5 0 4 0d 07:00:37.0400009155273
7 0 4 0d 07:00:37.0400009155273
9 0 4 0d 07:00:37.0400009155273
Notice that rows 4, 6 and 8 are missing. My goal is to create a function that can parse through the text file, identify possible missing index points and return a list that has all the missing index points (if any) or return nothing.
I'm using Python 3.7 in Spyder IDE Windows10 os. I am relatively new to Python and Stackoverflow.
This is what I've got so far. This works to ID 1 missing index but fails if there are several missing index points.
The error starts after the first else line. I'm not sure how to track the observed index in the doc (1, 2, 3, 5...) with the for loop's index (0, 1, 2, 3...) as missing index points compound over time.
Note, the first 4 rows of the text doc contain header info which I ignore during the parsing that's why data = f.readlines()[4:]
def check_sorted_file(fileName):
missing_idx = []
count = 1
with open(fileName, 'r') as f:
data = f.readlines()[4:]
for x, line in enumerate(data):
idx = int(line.split()[0])
if idx == (count + x):
pass
else:
missing_idx.append(count + x)
count += 1
if missing_idx != []:
print('\nThe following idicie(s) are missing: ')
print(*missing_idx, sep=", ")
else:
print('\nAll indices are accounted for. ')
return missing_idx
...
Thanks for any and all help!

The other answer give you much better overall solutions, however I want to just help guide your given one in the right direction so you see how you could change yours to work:
def check_sorted_file(fileName):
missing_idx = []
last_index = 0
with open(fileName, 'r') as f:
data = f.readlines()[4:]
for line in data:
idx = int(line.split()[0])
if idx == last_index+1:
pass
else:
missing_idx.extend(list(range(last_index+1, idx)))
last_index = idx
if missing_idx:
print('\nThe following idicie(s) are missing: ')
print(*missing_idx, sep=", ")
else:
print('\nAll indices are accounted for. ')
return missing_idx
So instead of needing to use enumerate we will use the incoming index as our guide of where we are at.
To solve multiple missing, we use range to get all the numbers between the last index and the current one, and extend our list with that new set of numbers.

You can do this with Python alone:
with open(filename) as f:
indices = [int(row.split('\t')[0]) for row in f.read().split('\n')[4:]]
missing_indices = [index
for index in range(1, len(indices) + 1)
if index not in indices]
This converts your data into a nested list where each outer list contains a row and each inner list contains an element. Since we only care about the indices, we get the first element and ignore the others.
Then, since the indices are in running order starting from 1, we construct a range spanning the expected range of indices, and get the indices that exist in that range but not in the file.
Assuming the indices are unique (which seems reasonable), we can also use DYZ's suggestion to use sets:
missing_indices = set(range(1, len(indices) + 1) - set(indices)
pandas works fine too:
import pandas as pd
df = pd.read_csv(filename, sep='\t').iloc[4:]
range_index = pd.RangeIndex(1, len(df) + 1)
print(range_index[~range_index.isin(df.iloc[:, 0])]
This creates a pandas DataFrame from your data, cutting off the first four rows. Following the same principle as the other answer, it creates an index with all expected values and takes the subset of it that does not exist in the first column of the DataFrame.

Since you have a large number of rows, you might want to do this in a lazy fashion without making large lists or using in to test if every value is in a million line list. You can mix a few of the itertools to do this as an iterator and save the list for the end (if you even need it then).
Basically you make tee a map into two iterators to get the indexes, knock off a value of one of them with next() then zip them checking the difference as you go:
from itertools import chain, tee
lines = ["1 0 4 0d 07:00:37.0400009155273",
"2 0 4 0d 07:00:37.0400009155273",
"3 0 4 0d 07:00:37.0400009155273",
"5 0 4 0d 07:00:37.0400009155273",
"7 0 4 0d 07:00:37.0400009155273",
"9 0 4 0d 07:00:37.0400009155273"
]
#two iterators going over indexes
i1, i2 = tee(map(lambda x: int(x.split()[0]), lines), 2)
# move one forward
next(i2)
# chain.from_iterable will be an iterator producing missing indexes:
list(chain.from_iterable(range(i+1, j) for i, j in zip(i1, i2) if j-i!=1))
Result:
[4, 6, 8]

Here's a compact, robust, set-based, core Python-only solution. Read the file, split each line into fields, convert the first field into an int, and build a set of actual indexes:
skip = 4 # Skip that many lines
with open(yourfile) as f:
for _ in range(skip):
next(f)
actual = {int(line.split()[0]) for line in f}
Create a set of expected indexes and take set difference:
expected = set(range(min(actual), max(actual) + 1))
sorted(expected - actual)
#[4, 6, 8]
The solution works even when the indexes do not start at 1.

Related

Python: Iterate through Big Array with BigInts, find first duplicate and printout the indexes of the duplicate Values

for Cryptographie course in my university i need to compare a lot of SHA Hashes which are saved in an array. I need to compare the values of the Array Indexes.
There are duplicates in the Array - i already checked that with a comparison between the length of a set of the array and the length of the array itself.
Now i need to have the indexes of the duplicated values. I found a lot of solutions for checking for duplicates, but only for short arrays. My Array has a length of 3 Million and the Values in each index are around this length: 864205495604807476120572616017955259175325408501.
I have written a nested loop (coming from Java and trying to learn python). Here my code:
counter_outer = 0
while counter_outer < len(hash_value_array):
counter_inner = counter_outer + 1
while counter_inner < len(hash_value_array):
if hash_value_array[counter_outer] == hash_value_array[counter_inner]:
print(f"*****FOUND MATCH *****")
print(f"Message [{counter_outer}] Hashvalue has same Value as Message [{counter_inner}]")
safe_index1 = counter_outer
safe_index2 = counter_inner
counter_outer = len(hash_value_array)
break
else:
print("------NO Match-----")
counter_inner += 1
counter_outer += 1
As you can imagine ... this takes ages.
Important for me is, that i need the indexes where the duplicates are - not the values. So for example, if there is a 898 in index 100 and a 898 in index 1000001 i only need as output: 100, 1000001
Any suggestions?

You can do something along these lines in Python:
Assume this list of 5 signatures (they could be ints or strings, but I have strings):
li=['864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502',
'864205495604807476120572616017955259175325408503',
'864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502']
You can make a dict of lists with each list being the index of duplicates:
idx={}
for i, sig in enumerate(li):
idx.setdefault(sig, []).append(i)
If you make li 3,000,000 entries, that runs in about 550 milliseconds on my computer and likely would be similar on yours.
You can then find the duplicates like so:
for sig, v in idx.items():
if len(v)>1:
print(f'{sig}: {v}')
Prints:
864205495604807476120572616017955259175325408501: [0, 3]
864205495604807476120572616017955259175325408502: [1, 4]
If you just want the FIRST duplicate, you can modify the first loop like so:
idx={}
for i, sig in enumerate(li):
if sig in idx:
print(f'Duplicate {sig} at {idx[sig]} and {i}')
break
else:
idx[sig]=i
Prints:
Duplicate 864205495604807476120572616017955259175325408501 at 0 and 3
But to be honest - i dont understand why this is so much faster.
Yours is super slow because it has O n**2 complexity from nested while loops. You are looping over the entire array for each element. The method I showed you here is only looping once over the entire list -- not 3 million times!

Mistake in Python code

There is an array of integers. There are also disjoint sets, A and B, each containing integers. You like all the integers in set A and dislike all the integers in set B. Your initial happiness is 0. For each integer in the array, if i in A, you add 1 to your happiness. If i in B, you add -1 to your happiness. Otherwise, your happiness does not change. Output your final happiness at the end.
Input Format
The first line contains integers n and m separated by a space.
The second line contains n integers, the elements of the array.
The third and fourth lines contain m integers, A and B respectively.
Output Format
Output a single integer, your total happiness.
Sample Input
3 2
1 5 3
3 1
5 7
Sample Output
1
Can someone please explain what is wrong with this solution? It passes some test, but fails on others.
input()
array = set(input().split())
set1 = set(input().split())
set2 = set(input().split())
res = len(set1 & array) - len(set2 & array)
print(res)

The problem is that you're transforming your inputs to sets, which in turn removes the duplicates. If you have repeated values in your input, with the set you're only adding/substracting 1 to the resulting happiness. If that's correct, your code is fine. If not, then you should work with lists rather than sets.
The code could be something like this:
# The first part should stay the same, without the set() call on array
input()
array = input().split()
list1 = set(input().split())
list2 = set(input().split())
# Now we use some list comprehension to get the happiness result
res = sum([1 for elem in array if elem in list1]) - sum([1 for elem in array if elem in list2])
The first sum accumulates the positive points, and the second one the negatives. It works with multiple occurences, adding/substracting one point per each.
EDIT
A more clear approach, to understand the for loop
# The first part should stay the same, without the set() call on array
input()
array = input().split()
list1 = set(input().split())
list2 = set(input().split())
# We create a variable res which will store the resulting happiness, initially 0
res = 0
# Now we iterate through the elements in array and check wheter they should add or substract
for elem in array:
# If the element is in list1, we add 1 to res
if elem in list1:
res += 1
# If the element is in list2, we substract 1 from res
elif elem in list2:
res -= 1

I took the inputs for list A and B and as a general list. I wanted to obtain happiness in 1 line using list comprehension as below. After I merged print and "happiness =" , in one line. Apparently, this is the solution to make the code faster.
input()
my_array = input().split()
listA=list(input().split())
listB=list(input().split())
print (sum(1 for data in my_array if data in listA)+sum(-1 for data in my_array if data in listB))

python sum columns in matrix

Here there is a piece of my script. What this should do is opening a matrix (in the file matrix_seeds_to_all_targets) and sum all the elements in each column (at the end I should get a 1xN array). What I get instead is an error: AttributeError: 'list' object has no attribute 'sum'. Could you please give me any insight on this?
def collapse_probtrack_results(waytotal_file, matrix_file):
with open(waytotal_file) as f:
waytotal = int(f.read())
f = open(wayfile_template + roi + "/matrix_seeds_to_all_targets")
l = [map(int, line.split(',')) for line in f if line.strip() != ""]
collapsed = l.sum(axis=0) / waytotal * 100.
return collapsed
print (collapsed)

As the message says: lists don't have a method named sum. It isn't clear just what you are trying to do on that line, so can't be more helpful than that.

You could just use numpy instead of trying to sum over lists:
import numpy as np
matrix = np.random.randint(0, 100, (3, 6)) //read in your matrix file here
newMatrix = np.sum(matrix, axis=0)
print newMatrix
which will give you something like:
[168 51 23 115 208 54]
Without numpy, you would have to use something like a list comprehension to go over the "columns" in your lists to sum them. Python's list sum works on lists, which isn't what you have if you have 1) a matrix and 2) want to do the summing over columns

I think that the instruction l.sum() is wrong. The function used to sum over a list is sum and must be used as in this sample:
myList = [1, 2, 3]
sum(myList) # will return 6
myList.sum() # will throw an error
If you want to select a given column, you can that list comprehension: [row[columnID] for row in A]
So, for instance, that code wil sum over the different rows of a 2D array named l.
numCols = len(l[0])
result = []
for i in range(numCols)
result.append(sum([row[i] for row in l]))
print(result)
Also there seems that in your code there's a print after a return. I think it will never execute ;)

Python: How to traverse a list of lists by columns, as if it was a regular 2D array (matrix)?

I have a list of lists like this:
list=[[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15]].
I want to use this list in a function that "slices" my list into 9 groups of four values each, that are dumped into a dict of lists. If len(list)=n, the function should create [sqrt(n)-1]*[sqrt(n)-1] slices. In this case, given len(list)=16, slices=[4-1]*[4-1]=9.
This is the function:
def dictionarize1(array):
dict1 = {}
count = 0
for x in range(len(array[0]) - 1) :
for y in range(len(array[0]) - 1):
dict1[count] = [array[x][y], array[x][y+1], array[x+1][y], array[x + 1][y+1]]
count = count + 1
return dict1
dictionarize1(list) #Calling the function
I think of my list as a 2D array - a 4x4 matrix in this case - which must be traversed as if it was a matrix: the row is fixed and the column is incremented:
Columns ->
Rows v 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Therefore, my slices must be [0,1,4,5], [1,2,5,6],[2,3,6,7],[4,5,8,9] etc., without wrapping the matrix up.
The final result should be dict={'0':[0,1,4,5],'1':[1,2,5,6],'2':[2,3,6,7],'3':[4,5,8,9],'4':[5,6,9,10],'5':[6,7,10,11],'6':[8,9,12,13],'7':[9,10,13,14],'8':[10,11,14,15]}.
My question: is this the correct way for my for loops to traverse the list by columns (which means: the row is fixed, the column is incremented)? I cannot judge as if I print my list I only get a regular list so it confuses me. Thanks!

I do not know if I understand what you want. If you want to show correct results do this
list = dictionarize1(list) #Calling the function
print(list) #your result is ok

Is there a better way to write the following method in python?

I am writing a small program, in python, which will find a lone missing element from an arithmetic progression (where the starting element could be both positive and negative and the series could be ascending or descending).
so for example: if the input is 1 3 5 9 11, then the function should return 7 as this is the lone missing element in the above AP series.
The input format: the input elements are separated by 1 white space and not commas as is commonly done.
Here is the code:
def find_missing_elm_ap_series(n, series):
ap = series
ap = ap.split(' ')
ap = [int(i) for i in ap]
cd = []
for i in range(n-1):
cd.append(ap[i+1]-ap[i])
common_diff = 0
if len(set(cd)) == 1:
print 'The series is complete'
return series
else:
cd = [abs(i) for i in cd]
common_diff = min(cd)
if ap[0] > ap[1]:
common_diff = (-1)*common_diff
new_ap = []
for i in range(n+1):
new_ap.append(ap[0] + i*common_diff)
missing_element = set(new_ap).difference(set(ap))
return missing_element
where n is the length of the series provided (the series with the missing element:5 in the above example).
I am sure there are other shorter and more elegant way of writing this code in python. Can anybody help ?
Thanks
BTW: i am learning python by myself and hence the question.

Based on the fact that if an element is missing it is exactly expected-sum(series) - actual-sum(series). The expected sum for a series with n elements starting at a and ending at b is (a+b)*n/2. The rest is Python:
def find_missing(series):
A = map(int, series.split(' '))
a, b, n, sumA = A[0], A[-1], len(A), sum(A)
if (a+b)*n/2 == sumA:
return None #no element missing
return (a+b)*(n+1)/2-sumA
print find_missing("1 3 5 9") #7
print find_missing("-1 1 3 5 9") #7
print find_missing("9 6 0") #3
print find_missing("1 2 3") #None
print find_missing("-3 1 3 5") #-1

Well... You can do simpler, but it would completely change your algorithm.
First, you can prove that the step for the arithmetic progression is ap[1] - ap[0], unless ap[2] - ap[1] is lower in magnitude than it, in which case the missing element is between terms 0 and 1. (This is true as there is a single missing element.)
Then you can just take ap[0] + n * step and print the first one that doesn't match.
Here is the source code (also implementing some minor shortcuts, such as grouping your first three lines into one):
def find_missing_elm_ap_series(n, series):
ap = [int(i) for i in series.split(' ')]
step = ap[1] - ap[0]
if (abs(ap[2] - ap[1]) <= abs(step)): # Check missing elt is not between 0 and 1
return ap[0] + ap[2] - ap[1]
for (i, val) in zip(range(len(ap)), ap): # And check position of missing element
if ap[0] + i * step != val:
return ap[0] + i * step
return series # missing element not found

The code appears to be working. There is perhaps a slightly easier way to get it done. This is due to the fact that you don't have to attempt to look through all of the values to get the common difference. The following code simply looks at the difference between the 1st and 2nd as well as the last and second last.
This works in the event that only a single value is missing (and the length of the list is at least 3). As the min difference between the values will provide you the common difference.
def find_missing(prog):
# First we cast them to numbers.
items = [int(x) for x in prog.split()]
#Then we compare the first and second
first_to_second = items[1] - items[0]
#then we compare the last to second last
last_to_second_last = items[-1] - items[-2]
#Now we have to care about which one is closes
# to zero
if abs(first_to_second) < abs(last_to_second_last):
change = first_to_second
else:
change = last_to_second_last
#Iterate through the list. As soon as we find a gap
#that is larger than change, we fill in and return
for i in range(1, len(items)):
comp = items[i] - items[i-1]
if comp != change:
return items[i-1] + change
#There was no gap
return None
print(find_missing("1 3 5 9")) #7
print(find_missing("-1 1 3 5 9")) #7
print(find_missing("9 6 0")) #3
print(find_missing("1 2 3")) #None
The previous code shows this example. First of all attempting to find change between each of the values of the list. Then iterating till the change is missed, and returning the value that has been expected.

Here's the way I thought about it: find the position of the maximum difference between the elements of the array; then regenerate the expected number in the sequence from the other differences (which should be all the same and the minimum number in the differences list):
def find_missing(a):
d = [a[i+1] - a[i] for i in range(len(a)-1)]
i = d.index(max(d))
x = min(d)
return a[0] + (i+1)*x
print find_missing([1,3,5,9,11])
7
print find_missing([1,5,7,9,11])
3

Here are some ideas:
Passing the length of the series seems like a bad idea. The function can more easily calculate the length
There is no reason to assign series to ap, just do a function using series and assign the result to ap
When splitting the string, don't give the sep argument. If you don't give the argument, then consecutive white space will also be removed and leading and trailing white space will also be ignored. This is more friendly on the format of the data.
I've combined a few operations. For example the split and the list comprehension converting to integer make sense to group together. There is also no need to create cd as a list and then convert that to a set. Just build it as a set to start with.
I don't like that the function returns the original series in the case of no missing element. The value None would be more in keeping with the name of the function.
Your original function returned a one item set as the result. That seems odd, so I've used pop() to extract that item and return just the missing element.
The last item was more of an experiment with combining all of the code at the bottom into a single statement. Don't know if it is better, but it's something to think about. I built a set with all the correct numbers and a set with the given numbers and then subtracted them and returned the number that was missing.
Here's the code that I came up with:
def find_missing_elm_ap_series(series):
ap = [int(i) for i in series.split()]
n = len(ap)
cd = {ap[i+1]-ap[i] for i in range(n-1)}
if len(cd) == 1:
print 'The series is complete'
return None
else:
common_diff = min([abs(i) for i in cd])
if ap[0] > ap[1]:
common_diff = (-1)*common_diff
return set(range(ap[0],ap[0]+common_diff*n,common_diff)).difference(set(ap)).pop()

Assuming the first & last items are not missing, we can also make use of range() or xrange() with the step of the common difference, getting rid of the n altogether, it can also return more than 1 missing item (although not reliably depending on number of items missing):
In [13]: def find_missing_elm(series):
ap = map(int, series.split())
cd = map(lambda x: x[1]-x[0], zip(ap[:-1], ap[1:]))
if len(set(cd)) == 1:
print 'complete series'
return ap
mcd = min(cd) if ap[0] < ap[1] else max(cd)
sap = set(ap)
return filter(lambda x: x not in sap, xrange(ap[0], ap[-1], mcd))
....:
In [14]: find_missing_elm('1 3 5 9 11 15')
Out[14]: [7, 13]
In [15]: find_missing_elm('15 11 9 5 3 1')
Out[15]: [13, 7]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Identify Missing Indices - python

Related

Python: Iterate through Big Array with BigInts, find first duplicate and printout the indexes of the duplicate Values

Mistake in Python code

python sum columns in matrix

Python: How to traverse a list of lists by columns, as if it was a regular 2D array (matrix)?

Is there a better way to write the following method in python?

Categories

Resources