index a list in a Python for loop - python

I'm making a for loop within a for loop. I'm looping through a list and finding a specific string that contains a regular expression pattern. Once I find the line, I need to search to find the next line of a certain pattern. I need to store both lines to be able to parse out the time for them. I've created a counter to keep track of the index number of the list as the outer for loop works. Can I use a construction like this to find the second line I need?
index = 0
for lineString in summaryList:
match10secExp = re.search('taking 10 sec. exposure', lineString)
if match10secExp:
startPlate = lineString
for line in summaryList[index:index+10]:
matchExposure = re.search('taking \d\d\d sec. exposure', line)
if matchExposure:
endPlate = line
break
index = index + 1
The code runs, but I'm not getting the result I'm looking for.
Thanks.

matchExposure = re.search('taking \d\d\d sec. exposure', lineString)
should probably be
matchExposure = re.search('taking \d\d\d sec. exposure', line)

Depending on your exact needs, you can just use an iterator on the list, or two of them as mae by itertools.tee. I.e., if you want to search lines following the first pattern only for the second pattern, a single iterator will do:
theiter = iter(thelist)
for aline in theiter:
if re.search(somestart, aline):
for another in theiter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This will not search lines from aline to the ending another for somestart, only for someend. If you need to search them for both purposes, i.e., leave theiter itself intact for the outer loop, that's where tee can help:
for aline in theiter:
if re.search(somestart, aline):
_, anotheriter = itertools.tee(iter(thelist))
for another in anotheriter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This is an exception to the general rule about tee which the docs give:
Once tee() has made a split, the
original iterable should not be used
anywhere else; otherwise, the iterable
could get advanced without the tee
objects being informed.
because the advancing of theiter and that of anotheriter occur in disjoint parts of the code, and anotheriter is always rebuilt afresh when needed (so the advancement of theiter in the meantime is not relevant).

Related

Comparing all contents of two files

I am trying to compare two files. One file has a list of stores. The other list has the same list of stores, except it is missing a few from a filter I had run against it from another script. I would like to compare these two files, if the store in file 1 is not anywhere to be located in file 2, I want to print it out, or append to a list, not too picky on that part. Below are examples of partials in both files:
file 1:
Store: 00377
Main number: 8033056238
Store: 00525
Main number: 4075624470
Store: 00840
Main number: 4782736996
Store: 00920
Main number: 4783337031
Store: 00998
Main number: 9135631751
Store: 02226
Main number: 3107501983
Store: 02328
Main number: 8642148700
Store: 02391
Main number: 7272645342
Store: 02392
Main number: 9417026237
Store: 02393
Main number: 4057942724
File 2:
00377
00525
00840
00920
00998
02203
02226
02328
02391
02392
02393
02394
02395
02396
02397
02406
02414
02425
02431
02433
02442
Here is what I built to try and make this work, but it just keeps spewing all stores in the file.
def comparesitestest():
with open("file_1.txt", "r") as pairsin:
pairs = pairsin.readlines()
pairsin.close
with open("file_2.txt", "r") as storesin:
stores = storesin.readlines()
storesin.close
for pair in pairs:
for store in stores:
if store not in pair:
print(store)
When you read your first file, add the store number to a set.
store_nums_1 = set()
with open("file_1.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
if line.startswith("Store"):
store_nums_1.add(line[7:]) # Add only store number to set
Next, read the other file and add those numbers to another set
store_nums_2 = set()
with open("file_2.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
store_nums_2.add(line) # The entire line is the store number, so no need to slice.
Finally, find the set difference between the two sets.
file1_extras = store_nums_1 - store_nums_2
Which gives a set containing only the store numbers in file 1 but not in file 2. (I changed your file_2 to have only the first three lines, because the file you've shown actually contains more store numbers than file_1, so the result file1_extras was empty using your input)
{'00920', '00998', '02226', '02328', '02391', '02392', '02393'}
This is more efficient than using lists, because checking if something exists in a list is an O(N) operation. When you do it once for each of the M items in your first list, you end up with an O(N*M) operation. On the other hand, membership checks in a set are O(1), so the entire set-difference operation is O(M) instead of O(N*M)
You are getting the output you get because your check is not checking what you want. Try changing your for loop to something like this:
for pairline in pairs:
if pairline:
name, number = pairline.split(': ')
if name == "Store":
if number not in stores:
print(number)
Explanation is as follows:
You start with a File 1 of pairs, and a File 2 of stores (store numbers, really). Your file 2 is in decent shape. After you read it in, you've got a list of store numbers. You don't need to put that through a second loop. In fact, it's wasteful and unnecessary.
Your File 1 is a little more complicated. Although you refer to the info as pairs, it's a little more complicated than that, because the lines have a store number and what I assume is a phone number. So, for each line in the File 1, I would check if the line starts with "Store:", knowing I can ignore all the other lines. If the line starts with "Store;", the next part of the line is the store number I actually want to check for in the list of File 2.
So, the program above does a little more checking to see if it's reading in a line it needs to act on. and then it acts on it if necessary by checking whether the store number is in the store number list.
Also, as a side note, it's great to use the with structure. It's good coding practice. But when you do that, you do not need to explicitly close the file. That happens automatically with that context structure. Once you leave the context, the close happens automatically.
As another side note, there are usually multiple good ways and bad ways to solve a problem. Another possible reasonable solution/version is:
for pairline in pairs:
if pairline and pairline.startswith("Store:"):
store = pairline.split()[1]
if store not in stores:
print(stores)
It's different. Not necessarily better or worse, just different.

Deleting/Removing element from a list when comparing to another list Python

So I have a good one. I'm trying to build two lists (ku_coins and bin_coins) of crypto tickers from two different exchanges, but I don't want to double up, so if it appears on both exchanges I want to remove it from ku_coins.
A slight complication occurs as Kucoin symbols come in as AION-BTC, while Binance symbols come in as AIONBTC, but it's no problem.
So firstly, I create the two lists of symbols, which runs fine, no problem. What I then try and do is loop through the Kucoin symbols and convert them to the Binance style symbol, so AIONBTC instead of AION-BTC. Then if it appears in the Binance list I want to remove it from the Kucoin list. However, it appears to randomly refuse to remove a handful of symbols that match the requirement. For example AION.
It removes the majority of doubled up symbols but in AIONs case for example it just won't delete it.
If I just do print(i) after this loop:
for i in ku_coins:
if str(i[:-4] + 'BTC') in bin_coins:
It will happily print AION-BTC as one of the symbols, as it fits the requirement perfectly. However, when I stick the ku_coins.remove(i) command in before printing, it suddenly decideds not to print AION suggesting it doesn't match the requirements. And it's doing my head in. Obviously the remove command is causing the problem, but I can't for the life of me figure out why. Any help really appreciated.
import requests
import json
ku_dict = json.loads(requests.get('https://api.kucoin.com/api/v1/market/allTickers').text)
ku_syms = ku_dict['data']['ticker']
ku_coins = []
for x in range(0, len(ku_syms)):
if ku_syms[x]['symbol'][-3:] == 'BTC':
ku_coins.append(ku_syms[x]['symbol'])
bin_syms = json.loads(requests.get('https://www.binance.com/api/v3/ticker/bookTicker').text)
bin_coins = []
for i in bin_syms:
if i['symbol'][-3:] == 'BTC':
bin_coins.append(i['symbol'])
ku_coins.sort()
bin_coins.sort()
for i in ku_coins:
if str(i[:-4] + 'BTC') in bin_coins:
ku_coins.remove(i)
#top bantz, #Fourier has already mentioned that you shouldn't modify a list you're iterating over. What you can do in this case is to create a copy of ku_coins first then iterate over that, and then remove the element from the original ku_coins that matches your if condition. See below:
ku_coins.sort()
bin_coins.sort()
# Create a copy
ku_coins_ = ku_coins[:]
# Then iterate over that copy
for i in ku_coins_:
if str(i[:-4] + 'BTC') in bin_coins:
ku_coins.remove(i)
How about modifying the code to:
while ku_coins:
i = ku_coins.pop()
if str(i[:-4] + 'BTC') in bin_coins:
pass
else:
# do something
the pop() method removes i from the ku_coins list
pop()

Splitting a list into a file without duplicates

Large data file like this:
133621 652.4 496.7 1993.0 ...
END SAMPLES EVENTS RES 271.0 2215.0 ...
ESACC 935.6 270.6 2215.0 ...
115133 936.7 270.3 2216.0 ...
115137 936.4 270.4 2219.0 ...
115141 936.1 271.0 2220.0 ...
ESACC L 114837 115141 308 938.5 273.3 2200
115145 936.3 271.8 2220.0 ...
END 115146 SAMPLES EVENTS RES 44.11 44.09
SFIX L 133477
133477 650.8 500.0 2013.0 ...
133481 650.2 499.9 2012.0 ...
ESACC 650.0 500.0 2009.0 ...
Want to grab only the ESACC data into trials. When END appears, preceding ESACC data is aggregated into a trial. Right now, I can get the first chunk of ESACC data into a file but because the loop restarts from the beginning of the data, it keeps grabbing only the first chunk so I have 80 trials with the exact same data.
for i in range(num_trials):
with open(fid) as testFile:
for tline in testFile:
if 'END' in tline:
fid_temp_start.close()
fid_temp_end.close() #Close the files
break
elif 'ESACC' in tline:
tline_snap = tline.split()
sac_x_start = tline_snap[4]
sac_y_start = tline_snap[5
sac_x_end = tline_snap[7]
sac_y_end = tline_snap[8]
My question: How to iterate to the next chunk of data without grabbing the previous chunks?
Try rewriting your code something like this:
def data_parse(filepath): #Make it a function
try:
with open(filepath) as testFile:
tline = '' #Initialize tline
while True: #Switch to an infinite while loop (I'll explain why)
while 'ESACC' not in tline: #Skip lines until one containing 'ESACC' is found
tline = next(testFile) #(since it seems like you're doing that anyway)
tline_snap = tline.split()
trial = [tline_snap[4],'','',''] #Initialize list and assign first value
trial[1] = tline_snap[5]
trial[2] = tline_snap[7]
trial[3] = tline_snap[8]
while 'END' not in tline: #Again, seems like you're skipping lines
tline = next(testFile) #so I'll do the same
yield trial #Output list, save function state
except StopIteration:
fid_temp_start.close() #I don't know where these enter the picture
fid_temp_end.close() #but you closed them so I will too
testfile.close()
#Now, initialize a new list and call the function:
trials = list()
for trial in data_parse(fid);
trials.append(trial) #Creates a list of lists
What this creates is a generator function. By using yield instead of return, the function returns a value AND saves its state. The next time you call the function (as you will do repeatedly in the for loop at the end), it picks up where it left off. It starts at the line after the most recently executed yield statement (which in this case restarts the while loop) and, importantly, it remembers the values of any variables (like the value of tline and the point it stopped at in the data file).
When you reach the end of the file (and have thus recorded all of your trials), the next execution of tline = next(testFile) raises a StopIteration error. The try - except structure catches that error and uses it to exit the while loop and close your files. This is why we use an infinite loop; we want to continue looping until that error forces us out.
At the end of the whole thing, your data is stored in trials as a list of lists, where each item equals [sac_x_start, sac_y_start, sac_x_end, sac_y_end], as you defined them in your code, for one trial.
Note: it does seem to me like your code is skipping lines entirely when they don't contain ESACC or END. I've replicated that, but I'm not sure if that's what you want. If you want to get the lines in between, you can rewrite this fairly simply by adding to the 'END' loop as below:
while 'END' not in tline:
tline = next(testFile)
#(put assignment operations to be applied to each line here)
Of course, you'll have to adjust the variable you're using to store this data accordingly.
Edit: Oh dear lord, I just now noticed how old this question is.

Why re is not compiling 'if' when there is 'else'?

Hello I'm facing a problem and I don't how to fix it. All I know is that when I add an else statement to my if statement the python execution always goes to the else statement even there is there a true statement in if and can enter the if statement.
Here is the script, without the else statement:
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
with the else statement and I want else for the if ((Mspl[0]).lower()==(Wspl[0])):
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
else:
b="\t".join(Wspl)+"\n"
if b not in filtred:
filtred.append(b)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
first of all, you're not using "re" at all in your code besides importing it (maybe in some later part?) so the title is a bit misleading.
secondly, you are doing a lot of work for what is basically a filtering operation on two files. Remember, simple is better than complex, so for starters, you want to clean your code a bit:
you should use a little more indicative names than 'd' or 'w'. This goes for 'Wsplt', 's' and 'av' as well. Those names don't mean anything and are hard to understand (why is the d.readlines named Wlines when ther's another file named 'w'? It's really confusing).
If you choose to use single letters, it should still make sense (if you iterate over a list named 'results' it makes sense to use 'r'. 'line1' and 'line2' however, are not recommanded for anything)
You don't need parenthesis for conditions
You want to use as little variables as you can as to not get confused. There's too much different variables in your code, it's easy to get lost. You don't even use some of them.
you want to use strip rather than replace, and you want the whole 'cleaning' process to come first and then just have a code the deals with the filtering logic on the two lists. If you split each line according to some logic, and you don't use the original line anywhere in the iteration, then you can do the whole thing in the beggining.
Now, I'm really confused what you're trying to achieve here, and while I don't understand why your doing it that way, I can say that looking at your logic you are repeating yourself a lot. The action of checking against the filtered list should only happend once, and since it happens regardless of whether the 'if' checks out or not, I see absolutely no reason to use an 'else' clause at all.
Cleaning up like I mentioned, and re-building the logic, the script looks something like this:
# PART I - read and analyze the lines
Wappresults = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
Mikrofull = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
Wapp = map(lambda x: x.strip().split(), Wappresults.readlines())
Mikro = map(lambda x: x.strip().split('\t'), Mikrofull.readlines())
Wappresults.close()
Mikrofull.close()
# PART II - filter using some logic
filtred = []
for w in Wapp:
res = w[:] # So as to copy the list instead of point to it
for m in Mikro:
if m[0].lower() == w[0]:
res.append(m[1])
if len(m) >= 3 :
res.append(m[2])
string = '\t'.join(res)+'\n' # this happens regardles of whether the 'if' statement changed 'res' or not
if string not in filtred:
filtred.append(string)
# PART III - write the filtered results into a file
combination = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
for comb in filtred:
combination.write(comb)
combination.close()
I can't promise it will work (because again, like I said, I don't know what you're trying to achive) but this should be a lot easier to work with.

Schwartzian sort example in "Text Processing in Python"

I was browsing through "Text Processing in Python" and tried its example about Schwartzian sort.
I used following structure for sample data which also contains empty lines. I sorted this data by fifth column:
383230 -49 -78 1 100034 '06 text' 9562 'text' 720 'text' 867
335067 -152 -18 3 100030 'text' 2400 'text' 2342 'text' 696
136592 21 230 3 100035 '03. text' 10368 'text' 1838 'text' 977
Code used for Schwartzian sorting:
for n in range(len(lines)): # Create the transform
lst = string.split(lines[n])
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[4], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
lines.sort() # Native sort
for n in range(len(lines)): # Restore original lines
lines[n] = lines[n][1]
open('tmp.schwartzian','w').writelines(lines)
I don't get how the author intended that short or empty lines should go to end of file by using this code. Lines are sorted after the if-else structure, thus raising empty lines to top of file. Short lines of course work as supposed with the custom sort (fourth_word function) as implemented in the example.
This is now bugging me, so any ideas? If I'm correct about this then how would you ensure that short lines actually stay at end of file?
EDIT: I noticed the square brackets around '\377'. This messed up sort() so I removed those brackets and output started working.
else: # Short lines to end
lines[n] = (['\377'], lines[n])
print type(lines[n][0])
>>> (type 'list')
I accepted nosklo's answer for good clarification about the meaning of '\377' and for his improved algorithm. Many thanks for the other answers also!
If curious, I used 2 MB sample file which took 0.95 secs with the custom sort and 0.09 with the Schwartzian sort while creating identical output files. It works!
Not directly related to the question, but note that in recent versions of python (since 2.3 or 2.4 I think), the transform and untransform can be performed automatically using the key argument to sort() or sorted(). eg:
def key_func(line):
lst = string.split(line)
if len(lst) >= 4:
return lst[4]
else:
return '\377'
lines.sort(key=key_func)
I don't know what is the question, so I'll try to clarify things in a general way.
This algorithm sorts lines by getting the 4th field and placing it in front of the lines. Then built-in sort() will use this field to sort. Later the original line is restored.
The lines empty or shorter than 5 fields fall into the else part of this structure:
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[4], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
It adds a ['\377'] into the first field of the list to sort. The algorithm does that in hope that '\377' (the last char in ascii table) will be bigger than any string found in the 5th field. So the original line should go to bottom when doing the sort.
I hope that clarifies the question. If not, perhaps you should indicate exaclty what is it that you want to know.
A better, generic version of the same algorithm:
sort_by_field(list_of_str, field_number, separator=' ', defaultvalue='\xFF')
# decorates each value:
for i, line in enumerate(list_of_str)):
fields = line.split(separator)
try:
# places original line as second item:
list_of_str[i] = (fields[field_number], line)
except IndexError:
list_of_str[i] = (defaultvalue, line)
list_of_str.sort() # sorts list, in place
# undecorates values:
for i, group in enumerate(list_of_str))
list_of_str[i] = group[1] # the second item is original line
The algorithm you provided is equivalent to this one.
An empty line won't pass the test
if len(lst) >= 4:
so it will have ['\377'] as its sort key, not the 5th column of your data, which is lst[4] ( lst[0] is the first column).
Well, it will sort short lines almost at the end, but not quite always.
Actually, both the "naive" and the schwartzian version are flawed (in different ways). Nosklo and wbg already explained the algorithm, and you probably learn more if you try to find the error in the schwartzian version yourself, therefore I will give you only a hint for now:
Long lines that contain certain text
in the fourth column will sort later
than short lines.
Add a comment if you need more help.
Although the used of the Schwartzian transform is pretty outdated for Python it is worth mentioning that you could have written the code this way to avoid the possibility of a line with line[4] starting with \377 being sorted into the wrong place
for n in range(len(lines)):
lst = lines[n].split()
if len(lst)>4:
lines[n] = ((0, lst[4]), lines[n])
else:
lines[n] = ((1,), lines[n])
Since tuples are compared elementwise, the tuples starting with 1 will always be sorted to the bottom.
Also note that the test should be len(list)>4 instead of >=
The same logic applies when using the modern equivalent AKA the key= function
def key_func(line):
lst = line.split()
if len(lst)>4:
return 0, lst[4]
else:
return 1,
lines.sort(key=key_func)

Categories

Resources