Why is re not removing some values from my list? - python

I'm asking more out of curiosity at this point since I found a work-around, but it's still bothering me.
I have a list of dataframes (x) that all have the same column names. I'm trying to use pandas and re to make a list of the subset of column names that have the format
"D(number) S(number)"
so I wrote the following function:
def extract_sensor_columns(x):
sensor_name = list(x[0].columns)
for j in sensor_name:
if bool(re.match('D(\d+)S(\d+)', j))==False:
sensor_name.remove(j)
return sensor_name
The list that I'm generating has 103 items (98 wanted items, 5 items). This function removes three of the five columns that I want to get rid of, but keeps the columns labeled 'Pos' and 'RH.' I generated the sensor_name list outside of the function and tested the truth value of the
bool(re.match('D(\d+)S(\d+)', sensor_name[j]))
for all five of the items that I wanted to get rid of and they all gave the False value. The other thing I tried is changing the conditional to ==True, which even more strangely gave me 54 items (all of the unwanted column names and half of the wanted column names).
If I rewrite the function to add the column names that have a given format (rather than remove column names that don't follow the format), I get the list I want.
def extract_sensor_columns(x):
sensor_name = []
for j in list(x[0].columns):
if bool(re.match('D(\d+)S(\d+)', j))==True:
sensor_name.append(j)
return sensor_name
Why is the first block of code acting so strangely?

In general, do not change arrays while iterating over them. The problem lies in the fact that you remove elements of the iterable in the first (wrong) case. But in the second (correct) case, you add correct elements to an empty list.
Consider this:
arr = list(range(10))
for el in arr:
print(el)
for i, el in enumerate(arr):
print(el)
arr.remove(arr[i+1])
The second only prints even number as every next one is removed.

Related

How could I iterate over the elements of a list of tuples and get the elements as strings as a result?

I'm just trying to write a program which loads a workbook, iterates over the values of all the cells in two columns, format a string with the values read from those cells and writes the resulting formatted string to a new worksheet's column's cells.
My problem is that I cannot get the original values as strings to be able to format them with string operations.
The other problem is that if I could even format the string successfully, I also don't know how to write the resulting string values to the new worksheet's column A.
This only appends the last element of fiok to fiokstr as string.
Any idea why only the last element is appended?
for x in fiok:
fiokstr = []
for y in x:
fiokstr.append(y)
Move fiokstr outside the loop:
fiokstr = []
for x in fiok:
for y in x:
fiokstr.append(y)
As it is, you're reassigning an empty list to it for every run through the loop.
I might be reading this wrong, but it looks like every time it iterates through fiok, fiokstr gets set back to an empty list. Try defining fiokstr outside of that outer for loop. Eg:
fiokstr = []
for x in fiok:
newl = []
for y in x:
newl.append(y)
fiokstr.append(newl)
Thank you all. That populated fiokstr appropriately, along with another list, ipstr (same number of entries in the two lists).
And why is the following?
When I run:
txt = []
for nev in fiokstr:
text = "H_"+nev.capitalize()
txt.append(text)
or
for cim in ipsrtr:
text = "_UPS_"+cim.replace("/24", "")
txt.append(text)
txt gets populated appropriately, but when I run:
for nev in fiokstr, cim in ipstr:
text = "H_"+nev.capitalize()+"_UPS_"+cim.replace("/24", "")
txt.append(text)
I get "NameError: name 'nev' is not defined", although previously worked as I wrote.
Thank you.

What does this anonymmous split function do?

narcoticsCrimeTuples = narcoticsCrimes.map(lambda x:(x.split(",")[0], x))
I have a CSV I am trying to parse by splitting on commas and the first entry in each array of strings is the primary key.
I would like to get the key on a separate line (or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
My current understanding is 'split x by commas, take the first part of each split [0], and return that as the new x', but I'm pretty sure that middle part is not right because the number inside the [] can be anything and returns the same result.
Your variable is named "narcoticsCrimeTuples", so you seem to be expected to get a "tuple".
Your two values of the tuple are the first column of the CSV x.split(",")[0] and the entire line x.
I would like to get the key on a separate line
Not really clear why you want that...
(or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
Well, when you call .first(), you get the entire tuple. [0] is the first column, and [1] would be the corresponding line of the CSV, which also contains the [0] value.
If you narcoticsCrimes.flatMap(lambda x: x.split(",")), then all the values will be separated.
For example, in the word count example...
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1))
Judging by the syntax seems like you are in PySpark. If that's true you're mapping over your RDD and for each row creating a (key, row) tuple, the key being the first element in a comma-separated list of items. Doing narcoticsCrimeTuples.first() will just give you the first record.
See an example here:
https://gist.github.com/amirziai/5db698ea613c6857d72e9ce6189c1193

Find index of a sublist in a list

Trying to find the index of a sublists with an element. I’m not sure how to specify the problem exactly (which may be why I’ve overlooked it in a manual), however my problem is thus:
list1 = [[1,2],[3,4],[7,8,9]]
I want to find the first sub-list in list1 where 7 appears (in this case the index is 2, but lll could be very very long). (It will be the case that each number will appear in only 1 sub-list – or not at all. Also these are lists of integers only)
I.e. a function like
spam = My_find(list1, 7)
would give spam = 2
I could try looping to make a Boolean index
[7 in x for x in lll]
and then .index to find the 'true' - (as per Most efficient way to get indexposition of a sublist in a nested list)
However surely having to build a new boolean list is really inefficient..
My code starts with list1 being relatively small, however it keeps building up (eventually there will be 1 million numbers arranged in approx. 5000 sub-lists of list1
Any thoughts?
I could try looping to make a Boolean index
[7 in x for x in lll]
and then .index to find the 'true' … However surely having to build a new boolean list is really inefficient
You're pretty close here.
First, to avoid building the list, use a generator expression instead of a list comprehension, by just replacing the [] with ().
sevens = (7 in x for x in lll)
But how do you do the equivalent of .index when you have an arbitrary iterable, instead of a list? You can use enumerate to associate each value with its index, then just filter out the non-sevens with filter or dropwhile or another generator expression, then next will give you the index and value of the first True.
For example:
indexed_sevens = enumerate(sevens)
seven_indexes = (index for index, value in indexed_sevens if value)
first_seven_index = next(seven_indexes)
You can of course collapse all of this into one big expression if you want.
And, if you think about it, you don't really need that initial expression at all; you can do that within the later filtering step:
first_seven_index = next(index for index, value in enumerate(lll) if 7 in value)
Of course this will raise a StopIteration exception instead of a ValueError expression if there are no sevens, but otherwise, it does the same thing as your original code, but without building the list, and without continuing to test values after the first match.

Bi-dimensional list, keep only lines with unique values on column

I have a list like:
[['A','B','1'],
['A','D','2'],
['F','B','1'],
['K','B','1'],
['M','D','2'],
['G','H','3']
]
I would like to keep only the lines where 'column' 2 contains unique values.
And more specific, the new 'matrix' should only have the last two columns.
result:
[
['B','1'],
['D','2'],
['H','3']
]
There are more than 1.000.000 lines, and column 2 contains strings of 48 digits, so a fast way to do it is preferable.
Thank you,
Tom
I tried:
matrixData=[['A','B','1'],['A','D','2'],['F','B','1'],['K','B','1'],['M','D','2'],['G','H','3']]
uniqueCol2=[]
uniqueCol3=[]
for line in matrixData:
if line[1] not in uniqueCol2:
uniqueCol2.append(line[1])
uniqueCol3.append(line[2])
print uniqueCol2
print uniqueCol3
result
['B','D','H']
['1','2','3']
this gives me two lists, in the end i need the sum of uniqueCol3, but since there are more then 1.000.000 lines and probably because strings contains 48 digits it takes a lot of time to check if line[1] not in uniqueCol2:.
You could try something along the lines of:
def crop(input_matrix):
output_matrix = []
unique = set() # Tracks unique 2nd column entries
for row in input_matrix:
if row[1] not in unique: # If second column is unique, add the row slice to the output matrix
output_matrix.append(row[1:2])
unique.add(row[1]) # Add that value to unique entries we've found so far
return output_matrix
set is O(1) to search through, so it's as efficient as you're going to get from that aspect. The total complexity of this is therefore O(n) with the number of rows in the input matrix, which I think is as efficient as you're going to get unless there's some information you can use to predict what rows will be non-unique.
You can absolutely code-golf this down to two lines of code with a list comprehension, but I didn't for clarity's sake

Nested for loop index out of range

I'm coming up with a rather trivial problem, but since I'm quite new to python, I'm smashing my head to my desk for a while. (Hurts). Though I believe that's more a logical thing to solve...
First I have to say that I'm using the Python SDK for Cinema 4D so I had to change the following code a bit. But here is what I was trying to do and struggling with:
I'm trying to group some polygon selections, which are dynamically generated (based on some rules, not that important).
Here's how it works the mathematical way:
Those selections are based on islands (means, that there are several polygons connected).
Then, those selections have to be grouped and put into a list that I can work with.
Any polygon has its own index, so this one should be rather simple, but like I said before, I'm quite struggling there.
The main problem is easy to explain: I'm trying to access a non existent index in the first loop, resulting in an index out of range error. I tried evaluating the validity first, but no luck. For those who are familiar with Cinema 4D + Python, I will provide some of the original code if anybody wants that. So far, so bad. Here's the simplified and adapted code.
edit: Forgot to mention that the check which causes the error actually should only check for duplicates, so the current selected number will be skipped since it hal already been processed. This is necessary due to computing-heavy calculations.
Really hope, anybody can bump me in the right direction and this code makes sense so far. :)
def myFunc():
sel = [0,1,5,12] # changes with every call of "myFunc", for example to [2,8,4,10,9,1], etc. - list alway differs in count of elements, can even be empty, groups are beeing built from these values
all = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # the whole set
groups = [] # list to store indices-lists into
indices = [] # list to store selected indices
count = 0 # number of groups
tmp = [] # temporary list to copy the indices list into before resetting
for i in range(len(all)): # loop through values
if i not in groups[count]: # that's the problematic one; this one actually should check whether "i" is already inside of any list inside the group list, error is simply that I'm trying to check a non existent value
for index, selected in enumerate(sel): # loop through "sel" and return actual indices. "selected" determines, if "index" is selected. boolean.
if not selected: continue # pretty much self-explanatory
indices.append(index) # push selected indices to the list
tmp = indices[:] # clone list
groups.append(tmp) # push the previous generated list to another list to store groups into
indices = [] # empty/reset indices-list
count += 1 # increment count
print groups # debug
myFunc()
edit:
After adding a second list which will be filled by extend, not append that acts as counter, everything worked as expected! The list will be a basic list, pretty simple ;)
groups[count]
When you first call this, groups is an empty list and count is 0. You can't access the thing at spot 0 in groups, because there is nothing there!
Try making
groups = [] to groups = [[]] (i.e. instead of an empty list, a list of lists that only has an empty list).
I'm not sure why you'd want to add the empty list to groups. Perhaps this is better
if i not in groups[count]:
to
if not groups or i not in groups[count]:
You also don't need to copy the list if you're not going to use it for anything else. So you can replace
tmp = indices[:] # clone list
groups.append(tmp) # push the previous generated list to another list to store groups into
indices = [] # empty/reset indices-list
with
groups.append(indices) # push the previous generated list to another list to store groups into
indices = [] # empty/reset indices-list
You may even be able to drop count altogether (you can always use len(groups)). You can also replace the inner loop with a listcomprehension
def myFunc():
sel = [0,1,5,12] # changes with every call of "myFunc", for example to [2,8,4,10,9,1], etc. - list alway differs in count of elements, can even be empty, groups are beeing built from these values
all = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # the whole set
groups = [] # list to store indices-lists into
for i in range(len(all)): # loop through values
if not groups or i not in groups[-1]: # look in the latest group
indices = [idx for idx, selected in enumerate(sel) if selected]
groups.append(indices) # push the previous generated list to another list to store groups into
print groups # debug
correct line 11 from:
if i not in groups[count]
to:
if i not in groups:

Categories

Resources