How can I modify a pandas dataframe I'm iterating over?

How can I modify a pandas dataframe I'm iterating over? - python

I know - this is verboten.
But when optimize.curve_fit hits a row of (maybe 5) identical values, it quits and returns a straight line.
I don't want to remove ALL duplicates, but I thought I might remove the middle member of any identical triplets, without doing too much damage to the fit.
So I wanted to use iterrows, and drop rows as I go, but I understand I may be working on a copy, not the original.
Or, I could just do an old-fashioned loop with an index.
How can I do this safely, and in such a way that the end parameter of the loop is updated each time I do a deletion?
Here's an example:
i = 1
while i < len(oneDate.index)-1:
print("triple=",oneDate.at[i-1,"Nprem"],oneDate.at[i,"Nprem"],oneDate.at[i+1,"Nprem"])
if oneDate.at[i,"Nprem"]==oneDate.at[i-1,"Nprem"] and oneDate.at[i,"Nprem"]==oneDate.at[i+1,"Nprem"]:
print("dropping i=",i,oneDate.at[i,"Nprem"])
oneDate.drop([i])
oneDate = oneDate.reset_index(drop=True)
pause()
else: i = i +1
I assumed that when I dropped and reset, the next item would move into the deleted slot, so I wouldn't have to increment the index. But it didn't, so I got an infinite loop.

OK, I found the , inplace=True option and it now works fine.

Related

Python Pandas. Endless cycle

Why does this part of the code have an infinite loop? It can't be so, because where I stop this part of code (in Jupyter Notebook), all 99999 values have changed to oil_mean_by_year[data.loc[i]['year']]
for i in data.index:
if data.loc[i]['dcoilwtico'] == 99999:
data.loc[i, 'dcoilwtico'] = oil_mean_by_year[data.loc[i]['year']]

Use merge to align the oil mean of a year with the given row:
Merge on data['year'] vs oil_mean_by_year's index
data_with_oil_mean = pd.merge(data, oil_mean_by_year.rename("oil_mean"),
left_on="year", right_index=True, how="left")
data_with_oil_mean['dcoilwtico'] = data_with_oil_mean['dcoilwtico'].mask(lambda xs: xs.eq(99999), data_with_oil_mean['oil_mean'])

This is a common mistake when using Pandas and it happens due to some misunderstanding about how Python works with lists. Let's take a look at what actually happens here.
We are trying to change dcoilwtico value for each row that has year equal to 99999. We do that by assigning new value to this column only if current value equals 99999. That means we need to check every single element of our list against 99999 and then assign new value to dcoilwtico only if needed. But there is no way to perform such operation on a list like this one without knowing its length beforehand! So, as soon as you try to access any item from this list that doesn't exist yet - e.g., data.loc(i, 'dcoilwtico') - your program will crash. And since you don't know anything about size of this list before running the script, it'll keep crashing until either memory runs out or you manually terminate the process.
The solution to this problem is simple. Just make sure that your condition checks whether index exists first:
if data.loc(i, 'dcoilwtico') == 99999:
data.loc(i, 'dcoilwtico') = oil_mean_by_year.get(data.loc(i, 'year'), 0)
else:
#...
Now your script should work fine.

The code that works individually breaks in the loop on 3rd-4th iteration, no matter what the input is

I wrote a script (can't publish all of it here, it is big), that downloads the CSV file, checks the rages and creates a new CSV file that has all "out of range" info.
The script was checked on all existing CSV files and works without errors.
Now I am trying to loop through all of them to generate the "out of range" data but it errors after the 3rd or 4th iteration no matter what the input file is.
I tried to swap the queue of files, and the ones that errored before are processed just fine, but the error still appears on 3rd-4th iteration.
What may be the issue with this?
The error I get is the ValueError: cannot reindex on an axis with duplicate labels
when I run the line assigning the out of range values to the column
dataframe.loc[dataframe['Flagged_measure'] == flags[i][0], ['Flagged_measure']] = dataframe[dataframe['Flagged_measure'] == flags[i][0]]['Flagged_measure'].astype(str) + ' , ' + csv_report_df.loc[flags[i][1], flags[i][0]].astype(str)

The ValueError you mentioned occurs when you join/assign to a column that has duplicate index values. From what I can infer from the single line of code you posted, I'll break it down and maybe it could be clear whether your assignment makes sense:
dataframe.loc[dataframe['Flagged_measure'] == flags[i][0], ['Flagged_measure']]
I equate the rows of the column Flagged_measure in dataframe that matches with flags[i][0] with some RHS value, preferably a single value per iteration.
dataframe[dataframe['Flagged_measure'] == flags[i][0]]['Flagged_measure'].astype(str) + ' , ' + csv_report_df.loc[flags[i][1], flags[i][0]].astype(str)
This way of assignment makes no sense whatsoever. You perform a grouped operation but at the same time, use a single-value assignment for changing values in dataframe.
Might I suggest you try this?
dataframe['Flagged_measure'] = dataframe['Flagged_measure'].apply(lambda row: (" , ".join([str(row),str(csv_report_df.iloc[flags[i][1], flags[i][0]]]))) if row == flags[i][0])
If it still doesn't work, maybe you need to look into csv_report_df as well. As far as I know, loc is good for label-based indices, but not for numeric-based indexing, as I think you're looking to achieve here.

Bubble sort: understanding recursion

I'm new to programming and data structures and algorithms. I understand the program, what I don't understand is why does it recursively do what I want? I guess what I'm asking is why doesn't it stop after it goes through the list once? It keeps going until the whole list is sorted.
def bubblesort(numbers):
for i in range(len(numbers)):
for j in range(len(numbers) - 1):
if(numbers[j] > numbers[j+1]):
temp = numbers[j]
numbers[j] = numbers[j+1]
numbers[j+1] = temp

There are two loops. The inner loop traverses the whole list for every iteration of the outer loop.
Now notice this:
The inner loop will guarantee that the greatest value "bubbles up" -- with every swap in which it participates -- to the far right. So after the first time that this inner loop completes, the greatest value will have arrived in its final spot.
The second time this inner loop restarts, we could imagine the list to be one element shorter: just imagine that the right most (i.e. greatest) value is not there. Then we have a similar situation: the now greatest value is guaranteed to be shifted to the far right. The inner loop will also compare this "greatest" value with the one we ignored (that really sits at the far right), but obviously they will not be swapped. So after the second time this inner loop does its traversal we have the two greatest values at the far right, in their final position.
So, there is a pattern here. If the inner loop is executed (in its totality) 10 times, then at the end we will have the greatest 10 values at the far right. That is why the outer loop makes as many iterations as there are values in the list. This way it is guaranteed that we will have sorted the whole list.

Well, if I've got your question correctly, the point you're trying to understand is why your code run your list multiples time instead of only once right?
You've got a for inside another, in this way, in line 2 you started a loop that will walk trough the number array. For EVERY repetition of the first loop you are doing another loop in line 3, so, every time you run the array on line 2 you're running it again on line 3.
That's one of the main issues with bubble sorting, you'll keep running the array even after it was sorted.

I have a list with 7 elements in Python, but the len operator returns a length of 1

New to Python, this is my first application. I've been staring at this a while, and I'm sure I have some fundamental misunderstanding about what's going on.
In this example I have a list of 7 str (entries), and an assignment statement:
listLen = len(entries)
Followed by a breakpoint, and below is a screen capture showing the debugger where listLen is assigned a value of 1, and entries is a {list: 7}
I'd expect len(entries) to return a value of 7, but I can't seem to get the expected behavior. What am I missing?
UPDATE: I thought the answer was in the for loop modifying the list but apparently not.
If I set a breakpoint before assigning entries and single step through with the debugger including the for loop everything looks good and works.
If I set a breakpoint ON the for loop and single step once, entries again appears to be a {list: 7} but the len(entries) appears to be 1. The for loop executes one loop and exits.
The deep copy entriesCopy I made for debug is used nowhere else, and gets changed to [''], but I assume that since it's not used it gets optimized out or garbage collected, though it doesn't when single-stepping from an earlier breakpoint.
After breaking on the 'for' loop and single stepping once to the beginning of the 'while' loop:
Why would single stepping through the code work fine, but breaking at the for loop cause len(entries) to be wrong?
Single stepping from earlier breakpoint works fine, and the program returns the correct result:
I'm still struggling to get a minimum reproducible sample of code.
Here's more of the code:
entries = self.userQuery.getEntries()
entriesCopy = copy.deepcopy(self.userQuery.getEntries())
entryList = list()
listLen = len(entries)
for ii in range(0,listLen):
while ("\n\n") in entries[ii]: entries[ii]=entries[ii].replace("\n\n","\n") #strip double newlines
while ("\t") in entries[ii]: entries[ii] = entries[ii].replace("\t", "") # strip tabs
entryList=entries[ii].split("\n")
while("" in entryList): entryList.remove('')
self.SCPIDictionary[self.instructions[ii][1].replace("\n","")]=entryList;

Look a little higher in your debug output- you can see on line 42 entries: ['']
I can't read the code in your for loop so I don't know whats happening, but you seem to be modifying the list in there. If you use the "hover" to look at the value, you would get the current value of that variable. You set the breakpoint on the "for" part of the loop- try setting it on the first line of the loop and the line before the loop and watch for that entries list to get mutated.
--- edit ---
You provided more code. Its... kind of insane. Why are you modifying the "entries" object repeatedly in while loops? Then you copy the entry into another object, and then replace a value in some dictionary with the entry you just copied (with the key determined after running string transformations on a matrix dictionary?)
Two things-
To debug this, I am concerned about the types. Does "getEntries" actually return a list of strings, or is it a resultproxy or something similar? Sqlalchemy for example does not actually return a list. The python debugger is great, but you're doing so much mutation here- instead, lets use print statements. do print(entries) after every line. That will let you see when things are changing, and at least how many times your loop is executing. If it is something like a result proxy, as an example, after you finished iterating over it, there may just not be anything left in there when you look at it in the debugger.
consider this- instead of modifying all these mutable objects, pull out the values and modify those. As a rough draft-
for entry in entries:
values = []
for val in entry.replace("\n\n", "\n").replace("\t, "").split("\n"):
if val:
values.append(val)
self.CCPIDictionary[something?] = values

Python List Help

I have a list of lists that looks like:
floodfillque = [[1,1,e],[1,2,w], [1,3,e], [2,1,e], [2,2,e], [2,3,w]]
for each in floodfillque:
if each[2] == 'w':
floodfillque.remove(each)
else:
tempfloodfill.append(floodfillque[each[0+1][1]])
That is a simplified, but I think relevant part of the code.
Does the floodfillque[each[0+1]] part do what I think it is doing and taking the value at that location and adding one to it or no? The reason why I ask is I get this error:
TypeError: 'int' object is unsubscriptable
And I think I am misunderstanding what that code is actually doing or doing it wrong.

In addition to the bug in your code that other answers have already spotted, you have at least one more:
for each in floodfillque:
if each[2] == 'w':
floodfillque.remove(each)
don't add or remove items from the very container you're looping on. While such a bug will usually be diagnosed only for certain types of containers (not including lists), it's just as terrible for lists -- it will end up altering your intended semantics by skipping some items or seeing some items twice.
If you can't substantially alter and enhance your logic (generally by building a new, separate container rather than mucking with the one you're looping on), the simplest workaround is usually to loop on a copy of the container you must alter:
for each in list(floodfillque):
Now, your additions and removals won't alter what you're actually looping on (because what you're looping on is a copy, a "snapshot", made once and for all at loop's start) so your semantics will work as intended.
Your specific approach to altering floodfillque also has a performance bug -- it behaves quadratically, while sound logic (building a new container rather than altering the original one) would behave linearly. But, that bug is harder to fix without refactoring your code from the current not-so-nice logic to the new, well-founded one.

Here's what's happening:
On the first iteration of the loop, each is [1, 1, 'e']. Since each[2] != 'w', the else is executed.
In the else, you take each[0+1][1], which is the same as (each[0+1])[1]. each[0+1] is 1, and so you are doing (1)[1]. int objects can't be indexed, which is what's raising the error.

Does the floodfillque[each[0+1] part
do what I think it is doing and taking
the value at that location and adding
one to it or no?
No, it sounds like you want each[0] + 1.
Either way, the error you're getting is because you're trying to take the second item of an integer... each[0+1][1] resolves to each[1][1] which might be something like 3[1], which doesn't make any sense.

The other posters are correct. However, there is another bug in this code, which is that you are modifying floodfillque as you are iterating over it. This will cause problems, because Python internally maintains a counter to handle the loop, and deleting elements does not modify the counter.
The safe way to do this is to iterate of a copy of the loop:
for each in floodfillque[ : ]:
([ : ] is Python's notation for a copy.)

Here is how I understand NoahClark's intentions:
Remove those sublists whose third element is 'w'
For the remaining sublist, add 1 to the second item
If this is the case, the following will do:
# Here is the original list
floodfillque = [[1,1,'e'], [1,2,'w'], [1,3,'e'], [2,1,'e'], [2,2,'e'], [2,3,'w']]
# Remove those sublists which have 'w' as the third element
# For the rest, add one to the second element
floodfillque = [[a,b+1,c] for a,b,c in floodfillque if c != 'w']
This solution works fine, but it is not the most efficient: it creates a new list instead of patching up the original one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I modify a pandas dataframe I'm iterating over? - python

OK, I found the , inplace=True option and it now works fine.

Related

Python Pandas. Endless cycle

The code that works individually breaks in the loop on 3rd-4th iteration, no matter what the input is

Bubble sort: understanding recursion

I have a list with 7 elements in Python, but the len operator returns a length of 1

Python List Help

Categories

Resources