Python Pandas. Endless cycle - python

Why does this part of the code have an infinite loop? It can't be so, because where I stop this part of code (in Jupyter Notebook), all 99999 values have changed to oil_mean_by_year[data.loc[i]['year']]
for i in data.index:
if data.loc[i]['dcoilwtico'] == 99999:
data.loc[i, 'dcoilwtico'] = oil_mean_by_year[data.loc[i]['year']]

Use merge to align the oil mean of a year with the given row:
Merge on data['year'] vs oil_mean_by_year's index
data_with_oil_mean = pd.merge(data, oil_mean_by_year.rename("oil_mean"),
left_on="year", right_index=True, how="left")
data_with_oil_mean['dcoilwtico'] = data_with_oil_mean['dcoilwtico'].mask(lambda xs: xs.eq(99999), data_with_oil_mean['oil_mean'])

This is a common mistake when using Pandas and it happens due to some misunderstanding about how Python works with lists. Let's take a look at what actually happens here.
We are trying to change dcoilwtico value for each row that has year equal to 99999. We do that by assigning new value to this column only if current value equals 99999. That means we need to check every single element of our list against 99999 and then assign new value to dcoilwtico only if needed. But there is no way to perform such operation on a list like this one without knowing its length beforehand! So, as soon as you try to access any item from this list that doesn't exist yet - e.g., data.loc(i, 'dcoilwtico') - your program will crash. And since you don't know anything about size of this list before running the script, it'll keep crashing until either memory runs out or you manually terminate the process.
The solution to this problem is simple. Just make sure that your condition checks whether index exists first:
if data.loc(i, 'dcoilwtico') == 99999:
data.loc(i, 'dcoilwtico') = oil_mean_by_year.get(data.loc(i, 'year'), 0)
else:
#...
Now your script should work fine.

Related

How to handle IndexError inside if statement without using try

I been having this problem in my code, that i solved and think the solution would help other people.
I have a list of objects and need to check if the objects variables are less, more, equal to other variables, the problem is that i don't always have a object in the list, so python would give an Index Error, because the list would be empty, i tried using try to catch the error but it made the code messy and hard to read.
To simplify, all i wanted was to say: if list[0].value < value2: do this, and if List Index does not exist or the condition is not True than go to else statement.
I will give the solution below and anyone else is free to give others solutions too
The way that i found to do this was, when creating the list append values that are NAN, now the list has values to be checked and if the values are NAN the answer will always be False witch leads to going to else.
I actually was using objects, so i just created a new object with all variables set to NAN.
I used numpy.nan
Couldnt you just check that the list length is greather than the index you want to check and then check the condition. Since and shortcircuts, if the list isnt long enough then the index will not get checked. this means you dont have to create dummy data especially for super long lists like in this case where i want to check index one million. This also saves you having to install and utilise other modules such as numpy
mylist = [1, 3]
index = 1_000_000
if len(mylist) > index and mylist[index] < 2:
print("yay")
else:
print("nay")

The code that works individually breaks in the loop on 3rd-4th iteration, no matter what the input is

I wrote a script (can't publish all of it here, it is big), that downloads the CSV file, checks the rages and creates a new CSV file that has all "out of range" info.
The script was checked on all existing CSV files and works without errors.
Now I am trying to loop through all of them to generate the "out of range" data but it errors after the 3rd or 4th iteration no matter what the input file is.
I tried to swap the queue of files, and the ones that errored before are processed just fine, but the error still appears on 3rd-4th iteration.
What may be the issue with this?
The error I get is the ValueError: cannot reindex on an axis with duplicate labels
when I run the line assigning the out of range values to the column
dataframe.loc[dataframe['Flagged_measure'] == flags[i][0], ['Flagged_measure']] = dataframe[dataframe['Flagged_measure'] == flags[i][0]]['Flagged_measure'].astype(str) + ' , ' + csv_report_df.loc[flags[i][1], flags[i][0]].astype(str)
The ValueError you mentioned occurs when you join/assign to a column that has duplicate index values. From what I can infer from the single line of code you posted, I'll break it down and maybe it could be clear whether your assignment makes sense:
dataframe.loc[dataframe['Flagged_measure'] == flags[i][0], ['Flagged_measure']]
I equate the rows of the column Flagged_measure in dataframe that matches with flags[i][0] with some RHS value, preferably a single value per iteration.
dataframe[dataframe['Flagged_measure'] == flags[i][0]]['Flagged_measure'].astype(str) + ' , ' + csv_report_df.loc[flags[i][1], flags[i][0]].astype(str)
This way of assignment makes no sense whatsoever. You perform a grouped operation but at the same time, use a single-value assignment for changing values in dataframe.
Might I suggest you try this?
dataframe['Flagged_measure'] = dataframe['Flagged_measure'].apply(lambda row: (" , ".join([str(row),str(csv_report_df.iloc[flags[i][1], flags[i][0]]]))) if row == flags[i][0])
If it still doesn't work, maybe you need to look into csv_report_df as well. As far as I know, loc is good for label-based indices, but not for numeric-based indexing, as I think you're looking to achieve here.

How can I modify a pandas dataframe I'm iterating over?

I know - this is verboten.
But when optimize.curve_fit hits a row of (maybe 5) identical values, it quits and returns a straight line.
I don't want to remove ALL duplicates, but I thought I might remove the middle member of any identical triplets, without doing too much damage to the fit.
So I wanted to use iterrows, and drop rows as I go, but I understand I may be working on a copy, not the original.
Or, I could just do an old-fashioned loop with an index.
How can I do this safely, and in such a way that the end parameter of the loop is updated each time I do a deletion?
Here's an example:
i = 1
while i < len(oneDate.index)-1:
print("triple=",oneDate.at[i-1,"Nprem"],oneDate.at[i,"Nprem"],oneDate.at[i+1,"Nprem"])
if oneDate.at[i,"Nprem"]==oneDate.at[i-1,"Nprem"] and oneDate.at[i,"Nprem"]==oneDate.at[i+1,"Nprem"]:
print("dropping i=",i,oneDate.at[i,"Nprem"])
oneDate.drop([i])
oneDate = oneDate.reset_index(drop=True)
pause()
else: i = i +1
I assumed that when I dropped and reset, the next item would move into the deleted slot, so I wouldn't have to increment the index. But it didn't, so I got an infinite loop.
OK, I found the , inplace=True option and it now works fine.

Using loops to compare two lists to find matching values

I have two lists, pPop and sPop. sPop is pPop after being sorted in ascending numerical order (they're populations of towns/cities).
I also have four other lists, pName, pType, pLat, and pLong, but I'm not really doing anything with them at this point in time.
I need to sort this list of cities by ascending population size, and I basically have been told to do it using what I know currently - which isn't that much. I've tried this using tuples and other stuff, but those fall under things I haven't been taught.
I have to compare sPop to pPop and use the information I get from that to reorder the other four lists, so I can spit them out in a .csv file.
I get the idea, I'm just not sure of the execution. I think I need to run a loop over all of sPop, with a loop inside that running for all pPop, which checks if sPop[x] = pPop[y], (x from 0 to len(sPop)) giving some kind of affirmative response when it's true. If it's true, then set pVar[y] equal to sVar[x].
After writing this out it seems fine, I'm just not entirely sure how to loop for every index in python. When I do, say,
for x in sPop
it's
x = sPop[i] i=range(0:len(sPop))
when I'd prefer x to refer to the index itself, not the value of the array/list at that index.
Short version: loop over indices in an array with
for x in array
where x refers to the index. How do? If you can't, best way to refer to the index within a loop? It's late and my brain is fried on this problem after spending about six hours trying to work this out a few days ago using different methods.
EDIT:
Alright, got it. For anyone who is somehow curious (maybe someone'll stumble across this one in five years), you loop over sPop, then over pPop, (use
for indexX, varX in enumerate(list) twice) then use
if varX ==varY
sortedList.append(initialList[varY])
Can't put down the actual code or I'd probably get smacked with a plagiarism checker.
To get the index:
for index, x in enumerate(sPop):
print index, x
for x in range(len(sPop)):
item = sPop[x]

python excel reading last empty cell

I am trying to read till the last empty cell for the specified number of rows.
Here is my code:
for j in xrange(0,REPEAT_CONST,1):
if r_sheet.cell_type(row+j,0)== xlrd.XL_CELL_EMPTY:
break
This code only works if the cell has been written something earlier and deleted. but will not work if the cell is never been edited. Not sure how to handle this.
Could you please help me to do this.
I will be grateful for your support.
Regards,
Pavan
The error you're experiencing suggests the rowx or colx arguments are out of bounds for this sheet.
In the cell access functions, "rowx" is a row index, counting from zero, and "colx" is a column index, counting from zero. Negative values for row/column indexes and slice positions are supported in the expected fashion.
I just found this out (never used XLRD before) but if you query the r_sheet.nrows I bet you'll find that the value is less than row+j. It appears that xlrd only reads part of the worksheet, essentially the UsedRange from Excel.
So you can use some exception handling, either a try/catch block or you could do this. Note this should short-circuit on the first part of the boolean expression whenever row+j is greater than the nrows attribute of that sheet.
if (row+j <= r_sheet.nrows-1) and (r_sheet.cell_value(row+j,0) = ''):
break
Or, perhaps even with your original method:
if (row+j <= r_sheet.nrows-1) and (r_sheet.cell_type(row+j,0)== xlrd.XL_CELL_EMPTY):
break

Categories

Resources