List index out of range, with split() - python

I am learning Python, and am trying to learn data.split(). I found the following in another StackOverflow question (link here), discussing appending a file in Python.
I have created biki.txt per the above link. Here's my code:
import re
import os
import sys
with open("biki.txt","r") as myfile:
mydata = myfile.read()
data = mydata.replace("http","%http")
for m in range (1,1000):
dat1 = data.split("%")[m]
f = open ("new.txt", "a")
f.write(dat1)
f.close()
But when I run the above, I get the error:
dat1 = data.split("%")[m]
IndexError: list index out of range
How come? I can't find documentation as to what that [m] does, but removing it doesn't fix the issue. (If I remove [m], then the error changes and says that f.write(dat1) must be a string, or read only character buffer (?).
Thank you for any help or ideas!

First, you need understand what is happening with m in your code. Assuming:
for m in range(1,1000):
print(m)
In the first loop, the value of m will be equal to 1.
In the next loop (and until m be less than 1000) the value of m will be m+1, I mean, if in the previous loop the value of m was 1, then, in this loop m will be equal to 2.
Second, you need to understand that the expression data.split('%') will split a string where it finds a '%' character, returning a list.
For example, assuming:
data = "one%two%three%four%five"
numbers = data.split('%')
numbers will be a list with five elements like this:
numbers = ['one','two','three','four','five']
To get each element on a list, you must subscript the list, which means to use the fancy [] operators and an index number (actually, you can do a lot more, like slicing):
numbers[0] # will return 'one'
numbers[1] # will return 'two'
...
numbers[4] # will return 'five'
Note that the first element on a list has index 0.
The list numbers has 5 elements, and the indexing starts with 0, so, the last element will have index 4. If you try to subscript with an index higher than 4, the Python Interpreter will raise an IndexError since there is no element at such index.
Your code is generating a list with less elements than the range you created. So, the list index is being exhausted before the for loop is done. I mean, if dat1 has 500 elements, when the value of m is 500 (don't forget that list indexes starts with 0) an IndexError is raised.
If I got what you want to do, you may achieve your objective with this code:
with open("input.txt","r") as file_input:
raw_text = file_input.read()
formated_text = raw_text.replace("http","%http")
data_list = formated_text.split("%")
with open("output.txt","w") as file_output:
for data in data_list:
file_output.write(data+'\n') # writting one URL per line ;)

You should just iterate over data.split():
for dat1 in data.split("%"):
Now you only split once (rather than on every iteration), it doesn't have to contain 1000+ items (which was the cause of the IndexError) and it gives a string to f.write() rather than a list (the source of the other error).

Related

Why is re not removing some values from my list?

I'm asking more out of curiosity at this point since I found a work-around, but it's still bothering me.
I have a list of dataframes (x) that all have the same column names. I'm trying to use pandas and re to make a list of the subset of column names that have the format
"D(number) S(number)"
so I wrote the following function:
def extract_sensor_columns(x):
sensor_name = list(x[0].columns)
for j in sensor_name:
if bool(re.match('D(\d+)S(\d+)', j))==False:
sensor_name.remove(j)
return sensor_name
The list that I'm generating has 103 items (98 wanted items, 5 items). This function removes three of the five columns that I want to get rid of, but keeps the columns labeled 'Pos' and 'RH.' I generated the sensor_name list outside of the function and tested the truth value of the
bool(re.match('D(\d+)S(\d+)', sensor_name[j]))
for all five of the items that I wanted to get rid of and they all gave the False value. The other thing I tried is changing the conditional to ==True, which even more strangely gave me 54 items (all of the unwanted column names and half of the wanted column names).
If I rewrite the function to add the column names that have a given format (rather than remove column names that don't follow the format), I get the list I want.
def extract_sensor_columns(x):
sensor_name = []
for j in list(x[0].columns):
if bool(re.match('D(\d+)S(\d+)', j))==True:
sensor_name.append(j)
return sensor_name
Why is the first block of code acting so strangely?
In general, do not change arrays while iterating over them. The problem lies in the fact that you remove elements of the iterable in the first (wrong) case. But in the second (correct) case, you add correct elements to an empty list.
Consider this:
arr = list(range(10))
for el in arr:
print(el)
for i, el in enumerate(arr):
print(el)
arr.remove(arr[i+1])
The second only prints even number as every next one is removed.

Index error while iterating through list and pop()-ing elements [duplicate]

This question already has answers here:
How to test multiple variables for equality against a single value?
(31 answers)
Closed 6 years ago.
import os
os.chdir('G:\\f5_automation')
r = open('G:\\f5_automation\\uat.list.cmd.txt')
#print(r.read().replace('\n', ''))
t = r.read().split('\n')
for i in range(len(t)):
if ('inherited' or 'device-group' or 'partition' or 'template' or 'traffic-group') in t[i]:
t.pop(i)
print(i,t[i])
In the above code, I get an index error at line 9: 'if ('inherited' or 'device-group'...etc.
I really don't understand why. How can my index be out of range if it's the perfect length by using len(t) as my range?
The goal is to pop any indexes from my list that contain any of those substrings. Thank you for any assistance!
This happens because you are editing the list while looping through it,
you first get the length which is 10 for example, then you loop through the thing 10 times. but as soon as you've deleted one thing the list will only be 9 long.
A way around this is to create a new list of things you want to keep and use that one instead.
I've slightly edited your code and done something similar.
t = ['inherited', 'cookies', 'device-group']
interesing_things = []
for i in t:
if i not in ['inherited', 'device-group', 'partition', 'template', 'traffic-group']:
interesing_things.append(i)
print(i)
Let's say len(t) == 5.
We'll process i taking values [0,1,2,3,4]
After we process i = 0, we pop one value from t. len(t) == 4 now. This would mean error if we get to i = 4. However, we're still going to try to go up to 4 because our range is already inited to be up to 4.
Next (i = 1) step ensures an error on i = 3.
Next (i = 2) step ensures an error on i = 2, but that is already processed.
Next (i = 3) step yields an error.
Instead, you should do something like this:
while t:
element = t.pop()
print(element)
On a side note, you should replace that in check with sets:
qualities_we_need = {'inherited', 'device-group', 'partition'} # put all your qualities here
And then in loop:
if qualities_we_need & set(element):
print(element)
If you need indexes you could either use one more variable to keep track of index of value we're currently processing, or use enumerate()
As many people said in the comments, there are several problems with your code.
The or operator sees the values on its left and right as booleans and returns the first one that is True (from left to right). So your parenthesis evaluates to 'inherited' since any non-empty string is True. As a result, even if your for loop was working, you would be popping elements that are equal to 'inherited' only.
The for loop is not working though. That happens because the size of the list you are iterating over is changing as you loop through and you will get an index-out-of-range error if an element of the list is actually equal to 'inherited' and gets popped.
So, take a look at this:
import os
os.chdir('G:\\f5_automation')
r = open('G:\\f5_automation\\uat.list.cmd.txt')
print(r.read().replace('\n', ''))
t = r.read().split('\n')
t_dupl = t[:]
for i, items in enumerate(t_dupl):
if items in ['inherited', 'device-group', 'partition', 'template', 'traffic-group']:
print(i, items)
t.remove(items)
By duplicating the original list, we can use its items as a "pool" of items to pick from and modify the list we are actually interested in.
Finally, know that the pop() method returns the item it removes from the list and this is something you do not need in your example. remove() works just fine for you.
As a side note, you can probably replace your first 5 lines of code with this:
with open('G:\\f5_automation\\uat.list.cmd.txt', 'r') as r:
t = r.readlines()
the advantage of using the with statement is that it automatically handles the closing of the file by itself when the reading is done. Finally, instead of reading the whole file and splitting it on linebreaks, you can just use the built-in readlines() method which does exactly that.

remove all elements in list that do not end in a number in python

I'm trying to sort an array that I have by removing lines that do not finish in a 1, 2, or 3.
So far I've not been very successful and the code I've come up with looks like this:
these lines set the variables to be used in the function
import numpy as np
A=[]
B=[]
C=[]
file = open('glycine_30c.data', 'r')
bondsfile = open('glycine_30c.bonds', 'r')
these lines read the .data and .bonds files into arrays
for lines in file:
eq = lines.split()
A.append(str(eq))
for x in bondsfile:
bon = x.split()
B.append(str(bon))
these lines are here to (hopefully) delete all elements in the list "B" that don't end in a 1,2, or 3 and then append them to a new list "C" although that's not really necessary
for n in range(len(B)):
if B[n].endswith(1,2,3) == True:
C.append (B[n])
print C
Any help would be really appreciated, thank you
The documentation for endswith() explains:
endswith(...)
S.endswith(suffix[, start[, end]]) -> bool
Return True if S ends with the specified suffix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
suffix can also be a tuple of strings to try.
I.e. calling
B[n].endswith(1,2,3)
won't actually check whether B[n] ends with any of 1, 2, 3. What you probably want instead is something along the lines of
B[n].endswith(("1", "2", "3"))
I.e. passing a single tuple argument.

Python - List not converting to Tuple inorder to Sort

def mkEntry(file1):
for line in file1:
lst = (line.rstrip().split(","))
print("Old", lst)
print(type(lst))
tuple(lst)
print(type(lst)) #still showing type='list'
sorted(lst, key=operator.itemgetter(1, 2))
def main():
openFile = 'yob' + input("Enter the year <Do NOT include 'yob' or .'txt' : ") + '.txt'
file1 = open(openFile)
mkEntry(file1)
main()
TextFile:
Emma,F,20791
Tom,M,1658
Anthony,M,985
Lisa,F,88976
Ben,M,6989
Shelly,F,8975
and I get this output:
IndexError: string index out of range
I am trying to convert the lst to Tuple from List. So I will able to order the F to M and Smallest Number to Largest Numbers. In around line 7, it's still printing type list instead of type tuple. I don't know why it's doing that.
print(type(lst))
tuple(lst)
print(type(lst)) #still showing type='list'
You're not changing what lst refers to. You create a new tuple with tuple(lst) and immediately throw it away because you don't assign it to anything. You can do:
lst = tuple(lst)
Note that this will not fix your program. Notice that your sort operation is happening once per line of your file, which is not what you want. Try collecting each line into one sequence of tuples and then doing the sort.
Firstly, you are not saving the tuple you created anywhere:
tup = tuple(lst)
Secondly, there is no point in making it a tuple before sorting it - in fact, a list could be sorted in place as it's mutable, while a tuple would need another copy (although that's fairly cheap, the items it contains aren't copied).
Thirdly, the IndexError has nothing to do with whether it's a list or tuple, nor whether it is sorted. It most likely comes from the itemgetter, because there's a list item that doesn't have three entries in turn - for instance, the strings "F" or "M".
Fourthly, the sort you're doing, but not saving anywhere, is done on each individual line, not the table of data. Considering this means you're comparing a name, a number, and a gender, I rather doubt it's what you intended.
It's completely unclear why you're trying to convert data types, and the code doesn't match the structure of the data. How about moving back to the overview plan and sorting out what you want done? It could well be something like Python's csv module could help considerably.

Multiple values to one key in python

So my main goal is simple, I want multiple values to be returned when using a single key. However I'm getting errors and confusing behavior. I am new to Python, so I fully expect there to be a simple reason for this issue.
I have a list of objects, list which only contains the index of the object. i.e.
1
2
3
4
etc..
and a file containing the groups that each of the objects belong to, listed in the same order. The file is a single value for n lines (n being the length of the list of objects as well.) i.e. the file looks like this:
2
5
2
4
etc..
meaning the first object belongs in group 2, the second object in group 5, the third in group 2 and fourth in group 4. This file will change depending on my input files. I have attempted the two following suggestions (that I could find).
EDIT: my end goal: to have a dictionary with group numbers as the keys and the objects in the groups as the values.
I looked to this StackOverflow question first for help since it is so similar, and ended up with this code:
def createdDict(list, file):
f = open(file, 'r')
d={}
i=0
for line in f:
groupIndex = int(line)
if groupIndex in d:
d[groupIndex].append(list[i])
else:
d[groupIndex] = list[i]
i +=1
print d
f.close()
And this error:
AttributeError: 'Element' object has no attribute 'append'
d[groupIndex] is just a dictionary and its key and groupIndex should also just be an integer.. not an object from a class I created earlier in the script. (Why is this error showing up?)
I then revised my code after coming upon this other question to be like the following, since I thought this was an alternative way to accomplish my task. My code then looked like this:
def createdDict(list, file):
f = open(file, 'r')
d={}
i=0
for line in f:
groupIndex = int(line)
if groupIndex in d:
d.setdefault('groupIndex', []).append(list[i])
else:
d[groupIndex] = list[i]
i +=1
print d
f.close()
This code snippet doesn't end in an error or what I want, but rather (what I believe) are the last objects in the groups... so print d gives me the key and the last object placed in the group (instead of the desired: ALL of the objects in that group) and then terminal randomly spits out groupIndex followed by all of the objects in list.
My question: what exactly am I missing here? People upvoted the answers to the questions I linked, so they are most likely correct and I am likely implementing them incorrectly. I don't need a correction to both procedures, but the most efficient answer to my problem of getting multiple values attached to one key. What would be the most pythonic way to accomplish this task?
EDIT 2: if this helps at all, here is the class that the first method is referencing the error too. I have no idea how it defined any part of this code as a part of this class. I haven't really developed it yet, but I'm all for an answer, so if this helps in locating the error:
class Element(object):
def __init__(self, globalIndex):
self.globalIndex = globalIndex
def GetGlobalIndex (self):
return self.globalIndex
globalIndex is a separate index of objects (Elements). with my current problem, I am taking a list of these Elements (this is the list mentioned earlier) and grouping them into smaller groups based upon my file (also mentioned earlier). Why I thought it shouldn't matter, the list is essentially a counting up of integers... How would it mess with my code?
The base of your problem is in this line:
d[groupIndex] = list[i]
In other words, when a key is not in the dictionary, you add a single value (Element object) under that key. The next time you see that key, you try to append to that single value. But you can't append to single values. You need a container type, such as a list, to append. Python doesn't magically turn your Element object into a list!
The solution is simple. If you want your dictionary's values to be lists, then do that. When you add the first item, append a one-element list:
d[groupIndex] = [list[i]]
Alternatively, you can take a one-item slice of the original list. This will be a list already.
d[groupIndex] = list[i:i+1]
Now the dictionary's values are always lists, and you can append the second and subsequent values to them without error.
As ecatmur points out, you can further simplify this (eliminating the if statement) using d.setdefault(groupIndex, []).append(list[i]). If the key doesn't exist, then the value is taken to be an empty list, and you simply always append the new item. You could also use collections.defaultdict(list).
Just use
d.setdefault(groupIndex, []).append(list[i])
This will check whether groupIndex is in d, so you don't need the if groupIndex in d: line.
from itertools import izip
from collections import defaultdict
dd = defaultdict(list)
with open('filename.txt') as fin:
for idx, line in izip(my_list, fin):
num = int(line)
defaultdict[num].append(idx)
This creates a defaultdict with a default type of list, so you can append without using setdefault. Then reads each element of my_list combined with the corresponding line from the file, converts the line to an integer, then adds to the group (represented by num) the corresponding index.
In your first try, you seem to correctly understand that adding the first element to the dictionary item is a special case, and you cannot append yet, since the dictionary item has no value yet.
In your case you set it to list[i]. However, list[i] is not a list, so you cannot run append on it in later iterations.
I would do something like:
for line in f:
groupIndex = int(line)
try:
blah = d[groupIndex] # to check if it exists
except:
d[groupIndex] = [] # if not, assign empty list
d[groupIndex].append(list[i])
print d
f.close()

Categories

Resources