I am reading from file x which is contained individual data. These data are separated from each other by new line.I want to calculate tf_idf_vectorizer() for each individual data. So, I need to remove all members of the tweets whenever the code fine new line (\n) . I got error for the bold line in my code.
def load_text():
file=open('x.txt', 'r')
tweets = []
all_matrix = []
for line in file:
if line in ['\n', '\r\n']:
all_matrix.append(tf_idf_vectorizer(tweets))
**for i in tweets: tweets.remove(i)**
else:
tweets.append(line)
file.close()
return all_matrix
You can make tweets an empty list again with a simple assignment.
tweets = []
If you actually need to empty out the list in-place, the way you do it is either:
del tweets[:]
… or …
tweets[:] = []
In general, you can delete or replace any subslice of a list in this way; [:] is just the subslice that means "the whole list".
However, since nobody else has a reference to tweets, there's really no reason to empty out the list; just create a new empty list, and bind tweets to that, and let the old list become garbage to be cleaned up:
tweets = []
Anyway, there are two big problems with this:
for i in tweets: tweets.remove(i)
First, when you want to remove a specific element, you should never use remove. That has to search the list to find a matching element—which is wasteful (since you already know which one you wanted), and also incorrect if you have any duplicates (there could be multiple matches for the same element). Instead, use the index. For example, del tweets[index]. You can use the enumerate function to get the indices. The same thing is true for lots of other list, string, etc. functions—don't use index, find, etc. with a value when you could get the index directly.
Second, if you remove the first element, everything else shifts up by one. So, first you remove element #0. Then, when you remove element #1, that's not the original element #1, but the original #2, which has shifted up one space. And besides skipping every other element, once you're half-way through, you're trying to remove elements past the (new) end of the list. In general, avoid mutating a list while iterating over it; if you must mutate it, it's only safe to do so from the right, not the left (and it's still tricky to get right).
The right way to remove elements one by one from the left is:
while tweets:
del tweets[0]
However, this will be pretty slow, because you keep having to re-adjust the list after each removal. So it's still better to go from the right:
while tweets:
del tweets[-1]
But again, there's no need to go one by one when you can just do the whole thing at once, or not even do it, as explained above.
You should never try to remove items from a list while iterating over that list. If you want a fresh, empty list, just create one.
tweets = []
Otherwise you may not actually remove all the elements of the list, as I suspect you noticed.
You could also re-work the code to be:
from itertools import groupby
def load_tweet(filename):
with open(filename) as fin:
tweet_blocks = (g for k, g in groupby(fin, lambda line: bool(line.strip())) if k)
return [tf_idf_vectorizer(list(tweets)) for tweets in tweet_blocks]
This groups the file into runs of non-blank lines and blank lines. Where the lines aren't blank, we build a list from them to pass to the vectorizer inside a list-comp. This means that we're not having references to lists hanging about, nor are we appending one-at-a-time to lists.
Related
I am trying to clean a text field of its duplicate items (each item is on a new line in the text field). My logic: call get() on the text field, insert into a list, and then run an admittedly slow series of nested loops to check for duplicates and then repopulate the text field.
Could someone please help evaluate my logic and tell me why this isn't working?
def checkDup(self):
clean = []
dirty = O1.get("1.0", END+'-1c').split("\n")
for i in dirty[1:]:
if i not in clean:
clean.append(i)
clean.append("\n")
O1.delete("1.0", END)
O1.insert(END, clean)
I would have used the same logic with for loops for checking duplicates.. maybe there is something better out there to do that but for now at our level I think it's a good start.
reviewing your code:
for i in dirty[1:]:
Here why do you start after the first item in your list, does it need to be excluded? if so, you are deleting it anyways with:
01.delete('1.0', END)
You may need to change your code to 01.delete('2.0', END) if you need to keep the first line.
if i not in clean:
clean.append(i)
clean.append('\n')
Here, you are creating a longer list with a bunch of newlines that are each considered a member of your list, interesting.. I messed up with this portion.. after testing I see your results are only half as weird as what I did.
last line: you are pushing your corrected list directly in your widget, which causes a weird result.
01.insert(END, clean)
Fix that one this way; 01.insert(END, ''.join(clean)) this will break your list into a string containing your previously inserted newlines, putting all the text in the right place.
If I have a list of strings and want to eliminate leading and trailing whitespaces from it, how can I use .strip() effectively to accomplish this?
Here is my code (python 2.7):
for item in myList:
item = item.strip()
print item
for item in myList:
print item
The changes don't preserve from one iteration to the next. I tried using map as suggested here (https://stackoverflow.com/a/7984192) but it did not work for me. Please help.
Note, this question is useful:
An answer does not exist already
It covers a mistake someone new to programming / python might make
Its title covers search cases both general (how to update values in a list) and specific (how to do this with .strip()).
It addresses previous work, in particular the map solution, which would not work for me.
I'm guessing you tried:
map(str.strip, myList)
That creates a new list and returns it, leaving the original list unchanged. If you want to interact with the new list, you need to assign it to something. You could overwrite the old value if you want.
myList = map(str.strip, myList)
You could also use a list comprehension:
myList = [item.strip() for item in myList]
Which many consider a more "pythonic" style, compared to map.
I'm answering my own question here in the hopes that it saves someone from the couple hours of searching and experimentation it took me.
As it turns out the solution is fairly simple:
index = 0
for item in myList:
myList[index] = item.strip()
index += 1
for item in myList:
print "'"+item+"'"
Single quotes are concatenated at the beginning/end of each list item to aid detection of trailing/leading whitespace in the terminal. As you can see, the strings will now be properly stripped.
To update the values in the list we need to actually access the element in the list via its index and commit that change. I suspect the reason is because we are passing by value (passing a copy of the value into item) instead of passing by reference (directly accessing the underlying list[item]) when we declare the variable "item," whose scope is local to the for loop.
I'm trying to strip subdomains off of a large list of domains in a text file. The script works but only for the last domain in the list. I know the problem is in the loop but can't pinpoint the extact issue. Thanks for any assistance:)
with open ("domainlist.txt", "r") as datafile:
s = datafile.read()
for x in s:
t = '.'.join(s.split('.')[-2:])
print t
this will take "example.test.com" and "return test.com". The only problem is it won't perform this for every domain in the list - only the last one.
What you want is to build up a new list, by modifying the elements of an old one, fortunately, Python has the list comprehension - perfect for this job.
with open("domainlist.txt", "r") as datafile:
modified = ['.'.join(x.split('.')[-2:]) for x in datafile]
This behaves exactly like creating a list and adding items to it in a for loop, except faster and nicer to read. I recommend watching the video linked above for more information on how to use them.
Note that file.read() reads the entire thing in as one big string, what you wanted was probably to loop over the lines of the file, which is done just by looping over the file itself. Your current loop loops of the individual characters of the file, rather than lines.
You are overwriting t in each loop iteration, so naturally only the value from the last iteration stays in t. INstead put the string inside a list with list.append.
Try this out. Better readability.
with open ("domainlist.txt", "r") as datafile:
s = datafile.readlines()
t = []
for x in s:
t.append('.'.join(x.split('.')[-2:]))
print t
I am reading a file and looking for a particular string in it like this :
template = open('/temp/template.txt','r')
new_elements = ["movie1","movies2"]
for i in template.readlines():
if "movie" in i:
print "replace me"
This is all good but I would like to replace the lines that are found with the elements from "new_elements" . I make the assumption that the number of found strings will always match the number of elements in the "new_elements" list . I just don't know how to iterate over the new_elements whilst looking for lines to replace .
Cheers
One way is to make new_elements an iterator:
template = open('/temp/template.txt','r')
new_elements = iter(["movie1","movies2"])
for i in template.readlines():
if "movie" in i:
print "replace line with", new_elements.next()
You haven't said how you want to do the replacement- whether you want to write it to a new file, for example- but this will fit into whatever code you use.
What you're looking for is pretty simple: the .pop() method of lists. This will get the next entry in the list, and remove it; your next call to that function will return the next item. No need to do any iteration at all.
Without a parameter, it will pop the last element. If you want to pop the first, you can use new_elements.pop(0), although this will be slower than popping from the end.
I have a list
myList=[1,2,3,4]
I want to access '1' (i.e the first element of myList).
myList is an instance of class/type list & has its own datamembers/attributes [1,2,3,4]
so some way I must be able to access '1' with reference to myList (instance)
>>>EG: myList.__datamemeber__.iteritems()
Note: I know I can do
>>>for i in myList:
Edit
Okay my doubt was
1) is myList an instance of class/type list ?
2) if so what is '1' ,'2' ... of myList ?
EDIT
Okay so I was trying to read a CSV with DictReader
>>>reader = csv.DictReader(ifile)
now I want to know if 'reader' is empty
so I thought if I could get a element of DictReader & see if its null
I can get all the lines
>>>for line in reader:
I want to know if 'reader' is empty.Also if I get an instance (like reader) can I get the elements of it.
You seem to operate under the (wrong) assumption that CSV readers are lists or at least somwhat like lists (or, more generally, sequences). With lists, checking for emptiness is easy - generally, compare the length to 0; in Python the idiom is for collections the be falsy if they are empty (i.e. if not items: # empty collection).
But CSV readers aren't lists or even collections, they're iterators. Whether or not there is an element in an iterator is unknown (in fact, it doesn't matter - read on), you can only ask for the next element and either get that or an exception if there is no next element. The items may not even exist until you ask for them. Iterators don't have a notion of length.
If you absolutely positively need special logic to handle empty files, there is a (rather ugly) way to do it, although it forces you to have that empty file-logic and the processing in the same place.
Make a reader
try to get its first element and store it somewhere (say, first_element).
except StopIteration: (i.e. when you get the exception that indicated "no more elements"), the file is empty (respectively, the reader considers its empty).
If the except didn't trigger (that's a else clause on the try, see documentation), you got the first element, can process it, then proceed to process everything else the reader yields in your casual for loop.
Alternatively to the code duplication implied in the last step, just make the loop go for element in itertools.chain([first_element], reader) (which first yields first_element and then passes on each element reader yields). Note that this has some overhead though (probably negible compared to the reader's parsing efforts and your own processing, but for the record...).
A list is indexed by an integer. For example, to get to its first (in Python, at index 0) item:
>> myList[0]
1
You can use list's index method to find the index of the given value:
>> myList.index(1)
0
And you can iterate a list with enumerate to get both the index and the item:
for index, item in enumerate(myList):
print "myList[%s] is %s" % (index, item)
I know 'After for line in reader. It will be empty. Before for line in reader it will not be empty'.But what if ifile is empty file object.so will reader be empty then?one can always use 'for line in reader:' & then check if line is empty.But I don't want that.I want to know if 'reader' is empty just after its created using csv.DictReader()
haverows = False
reader = csv.DictReader(ifile)
# reader may or may not have rows
for row in reader:
haverows = True
# reader definately had a row. here it is
print row
if not haverows:
# reader never had any rows. oh well.
print 'done!'
# reader is now exhausted of all its rows.
'1', '2' etc, are items of the list, as opposed to attributes (that's python jargon, I would have called them elements). As pointed out above, you usually just grab them with
mylist[0] # ... or whatever instead of 0
For other fancy stuff, you might find mylist.index is handy, or even itemgetter from operator package. Depends on what you want really.