I have a .txt file of words I want to 'clean' of swear words, so I have written a program which checks each position, one-by-one, of the word list, and if the word appears anywhere within the list of censorable words, it removes it with var.remove(arg). It's worked fine, and the list is clean, but it can't be written to any file.
wordlist is the clean list.
newlist = open("lists.txt", "w")
newlist.write(wordlist)
newlist.close()
This returns this error:
newlist.write(wordlist)
TypeError: expected a string or other character buffer object
I'm guessing this is because I'm trying to write to a file with a variable or a list, but there really is no alternate; there are 3526 items in the list.
Any ideas why it can't write to a file with a list variable?
Note: lists.txt does not exist, it is created by the write mode.
write writes a string. You can not write a list variable, because, even though to humans it is clear that it should be written with spaces or semicolons between words, computers do not have the free hand for such assumptions, and should be supplied with the exact data (byte wise) that you want to write.
So you need to convert this list to string - explicitly - and then write it into the file. For that goal,
newlist.write('\n'.join(wordlist))
would suffice (and provide a file where every line contains a single word).
For certain tasks, converting the list with str(wordlist) (which will return something like ['hi', 'there']) and writing it would work (and allow retrieving via eval methods), but this would be very expensive use of space considering long lists (adds about 4 bytes per word) and would probably take more time.
If you want a better formatting for structural data you can use built-in json module.
text_file.write(json.dumps(list_data, separators=(',\n', ':')))
The list will work as a python variable too. So you can even import this later.
So this could look something like this:
var_name = 'newlist'
with open(path, "r+", encoding='utf-8') as text_file:
text_file.write(f"{var_name} = [\n")
text_file.write(json.dumps(list_data, separators=(',\n', ':')))
text_file.write("\n]\n")
Related
I know there are similar threads to this question (having looked at them already) but I cannot, as a noob, work out how to translate those answers across to adjust my script to make it work (4+ days of trying).
So.. I have a python script to randomly select a subset of items from a file and components of those items. I want to create two new txt files as output. One with the subset of items and one with just a list of components (Ingredients) for those items.
To do this I have done write-lines to the first txt (MenuOutput.txt)file and then want to use regex (re.sub) to strip out the first part of the string from each line in the second file (ShoppingOutput.txt).
Now the issue: the TypeError: 'list' object cannot be interpreted as an integer. I understand (I think) the problem is the re.sub outputs a list object. But I don't know another way to strip the first part of each line from a text file. Is there a way of tweaking the re.sub to make it work, or do I need another function I am unaware of?
Menu_choices = random.sample(sample_list, k=6)
MenuOutput = open('MenuOutput.txt', 'w')
for element in Menu_choices:
MenuOutput.write(element)
MenuOutput.close()
MyFile = open('ShoppingOutput.txt', 'w')
ShoppingOutput = re.sub(r'.*?', 'I', Menu_choices)
for element in ShoppingOutput:
MyFile.write(element)
MyFile.close
Just like you loop over the list of strings to write them, you have to loop over them to perform other string manipulations on them.
with open('ShoppingOutput.txt', 'w') as my_file:
for element in MenuChoices:
my_file.write(re.sub(r'.*?', 'I', element))
Notice also the upgrade to a with statement, and using snake_case for regular variables.
Your regex seems both inexact and inefficient, though. Probably better to just my_file.write('I' + element)) and get rid of the no-op re.sub, or perhaps replace with a simple substring operation if the intent was to remove a prefix but you hadn't worked out the correct regex for that yet.
my_file.write('I' + element[element.index(' ')+1:])
would write everything after the first space.
I am very new to Python and am looking for assistance to where I am going wrong with an assignment. I have attempted different ways to approach the problem but keep getting stuck at the same point(s):
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
Problem 2: When I try and combine the lists I keep receiving "None" for my result or Nonetype errors [which I think means I have added the None's together(?)].
The assignment is:
#8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.You can download the sample data at http://www.py4e.com/code3/romeo.txt
My current code which is giving me a Nonetype error is:
poem = input("enter file:")
play = open(poem)
lst= list()
for line in play:
line=line.rstrip()
word=line.split()
if not word in lst:
lst= lst.append(word)
print(lst.sort())
If someone could just talk me through where I am going wrong that will be greatly appreciated!
your problem was lst= lst.append(word) this returns None
with open(poem) as f:
lines = f.read().split('\n') #you can also you readlines()
lst = []
for line in lines:
words = line.split()
for word in words:
if word:
lst.append(word)
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
You are doing play = open(poem) then for line in play: which is method for processing file line-by-line, if you want to process whole content at once then do:
play = open(poem)
content = play.read()
words = content.split()
Please always remember to close file after you used it i.e. do
play.close()
unless you use context manager way (i.e. like with open(poem) as f:)
Just to help you get into Python a little more:
You can:
1. Read whole file at once (if it is big it is better to grab it into RAM if you have enough of it, if not grab as much as you can for the chunk to be reasonable, then grab another one and so on)
2. Split data you read into words and
3. Use set() or dict() to remove duplicates
Along the way, you shouldn't forget to pay attention to upper and lower cases,
if you need same words, not just different not repeating strings
This will work in Py2 and Py3 as long as you do something about input() function in Py2 or use quotes when entering the path, so:
path = input("Filename: ")
f = open(filename)
c = f.read()
f.close()
words = set(x.lower() for x in c.split()) # This will make a set of lower case words, with no repetitions
# This equals to:
#words = set()
#for x in c.split():
# words.add(x.lower())
# set() is an unordered datatype ignoring duplicate items
# and it mimics mathematical sets and their properties (unions, intersections, ...)
# it is also fast as hell.
# Checking a list() each time for existance of the word is slow as hell
#----
# OK, you need a sorted list, so:
words = sorted(words)
# Or step-by-step:
#words = list(words)
#words.sort()
# Now words is your list
As for your errors, do not worry, they are common at the beginning in almost any objective oriented language.
Other explained them well in their comments. But not to make the answer lacking...:
Always pay attention on functions or methods which operate on the datatype (in place sort - list.sort(), list.append(), list.insert(), set.add()...) and which ones return a new version of the datatype (sorted(), str.lower()...).
If you ran into a similar situation again, use help() in interactive shell to see what exactly a function you used does.
>>> help(list.append)
>>> help(list.sort)
>>> help(str.lower)
>>> # Or any short documentation you need
Python, especially Python 3.x is sensitive to trying operations between types, but some might have a different connotation and can actually work while doing unexpected stuff.
E.g. you can do:
print(40*"x")
It will print out 40 'x' characters, because it will create a string of 40 characters.
But:
print([1, 2, 3]+None)
will, logically not work, which is what is happening somewhere in the rest of your code.
In some languages like javascript (terrible stuff) this will work perfectly well:
v = "abc "+123+" def";
Inserting the 123 seamlessly into the string. Which is usefull, but a programming nightmare and nonsense from another viewing angle.
Also, in Py3 a reasonable assumption from Py2 that you can mix unicode and byte strings and that automatic cast will be performed is not holding.
I.e. this is a TypeError:
print(b"abc"+"def")
because b"abc" is bytes() and "def" (or u"def") is str() in Py3 - what is unicode() in Py2)
Enjoy Python, it is the best!
I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
Size of the file to be searched is 150mb – it’s text file.
I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
It’s not simple search, I have to use regex to search the string.
In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
In few cases I also have to replace the searched string with new string.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it’s extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.
I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))
A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ... instead of open()/close(). The with statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because | is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy. PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.
I have a file which currently stores a string eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
which I am trying to pass into as a variable to my subprocess command.
My current code looks like this
with open(logfilnavn, 'r') as t:
test = t.readlines()
print(test)
But this prints ['eeb39d3e-dd4f-11e8-acf7-a6389e8e7978\n'] and I don't want the part with ['\n'] to be passed into my command, so i'm trying to remove them by using replace.
with open(logfilnavn, 'r') as t:
test = t.readlines()
removestrings = test.replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
print(removestrings)
I get an exception value saying this so how can I replace these with nothing and store them as a string for my subprocess command?
'list' object has no attribute 'replace'
so how can I replace these with nothing and store them as a string for my subprocess command?
readline() returns a list. Try print(test[0].strip())
You can read the whole file and split lines using str.splitlines:
test = t.read().splitlines()
Your test variable is a list, because readlines() returns a list of all lines read.
Since you said the file only contains this one line, you probably wish to perform the replace on only the first line that you read:
removestrings = test[0].replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
Where you went wrong...
file.readlines() in python returns an array (collection or grouping of the same variable type) of the lines in the file -- arrays in python are called lists. you, here are treating the list as a string. you must first target the string inside it, then apply that string-only function.
In this case however, this would not work as you are trying to change the way the python interpretter has displayed it for one to understand.
Further information...
In code it would not be a string - we just can't easily understand the stack, heap and memory addresses easily. The example below would work for any number of lines (but it will only print the first element) you will need to change that and
this may be useful...
you could perhaps make the variables globally available (so that other parts of the program can read them
more useless stuff
before they go out of scope - the word used to mean the points at which the interpreter (what runs the program) believes the variable is useful - so that it can remove it from memory, or in much larger programs only worry about the locality of variables e.g. when using for loops i is used a lot without scope there would need to be a different name for each variable in the whole project. scopes however get specialised (meaning that if a scope contains the re-declaration of a variable this would fail as it is already seen as being one. an easy way to understand this might be to think of them being branches and the connections between the tips of branches. they don't touch along with their variables.
solution?
e.g:
with open(logfilenavn, 'r') as file:
lines = file.readlines() # creates a list
# an in-line for loop that goes through each item and takes off the last character: \n - the newline character
#this will work with any number of lines
strippedLines = [line[:-1] for line in lines]
#or
strippedLines = [line.replace('\n', '') for line in lines]
#you can now print the string stored within the list
print(strippedLines[0]) # this prints the first element in the list
I hope this helped!
You get the error because readlines returns a list object. Since you mentioned in the comment that there is just one line in the file, its better to use readline() instead,
line = "" # so you can use it as a variable outside `with` scope,
with open("logfilnavn", 'r') as t:
line = t.readline()
print(line)
# output,
eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
readlines will return a list of lines, and you can't use replace with a list.
If you really want to use readlines, you should know that it doesn't remove the newline character from the end, you'll have to do it yourself.
lines = [line.rstrip('\n') for line in t.readlines()]
But still, after removing the newline character yourself from the end of each line, you'll have a list of lines. And from the question, it looks like, you only have one line, you can just access first line lines[0].
Or you can just leave out readlines, and just use read, it'll read all of the contents from the file. And then just do rstrip.
contents = t.read().rstrip('\n')
I'm using python 2.6 on linux.
I have two text files
first.txt has a single string of text on each line. So it looks like
lorem
ipus
asfd
The second file doesn't quite have the same format.
it would look more like this
1231 lorem
1311 assss 31 1
etc
I want to take each line of text from first.txt and determine if there's a match in the second text. If there isn't a match then I would like to save the missing text to a third file. I would like to ignore case but not completely necessary. This is why I was looking at regex but didn't have much luck.
So I'm opening the files, using readlines() to create a list.
Iterating through the lists and printing out the matches.
Here's my code
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. Anyone have suggestions how to fix this or a better solution?
My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. Then convert this list into a set, for better performance with large files.
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file.
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
with statements are a convenient way to ensure that a file is closed, even if an exception occurs. They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements at the top of your file.
The in operator does what it sounds like: tell you if a value can be found in a collection. When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. not in does the opposite. (Possible point of confusion: in is also used when defining a for loop (for x in [1, 2, 3]), but this is unrelated.)
Assuming that you're looking for the entire line in the second file:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()