I appreciate that this may be an issue with my computer/software, but I want to double check that my code isn't causing the problem before ruling it out.
I have written a fairly simple program. I have a short list of strings read in from a text file, then with a different text file open, I iterate over each word in the second text file, checking if the first two letters of the word are contained in the first list of strings.
Then, if that condition is fulfilled, I use string interpolation to insert that word into a string of HTML code. Finally, I append that string to an existing empty .html. When finished iterating through, I close the html file.
with open("strings.txt", "r") as f:
strings = f.read().splitlines()
urlfile = open("links.html", "a")
with open("words.txt", "r") as f:
text = f.read().splitlines()
for word in text:
if word[:2] in strings:
html = '<a href="[URL]/{}">'.format(word)
urlfile.write(html)
urlfile.close()
so far there doesn't actually seem to be any issues with my code doing what I want - I am generating the right html code and if I print it to console it does so quickly. It is being appended to the html file.
The problem I have is that something I am doing must be computationally expensive or problematic, because Notepad++ freezes every time I try to check links.html for the results. I have managed to see that it looks correct, but Notepad++ then becomes unusable, and my computer is clearly straining. The only solution I have is to close anything related to the html file.
None of the lists used are long and all the operations should in theory be quite simple, so I feel as though I must be doing something wrong. Am I writing to files in an unsafe way? Am I doing something wildly expensive that I'm just missing? I am using Notepad++ v7.9.5, Python 3, and Anaconda prompt.
EDIT: I am now able to access the html file on my browser and on Notepad++ without issue. I think the source of the problem was some laptop software updating in the background without me noticing. I'll check that first next time!
Related
I want to edit a few lines in an uncompressed pdf.
I found a similar problem but since I need to scan the file a few times to get the exact line positions I want to change this doesn't really suit (and the pure number of RegEx matches are more than desired).
The pdf contains utf-8 encodable lines (a few of them I want to edit, bookmark target ids in particular)
and a lot of blobs (guess images and so on).
When I edit the file with notepad it's working fine, but when I do it programatically (reading in, changing a few lines, writing back)
images and some formatting is missing. (Sine they are not read in at the firstplace, ignore-option)
with codecs.open("merged-uncompressed.pdf", "r", encoding='ascii', errors='ignore') as f:
I can read the file in with errors="surrogateescape" and wanted to map the lines from above import but don't know if this approach can work.
Does anyone know a way how to deal with this?
Best, Lukas
I was able to solve this:
read the file as binary
marked the lines which couldn't be encoded utf-8
copied the list line by line to a temporary list ( not encodable lines were copied with a placholder 'None\n')
Then I went back to do the searching part on the copied list so I got my lines I wanted to replace
replaced the lines in the original binary list (same indices!)
wrote it back to file
the resulting pdf was a bit corupted because of whitespace before the target ids of the bookmarks but by recompressing qpdf fixed it:)
The code is very messy at the moment and so I don't want to publish it right now.
But I want to add it at github within the next few weeks.
If anyone needs it: just comment and it will have more priority.
Thanks to anyone who wanted to help:)
Lukas
I have collected a large set of text(a online newspaper website) by scraping using Scrapy Framework which I have stored in 'nahidd.txt' file. The txt file size is almost 240MB.
Now In this txt file I have several word redundancy. For example, word 'love' may seen in multiple lines in that txt file. However, I need only one presence of word 'love'
I have used the following code to remove redundancy from my large 'nahidd.txt' file.
file_object = open("nahidd.txt", "r", encoding='utf-8-sig')
file_object_all_text = file_object.read().split()
file_object_redundancy_removed = " ".join(sorted(set(file_object_all_text), key=file_object_all_text.index))
file_object = open("nahidd_pure.txt", "w", encoding='utf-8-sig')
file_object.write(file_object_redundancy_removed)
But the problem is that whenever I put a command in cmd.
scrapy runspider nahidBot.py
It works perfectly fine but It takes forever (since file size is large) and I see a single cursor blinking for hours. It's difficult to understand whether my command is still working or just hanged. I just need to show some kind of text in cmd just like 'line 1 processed', 'line 2 processed' or Percentage of background work done. So that anyone can understand how much work is left or to understand that my command is still working.
Thanks in advance.
Nahid
this line performs a sort
file_object_redundancy_removed = " ".join(sorted(set(file_object_all_text), key=file_object_all_text.index))
but uses linear search in the key, which is very bad for performance.
If you don't need to preserve order, just do:
file_object_redundancy_removed = sorted(set(file_object_all_text))
If you need to preserve order "as occurring", which you're trying to emulate with your sort, a faster way would be to store the words you already encountered in an auxiliary set:
marker = set()
file_object_redundancy_removed = []
for w in file_object_all_text:
if w not in marker:
marker.add(w)
file_object_redundancy_removed.append(w)
you now have a list with redundancy removed, and order of first word occurrences preserved.
I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4
I'm very new to Python (and coding in general, if I'm honest) and decided to learn by dipping into the Twitter API to make a weird Twitterbot that scrambles the words in a tweet and reposts them, _ebooks style.
Anyway, the way I have it currently set up, it pulls the latest tweet and then compares it to a .txt file with the previous tweet. If the tweet and the .txt file match (i.e., not a new tweet), it does nothing. If they don't, it replaces the .txt file with the current tweet, then scrambles and posts it. I feel like there's got to be a better way to do this than what I'm doing. Here's the relevant code:
words = hank[0]['text']
target = open("hank.txt", "r")
if words == "STOP":
print "Sam says stop :'("
return
else:
if words == target.read():
print "Nothing New."
else:
target.close()
target = open("hank.txt", "w")
target.write(words)
target.close()
Obviously, opening as 'r' just to check it against the tweet, closing, and re-opening as 'w' is not very efficient. However, if I open as 'w+' it deletes all the contents of the file when I read it, and if I open it as 'r+', it adds the new tweet either to the beginning or the end of the file (dependent on where I set the pointer, obviously). I am 100% sure I am missing something TOTALLY obvious, but after hours of googling and dredging through Python documentation, I haven't found anything simpler. Any help would be more than welcome haha. :)
with open(filename, "r+") as f:
data = f.read()// Redaing the data
//any comparison of tweets etc..
f.truncate()//here basically it clears the file.
f.seek(0)// setting the pointer
f.write("most recent tweet")// writing to the file
No need to close the file instance, it automatically closes.
Just read python docs on these methods used for a more clear picture.
I suggest you use yield to compare hank.txt and words line by line so that more memory space could be saved, if you are so focused on efficiency.
As for file operation, I don't think there is a better way in overwriting a file. If you are using Linux, maybe 'cat > hank.txt' could be faster. Just a guess.
From what I've researched, csv.writeRow should take in a list, and then write it to the given csv file. Here's what I tried:
from csv import writer
with open('Test.csv', 'wb') as file:
csvFile, count = writer(file), 0
titles = ["Hello", "World", "My", "Name", "Is", "Simon"]
csvFile.writerow(titles)
I'm just trying to write it so that each word is in a different column.
When I open the file that it creates, however, I get the following message:
After pressing to continue anyways, I get a message saying that the file is either corrupted, or is a SYLK file. I can then open the file, but only after going through two error messages everytime I open the file.
Why is this?
Thanks!
It's a documented issue that Excel will assume a csv file is SYLK if the first two characters are 'ID'.
Venturing into the realm of opinion - it shouldn't, but Excel thinks it knows better than the extension. To be fair, people expect it to be able to figure out cases where the extension really is wrong, but in a case like this assuming the extension is wrong, and then further assuming the file is corrupt when it doesn't appear corrupt if interpreted according to the extension is just mind-boggling.
#John Y points out:
One thing to watch out for: The "workaround" given by the Microsoft issue linked to by #PeterDeGlopper is to (manually) prepend an apostrophe into the file. (This is also advice commonly found on the Web, including StackOverflow, to try to force CSV digits to be treated as strings rather than numbers.) This is not what I'd call good advice, as that injects a literal apostrophe into your data.
#DSM suggests using quoting=csv.QUOTE_NONNUMERIC on the writer. Excel is not confused by a file beginning with "ID" rather than ID, so if the other tools that are going to work with the CSV accept that quoting level this is probably the best solution other than just ignoring Excel's confusion.