I am parsing xml files on a linux ubuntu machine using a python script and the cElementTree package. After a while (at the same point every time) it results in the error
Segmentation fault (core dumped)
This seems to be a C error and hence I think its connected to the C-library I am using (cElementTree). However, I am a bit stuck in how to debug this. If I run the same program on my local Macbook, it works fine without any problem. Only on the linux server does it crash?
How can I debug this? Does anybody know about problems of cElementTree in linux?
Here is my code
import xml.etree.cElementTree as ET
def fill_pubmed_papers_table(list_of_files):
for f in list_of_files:
print "read file %s" % f
inF = gzip.open(f, 'rb')
tree = ET.parse(inF)
inF.close()
root = tree.getroot()
papers = root.findall('PubmedArticle')
root.clear()
for i, citation in enumerate(papers):
write_to_db(citation)
return
the parsing script write_to_db() is fairly long, but I can make it available if anybody is interested.
ok not sure whether it will help anyone, but I found the cause of the set fault. It was not actually connected to cElementTree, but connected to the file read in. I do not completely understand why this happened, but my code works fine if I delete the papers list at the end of the loop meaning I changed the code to
def fill_pubmed_papers_table(list_of_files):
for i, f in enumerate(list_of_files):
print "read file %d names %s" % (i, f)
inF = gzip.open(f, 'rb')
tree = ET.parse(inF)
inF.close()
root = tree.getroot()
papers = root.findall('PubmedArticle')
print "number of papers = ", len(papers)
# we don't need anything from root anymore
root.clear()
for citation in papers:
write_to_db(citation)
# If I do not release memory here I get segfault on the linux server
del papers
gc.collect()
return
I also added the garbage collector just in case, but its not actually needed... deleting the papers list is what solved the problem... I guess it has to do with memory(?)
Related
This question already has answers here:
What causes a Python segmentation fault?
(8 answers)
Closed 3 years ago.
I'm fairly new to Python and I am having real trouble because I run into it this Segmentation fault: 11 error.
Here is a simple code example that produces this error every time:
import grequests
class Url(object):
pass
a = Url()
a.url = 'http://www.heroku.com'
a.result = 0
b = Url()
b.url = 'http://www.google.com'
b.result = 0
c = Url()
c.url = 'http://www.wordpress.com'
c.result = 0
urls = [a, b, c]
rs = (grequests.get(u.url) for i, u in enumerate(urls))
grequests.map(rs)
What is absolutely bizarre is that if I replace the urls = ... line with this:
urls = [a, b]
Then I get no error, and the script runs fine.
If I change that to just
urls = [c]
Then I also get no error, and the script runs fine.
If I change c.url = ... to
c.url = "http://yahoo.com"
And revert urls = ... back to
urls = [a, b, c]
Then I do get the segmentation fault: 11 error.
Being a memory issue sounds like a possibility though I'm not sure how to fix it.
I've been stuck on this for a number of days, so any help, no matter how small, is greatly appreciated.
For reference, I'm using macOS High Sierra (10.13.5) and installed Python 3.7.0 using Brew.
Segmentation fault (violation) is caused by an invalid memory reference. Trying to access an address that should not be accessible for current process (could also be buffer overrun or entirely bogus or uninitialized pointer). Usually it would be indicative of a bug in the underlying code or a problem during binary build (linking).
This problem lies not in your Python script, even though you may be able to trigger it by modifying your python code. Even if you for instance exhausted buffers used in a module or by the interpreter itself, it should still handle that situation gracefully.
Given your script, either gevent (dependency of grequests) or your Python (and/or bits of its standard library) are likely places where a segfault could have occurred (or a library is being used that causes it). Perhaps try rebuilding them? Where there any substantial changes around them on your system since the time you've built them? Perhaps they are trying to run against libraries other than they've been originally built against?
You can also allow your system to dump cores (I presume MacOS being essentially BSD can do that) and inspect (load it into a debugger such as gdb) the coredump to see what exactly crashed and what was going on at the time.
I have a program that runs for a while and outputs "Killed". I can't imagine that it's a memory thing because the file it's loading is under a gig. I've been trying to Google what other things can cause a python script to be killed but all I can find are articles about people being eaten by snakes... Here is my code:
import neo
from neo.io import BlackrockIO
dir = '/PHShome/gcw8/Ephys_Test/MG79_d4_Sat.ns3'
reader = BlackrockIO(filename=dir)
blks = reader.read(lazy=False, cascade=True)
for blk in blks:
for seg in blk.segments:
print 'Sampling Rate = %s' %seg.analogsignals[0].sampling_rate
print 'Number of Channels = %d' %len(blk.recordingchannelgroups[0].recordingchannels)
A little background. The file I'm working on is an electrophysioloy data file that consists of
1.) a header containing metadata (small)
2.) data (large)
the lazy option of reader.read() loads only the header when set to True and loads the entire file (including the data) when set to False. The code is not killed when lazy = True but does crash when lazy = False. While lazy = False causes much, much more of the file to be read,
[gcw8#database_dev Ephys_Test]$ du -h ./MG79_d4_Sat.ns3
719M ./MG79_d4_Sat.ns3
So I have trouble beleiving that it is a memory issue. Can anyone think of another reason this is being killed or a work around? I'm running Python 2.7 on CentOS.
That BlackrockIO library appears to parse the data and do all sorts of things with it. It could be that you actually ARE running out of memory. You could try monitoring memory usage using e.g. htop.
Following the answer to this similar stackoverflow question, I tried running this code
import gtk.gdk
w = gtk.gdk.get_default_root_window()
sz = w.get_size()
print "The size of the window is %d x %d" % sz
pb = gtk.gdk.Pixbuf(gtk.gdk.COLORSPACE_RGB,False,8,sz[0],sz[1])
pb = pb.get_from_drawable(w,w.get_colormap(),0,0,0,0,sz[0],sz[1])
if (pb != None):
pb.save("screenshot.png","png")
print "Screenshot saved to screenshot.png."
else:
print "Unable to get the screenshot."
However, the resulting image is completely black. My environment is linux mint 16 running in virtualbox on a mac. Was the black image the result of running a linux vm, or is there another reason?
It is the VM that is the issue here I'm pretty sure. At least it was for me.
There are several steps to figure out which part is messing up but most of them are with issues with the display on VM's having issues with the colormaps. so first
use:
gtk.gdk.colormap_get_system()
to get the colormap to use and replace the colormap in the original code. See if there's a change.
if that is a dead end I would suggest the following:
TURN OF YOUR VIDEO ACCELERATION <====Huge majority of issues fixed here
Roll Back then Update your video/graphics/3d drivers <==This is almost never the problem
Be sure you've got the newest release of pyscreenshot and retry screenshot
Let me know if it still isn't working and I'll send you the full Step-by-step (its quite long and jumps around a lot but it covers just about everything with this issue.)
I'm using App Engine with Python. In order to store the images of my users, I write them directly to the blobstore as indicated in Google documentation.
My code is below:
# Image insertion in the blobstore
file_name = files.blobstore.create(mime_type='image/jpeg')
with files.open(file_name, 'a') as f:
f.write(self.imageContent)
files.finalize(file_name)
self.blobKey = files.blobstore.get_blob_key(file_name)
logging.info("Blobkey: "+str(self.blobKey))
The problem is erratic. I don't change anything and since yesterday sometimes it works sometimes it doesn't work. Why? As I print the blobkey (last line of my code), I can see whether the image has been saved into the blobstore or not.
When it works, I have the following line displayed:
Blobkey: AMIfv94p1cFdqkZa3AhZUF2Tf76szVEwpGgwOpN...
When it doesn't work, I have this in my logs:
Blobkey: None
Last detail: images (self.imageContent) are preprocessed and converted into .JPEG before each write.
EDIT:
Everytime, the images are stored in the blobstore (I can see them in the blobviewer in the Administration console). So that's the get_blob_key function which is malfunctioning...
I would like to know what should I do in such a situation? Am I doing something wrong that makes App Engine behavior erratic. How can I solve this out?
I finally managed to solve this problem by making the thread sleep during intervals of 50ms
This is the code I added:
# Sometimes blobKey is None
self.blobKey = files.blobstore.get_blob_key(file_name)
# We have to make it wait til it works!
for i in range(1,3):
if(self.blobKey):
break
else:
logging.info("blobKey is still None")
time.sleep(0.05)
self.blobKey = files.blobstore.get_blob_key(file_name)
logging.info("Blobkey: "+str(self.blobKey))
Of course, you have to import the time module to make it work.
import time
I pretty much did the same as the person in the Issue 4872 that systempuntoout mentioned.
Thanks. Please feel free to add any suggestion.
In my efforts to resolve Python issue 1578269, I've been working on trying to resolve the target of a symlink in a robust way. I started by using GetFinalPathNameByHandle as recommended here on stackoverflow and by Microsoft, but it turns out that technique fails when the target is in use (such as with pagefile.sys).
So, I've written a new routine to accomplish this using CreateFile and DeviceIoControl (as it appears this is what Explorer does). The relevant code from jaraco.windows.filesystem is included below.
The question is, is there a better technique for reliably resolving symlinks in Windows? Can you identify any issues with this implementation?
def relpath(path, start=os.path.curdir):
"""
Like os.path.relpath, but actually honors the start path
if supplied. See http://bugs.python.org/issue7195
"""
return os.path.normpath(os.path.join(start, path))
def trace_symlink_target(link):
"""
Given a file that is known to be a symlink, trace it to its ultimate
target.
Raises TargetNotPresent when the target cannot be determined.
Raises ValueError when the specified link is not a symlink.
"""
if not is_symlink(link):
raise ValueError("link must point to a symlink on the system")
while is_symlink(link):
orig = os.path.dirname(link)
link = _trace_symlink_immediate_target(link)
link = relpath(link, orig)
return link
def _trace_symlink_immediate_target(link):
handle = CreateFile(
link,
0,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
None,
OPEN_EXISTING,
FILE_FLAG_OPEN_REPARSE_POINT|FILE_FLAG_BACKUP_SEMANTICS,
None,
)
res = DeviceIoControl(handle, FSCTL_GET_REPARSE_POINT, None, 10240)
bytes = create_string_buffer(res)
p_rdb = cast(bytes, POINTER(REPARSE_DATA_BUFFER))
rdb = p_rdb.contents
if not rdb.tag == IO_REPARSE_TAG_SYMLINK:
raise RuntimeError("Expected IO_REPARSE_TAG_SYMLINK, but got %d" % rdb.tag)
return rdb.get_print_name()
Unfortunately I can't test with Vista until next week, but GetFinalPathNameByHandle should work, even for files in use - what's the problem you noticed?
In your code above, you forget to close the file handle.