January 30, 2010

The strangest crash I've seen in a while

Last week, I wrote about a new logging infrastructure I built to track down a bug. In the original post, I warned that logging wasn't a panacea - that it could cause more trouble than it solved. I was more right than I knew, as I added a bit of logging which led inexorably to a crash by JS_ASSERT failure. (For those who don't know what a JS_ASSERT is, it's an assertion inside the JavaScript Engine that all is well. Since JavaScript is one of the primary languages of the World Wide Web, these assertions are really important. If JavaScript is misbehaving, it can't be trusted. Fail one such assertion, and the program aborts, forcing the developer to take notice.)

At first, the crash didn't make a damned bit of sense. I prefer to write JavaScript, not just because I entered into software engineering (and Mozilla development) through JavaScript, but also because it's super-stable. In fact, the only new code I was introducing was in JavaScript. It shouldn't crash. Through an unlikely chain of events, it does.

This article details how I was able to analyze the crash itself, how logging set up the conditions for the crash, and how other logging ultimately proved that the crash resulted from a logging call. I've filed this as bug 543304. Ultimately, reducing this test case to a usable bug report will not be easy - but worth it. More details in the extended entry. (Warning: technobabble content ratio is high. If you do not want to see how this developer thinks and debugs Mozilla code, do not read on.)

Where this all began

First, I'll show you the code I wrote, and what I was trying to catch:

const serial = C_c["@mozilla.org/xmlextras/xmlserializer;1"]
                  .createInstance(C_i.nsIDOMSerializer);
function serializeNode(aNode)
{
  var node = aNode.cloneNode(true);
  return serial.serializeToString(node, "application/xml");
}

/* Much later... */

    // xeIMarkupTemplate
    showTemplate: function showTemplate(aNode)
    {
      dump("Entering showTemplate\n\n");
      // ...
      var panel = this.getUserInterfacePanel();

      getLoggingService().log("Templates", C_i.xeILoggingService.INFO,
                              "Panel retrieved", serializeNode(panel));

      // This should force <markup:children/> elements to take XBL bindings.
      aNode.ownerDocument.documentElement.appendChild(panel);
      // ...
    },

Basically, I built a <xul:panel/> element, and I wanted to see what its contents were. The panel had as its only child one of my <markup:ui-section/> elements. Then I would append the panel to the document, and go on with other tasks. The logging statement was so I could see what the panel's contents were.

That was the plan. Unfortunately, it didn't frakking happen.

Stack trace (read from the bottom upwards)

Notes Stack line
*crunch* KernelBase.dll!_DebugBreak@0() + 0x2 bytes
Wait, you're in garbage collection? js3250.dll!JS_Assert(const char * s=0x6f3e6478, const char * file=0x6f3e6444, int ln=110) Line 65C++
js3250.dll!js_FillPropertyCache(JSContext * cx=0x04120970, JSObject * obj=0x05143120, unsigned int scopeIndex=0, unsigned int protoIndex=0, JSObject * pobj=0x05143120, JSScopeProperty * sprop=0x06149a58, int adding=0) Line 110 + 0x24 bytesC++
js3250.dll!js_GetPropertyHelper(JSContext * cx=0x04120970, JSObject * obj=0x05143120, int id=59513756, int cacheResult=1, int * vp=0x002fa900) Line 4274 + 0x1d bytesC++
js3250.dll!js_Interpret(JSContext * cx=0x04120970) Line 1520 + 0x23 bytesC++
js3250.dll!js_Invoke(JSContext * cx=0x04120970, unsigned int argc=1, int * vp=0x0613f3c8, unsigned int flags=0) Line 1368 + 0x9 bytesC++
Execute JS for the XTF element xpc3250.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper=0x060cdc28, unsigned short methodIndex=9, const XPTMethodDescriptor * info=0x04421ce0, nsXPTCMiniVariant * nativeParams=0x002fadc8) Line 1696 + 0x1b bytesC++
xpc3250.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex=9, const XPTMethodDescriptor * info=0x04421ce0, nsXPTCMiniVariant * params=0x002fadc8) Line 571C++
xpcom_core.dll!PrepareAndDispatch(nsXPTCStubBase * self=0x081f29e0, unsigned int methodIndex=9, unsigned int * args=0x002fae88, unsigned int * stackBytesToPop=0x002fae78) Line 114 + 0x21 bytesC++
xpcom_core.dll!SharedStub() Line 142C++
XTF element is an orphan... gklayout.dll!nsXTFElementWrapper::UnbindFromTree(int aDeep=0, int aNullParent=1) Line 231C++
gklayout.dll!nsAttrAndChildArray::Clear() Line 664C++
gklayout.dll!nsAttrAndChildArray::~nsAttrAndChildArray() Line 135C++
gklayout.dll!nsGenericElement::~nsGenericElement() Line 1792 + 0xb bytesC++
gklayout.dll!nsXULElement::~nsXULElement() + 0x1a bytesC++
gklayout.dll!nsXULElement::`scalar deleting destructor'() + 0xf bytesC++
No more XUL element! gklayout.dll!nsNodeUtils::LastRelease(nsINode * aNode=0x06701b98) Line 272 + 0x21 bytesC++
gklayout.dll!nsGenericElement::Release() Line 4155 + 0xc8 bytesC++
XUL element refcount-- gklayout.dll!nsXULElement::Release() Line 369 + 0xd bytesC++
xpc3250.dll!XPC_SWN_Finalize(JSContext * cx=0x04120970, JSObject * obj=0x05143240) Line 1535 + 0xe bytesC++
js3250.dll!FinalizeObject(JSContext * cx=0x04120970, JSObject * obj=0x05143240) Line 3190 + 0x10 bytesC++
js3250.dll!js_GC(JSContext * cx=0x04120970, JSGCInvocationKind gckind=GC_NORMAL) Line 3622 + 0xd bytesC++
JS Garbage Collection js3250.dll!JS_GC(JSContext * cx=0x04120970) Line 2439 + 0xb bytesC++
xpc3250.dll!nsXPConnect::Collect() Line 477 + 0xa bytesC++
xpcom_core.dll!nsCycleCollector::Collect(unsigned int aTryCollections=1) Line 2434 + 0x19 bytesC++
xpcom_core.dll!nsCycleCollector_collect() Line 3129 + 0x16 bytesC++
gklayout.dll!nsJSContext::CC() Line 3578 + 0x6 bytesC++
Cycle collection starts gklayout.dll!nsJSContext::IntervalCC() Line 3667C++
gklayout.dll!nsUserActivityObserver::Observe(nsISupports * aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t * aData=0x00000000) Line 269C++
xpcom_core.dll!nsObserverList::NotifyObservers(nsISupports * aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t * someData=0x00000000) Line 131C++
xpcom_core.dll!nsObserverService::NotifyObservers(nsISupports * aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t * someData=0x00000000) Line 185C++
gklayout.dll!nsUITimerCallback::Notify(nsITimer * aTimer=0x02f34d80) Line 277C++
xpcom_core.dll!nsTimerImpl::Fire() Line 431C++
xpcom_core.dll!nsTimerEvent::Run() Line 521C++
xpcom_core.dll!nsThread::ProcessNextEvent(int mayWait=1, int * result=0x002ff17c) Line 527 + 0x19 bytesC++
xpcom_core.dll!NS_ProcessNextEvent_P(nsIThread * thread=0x008d4e88, int mayWait=1) Line 250 + 0x16 bytesC++
gkwidget.dll!nsBaseAppShell::Run() Line 170 + 0xc bytesC++
tkitcmps.dll!nsAppStartup::Run() Line 182 + 0x1c bytesC++
xul.dll!XRE_main(int argc=5, char * * argv=0x00a6c890, const nsXREAppData * aAppData=0x008d0118) Line 3505 + 0x25 bytesC++
firefox.exe!NS_internal_main(int argc=5, char * * argv=0x00a6c890) Line 158 + 0x12 bytesC++
firefox.exe!wmain(int argc=5, wchar_t * * argv=0x00a6f590) Line 120 + 0xd bytesC++
firefox.exe!__tmainCRTStartup() Line 583 + 0x19 bytesC
firefox.exe!wmainCRTStartup() Line 403C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes

I had added a bunch of logging at the same time, and I couldn't figure out what was causing it. I kept looking at my XTF element, because I think bugs are most likely my fault. Several days later, I looked a few levels deeper and spotted the nsXULElement::~nsXULElement line... which was my first clue.

That line indicates a destructor, which JavaScript authors rarely have to think about. For C++ authors, this is routine: you have a function that potentially cleans up after anything you left hanging around in memory as your object goes away. But why would the destructor be executing? That means nobody's holding any references to it: the "reference count" (which most XPCOM objects depend on) was zero, indicating no one needed the object. Hence why the "cycle collection" and garbage collection was destroying it.

Now, having the debugger told me the XTF element was one of my <markup:ui-section/> elements. In my current implementation of the markup language, these elements can be children of only two elements: a <markup:template/> element, and a <xul:panel/> element. From the stack, it obviously was a XUL element, so it had to be a panel.

Except as far as I could remember, the XUL panel was always appended to the document's root element. Always. (I'd forgotten about the cloneNode part above.) After all, in the showTemplate method which retrieved the panel, the next thing it did was append that panel to the root element. So somebody - namely, the parent element of the panel - should've held a strong reference on it. The panel element ultimately should have had a reference count of at least one. It should not have been garbage-collected.

Now, this is an equally bizarre thought. Just as JavaScript shouldn't crash, XUL elements (in fact, pretty much anything in the DOM) should have its reference counting accurate. You might have extra reference counts (that's one reason the cycle collector is around), but you never find yourself short one. The destructor call, though, was telling me that's exactly what happened.

The depths of reference counting

Well, at this point, I didn't see I had much of a choice: I had to track down that one element and see who was "addref'ing" it (holding strong references) and "releasing" it (letting those strong references go). There's two difficulties with this choice: there are a lot of XUL elements, far more than just panel elements, and I had to write out stack traces.

First, I tried using my logging service (implemented in JavaScript) to capture reference counts.That didn't work - there was so much static from additional addrefs/releases for XPConnect that it started getting crazy - and then crashing on another garbage-collection assertion while calling my logger. I gave up and invented another new interface:

[uuid(...)]
interface xeIRefCntLogger : nsISupports
{
  void logRefCnt(in nsISupports component,
                 in PRUint32 count,
                 in short offset);
};

This had to be C++ to reduce the noise. Then I tried another new interface, this one for XUL elements:

[scriptable, uuid(...)]
interface xeILogRefCnt : nsISupports
{
  void LogAllRefCnts(in PRBool shouldLog);
};

I figured I'd call that on the XUL panel I knew about. As for the actual logging, I had to do some dirty hacks. (Don't try this at home.) In nsXULElement, it ended up looking like this:

diff --git a/content/xul/content/src/nsXULElement.cpp b/content/xul/content/src/nsXULElement.cpp
--- a/content/xul/content/src/nsXULElement.cpp
+++ b/content/xul/content/src/nsXULElement.cpp
@@ -234,16 +234,17 @@ NS_INTERFACE_MAP_BEGIN_CYCLE_COLLECTION(
 NS_INTERFACE_MAP_END_AGGREGATED(mElement)
 
 //----------------------------------------------------------------------
 // nsXULElement
 //
 
 nsXULElement::nsXULElement(nsINodeInfo* aNodeInfo)
     : nsGenericElement(aNodeInfo),
+      mLogAllRefCnts(PR_FALSE),
       mBindingParent(nsnull)
 {
     XUL_PROTOTYPE_ATTRIBUTE_METER(gNumElements);
 }
 
 nsXULElement::nsXULSlots::nsXULSlots(PtrBits aFlags)
     : nsXULElement::nsDOMSlots(aFlags)
 {
@@ -258,16 +259,24 @@ nsXULElement::nsXULSlots::~nsXULSlots()
 }
 
 nsINode::nsSlots*
 nsXULElement::CreateSlots()
 {
     return new nsXULSlots(mFlagsOrSlots);
 }
 
+NS_IMETHODIMP
+nsXULElement::LogAllRefCnts(PRBool shouldLog)
+{
+  RefCntToLoggingService(mRefCnt.get(), 0);
+  mLogAllRefCnts = shouldLog;
+  return NS_OK;
+}
+
 /* static */
 already_AddRefed<nsXULElement>
 nsXULElement::Create(nsXULPrototypeElement* aPrototype, nsINodeInfo *aNodeInfo,
                      PRBool aIsScriptable)
 {
     nsXULElement *element = new nsXULElement(aNodeInfo);
     if (element) {
         NS_ADDREF(element);
@@ -360,24 +369,50 @@ NS_NewXULElement(nsIContent** aResult, n
 
 NS_IMPL_CYCLE_COLLECTION_CLASS(nsXULElement)
 NS_IMPL_CYCLE_COLLECTION_TRAVERSE_BEGIN_INHERITED(nsXULElement,
                                                   nsGenericElement)
     NS_IMPL_CYCLE_COLLECTION_TRAVERSE_NATIVE_MEMBER(mPrototype,
                                                     nsXULPrototypeElement)
 NS_IMPL_CYCLE_COLLECTION_TRAVERSE_END
 
-NS_IMPL_ADDREF_INHERITED(nsXULElement, nsGenericElement)
-NS_IMPL_RELEASE_INHERITED(nsXULElement, nsGenericElement)
+NS_IMETHODIMP_(nsrefcnt) nsXULElement::AddRef(void)
+{
+  nsrefcnt r = nsGenericElement::AddRef();
+  //NS_LOG_ADDREF(this, r, nsXULElement, sizeof(*this));
+  RefCntToLoggingService(r, +1);
+  return r;
+}
+NS_IMETHODIMP_(nsrefcnt) nsXULElement::Release(void)
+{
+  RefCntToLoggingService(mRefCnt.get() - 1, -1);
+  nsrefcnt r = nsGenericElement::Release();
+  //NS_LOG_RELEASE(this, r, nsXULElement);
+  return r;
+}
+
+void
+nsXULElement::RefCntToLoggingService(PRUint32 count, PRInt16 offset)
+{
+  if (!mLogAllRefCnts)
+    return;
+
+  nsCOMPtr<xeIRefCntLogger> logger
+    = do_GetService(XE_REFCNTLOGGER_CONTRACTID);
+  if (!logger)
+    return;
+  logger->LogRefCnt((nsIContent*) this, mRefCnt.get(), offset);
+};
 
 NS_INTERFACE_TABLE_HEAD_CYCLE_COLLECTION_INHERITED(nsXULElement)
     NS_NODE_OFFSET_AND_INTERFACE_TABLE_BEGIN(nsXULElement)
         NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMNode)
         NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMElement)
         NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMXULElement)
+        NS_INTERFACE_TABLE_ENTRY(nsXULElement, xeILogRefCnt)
     NS_OFFSET_AND_INTERFACE_TABLE_END
     NS_ELEMENT_INTERFACE_TABLE_TO_MAP_SEGUE
     NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIScriptEventHandlerOwner,
                                    new nsScriptEventHandlerOwnerTearoff(this))
     NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIDOMElementCSSInlineStyle,
                                    new nsXULElementTearoff(this))
     NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIFrameLoaderOwner,
                                    new nsXULElementTearoff(this))
diff --git a/content/xul/content/src/nsXULElement.h b/content/xul/content/src/nsXULElement.h
--- a/content/xul/content/src/nsXULElement.h
+++ b/content/xul/content/src/nsXULElement.h
@@ -73,16 +73,18 @@
 #include "nsLayoutCID.h"
 #include "nsAttrAndChildArray.h"
 #include "nsGkAtoms.h"
 #include "nsAutoPtr.h"
 #include "nsGenericElement.h"
 #include "nsDOMScriptObjectHolder.h"
 #include "nsIFrameLoader.h"
 
+#include "xeILoggingService.h"
+
 class nsIDocument;
 class nsString;
 class nsIDocShell;
 class nsICSSStyleRule;
 
 class nsIObjectInputStream;
 class nsIObjectOutputStream;
 class nsIScriptGlobalObjectOwner;
@@ -464,17 +466,18 @@ public:
   The XUL element.
 
  */
 
 #define XUL_ELEMENT_TEMPLATE_GENERATED 1 << NODE_TYPE_SPECIFIC_BITS_OFFSET
 
 class nsScriptEventHandlerOwnerTearoff;
 
-class nsXULElement : public nsGenericElement, public nsIDOMXULElement
+class nsXULElement : public nsGenericElement, public nsIDOMXULElement,
+                     public xeILogRefCnt
 {
 public:
 
     /** Typesafe, non-refcounting cast from nsIContent.  Cheaper than QI. **/
     static nsXULElement* FromContent(nsIContent *aContent)
     {
         if (aContent->IsNodeOfType(eXUL))
             return static_cast<nsXULElement*>(aContent);
@@ -571,16 +574,24 @@ public:
     NS_FORWARD_NSIDOMNODE(nsGenericElement::)
 
     // nsIDOMElement
     NS_FORWARD_NSIDOMELEMENT(nsGenericElement::)
 
     // nsIDOMXULElement
     NS_DECL_NSIDOMXULELEMENT
 
+    // xeILogRefCnt
+    NS_DECL_XEILOGREFCNT
+private:
+  PRBool mLogAllRefCnts;
+  void RefCntToLoggingService(PRUint32 count, PRInt16 offset);
+
+public:
+
     virtual nsresult Clone(nsINodeInfo *aNodeInfo, nsINode **aResult) const;
     virtual PRInt32 IntrinsicState() const;
 
     nsresult GetStyle(nsIDOMCSSStyleDeclaration** aStyle);
 
     nsresult GetFrameLoader(nsIFrameLoader** aFrameLoader);
     nsresult SwapFrameLoaders(nsIFrameLoaderOwner* aOtherOwner);

I only recommend replacing NS_IMPL_ADDREF and NS_IMPL_RELEASE if you really, really have to. As for the logging code itself,

#ifndef __xeRefCountLogger_h__
#define __xeRefCountLogger_h__

#include "nsCOMPtr.h"
#include "nsAutoPtr.h"
#include "nsIOutputStream.h"
#include "nsIStreamListener.h"
#include "nsIException.h"
#include "xeILoggingService.h"

class xeRefCountLogger : public xeIRefCntLogger
{
public:
  NS_DECL_ISUPPORTS
  NS_DECL_XEIREFCNTLOGGER

  xeRefCountLogger();
  ~xeRefCountLogger();

  nsresult Init();

  void LogStackFrame(void *aPC);

private:
  nsCOMPtr<nsIOutputStream> mLogStream;
  PRPackedBool mIsInitialized;

  nsCOMPtr<nsIStackFrame> mCurrentJSFrame;

protected:
  /* additional members */
};

/* 8383d82e-4046-444a-aa73-5b420fb1a6f2 */
#define XE_REFCNTLOGGER_CID \
{ \
    0x8383d82e, \
    0x4046, \
    0x444a, \
    { 0xaa, 0x73, 0x5b, 0x42, 0x0f, 0xb1, 0xa6, 0xf2 } \
}

#endif // __xeRefCountLogger_h__

(C++ here)

#include "xeRefCountLogger.h"
#include "nsIFileStreams.h"
#include "nsServiceManagerUtils.h"
#include "nsComponentManagerUtils.h"
#include "nsIProperties.h"
#include "nsILocalFile.h"
#include "nsIStreamListener.h"
#include "nsIInputStreamPump.h"
#include "nsStringGlue.h"
#include "nsStackWalk.h"

NS_IMPL_ISUPPORTS1(xeRefCountLogger, xeIRefCntLogger)

xeRefCountLogger::xeRefCountLogger() :
  mIsInitialized(PR_FALSE)
{
  /* member initializers and constructor code */
}

xeRefCountLogger::~xeRefCountLogger()
{
  /* destructor code */
}

nsresult
xeRefCountLogger::Init()
{
  if (mIsInitialized)
    return NS_OK;

  nsresult rv;
  nsCOMPtr<nsIProperties> propertiesService
    = do_GetService("@mozilla.org/file/directory_service;1", &rv);
  NS_ENSURE_SUCCESS(rv, rv);

  nsCOMPtr<nsILocalFile> profDFile;
  rv = propertiesService->Get("ProfD", NS_GET_IID(nsILocalFile),
                              getter_AddRefs(profDFile));
  NS_ENSURE_SUCCESS(rv, rv);

  nsCOMPtr<nsIFile> refCntLogFile;
  rv = profDFile->Clone(getter_AddRefs(refCntLogFile));
  NS_ENSURE_SUCCESS(rv, rv);

  rv = refCntLogFile->Append(NS_LITERAL_STRING("refcnt.log"));
  NS_ENSURE_SUCCESS(rv, rv);

  rv = refCntLogFile->CreateUnique(nsIFile::NORMAL_FILE_TYPE, 777);
  NS_ENSURE_SUCCESS(rv, rv);

  nsCOMPtr<nsIFileOutputStream> fileOut
    = do_CreateInstance("@mozilla.org/network/file-output-stream;1", &rv);
  NS_ENSURE_SUCCESS(rv, rv);

  rv = fileOut->Init(refCntLogFile, -1, -1, 0);
  NS_ENSURE_SUCCESS(rv, rv);

  mLogStream = do_QueryInterface(fileOut, &rv);
  NS_ENSURE_SUCCESS(rv, rv);

  mIsInitialized = PR_TRUE;
  return NS_OK;
}

extern "C" {
static void PrintStackFrame(void *aPC, void *aClosure)
{
  xeRefCountLogger* logger = (xeRefCountLogger*) aClosure;
  logger->LogStackFrame(aPC);
}

}

void
xeRefCountLogger::LogStackFrame(void *aPC)
{
  nsCodeAddressDetails details;
  char buf[1024];

  NS_DescribeCodeAddress(aPC, &details);
  NS_FormatCodeAddressDetails(aPC, &details, buf, sizeof(buf));

  nsCString data(buf);
  PRUint32 writeCount;
  nsresult rv = mLogStream->Write(data.get(), data.Length(), &writeCount);
  NS_WARN_IF_FALSE(NS_SUCCEEDED(rv), "Oops");
}


/* void logRefCnt (in nsISupports component, in PRUint32 count); */
NS_IMETHODIMP
xeRefCountLogger::LogRefCnt(nsISupports *component,
                            PRUint32 count,
                            PRInt16 offset)
{
  nsresult rv = Init();
  if (NS_FAILED(rv))
    return rv;
  
  nsCString msg;
  msg.Truncate();
  msg.AppendInt(count);
  msg.Append(" (");
  if (offset == 0)
    msg.Append("---");
  else if (offset == +1)
    msg.Append("Add");
  else if (offset == -1)
    msg.Append("Rel");
  else
    return NS_ERROR_ILLEGAL_VALUE;

  msg.Append(") ");

  PRUint32 addr = (unsigned int) &(*component);
  msg.AppendInt(addr, 16);

  msg.Append('\n');

  // C++ Stack
  PRUint32 writeCount;
  rv = mLogStream->Write(msg.get(), msg.Length(), &writeCount);
  NS_ENSURE_SUCCESS(rv, rv);

  NS_StackWalk(PrintStackFrame, 2, this);

  rv = mLogStream->Write("\n\n", 2, &writeCount);
  NS_ENSURE_SUCCESS(rv, rv);
  return mLogStream->Flush();
}

Here I found something which was actually pretty cool - the NS_StackWalk function. I'd seen other cases where Firefox code could generate a stack, and I said to myself, "I want that." It turned out to be a pretty good idea for figuring out why something is adding or releasing references. The stacks would (on Windows) look like this:

2 (Add) 7bb0998
xpcom_core!NS_TableDrivenQI+0x0000000000000040 (c:\edit-template\fx-debug\xpcom\build\nsisupportsimpl.cpp, line 50)
gklayout!nsXULElement::QueryInterface+0x00000000000000AD (c:\edit-template\mozilla\content\xul\content\src\nsxulelement.cpp, line 410)
xpc3250!XPCConvert::JSObject2NativeInterface+0x0000000000000153 (c:\edit-template\mozilla\js\src\xpconnect\src\xpcconvert.cpp, line 1484)
xpc3250!XPCConvert::JSData2Native+0x000000000000104F (c:\edit-template\mozilla\js\src\xpconnect\src\xpcconvert.cpp, line 1017)
xpc3250!XPCWrappedNative::CallMethod+0x0000000000000C7E (c:\edit-template\mozilla\js\src\xpconnect\src\xpcwrappednative.cpp, line 2564)
xpc3250!XPC_WN_CallMethod+0x00000000000001C6 (c:\edit-template\mozilla\js\src\xpconnect\src\xpcwrappednativejsops.cpp, line 1740)
js3250!js_Invoke+0x000000000000087D (c:\edit-template\mozilla\js\src\jsinterp.cpp, line 1360)
 ... // and so on for another fifty or so lines.

By the way, if you're in a debug build and trying to match reference counts, don't forget this patch either:

diff --git a/xpcom/glue/nsCOMPtr.h b/xpcom/glue/nsCOMPtr.h
--- a/xpcom/glue/nsCOMPtr.h
+++ b/xpcom/glue/nsCOMPtr.h
@@ -87,17 +87,17 @@
     //  about (unused) inline functions being removed.  This is to be expected with
     //  templates, so we disable the warning.
   #pragma warning( disable: 4514 )
 #endif

 #define NSCAP_FEATURE_USE_BASE

 #ifdef NS_DEBUG
-  #define NSCAP_FEATURE_TEST_DONTQUERY_CASES
+//#define NSCAP_FEATURE_TEST_DONTQUERY_CASES
   #undef NSCAP_FEATURE_USE_BASE
 //#define NSCAP_FEATURE_TEST_NONNULL_QUERY_SUCCEEDS
 #endif

   /*
     |...TEST_DONTQUERY_CASES| and |...DEBUG_PTR_TYPES| introduce some code that is
     problematic on a select few of our platforms, e.g., QNX.  Therefore, I'm providing
     a mechanism by which these features can be explicitly disabled from the command-line.

As I found out the hard way, that DONTQUERY_CASES thing generates a new nsCOMPtr for every nsCOMPtr you are using in a QueryInterface call. That means an extra addref and release pair. Very, very noisy when you're trying to track down a possible extra release or missing addref. This one change cut my log file sizes in half.

So once I cleared out all that clutter, I found that the panel element I was following didn't look like the one being destroyed. At the end, it had a reference count of 12, which didn't make sense for something being deleted. So I tried something nastier, and added this code to both nsXULElement::AddRef and nsXULElement::Release:

+    if (mNodeInfo->Equals(nsGkAtoms::panel))
+    {
+        mLogAllRefCnts = PR_TRUE;
+    }
+

Here comes the horde. This time I did find the panel element with a refcount approaching zero... and several other panels. Fortunately, my debugger offered an easy way to tell which panel was dying: in its stack it had a pointer value for the dead panel. That same pointer value (the address) matched the address line for several of my addref/release stack traces. Also, in filtering for just that panel's memory address, I found the ref count going all the way down to 1, right before the crash. Yippie-kay-yay.

This is where the investigation began in earnest. I noticed the first appearance of the dead panel (while it was being born) had an interesting signature:

2 (Add) 7bb0998
gklayout!nsXULElement::Clone+0x0000000000000182 (c:\edit-template\mozilla\content\xul\content\src\nsxulelement.cpp, line 465)
gklayout!nsNodeUtils::CloneAndAdopt+0x00000000000002AF (c:\edit-template\mozilla\content\base\src\nsnodeutils.cpp, line 574)
gklayout!nsNodeUtils::Clone+0x0000000000000026 (c:\edit-template\mozilla\content\base\src\nsnodeutils.h, line 171)
gklayout!nsNodeUtils::CloneNodeImpl+0x000000000000004A (c:\edit-template\mozilla\content\base\src\nsnodeutils.cpp, line 450)
gklayout!nsGenericElement::CloneNode+0x0000000000000018 (c:\edit-template\mozilla\content\base\src\nsgenericelement.h, line 537)
gklayout!nsXULElement::CloneNode+0x0000000000000016 (c:\edit-template\mozilla\content\xul\content\src\nsxulelement.h, line 574)
xpc3250!nsIDOMNode_CloneNode+0x0000000000000198 (c:\edit-template\fx-debug\js\src\xpconnect\src\dom_quickstubs.cpp, line 3868)
js3250!js_Interpret+0x000000000000FE71 (c:\edit-template\mozilla\js\src\jsops.cpp, line 2208)
js3250!js_Invoke+0x00000000000008C2 (c:\edit-template\mozilla\js\src\jsinterp.cpp, line 1368)

The panel was being born of a call to node.cloneNode(). I don't know many XUL panels being cloned, so why not stick the flag for turning on refcount logging there instead?

@@ -407,16 +442,21 @@ nsXULElement::Clone(nsINodeInfo *aNodeIn
         element->SetScriptTypeID(GetScriptTypeID());
         }
     }
 
     if (!element) {
         return NS_ERROR_OUT_OF_MEMORY;
     }
 
+    if (aNodeInfo->Equals(nsGkAtoms::panel))
+    {
+        element->mLogAllRefCnts = PR_TRUE;
+    }
+
     // XXX TODO: set up RDF generic builder n' stuff if there is a
     // 'datasources' attribute? This is really kind of tricky,
     // because then we'd need to -selectively- copy children that
     // -weren't- generated from RDF. Ugh. Forget it.
 
     // Note that we're _not_ copying mControllers.
 
     nsresult rv = CopyInnerTo(element);
 

Bingo. That gave me only one panel in the refcount log. I wanted to see where this panel was being inserted in a document. Scanning through the log, I was surprised: it was never inserted into any document. It would be cloned, go through a DOM serializer, and then be caught in cycle collection. That's all that ever happened to it.

By now I had more than enough to place a breakpoint and see who was creating this dead panel in the first place. The JavaScript frames pointed right to my own serializeNode() function... at which point I remembered where and why I was serializing a panel. To diagnose another bug. Whoops.

So there you have it... how one piece of logging code blamed another logging fragment for crashing my Firefox builds.

But wait, THERE'S MORE!!!

At this point, I was pretty much convinced that I was dead, unable to make further progress on this project until someone could solve that JS_ASSERT trickiness. I decided as almost the first step in reducing this monster testcase to move the offending logging call after the appendChild call.

      // This should force <markup:children/> elements to take XBL bindings.
      aNode.ownerDocument.documentElement.appendChild(panel);

      getLoggingService().log("Templates", C_i.xeILoggingService.INFO,
                              "Panel retrieved", serializeNode(panel));

I don't remember why I did that, but this was where I got another big surprise. It stopped crashing. I swapped it back, and it crashed. I moved the logging call after the appendChild call again, and it didn't crash.

So let me try to get this straight, because it's still mind-boggling to me:

  • XTF + CloneNode + SerializeToString + XUL, followed by an appendChild, leads to garbage collection and a JS_ASSERT() failure...
  • AppendChild, followed by XTF + CloneNode + SerializeToString + XUL, leads to normal (well, buggy), not-crashing behavior.

What the hell is going on here? I really can't understand this, how just the order of operations in JavaScript is the difference between a death spiral, and a shiny day. I know this isn't a Heisenbug, since it's 100% reproducible in one case and 0% reproducible in the other... but this just goes so far outside my understanding of "How Things Work" that it blows my mind this is even possible.

That's three times over the course of a month trying to figure this sick, twisted little freak out, where something that makes absolutely no sense appeared to be real:

  1. A JS_ASSERT being triggered where I had written no native-level code, just JavaScript, XUL, XBL, CSS, and IDL.
  2. XUL reference counting leading to garbage collection when the item being collected looked like 12 consumers had a hold on it. (Fortunately, this turned out to be false.)
  3. The program does not crash if I insert the panel into the document before trying to serialize the panel.

All this in a pet project, found by one guy working alone in his apartment for years.

So now I have to take some 300KB of extension + Mochitest apart (again!), to a reasonably small reproducible test case. This I have to do before someone else will find the bug painless enough to actually look at and fix. I still don't know why it failed, and my confidence in my ability to understand - and in the Mozilla platform - have been badly shaken.

Worst of all, I fear that when I reduce the testcase and submit it, someone competent will decide that the cloned node - after its original has been appended to the document - should have been garbage-collected just as if the original hadn't been appended. That means I will universally crash again, and there will be nothing I can do to stop it. Except wait for that ever-moving target, XBL 2.0, to replace XTF.

For those of you boldly going where no one has gone before:

If you really want to build this test and see for yourself, here's my build instructions:

  1. hg clone http://hg.mozilla.org/releases/mozilla-1.9.2 mozilla
  2. hg clone http://hg.mozdev.org/verbosio verbosio
  3. cd verbosio
  4. hg update mozilla-bug534412
  5. cd ..
  6. Copy verbosio/src/experimental/edit-template to mozilla/extensions/edit-template
  7. In your .mozconfig, add ac_add_options --enable=extensions=default,edit-template
  8. Also add ac_add_options --enable-tests
  9. Build Mozilla Firefox.
  10. cd fx-debug
  11. python _tests/testing/mochitest/runtests.py --chrome --test-path=extensions/edit/tests/test_templates.xul
  12. When the page appears, click on the "Run Test" button, and buckle your seatbelt.

As I said, I have to try to reduce this to a manageable testcase. If you want to try debugging it now, go ahead. I've reserved this branch specifically for the crash bug I saw here.

Posted by WeirdAl at January 30, 2010 11:33 PM
Comments
Post a comment









Remember personal info?