A Tale of Two Leaks: An Incremental Leak

In part 1, we saw a memory leak that was in some sense static: The memory use was unintentional/undesired, but it was only allocated once, and didn’t get worse as the process continued. On the other hand, the second leak was a true leak in the classic sense: the memory use appeared to be unbounded.

Adam, the lead sustaining engineer at edX, and Ed, our Devops lead, were able to identify that the leak happened during our bulk-grading operations. During grading, we loop through every student in a single and then loop through every gradable XBlock to identify whether we’ve already scored that XBlock, and if not, we score the student on that block. Then we aggregate all of those grades based on the course’s grading policy.

Adam was able to narrow down the problem by creating a test case that graded a single student repeatedly. That test showed the same unbounded memory growth we observed in the overall process. Using objgraph, he was able to identify that each time the student was graded, a constant number of CombinedSystems were created and not released. This was suspicious, as those objects were intended to be ephemeral objects intended only to combine the attributes of DescriptorSystem and ModuleSystem into a single object.

Adam was also to dump the processes memory with meliae, after it had leaked memory, so we were able to dig more into the particulars of the CombinedSystems that were still held in memory.

Investigating the Memory Dump

Once we had a dump, I was able to begin investigating with memsee. My first attempt was to use the path command that pointed to the errant pointer in part 1. However, all of those attempts timed out before they found any path from the root to an CombinedSystem.

::> select * from obj where type = 'CombinedSystem';
#                 address type                 name   value    size    len mark       repr
-------- ---------------- -------------------- ------ ------ ------ ------ ---------- ----------
#1.0            270528592 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.1            256545552 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.2            239994960 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
<snip>
#1.1021         210278800 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.1022         196178640 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.1023         179498384 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.1024         166197904 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy
#1.1025         149917968 CombinedSystem       ∘      ∘          64      ∘ ∘          CombinedSy

Picking an arbitrary CombinedSystem, we can find out what’s pointing to it.

::> parents 149917968
(151461744, u'dict', None, None, 3352, 45)
::> parents 151461744
(149920464, u'LmsModuleSystem', None, None, 64, None)

The first dict is just __dict__ from the the LmsModuleSystem. Given that the CombinedSystems are just supposed to be pointing to a DescriptorSystem and ModuleSystem, it’s suspicious that a CombinedSystem is being held in memory in turn by an LmsModuleSystem.

We can keep tracking the parents upwards:

::> parents 149920464
(166197904, u'CombinedSystem', None, None, 64, None)
(151176912, u'dict', None, None, 280, 1)
(150230496, u'dict', None, None, 1048, 9)
::> parents 166197904
(166753712, u'dict', None, None, 3352, 45)
::> parents 166753712
(166196048, u'LmsModuleSystem', None, None, 64, None)

It looks like we have a chain of LmsModuleSystem -> CombinedSystem -> LmsModuleSystem. (A side note about python memory management: CombinedSystem appears as a direct parent of LmsModuleSystem, with no intervening dict because CombinedSystem defines its attributes using __slots__. This is a good strategy for ephemeral objects, as it saves you from allocating additional dictionaries.)

Designing a Fix

Looking at the relevant edx-platform code, there’s only one place where CombinedSystems are constructed:

# edx-platform/common/lib/xmodule/xmodule/x_module.py

class XModuleMixin(XBlockMixin):
    """
    Fields and methods used by XModules internally.

    Adding this Mixin to an :class:`XBlock` allows it to cooperate with old-style :class:`XModules`
    """

    ...

    @property
    def runtime(self):
        return CombinedSystem(self.xmodule_runtime, self._runtime)

XModuleMixin is used added as a base class for all XBlocks used in edx-platform. So, whatever is capturing CombinedSystem in LmsModuleSystem, it’s coming from a call to .runtime.

The constructor for ModuleSystem (the base-class for LmsModuleSystem) takes a descriptor_runtime argument:

# edx-platform/common/lib/xmodule/xmodule/x_module.py

class ModuleSystem(MetricsMixin, ConfigurableFragmentWrapper, Runtime):
    def __init__(self, ..., descriptor_runtime, ...):
        ...
        self.descriptor_runtime = descriptor_runtime

This seems like a good candidate for something that would be storing a CombinedSystem. In fact, looking at module_render.py, which is the primary entry point in the LMS for working with XBlocks, we see that we’re passing descriptor.runtime in to that argument:

# edx-platform/lms/djangoapps/courseware/module_render.py

def get_module_system_for_user(...):
    system = LmsModuleSystem(
        ...
        descriptor_runtime=descriptor.runtime,
        ...
    )

We find the last piece of the puzzle by looking at what happens to the LmsModuleSystem once it’s constructed.

# edx-platform/lms/djangoapps/courseware/module_render.py

def get_module_for_descriptor_internal(user, descriptor, ...):
    (system, field_data) = get_module_system_for_user(
        user=user,
        descriptor=descriptor,
        ...
    )

    descriptor.bind_for_student(system, field_data)  # pylint: disable=protected-access

# edx-platform/common/lib/xmodule/xmodule/x_module.py

    def bind_for_student(self, xmodule_runtime, field_data):
        self.xmodule_runtime = xmodule_runtime

So, now consider what happens during grading, when we have the same XBlock being bound to different users (or re-bound to the same user) over the course of the grading session:

# Before grading step

        +----------+
        |descriptor|
        +----------+

# LmsModuleSystem(descriptor_runtime=descriptor.runtime)  ## get_module_system_for_user

        +----------+   +---------------+
        |descriptor|   |CombinedSystem |
        +----------+   +-^-------------+
                         |
                         |descriptor_runtime
                         |
                       +-+-------------+
                       |LmsModuleSystem|
                       +---------------+

# self.xmodule_runtime = xmodule_runtime   ## bind_for_student

        +----------+   +---------------+
        |descriptor|   |CombinedSystem |
        +------+---+   +-^-------------+
               |         |
xmodule_runtime|         |descriptor_runtime
               |         |
               |       +-+-------------+
               +------->LmsModuleSystem|
                       +---------------+

# LmsModuleSystem(descriptor_runtime=descriptor.runtime)  ## get_module_system_for_user                                                              

        +----------+   +---------------+
        |descriptor|   |CombinedSystem |
        +------+---+   +-^-------------+
               |         |
xmodule_runtime|         |descriptor_runtime
               |         |
               |       +-+-------------+
               +------->LmsModuleSystem|
                       +-^-------------+
                         |
                         |_runtime
                         |
                       +-+-------------+
                       |CombinedSystem |
                       +-^-------------+
                         |
                         |descriptor_runtime
                         |
                       +-+-------------+
                       |LmsModuleSystem|
                       +---------------+

# self.xmodule_runtime = xmodule_runtime   ## bind_for_student

        +----------+   +---------------+
        |descriptor|   |CombinedSystem |
        +------+---+   +-^-------------+
               |         |
xmodule_runtime|         |descriptor_runtime
               |         |
               |       +-+-------------+
               |       |LmsModuleSystem|
               |       +-^-------------+
               |         |
               |         |_runtime
               |         |
               |       +-+-------------+
               |       |CombinedSystem |
               |       +-^-------------+
               |         |
               |         |descriptor_runtime
               |         |
               |       +-+-------------+
               +------->LmsModuleSystem|
                       +---------------+

In the end, we’re building up a chain of LmsModuleSystems and CombinedSystems, and never releasing them.

The initial fix to this was to extract actual DescriptorSystem from the CombinedSystem, and passing that to the LmsModuleSystem. That ensures that the references to the previous LmsModuleSystem is released when xmodule_runtime is re-bound. The code to make that change is in this pr.

A more robust fix would be for ModuleSystem to expect a pointer to a descriptor, rather than the descriptor runtime, so that it can extract the DescriptorSystem from the CombinedSystem itself.