⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@honzaflash
Copy link

  • apply lru_cache manually in the constructor instead of as a decorator
    • pretty transparent/minimal change
    • the methods should work just like they did - since the function decorator @ is just syntactic sugar for what the change does (except it does it inside the constructor per-instance and the self is already bound to the method)
  • this way we avoid holding a self reference in the cache on the class level (which caused a memory leak previously)

Fixes #1983

- apply `lru_cache` manually in the constructor instead of as a decorator
- this way we avoid holding a `self` reference in the cache on the class level
# apply `lru_cache` decorator manually per instance
# This way `self` reference is not held on class level
self.listdir = lru_cache()(self.listdir)
self._key_to_record = lru_cache(maxsize=4096)(self._key_to_record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, unfortunately, is not correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate, please? What is incorrect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from functools import lru_cache


class A:
    def __init__(self, value):
        self.volatile_value = value
        self.calc = lru_cache()(self.calc)

    def calc(self):
        return self.volatile_value

    def calc2(self):
        @lru_cache
        def _calc():
            return self.volatile_value

        return _calc()


instance = A('a')
print(instance.calc())
print(instance.calc2())
instance.volatile_value = 'b'
print(instance.calc())
print(instance.calc2())

This prints: a a a b.
The expected output should be a a b b.
This cache self.calc = lru_cache()(self.calc) does not account for the change of relevant class members.
This thus would only work if all class members are immutable.
But we cannot guarantee this with python.
And this may lead to mysterious bugs that are hard to debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, could you please also remove this line as well so that ruff can pick up issues like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:O Oh no! You're totally right, self and its attributes are not considered for the cache key in this solution.

Thank you for catching that!

Copy link
Author

@honzaflash honzaflash Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So really the only viable way is to cache a standalone free function that resides at the module level.

Options that I see:

  1. cached module level function that is then called from the method
  2. cached staticmethod/classmethod that is then called from the actual method
  3. a cached function defined in the constructor or setup just like open_refs
  4. assume instances are immutable and accept current fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the function/method is not static - it depends on the specific instance, but those details are not going to change. So, it would need to be func(instance, args), which is exactly what we have.

Copy link
Author

@honzaflash honzaflash Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, see the earlier comment above with a snippet using a @staticmethod and then a regular method that just wraps it, passing in the necessary parameters from self. That is what I meant by a static method.
It's not that different from a module level function - really only a matter of namespace I guess.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to add to the "immutable assumption", the class does inherit from collections.abc.MutableMapping - could this be changed to just collections.abc.Mapping (does not have __setitem__)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is a mutable mapping, you can write to keys. These two functions do not depend on those keys, though.

@TLCFEM
Copy link
Contributor

TLCFEM commented Feb 3, 2026

Furthermore, even the second approach is not working.

from functools import lru_cache


class A:
    def __init__(self, value):
        self.volatile_value = value
        self.calc = lru_cache()(self.calc)

    def calc(self):
        return self.volatile_value

    def calc2(self):
        @lru_cache
        def _calc():
            print("Am I cached?")
            return self.volatile_value

        return _calc()


instance = A('a')
print(instance.calc())
print(instance.calc2())
instance.volatile_value = 'b'
print(instance.calc())
print(instance.calc2())
print(instance.calc())
print(instance.calc2())

This prints the following.

a
Am I cached?
a
a
Am I cached?
b
a
Am I cached?
b

Decorating an inner function does not actually apply the cache to that function because it is not persistent.
So really the only viable way is to cache a standalone free function that resides at the module level.

@martindurant
Copy link
Member

How about something like

@@ -229,7 +232,7 @@ class LazyReferenceMapper(collections.abc.MutableMapping):
         """Shortcut file listings"""
         path = path.rstrip("/")
         pathdash = path + "/" if path else ""
-        dirnames = self.listdir()
+        dirnames = self.listdir
         dirs = [
             d
             for d in dirnames
:
-            path = self.url.format(field=field, record=record)
-            data = io.BytesIO(self.fs.cat_file(path))
+        def load_refs(url, fs, engine):
+            import pandas as pd
+            data = io.BytesIO(fs.cat_file(url))
             try:
-                df = self.pd.read_parquet(data, engine=self.engine)
+                df = pd.read_parquet(data, engine=engine)
                 refs = {c: df[c].to_numpy() for c in df.columns}
             except OSError:
                 refs = None
             return refs

-        self.open_refs = open_refs
+        self.load_refs = lru_cache(self.cache_size)(load_refs)
+
+    def open_refs(self, field, record):
+        """cached parquet file loader"""
+        path = self.url.format(field=field, record=record)
+        return self.load_refs(path, self.fs, self.engine)

     @staticmethod
     def create(root, storage_options=None, fs=None, record_size=10000, **kwargs):
@@ -219,7 +222,7 @@ class LazyReferenceMapper(collections.abc.MutableMapping):
         fs.pipe("/".join([root, ".zmetadata"]), json.dumps(met).encode())
         return LazyReferenceMapper(root, fs, **kwargs)

-    @lru_cache
+    @cached_property
     def listdir(self):

@honzaflash
Copy link
Author

@martindurant

How about something like

Yes, @cached_propery is also a possibility, though still assumes immutability. Also, I am not sure how that would work for _key_to_record.

My though was something like this:

    @lru_cache()
    @staticmethod
    def _listdir(zmetadata):
        """Cached static helper that lists top-level directories"""
        dirs = (p.rsplit("/", 1)[0] for p in zmetadata if not p.startswith(".z"))
        return set(dirs)

    def listdir(self):
        """List top-level directories"""
        return LazyReferenceMapper._listdir(self.zmetadata)

Doing this for _key_to_record requires also making a static version of _get_chunk_sizes_static I think. Bit more involved, perhaps little messy, but it wouldn't make assumptions about mutability. It looks like _key_to_record's return value depends on self.chunk_sizes, self.zmetadata, self.record_size, and the key.

- warns about lru_cache on methods
- instead of using a `lru_cache` on bound methods (assumes immutability of object members) define static methods that can be cached and wrapped by the original methods
@honzaflash
Copy link
Author

I had most of the code ready, so I just pushed it.

I can revert or whatever if we don't want to go this route.

And I will cleanup with some squashing and a force-push once we are done.

def _key_to_record(self, key):
"""Details needed to construct a reference for one key"""
@staticmethod
def _key_to_record_static(chunk_sizes, zmetadata, record_size, key):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, silly me, dictionaries are unhashable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at this, the more I think the original simple solution might be the best.

Apparently, you can use freezedict which is hashable, but that might be a problem because _get_chunk_sizes actually mutates self.chunk_sizes (or more specifically adds a key-val pair to it, it should never mutate existing key).

I think, assuming that the object attributes won't change once the cache entry has been saved might be the easiest way forward. Of course, we should note that in a comment. And go back to

        self.listdir = lru_cache()(self.listdir)
        self._key_to_record = lru_cache(maxsize=4096)(self._key_to_record)

@martindurant
Copy link
Member

And I will cleanup with some squashing and a force-push once we are done.

Don't worry about that, we squash anyway.

@honzaflash
Copy link
Author

This is a can of worms, thank you for walking through it with me.

I propose reverting the last commit and going back to the originally suggested solution with the immutability assumption: #1985 (comment)

Unless you want to rethink which parts of the computation are cached.

@martindurant
Copy link
Member

I propose reverting the last commit and going back to the originally suggested solution

Yes, I think we are agreed.

honzaflash and others added 2 commits February 5, 2026 09:56
- to simplify caching of `listdir` and `_key_to_record` we assume that the memebrs of self that the return values depend on won't mutate once a return value is computed and cached
- note the assumption to hopefully warn or aid future developers
@honzaflash
Copy link
Author

honzaflash commented Feb 5, 2026

Side note, I just had this though about how the cache key is computed in the old memory-leaky scenario - I am pretty sure lru_cache uses __hash__.
This means that in LazyReferenceMapper's case this would be the object ID. Therefore, the assumption about immutability has been there all along because self.__hash__() doesn't change with the object members.

I wrote a little minimal example:
# default __hash__ is based on object ID - like LazyReferenceMapper
class A:
  def __init__(self, x):
    self.x = x
  @lru_cache()
  def add(self, y):
    return self.x + y

a = A(1)
print(a.add(1))  # 2
a.x = 5
print(a.add(1))  # 2

# custom __hash__ that reflects relevant member
class B:
  def __init__(self, x):
    self.x = x
  def __hash__(self):
    return self.x
  @lru_cache()
  def add(self, y):
    return self.x + y

b = B(1)
print(b.add(1))  # 2
b.x = 6
print(b.add(1))  # 7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LazyReferenceMapper cache leaking memory

3 participants