Apply `lru_cache` manually per `LazyReferenceMapper` instance #1985

honzaflash · 2026-02-03T19:05:50Z

apply lru_cache manually in the constructor instead of as a decorator
- pretty transparent/minimal change
- the methods should work just like they did - since the function decorator @ is just syntactic sugar for what the change does (except it does it inside the constructor per-instance and the self is already bound to the method)
this way we avoid holding a self reference in the cache on the class level (which caused a memory leak previously)

Fixes #1983

- apply `lru_cache` manually in the constructor instead of as a decorator - this way we avoid holding a `self` reference in the cache on the class level

TLCFEM · 2026-02-03T19:31:07Z

fsspec/implementations/reference.py

+        # apply `lru_cache` decorator manually per instance
+        # This way `self` reference is not held on class level
+        self.listdir = lru_cache()(self.listdir)
+        self._key_to_record = lru_cache(maxsize=4096)(self._key_to_record)


This, unfortunately, is not correct.

Could you elaborate, please? What is incorrect?

from functools import lru_cache class A: def __init__(self, value): self.volatile_value = value self.calc = lru_cache()(self.calc) def calc(self): return self.volatile_value def calc2(self): @lru_cache def _calc(): return self.volatile_value return _calc() instance = A('a') print(instance.calc()) print(instance.calc2()) instance.volatile_value = 'b' print(instance.calc()) print(instance.calc2())

This prints: a a a b.
The expected output should be a a b b.
This cache self.calc = lru_cache()(self.calc) does not account for the change of relevant class members.
This thus would only work if all class members are immutable.
But we cannot guarantee this with python.
And this may lead to mysterious bugs that are hard to debug.

Besides, could you please also remove this line as well so that ruff can pick up issues like this.

filesystem_spec/pyproject.toml

Line 192 in 1de5e1d

"B019",

:O Oh no! You're totally right, self and its attributes are not considered for the cache key in this solution.

Thank you for catching that!

So really the only viable way is to cache a standalone free function that resides at the module level.

Options that I see:

cached module level function that is then called from the method

cached staticmethod/classmethod that is then called from the actual method

a cached function defined in the constructor or setup just like open_refs

assume instances are immutable and accept current fix

No, the function/method is not static - it depends on the specific instance, but those details are not going to change. So, it would need to be func(instance, args), which is exactly what we have.

Sorry, see the earlier comment above with a snippet using a @staticmethod and then a regular method that just wraps it, passing in the necessary parameters from self. That is what I meant by a static method.
It's not that different from a module level function - really only a matter of namespace I guess.

Also to add to the "immutable assumption", the class does inherit from collections.abc.MutableMapping - could this be changed to just collections.abc.Mapping (does not have __setitem__)?

No. It is a mutable mapping, you can write to keys. These two functions do not depend on those keys, though.

TLCFEM · 2026-02-03T21:17:18Z

Furthermore, even the second approach is not working.

from functools import lru_cache


class A:
    def __init__(self, value):
        self.volatile_value = value
        self.calc = lru_cache()(self.calc)

    def calc(self):
        return self.volatile_value

    def calc2(self):
        @lru_cache
        def _calc():
            print("Am I cached?")
            return self.volatile_value

        return _calc()


instance = A('a')
print(instance.calc())
print(instance.calc2())
instance.volatile_value = 'b'
print(instance.calc())
print(instance.calc2())
print(instance.calc())
print(instance.calc2())

This prints the following.

a
Am I cached?
a
a
Am I cached?
b
a
Am I cached?
b

Decorating an inner function does not actually apply the cache to that function because it is not persistent.
So really the only viable way is to cache a standalone free function that resides at the module level.

martindurant · 2026-02-03T22:00:14Z

How about something like

@@ -229,7 +232,7 @@ class LazyReferenceMapper(collections.abc.MutableMapping):
         """Shortcut file listings"""
         path = path.rstrip("/")
         pathdash = path + "/" if path else ""
-        dirnames = self.listdir()
+        dirnames = self.listdir
         dirs = [
             d
             for d in dirnames
:
-            path = self.url.format(field=field, record=record)
-            data = io.BytesIO(self.fs.cat_file(path))
+        def load_refs(url, fs, engine):
+            import pandas as pd
+            data = io.BytesIO(fs.cat_file(url))
             try:
-                df = self.pd.read_parquet(data, engine=self.engine)
+                df = pd.read_parquet(data, engine=engine)
                 refs = {c: df[c].to_numpy() for c in df.columns}
             except OSError:
                 refs = None
             return refs

-        self.open_refs = open_refs
+        self.load_refs = lru_cache(self.cache_size)(load_refs)
+
+    def open_refs(self, field, record):
+        """cached parquet file loader"""
+        path = self.url.format(field=field, record=record)
+        return self.load_refs(path, self.fs, self.engine)

     @staticmethod
     def create(root, storage_options=None, fs=None, record_size=10000, **kwargs):
@@ -219,7 +222,7 @@ class LazyReferenceMapper(collections.abc.MutableMapping):
         fs.pipe("/".join([root, ".zmetadata"]), json.dumps(met).encode())
         return LazyReferenceMapper(root, fs, **kwargs)

-    @lru_cache
+    @cached_property
     def listdir(self):

honzaflash · 2026-02-03T22:15:05Z

@martindurant

How about something like

Yes, @cached_propery is also a possibility, though still assumes immutability. Also, I am not sure how that would work for _key_to_record.

My though was something like this:

    @lru_cache()
    @staticmethod
    def _listdir(zmetadata):
        """Cached static helper that lists top-level directories"""
        dirs = (p.rsplit("/", 1)[0] for p in zmetadata if not p.startswith(".z"))
        return set(dirs)

    def listdir(self):
        """List top-level directories"""
        return LazyReferenceMapper._listdir(self.zmetadata)

Doing this for _key_to_record requires also making a static version of _get_chunk_sizes_static I think. Bit more involved, perhaps little messy, but it wouldn't make assumptions about mutability. It looks like _key_to_record's return value depends on self.chunk_sizes, self.zmetadata, self.record_size, and the key.

- warns about lru_cache on methods

- instead of using a `lru_cache` on bound methods (assumes immutability of object members) define static methods that can be cached and wrapped by the original methods

honzaflash · 2026-02-03T22:49:20Z

I had most of the code ready, so I just pushed it.

I can revert or whatever if we don't want to go this route.

And I will cleanup with some squashing and a force-push once we are done.

honzaflash · 2026-02-03T23:10:34Z

fsspec/implementations/reference.py

-    def _key_to_record(self, key):
-        """Details needed to construct a reference for one key"""
+    @staticmethod
+    def _key_to_record_static(chunk_sizes, zmetadata, record_size, key):


Ah, silly me, dictionaries are unhashable

The more I look at this, the more I think the original simple solution might be the best.

Apparently, you can use freezedict which is hashable, but that might be a problem because _get_chunk_sizes actually mutates self.chunk_sizes (or more specifically adds a key-val pair to it, it should never mutate existing key).

I think, assuming that the object attributes won't change once the cache entry has been saved might be the easiest way forward. Of course, we should note that in a comment. And go back to

self.listdir = lru_cache()(self.listdir) self._key_to_record = lru_cache(maxsize=4096)(self._key_to_record)

martindurant · 2026-02-03T23:20:24Z

And I will cleanup with some squashing and a force-push once we are done.

Don't worry about that, we squash anyway.

honzaflash · 2026-02-03T23:34:55Z

This is a can of worms, thank you for walking through it with me.

I propose reverting the last commit and going back to the originally suggested solution with the immutability assumption: #1985 (comment)

Unless you want to rethink which parts of the computation are cached.

This reverts commit fa8742c.

martindurant · 2026-02-05T14:16:01Z

I propose reverting the last commit and going back to the originally suggested solution

Yes, I think we are agreed.

- to simplify caching of `listdir` and `_key_to_record` we assume that the memebrs of self that the return values depend on won't mutate once a return value is computed and cached - note the assumption to hopefully warn or aid future developers

honzaflash · 2026-02-05T19:27:58Z

Side note, I just had this though about how the cache key is computed in the old memory-leaky scenario - I am pretty sure lru_cache uses __hash__.
This means that in LazyReferenceMapper's case this would be the object ID. Therefore, the assumption about immutability has been there all along because self.__hash__() doesn't change with the object members.

I wrote a little minimal example:

# default __hash__ is based on object ID - like LazyReferenceMapper
class A:
  def __init__(self, x):
    self.x = x
  @lru_cache()
  def add(self, y):
    return self.x + y

a = A(1)
print(a.add(1))  # 2
a.x = 5
print(a.add(1))  # 2

# custom __hash__ that reflects relevant member
class B:
  def __init__(self, x):
    self.x = x
  def __hash__(self):
    return self.x
  @lru_cache()
  def add(self, y):
    return self.x + y

b = B(1)
print(b.add(1))  # 2
b.x = 6
print(b.add(1))  # 7

Apply lru_cache manually per-instance

5a3c61b

- apply `lru_cache` manually in the constructor instead of as a decorator - this way we avoid holding a `self` reference in the cache on the class level

TLCFEM reviewed Feb 3, 2026

View reviewed changes

honzaflash added 2 commits February 3, 2026 13:43

Remove B019 rule from ruff ignore list

1bbfcc4

- warns about lru_cache on methods

Use cached static helpers for listdir and _key_to_record

fa8742c

- instead of using a `lru_cache` on bound methods (assumes immutability of object members) define static methods that can be cached and wrapped by the original methods

honzaflash commented Feb 3, 2026

View reviewed changes

Revert "Use cached static helpers for listdir and _key_to_record"

e6da03e

This reverts commit fa8742c.

honzaflash and others added 2 commits February 5, 2026 09:56

Merge branch 'master' into fix-lru-cache-methods

eefb930

honzaflash requested a review from martindurant February 9, 2026 18:50

Apply lru_cache manually per LazyReferenceMapper instance #1985

Are you sure you want to change the base?

Apply lru_cache manually per LazyReferenceMapper instance #1985

Conversation

honzaflash commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honzaflash Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

honzaflash Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TLCFEM commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Feb 3, 2026

Uh oh!

honzaflash commented Feb 3, 2026

Uh oh!

honzaflash commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant commented Feb 3, 2026

Uh oh!

honzaflash commented Feb 3, 2026

Uh oh!

martindurant commented Feb 5, 2026

Uh oh!

honzaflash commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Apply `lru_cache` manually per `LazyReferenceMapper` instance #1985

Apply `lru_cache` manually per `LazyReferenceMapper` instance #1985

honzaflash Feb 3, 2026 •

edited

Loading

honzaflash Feb 3, 2026 •

edited

Loading

TLCFEM commented Feb 3, 2026 •

edited

Loading

honzaflash commented Feb 5, 2026 •

edited

Loading