Description
Some applications have a need to canonicalize strings to avoid consuming a lot of memory due to duplicated strings. The way this is usually done is to have map that strongly holds on to the canonicalized string - the simplest being Map<String, String>
.
Though using such a cache implies that strings may be held on to longer (by the cache) than by the application needs them. It therefore requires the application to manually decide when it's a good time to clear/prune such cache (based on not ideal heuristics). Once cleared, looking up new strings may end up us having duplicate strings in the heap.
=> The current options are somewhat unsatisfactory.
Weakness
One would want a canonicalization mechanism where the cache holds on to the strings only weakly, thereby allowing an implementation to reclaim memory if the application doesn't need them. Unfortunately one cannot build such a cache atm with e.g. WeakReference
s since they cannot point to String
s.
(Even if one could build such a cache which weakly holds on to the strings, it would require all components of an application to use the same cache for canonicalization. Independent components of an app may not know about each other and may not have such a common, standardized cache library)
API
Ideally we'd provide a string canonicalization/interning API as part of the core libraries. Java has for this purpose e.g. the String.intern() method.
The specific use cases that could benefit here in our own tooling are e.g. CFE/Analyzer/...
Optimizations
As an optimization one may want the API to allow getting a canonicalized string for
- an existing
String
- a substring of an existing string (
String
,int start
,int length
) - a to-be utf-8 decoded string from (
Uint8List
,int start
,int length
)
The CFE cache currently supports the above use cases (e.g. in some modes it avoids the utf-8 decoding to temporary string objects - since probability is high to find a matching entry).
VM-specifics
In the VM specifically we'd prefer if the implementation ensures such strings are interned/canonicalized across isolates within the same isolate group.
/cc @lrhn @leafpetersen for library/language
/cc @rakudrama @askeksa-google for web & wasm capabilities
/cc @rmacnak-google for GC
(/cc @jensjoha for CFE)