Reduce memory consumption of TURF parser from tags from handles.

Fixed

Description

The TURF parser currently uses much more memory than the URF CSV parser. After some investigation it turns out that one of the main culprits is the repeated creation of equivalent URI instances for the same handles. The URF CSV parser creates a tag URI instead a single time for each column. The TURF parser, however, creates a new tag each time it parses a handle, e.g. when parsing an object property.

This duplication can be reduced significantly by caching frequently-used tags so that they are reused if possible. Initially we can just cache the non-ID tags in the ad-hoc namespace. This can be done in the URF.Handle class, because this is where most ad-hoc tags get created from handles.

In the future we might implement some more complex caching strategy, or perhaps even introduce Tag and Handle classes which would completely control yet hide caching of tags.

Environment

None

Activity

Garret Wilson 
February 14, 2019 at 8:17 PM
(edited)

These extra URI copies may not seem like a lot, but in a recent real-world example data set, we had around 286,000 root elements, and each one of them had various properties, some property values were objects with their own properties. After caching we wound up with 58 cached properties. So before caching, those 58 properties were roughly around 286,000 * 50 (ignoring child objects and assuming each root resource had all properties, both assumptions of which cancel out to give a a very rough estimate) or over 14 million copies instead of 58. Each URI contains various strings, so you have the string contents and the references, etc., so suddenly you're up to over 286MB (assuming each URI has at least 20 characters). That's probably a severe underestimate.

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Created February 14, 2019 at 2:38 PM
Updated February 15, 2019 at 2:59 PM
Resolved February 15, 2019 at 2:59 PM