#`.iterator(chunk_size)` performance issue

4 messages · Page 1 of 1 (latest)

chrome wren
#

I'm encountering a significant issue with RAM usage and a progressive degradation of performance while fetching data from PostgreSQL using Django ORM.

Example of the dummy code:

with transaction.atomic():
    queryset = get_queryset(...)
    total_items = queryset.count()

    items_counter = 0
    chunk_actions = [] # list of dicts
    init_time = start_time = time.time()
    for item in queryset.iterator(chunk_size=500000):
        chunk_actions.append(instance_to_dict(item))
        items_counter += 1
        # Chunk completed or last chunk
        if len(chunk_actions) >= block_size or items_counter == total_items:
            # Dummy code....
            # Reset chunk
            chunk_actions = []

The dummy queryset:
NOTE: The performance issue only appears when 'embeddings' field is added to the select_related (a OneToOneField of MessageItem)

def get_queryset():
  ...  
  return MessageItem.objects.select_related('room', 'embeddings').prefetch_related(annotated_entities_prefetch).distinct("pk")

The EmbeddingsItem model (each embedding is 1024 length of 32 bits floats)::

class EmbeddingsItem(BaseModel):
    model = models.CharField(max_length=256)
    instruct = models.TextField(blank=True, null=True)
    embeddings = ArrayField(models.FloatField())

Output (over 16M of items, block_size = 500.000):

[Extracted 500000 items (iterator: 79.474secs)]
[Extracted 500000 items (iterator: 54.127secs)]
[Extracted 500000 items (iterator: 50.736secs)]
[Extracted 500000 items (iterator: 52.507secs)]
[Extracted 500000 items (iterator: 51.457secs)]
[Extracted 500000 items (iterator: 51.614secs)]
[Extracted 500000 items (iterator: 93.181secs)]
[Extracted 500000 items (iterator: 458.092secs)] # ---> **Perf issue + huge RAM usage.**
[Extracted 500000 items (iterator: 485.098secs)]
[Extracted 500000 items (iterator: 498.268secs)]
[Extracted 500000 items (iterator: 517.507secs)]
...
chrome wren
#

I have tested again and these are my conclusions:

  • This always occurs after 3.5M items (during the 500K block for 4M items).
  • I think it is mainly due to a progressive increase in RAM usage that triggers a high degradation after 3.5M items.
  • I also tested with chunk_size=10000 an the results are the same:
[Extracted 500000 items (iterator: 181.696secs).]
[Extracted 500000 items (iterator: 147.379secs).]
[Extracted 500000 items (iterator: 142.043secs).]
[Extracted 500000 items (iterator: 141.764secs).]
[Extracted 500000 items (iterator: 141.603secs).]
[Extracted 500000 items (iterator: 142.320secs).]
[Extracted 500000 items (iterator: 186.708secs).]
[Extracted 500000 items (iterator: 582.356secs).]
left prawn
#

I don't know what's going on, but I'd see if I could avoid appending each chunk to a list. Those chunks are big -- 500,000 somethings -- and you seem to have a lot of them. Can you not just process each item as soon as you get it from the iterator, and not save anything in that list?

chrome wren
#

@left prawn The list only stores a dict with some info from each item. Then, I send chunks of 500K dicts to an Elasticsearch cluster.
I assume that the RAM usage should be N MB per 500k items, what happens is that it keeps growing, i.e. it does not free up. Resulting in a bottleneck.