I'm encountering a significant issue with RAM usage and a progressive degradation of performance while fetching data from PostgreSQL using Django ORM.
Example of the dummy code:
with transaction.atomic():
queryset = get_queryset(...)
total_items = queryset.count()
items_counter = 0
chunk_actions = [] # list of dicts
init_time = start_time = time.time()
for item in queryset.iterator(chunk_size=500000):
chunk_actions.append(instance_to_dict(item))
items_counter += 1
# Chunk completed or last chunk
if len(chunk_actions) >= block_size or items_counter == total_items:
# Dummy code....
# Reset chunk
chunk_actions = []
The dummy queryset:
NOTE: The performance issue only appears when 'embeddings' field is added to the select_related (a OneToOneField of MessageItem)
def get_queryset():
...
return MessageItem.objects.select_related('room', 'embeddings').prefetch_related(annotated_entities_prefetch).distinct("pk")
The EmbeddingsItem model (each embedding is 1024 length of 32 bits floats)::
class EmbeddingsItem(BaseModel):
model = models.CharField(max_length=256)
instruct = models.TextField(blank=True, null=True)
embeddings = ArrayField(models.FloatField())
Output (over 16M of items, block_size = 500.000):
[Extracted 500000 items (iterator: 79.474secs)]
[Extracted 500000 items (iterator: 54.127secs)]
[Extracted 500000 items (iterator: 50.736secs)]
[Extracted 500000 items (iterator: 52.507secs)]
[Extracted 500000 items (iterator: 51.457secs)]
[Extracted 500000 items (iterator: 51.614secs)]
[Extracted 500000 items (iterator: 93.181secs)]
[Extracted 500000 items (iterator: 458.092secs)] # ---> **Perf issue + huge RAM usage.**
[Extracted 500000 items (iterator: 485.098secs)]
[Extracted 500000 items (iterator: 498.268secs)]
[Extracted 500000 items (iterator: 517.507secs)]
...