#Apparent memory leak

37 messages · Page 1 of 1 (latest)

wooden lion
#

We noticed Payload was struggling to serve images from a media collection. It would take 1-2 minutes for an image to load. Looking at the AWS metrics for the Fargate cluster indicates there could be a memory leak somewhere. Restarting the tasks in Fargate resolved the issue.

Versions:

  • payload@1.6.6
  • @payloadcms/plugin-cloud-storage@1.0.12
    • @aws-sdk/client-s3@3.266.1
    • @aws-sdk/lib-storage@3.267.0
tawny eagle
#

Interesting - we have never come across this ourselves but it's definitely something we need to look into

#

is this the only time you've seen it?

#

and can you trace it back to one action? or be able to reproduce it? like, did someone upload a large file?

wooden lion
#

This is the only time we have seen it yeah, and we can't identify anything in particular that has caused this. This is from our production instance, and actually noone can access the admin dashboard directly, everything is edited in our staging environment and then the db gets promoted to production through mongodump and mongorestore, s3 objects get cloned from staging to prod buckets

wooden lion
#

@tawny eagle we've seen this again, and nothing happened on the Payload instances for ~2 days prior

#

The start of the slope increase is at ~9:30PM on Sunday, and the last action we took that affected the instances was Friday ~5PM, where we promoted our staging db to prod using mongodump/mongorestore

#

Everything else works fine, and this seems to only impact the loading of images from our media collection

tawny eagle
#

Hmmm, this is very interesting to me. Can you try and create a reproduction locally? We will get on this immediately but we'll likely need more in terms of reproduction to be able to assist

#

are you using any type of afterRead hooks or something that would run in production?

wooden lion
#

Not for media

tawny eagle
#

what about for anything else?

#

we have some big projects in production on digitalocean droplets, but we have never seen this before

wooden lion
#

Not for our currently deployed instances; we're developing something that uses one but it's not on prod or staging yet

wooden lion
#

The really strange thing is that it affects nothing but the retrieval of media from the cms. Data returns are just as fast, but trying to fetch the media slows down dramatically

tawny eagle
#

ok, well, at least we can narrow it down to media at least

tawny eagle
#

I just found that this package (which we rely on) may be causing a potential memory leak

#

i released a fix in 1.6.28

#

can you try that version to see if this solves your issue?

#

and followup question for you: are you using the useTempFiles: true option?

wooden lion
#

Awesome news, thanks for looking into it! We have a ticket for upgrades next week so will report back then, but there is another issue that's slowing down our upgrade process because it appears that some future version of payload/payload-cloud-storage from what we currently use no longer handles spaces in the filenames of media

#

@tawny eagle no we aren't using useTempFiles: true
Our media collection is very simple:

import { CollectionConfig } from "payload/types";

const Media: CollectionConfig = {
  slug: "media",
  access: {
    read: () => true,
  },
  admin: {
    useAsTitle: "alt",
  },
  fields: [
    {
      name: "alt",
      type: "text",
      required: true,
    },
  ],
  upload: {
    staticURL: "/media",
    staticDir: "media",
    adminThumbnail: "thumbnail",
  },
};

export default Media;

And the plugin config:

import { fromContainerMetadata } from "@aws-sdk/credential-providers";
import { cloudStorage } from "@payloadcms/plugin-cloud-storage";
import { s3Adapter as payloadS3Adapter } from "@payloadcms/plugin-cloud-storage/s3";

const s3Adapter = payloadS3Adapter({
  config: {
    credentialDefaultProvider: fromContainerMetadata,
  },
  bucket: process.env.PAYLOAD_CMS_S3_BUCKET,
});

export default buildConfig({
  ...
  plugins: [
    cloudStorage({
      collections: {
        [Media.slug]: {
          adapter: process.env.PAYLOAD_CMS_S3_BUCKET ? s3Adapter : null,
        },
      },
    }),
  ],
});
wooden lion
#

Hmm, unfortunately it seems we're still having this issue

#

Versions:

"dependencies": {
    "@aws-sdk/client-s3": "^3.305.0",
    "@aws-sdk/credential-providers": "^3.303.0",
    "@aws-sdk/lib-storage": "^3.305.0",
    "@payloadcms/plugin-cloud-storage": "^1.0.14",
    "express": "^4.18.2",
    "payload": "^1.6.30"
  },
wooden lion
#

@tawny eagle this still happens and it really is quite bizarre. It is only the image loading that degrades, and media is the only thing that actually "hits" the running Payload instances API. For everything else, we use a local instance of payload inside our own API container to connect directly to the database for fetching data. This also seems to be exclusive to our production environment. The differences between our envs:

  • Staging

    • 1 Fargate container instance
    • 0.25 cpu units
    • 0.5 mem units
  • Prod

    • minimum 2, maximum 20 Fargate container instances
    • 1 cpu units
    • 4 mem units
charred galleon
#

You may already know this - so feel free to ignore, but there are ways to trigger heap snapshots for node.js in production. You can then bring them down for analysis in the standalone version of dev tools. Here's an excellent presentation from Matteo and Kent.... https://www.youtube.com/watch?v=vkys6Wk-jYk https://kentcdodds.com/blog/fixing-a-memory-leak-in-a-production-node-js-app

I paired with Kent C. Dodds to solve a memory leak in his website. In this video, we used the Chrome Inspector to visualize heap dumps and allocation timelines.

Here is the link where Kent implemented the fixes: https://www.youtube.com/watch?v=NLMeqH4-Hjc

▶ Play video

How I found and fixed a memory leak on kentcdodds.com

#

In Chrome...

#

@wooden lion @tawny eagle I think what Kent did was create a custom route that he could 'hit' that would trigger heap snapshots and then he downloaded them and the two of them went through the complete analysis process (you can compare snapshots as well - a baseline, against the increased memory version)

#

Again - ignore all of this if I'm 'preaching to the choir' ;-)

#

Also again - I'm sure you know this - but where the snapshot gets written will depend on your Fargate instance config. We use EFS.

wooden lion
#

Awesome! I didn't actually know about the heap stuff, will investigate how we can incorporate this so we can investigate the issues

#

Thank you

charred galleon
#

Matteo is a member of the Node.js Technical Steering Committee. He really knows his stuff.