#Aggregate over multiple tables

1 messages ยท Page 1 of 1 (latest)

humble zephyr
#

Is it possible to create an aggregate over multiple tables? Specifically, I have tables videos and categories. Each video document has a categoryId. I would like to aggregate over categories to e.g. count the number of videos in each category. I also have a third table, watchSessions which basically just store a videoId and the duration of the watch event. I would also like to relate these through videos to categories to maintain statistics about which categories are particularly popular, and the evolution of this over time etc...

I have a hard time setting this up elegantly with an aggregate - can it be done? And does it sound like a good use case for aggregates? Thanks ๐Ÿ™‚

#

Aggregate over multiple tables

quartz sentinel
#

I'm interested in implementing something similar but haven't gotten there yet. Currently working on a media app. In your example i'm assuming you would have to make an aggregate on the videos table called videosbyCategories using the sort key categoryId and from that you can count the number of videos in each category. And then for the watchSessions you could create an aggregate on watchSessions table using multiple sort keys like [videoId, categoryId, watchTime...etc] from there you can probably get some of the stats your looking for. But i'm curious to what convex team suggests.

elfin echo
#

It sounds like you want an aggregate with a key like
[categoryId, videoId, duration] with a sumValue of duration, so you can look up count by [categoryId], [categoryId, videoId], and also get a sum of the duration over those ranges so you can find average watch time, and find 90th percentile watch time, etc.

I'd say you have a few options:

  1. Use the TableAggregate on watchSessions and denormalize the categoryId there too, assuming that videos don't change categories (or when they do the sessions are all updated too)
  2. Use the lower-level DirectAggregate where you do inserts like .insert(ctx, [category._id, video._id, watch.duration], watch._id, watch.duration) whenever you add a new session, and keep it up to date as it changes (e.g. updating all of them when a video category changes)
  3. Roll your own rollups: generate these statistics periodically - e.g. once a day in a cron job, over that day's sessions - and store that info into a dedicated table, one entry per day. Then you can query over the days to get stats over custom time periods.
  4. Run a periodic stats job that paginates over all the data and generates a custom report. If you just make a monthly report, you might like this for being able to define new stats and not have to worry about backfilling / maintaining data
  5. Set up streaming export to mirror your data into an analytics DB, where you can run all sorts of analytical queries. It won't be transactional or up to date, but is the right tool for the job for deep analysis.
  6. Set up log streaming and emit these events as logs, and roll them up in metrics / dashboards in Axiom / DataDog - provided you don't care about being able to migrate old data / get stats over old data.
humble zephyr
#

Thanks for the suggestions! I'll try option 2 and report back how it goes ๐Ÿ˜Œ

humble zephyr
# elfin echo It sounds like you want an aggregate with a key like `[categoryId, videoId, dura...

Thanks Ian! DirectAggregate seems to work wonderfully!

I ran into one problem though: I can't seem to get correct values for quantiles or min/max. I use this syntax:
const max = (await categoryWatchTimeAggregate.max(ctx, {prefix: [categoryId]}))?.key[1]
I guess the issue is that entries are ordered by videoId instead of watchPercentage when I prefix by categoryId.

I don't understand complete what you mean by this: and also get a sum of the duration over those ranges so you can find average watch time, and find 90th percentile watch time, etc.? ๐Ÿ˜Š thanks!

elfin echo
#

Yeah the min/max/etc API is realtive to your sorting. so if you want the max per category you would need a [categoryId, duration] aggregate. With that you could do:

  • min/max duratiion
  • use the sum and count to calculate an average duration
  • use the count and at to get the value of the element at the Xth percentile

and more things outlined in the README

humble zephyr
# elfin echo Yeah the min/max/etc API is realtive to your sorting. so if you want the max per...

Thanks @elfin echo , I went with that approach and it works well for that situation.

I am battling creating time-series data using aggregates at the moment. I am trying to recreate Youtube studio stats where the creator can see trends of counts of watches, subscribers, etc.

I created an aggregate on [videoId, creationTime, watchPercentage] with a sortkey in the same order. I thought I would somehow be able to aggregate by first a videoId and then a range of creationTimes, such that I could show watchPercentage trends for the last 24h, 1 week, 1 month, etc.. At the same time, I would like to get quantiles for watchPercentages for the given time period. I tried doing something like this, but it seemed to just count all documents:

        lower: { key: [videoId, beginDate, quantile[0], inclusive: false },
        upper: { key: [videoId, Date.now(), quantile[1], inclusive: true },
      })```

Maybe I am overloading these aggregates, and they should instead be split into smaller ones to handle each use case? Would just be so coooool to be able to do all of this with a single aggregate! ๐Ÿ˜Š
elfin echo
#

sum is what you're probably looking for instead of count - assuming you've set sumValue to watchPercentage

#

but also, the sorting is by date, then watch percentage, so I don't think the quantile argument here is doing what you think. Here's a good way to think about index sorting / sort keys: https://stack.convex.dev/databases-are-spreadsheets

I want to share my mental model of databases:

  • Databases are just big spreadsheets
  • An index is just a view of the spreadsheet sorted by one or mor...
elfin echo
#

I will say, once you go too far down the anaylitical query route, once your queries can't efficiently operate over a range of an index, you'll probably end up reaching for an OLAP tool, like streaming stats out to an analytics tool, and occasionally calculate metrics and push the results back into convex. That's the route youtube / others take. Eventually you want to isolate your "live" database from your large "offline" queries so the former stays snappy and any overloading happens on a non-critical read replica (that can be column-based for really efficient querying)

humble zephyr
#

Thanks @elfin echo for your thoughts, makes good sense! We're definitely not that large yet, honestly we could still just do all of this analytics with simple queries on our tables, but I can't help trying to optimize a bit and prepare for our hopeful scaling!!

I've decided to go with a modification of one of your original suggestions above, and will maintain aggregates that I update with a cron job just once daily during nighttime (which can be done very neatly, since each cron job will only have to look at documents created within the last 24 hours). As you also point out, it makes good sense to separate these more "offline" analytics data intended for admin staff and our creators from our live database that serves our actual end users. I began by implementing triggers to keep these analytics aggregates healthy, but realized that this was ridiculous: analytics will be used to glance retrospectively at data on the timescale of days, weeks, or even months, so triggers were completely overkill and unnecessary.

Thanks for the help, was great to discuss these architectural considerations and get your input ๐Ÿ™‚