#Is ~150 MB per user query size on BigQuery reasonable?

1 messages · Page 1 of 1 (latest)

sudden crow
#

I’ve been optimizing #🔹|hubble BigQuery query on the crypto-stellar.crypto_stellar.history_transactions table to generate a yearly wallet review. For starters, I’m calculating:
• Total successful transactions per account.
• Most active transaction day for each account.
• And transaction count of the most active day.

What I’ve Tried So Far:

1️⃣ Snapshot Table:
• Queried the entire table (~450+ GB processed) to create a snapshot with just three columns (account, batch_run_date, successful).
• Using this snapshot, queries now cost ~100-150 MB per wallet.

2️⃣ Partition Filtering:
• Leveraged partition pruning using partition_id (in YYYYMM format) for yearly queries.
• For example, querying 2024 data for all accounts costs ~99 GB.
• Subsequent queries on the partitioned snapshot also cost ~100-150 MB per wallet, which is fantastic!

3️⃣ Current Working Query:
• Added a WITH clause to pre-aggregate daily transaction counts before calculating the most active day and total transactions.
• This works excellently and is way more optimized compared to earlier!

Questions:

❓ 2. Is there an even better way to structure this query for further optimization?
❓ 3. Do you have any tips on reducing costs for large datasets like this? Any help, insights, or tips would be super appreciated! 🙌

#

Here’s the current working query I’m using:

CREATE TABLE `blip-444620.stellar_wrapped.clustered_transactions`
CLUSTER BY account
OPTIONS (
    expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
) AS
WITH daily_counts AS (
    SELECT
        account,
        DATE(batch_run_date) AS txn_date,
        COUNT(*) AS daily_transaction_count
    FROM `crypto-stellar.crypto_stellar.history_transactions`
    WHERE
        successful = TRUE
        AND batch_run_date >= '2024-01-01'
        AND batch_run_date < CURRENT_DATE()
    GROUP BY account, txn_date
)
SELECT
    account,
    SUM(daily_transaction_count) AS total_transactions,
    ARRAY_AGG(STRUCT(txn_date, daily_transaction_count)
              ORDER BY daily_transaction_count DESC LIMIT 1)[SAFE_OFFSET(0)].txn_date AS highest_transaction_count_date,
    ARRAY_AGG(STRUCT(txn_date, daily_transaction_count)
              ORDER BY daily_transaction_count DESC LIMIT 1)[SAFE_OFFSET(0)].daily_transaction_count AS highest_transaction_count
FROM daily_counts
GROUP BY account;
lime olive
#

If you need large datasets on a consistent basis, you could probably get started with the composable data platform and set up your own dataset.

(e.g. Received message: {"account_id":"GBUVCM6XH3ZXP66L6YFF7OVIGXSQ5JB2SLNC45Q2GBFCGDIGVBMZ5HQI","total_transactions":502,"highest_transaction_count_date":"2024-09-18T00:00:00Z","highest_transaction_count":288})

fringe pewter
#

Unfortunately that's about as cheap as you're gonna get from the history_transactions table. I think doing some napkin math like

  • A timestamp (batch_run_date) + string (account) columns is probably like 100 bytes(?) for each transaction (1 row)
  • There are like 1-2 million transactions a day
  • Minus whatever magic compression BigQuery does to store data

I could see all be ~= 100 GB like BigQuery estimates for 2024

sudden crow
#

Yeah. That makes total sense. Thanks for the insight!

graceful raven
#
  1. Do you have any tips on reducing costs for large datasets like this? Any help, insights, or tips would be super appreciated! 🙌

great work on initial query optimization, @sudden crow 👏

Do you plan on rebuilding or refreshing your snapshot table on a regular cadence? My only other suggestion is if you do, consider writing two queries:

  1. Your initial query to build the snapshot table with full history of whatever you need. This would execute on an ad hoc basis only when you need to initially build your table,
  2. An incremental query that appends only the latest day to your pre-built snapshot table (ie, if the query runs daily, it would append data for the last full day). This way you would only be scanning the data you are missing in your snapshot and nothing more.

If you are rebuilding your snapshot table daily, running an incremental query would reduce your query cost from ~100GB to ~10GB since you would only need to scan the current month partition.

sudden crow
#

Hi Syd, thank you for the thoughtful suggestions! 🙌 We’re working on generating a 2024 year wrapped for Stellar wallets.

Our metrics have expanded significantly to provide a detailed view of user activity. Here’s the latest query we’re using:

  account,
  total_transactions,
  net_sent AS total_sent_amount,
  net_received AS total_received_amount,
  unique_wallet_transfers,
  FORMAT_DATE('%Y-%m-%d', most_active_day) AS most_active_day, 
  highest_daily_transaction_count AS most_active_day_count,
  most_active_month,
  highest_monthly_transaction_count AS monthly_transaction_count,
  top_interaction_wallet,
  total_interaction_count,
  top_5_transactions_by_category,
  total_selling_amount_in_xlm,
  total_buying_amount_in_xlm,
  net_pnl_in_xlm,
  token_balance,
  FORMAT_DATE('%Y-%m-%d', first_transaction_date) AS first_transaction_date,
  FORMAT_DATE('%Y-%m-%d', last_transaction_date) AS last_transaction_date,
  time_on_chain_days
FROM
  `blip-444620.stellar_wrapped.annual_account_summary_2024`
WHERE
  account = @account;```

This query costs ~731.84 MB per wallet, which is reasonable considering the increased metrics. We’ve also completed a one-time query to create the snapshot table by combining multiple tables (optimized with selected columns), processing ~2TB cumulatively. The final table has ~2.8 million rows, 760.48 MB of logical bytes, and 234.08 MB of compressed bytes. The cutoff date for this snapshot is December 18, though we plan to append more data before the public launch. Per your suggestion, if we decide to go live on the 20th, I will add two days of data scanning for only one month. 

Even if 10,000 users interact with Stellar Wrapped, the estimated cost would be ~7.5 TB of processed data, or ~$50—a manageable expense. Additionally, once the query results are generated, they’re stored in MongoDB. Any repeat requests fetch data directly from MongoDB instead of querying BigQuery again, minimizing recurring costs. And as you mentioned, we can consider buying capacity reservations for pre-purchase slot time to reduce this further. This will allow us to base costs on execution time (currently only 2 seconds, but the minimum is one minute) rather than bytes processed. This approach should make 10,000 expected queries cost around ~$7, a minimal expense.

Thanks again for sharing your insights! They’ve been invaluable as we work on scaling and optimizing the backend. 😊
calm bay