#tpu-research-cloud | Google Developer Community | Page 2

strange quartz Sep 2, 2025, 1:57 PM

#

Used the web console, could be that? Also isn’t —worker=all needed only for multi-host?

bitter nebula Sep 3, 2025, 9:49 AM

#

yeah in general we recommend that folks use gcloud, give that a try: https://cloud.google.com/tpu/docs/v6e-intro#set_up_jax_using_queued_resources

carmine bay Sep 13, 2025, 7:30 PM

#

Hi everyone — hope you’re doing well. I’m a recent TRC grantee setting up workloads and wanted to sanity-check pricing to avoid unexpected spend. I’d really appreciate any firsthand experiences or pointers to official docs:

TPU VM pricing (Spot/Preemptible)
• Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?
• If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately?
What still costs money outside the TPU allocation?
• GCS buckets (storage, API ops, lifecycle)
• Network egress (inter-region / to Internet)
• Persistent Disk attached to TPU VM
• External/static IPs, NAT, Load Balancers
• Cloud Logging/Monitoring, Artifact Registry, Pub/Sub, Cloud Build, etc.

Context: Region = [X], TPU type = [v5e-8 / v4-8].
Thank you in advance! And if this has been answered before, I’d be grateful for a link to the thread.

bitter nebula Sep 15, 2025, 7:39 AM

#

carmine bay Hi everyone — hope you’re doing well. I’m a recent TRC grantee setting up worklo...

Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?

When you register a project you'll get a notification with the quotas that you have access to - those are covered.

If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately?

The host VM architecture has been deprecated for a long time (it might not even be usable anymore but I'm not sure) - you should be using TPU VM via Queued Resource w/ TRC quota so this wouldn't be an issue.

What still costs money outside the TPU allocation?

TRC covers the Cloud TPU service so anything that you use otherwise is subject to regular billing.

vivid plank Sep 15, 2025, 7:38 PM

#

What do you guys usually do to use larger models like 100b + since the disk is liwe 90gb perhost

lofty yarrow Sep 23, 2025, 6:57 AM

#

Hi! Recently I'm training some LLMs on Google v5e and v4 TPUs. However, I have encountered NaN loss problems on v5litepod-64 in zone us-central1-a. Specifically, the model trains successfully for ~4000 steps and suddenly the gradient and loss exploded.

I tried to run training with exactly the same setup on v4-128 and did not encounter the same problem. I wonder:

if anyone else has encountered this problem before? Is this likely a hardware issue?
if there a way to exclude certain TPU cores when training? (like we typically have a --exclude flag in slurm systems)

Thank you for your help in advance! Please let me know if more details are needed:)

lofty yarrow Sep 23, 2025, 8:06 AM

#

vivid plank What do you guys usually do to use larger models like 100b + since the disk is l...

I believe most people use buckets to store large data files and ckpts

slate pond Sep 26, 2025, 4:12 AM

#

Where\How should I be contacting Google Research TRC regarding the sharing of research work through open source releases, blog posts, program feedback, or anything else as mentioned in the onboarding email?

bitter nebula Sep 26, 2025, 8:00 AM

#

slate pond Where\How should I be contacting Google Research TRC regarding the sharing of re...

The easiest way is probably just to reply to the email but there's also a form linked at the bottom of sites.research.google/trc/publications

#

For the record, if it's been indexed by Google Scholar and it mentions the program by name we may have already picked it up - probably searching for the paper name or the first author's surname on that page is the easiest way to tell

slate pond Sep 26, 2025, 8:18 AM

#

Ahhhh, thank you for the information. I missed the form you mentioned. New to the program and I've never shared any of my work before, so trying to figure out the best way to go about all that.

bitter nebula Sep 26, 2025, 8:19 AM

#

Yeah that link is fairly hard to find - the publication page was a lot shorter when we added it. 🙂

slate pond Sep 26, 2025, 8:22 AM

#

Haha, that's a good sign at least. I hope I can contribute something novel to the program.

eternal tide Sep 29, 2025, 2:22 PM

#

Hello, TRC team.

The trc email said free access to those resources

1 on-demand Cloud TPU v5e chips in zone europe-west4-a
1 on-demand Cloud TPU v5lite chips in zone europe-west4-b

may I know whether the v5lite-32/64 can be covered in this case? Thanks for your help!

bitter nebula Sep 29, 2025, 2:51 PM

#

eternal tide Hello, TRC team. The trc email said free access to those resources 1 on-demand...

you'd need to take that up with trc-support@google.com

eternal tide Sep 29, 2025, 2:59 PM

#

Ok

rustic ginkgo Oct 12, 2025, 2:23 PM

#

https://docs.google.com/presentation/d/1TZMmXumbaCf4PEIHcNVOuhwniLFJyOvLjPoZrnanb78/edit?usp=sharing

Google Docs

Coma: Boost ML Research with TPU Research Cloud

Manila Alpha Romer Coma Associate Engineer, Kollab Boost ML Research with TPU Research Cloud 2025

#

Excited to share here my slides at Google's Devfest in Manila for a talk about TPUs and the TRC program

rustic ginkgo Oct 17, 2025, 10:55 AM

#

The TPU talk actually got accepted to two events - last week (10m lightning) and this week (30m). These slides are bit more detailed:

https://docs.google.com/presentation/d/1C6ccqrJz--90Po2eo1G8F4Uko0TOejT6SNwInZswiJ4/edit?usp=sharing

Google Docs

Coma: Supercharge your ML Research with Google’s TPU Research Cloud

Supercharge your ML Research with Google’s TPU Research Cloud Alpha Romer Coma Associate Engineer, Kollab

rich frigate Oct 21, 2025, 12:09 PM

#

Hello!
I submittted an application to TRC two weeks ago and didn't get any response, neither approval, neither denial, so I decided to post here and to check. (I checked spam folder too, nothing's there)

bitter nebula Oct 21, 2025, 1:11 PM

#

Not sure that's something that can be covered in detail on discord but generally speaking no email == not approved, unfortunately

rich frigate Oct 21, 2025, 1:28 PM

#

so it doesn't mean that my application was lost or ignored, just denied. correct?

bitter nebula Oct 21, 2025, 1:59 PM

#

I suppose anything is possible but it's not likely the case - you could try emailing trc-support@google.com to see if they can verify

rich frigate Oct 21, 2025, 2:12 PM

#

Will do, thanks

junior storm Oct 22, 2025, 11:16 AM

#

Hello, I have a question. I use TRC to receive the zone I received, which I created as I normally receive it, following the normal rules. Then I found that the code cannot find the TPU on the hardware. What is the cause?

#

Thanks in advance for answering me.

bitter nebula Oct 22, 2025, 11:57 AM

#

junior storm Hello, I have a question. I use TRC to receive the zone I received, which I crea...

There's not really enough info here to be able to establish a root cause - if it's throwing a particular error the best bet is to search for that to see if anyone else has mentioned it / how they resolved it. Failing that you might provide the same info here so anyone who has any insight can chime in.

junior storm Oct 22, 2025, 11:58 AM

#

I'm starting over again, if I find anything else I'll let you know, thanks.

bitter nebula Oct 22, 2025, 11:59 AM

#

at a super-high level a lot of times that ends up being because the code is bugged (eg using an api incorrectly) or the tpu is unreachable (preempted, maintenance event etc)

prime bear Nov 6, 2025, 11:59 AM

#

Hey! I am a bit confused with global batch size calculation on v6e-8 (for example). So jax.local_device_count("tpu") shows 8 for a v6e-8. I am using MaxText that has a per_device_batch_size parameter. So when I use 64 as per_device_batch_size does it mean, that my global batch size is 64 x 8 = 512? If I would use a v6e-64, is it then 64 x 64 = 4096?

hoary swallow Nov 6, 2025, 7:43 PM

#

Hi, I am trying to create a tpu v4 on demand instance but when I run this, I get an error, what am I doing wrong?

gcloud compute tpus queued-resources create tpuv4 \
    --node-id v4-32-ond \
    --zone us-central2-b \
    --accelerator-type v4-32 \
    --runtime-version tpu-ubuntu2204-base
ERROR: (gcloud.compute.tpus.queued-resources.create) NOT_FOUND: Cloud TPU was unable to complete the operation.

weak geyser Nov 6, 2025, 7:52 PM

#

Hey all – excited to get working with the TPUs. I was trying some of the hello world stuff on Colab but unfortunately couldn't get anything to run. I think the tutorials may be based on a v2 TPU hardware that isn't live anymore, as I can only select v5. Any tips on getting started here? Also happy to move to just running the gcloud API to try this. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb#scrollTo=Zo-Yk6LFGfSf

Google Colab

junior storm Nov 7, 2025, 3:12 AM

#

Hello, I have a question about GCP credits. I received a trial credit for GenAI App Builder.
What are 32K credits? And how do I check

rich musk Nov 7, 2025, 4:24 AM

#

junior storm Hello, I have a question about GCP credits. I received a trial credit for GenAI ...

On ur profile section

bitter nebula Nov 7, 2025, 9:22 AM

#

hoary swallow Hi, I am trying to create a tpu v4 on demand instance but when I run this, I get...

it's not possible to answer this without details about your project that shouldn't be shared on discord, recommend contact trc-support@

bitter nebula Nov 7, 2025, 9:24 AM

#

weak geyser Hey all – excited to get working with the TPUs. I was trying some of the hello w...

we recently pulled the colab links out of TRC's welcome documentation as, to your point, they referenced no-longer-supported generations of hardware - using gcloud is probably the easiest path unless someone knows of some other official notebook that could replace it (unfortunately I don't)

bitter nebula Nov 7, 2025, 9:26 AM

#

junior storm Hello, I have a question about GCP credits. I received a trial credit for GenAI ...

just as an fyi to future readers: TRC has nothing to do with GCP credits, please don't worry if you've been approved for the program but don't have any credits issued in the UI 😅

bitter nebula Nov 7, 2025, 9:31 AM

#

prime bear Hey! I am a bit confused with global batch size calculation on v6e-8 (for exampl...

Did you ever figure this out? I'm not sure what MaxText considers to be a "device" but I'd assume it's either a chip (in which case 512 sounds right) or a host (8 chips/host -> 64)...

#

(https://docs.cloud.google.com/tpu/docs/v6e#configurations for context)

prime bear Nov 7, 2025, 9:55 AM

#

bitter nebula Did you ever figure this out? I'm not sure what MaxText considers to be a "devic...

I found one relevant information in this readme ( https://github.com/AI-Hypercomputer/maxtext/blob/69ed0c5d29aa25c61fd4c31a666ef35cf345d30e/docs/reference/architecture_overview.md?plain=1#L63) it says "Sets the local batch size per accelerator chip." But I will ask as well on GitHub 🙂

weak geyser Nov 7, 2025, 4:31 PM

#

bitter nebula just as an fyi to future readers: TRC has nothing to do with GCP credits, please...

This is also useful lol. I signed up for free trial credits then somehow used all of them in 24hrs

fiery siren Nov 20, 2025, 2:33 PM

#

any idea why I can't create the TPU? (also I checked. all of the quota usage is at 0). gcloud compute tpus tpu-vm create node-tpu --zone=us-central1-a --accelerator-type=v5litepod-64 --version=v2-tpuv5-litepod --preemptible
Create request issued for: [node-tpu]
Waiting for operation [projects/tpu-proj-ai-detection/locations/us-central1-a/operations/operation-1763648567926-644076e665ecc-2a7ceae1-85b1a8b4] to complete...failed.
ERROR: (gcloud.compute.tpus.tpu-vm.create) {
"code": 8,
"message": "You have reached IN_USE_ADDRESSES limit. [EID: 0xbf338a209239294b]"
}

bitter nebula Nov 20, 2025, 4:36 PM

#

sounds like a quota issue, I think you can request an increase via the GCP console

fiery siren Nov 20, 2025, 5:27 PM

#

i dont know which quota it is because all quotas i found have usage on 0 and limit at 8

tulip kestrel Nov 21, 2025, 12:04 PM

#

Got a problem. One of my Tpus is stuck and I cannot ssh to it. And it's a pod so idk if it's even possible to restart it. And I got stuff saved there. Possible to somehow just take the stuff from there or unstuck it? It happend when I ssh to it and the pc crashed then I opened the pc again ssh and it said that connection timed out.

bitter nebula Nov 21, 2025, 2:13 PM

#

to the best of my knowledge no, but maybe someone else on the server has found a way...

tulip kestrel Nov 21, 2025, 3:07 PM

#

bitter nebula to the best of my knowledge no, but maybe someone else on the server has found a...

Ouch :((

vague osprey Nov 21, 2025, 8:45 PM

#

Hellow everyone, I am new here, what is tpu about?

tulip kestrel Nov 22, 2025, 10:10 AM

#

vague osprey Hellow everyone, I am new here, what is tpu about?

tpus are like gpus or cpus. but they work a lot diffrent. and Tpu Research Cloud provide that tpus to peoples so they can train or run experiments on them. very cool stuff ^^
you can learn how tpus works here > https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm

vague osprey Nov 22, 2025, 10:15 AM

#

tulip kestrel tpus are like gpus or cpus. but they work a lot diffrent. and Tpu Research Cloud...

Thank you

tulip kestrel Nov 22, 2025, 10:17 AM

#

vague osprey Thank you

np!

tulip kestrel Nov 23, 2025, 11:38 AM

#

tulip kestrel Got a problem. One of my Tpus is stuck and I cannot ssh to it. And it's a pod so...

Even gemini was not able to help. So I just gona delete the tpu and start fresh. billy_sad

twin zenith Nov 29, 2025, 6:02 AM

#

Soo, I don't know if I posted in the right spot, but, I think I stumbled across an architecture that might, like, nuke GPU architecture and TPU's for LLM design. I really want someone to tear this idea down, cause the math keeps saying it amplifies LLM speed by 200% and shrinks it exponentially, while allowing for rapid iteration. I can't have been the only person to have stumbled across this, I feel like.

#

https://github.com/Kaleaon/Pentary is the repo I posted the preliminary ideas, research, etc, on.

bitter nebula Dec 1, 2025, 12:28 PM

#

twin zenith Soo, I don't know if I posted in the right spot, but, I think I stumbled across ...

yeah, probably not the best channel for this, I see you already posted in eg #ai-general which I would expect to be more likely to generate some interest in the idea though

novel oyster Dec 3, 2025, 1:17 AM

#

Can anyone tell me what configuration should I use for running JAX on v6e-64?

tidal garnet Dec 3, 2025, 1:21 AM

#

Boa noite.

novel oyster Dec 3, 2025, 7:37 AM

#

Retrying at least 10 times for v6e-64 vm

#

😭

uneven seal Dec 3, 2025, 8:00 PM

#

Hey! Glad to meet all of you guys who're in the same boat as testing out TPUs for research. I've been looking at a ton of the guides/documentation, and wanted to ask about some of the cool stuff you've been doing that you haven't wrote up/posted about. Any cool challenges you had to work through? etc.

uneven seal Dec 3, 2025, 8:03 PM

#

novel oyster Can anyone tell me what configuration should I use for running JAX on v6e-64?

#

If you mean by the type of VM, you should be using v2-alpha-tpuv6e for the best compatibility, at least from the documentation I've been reading for running some test workloads

novel oyster Dec 4, 2025, 6:59 AM

#

Also, is it normal to get internal error while creating tpu vm even with spot + queue?
I've used this command to queue 64 v6e resources but it just keeps failing.

gcloud compute tpus queued-resources create qr-v6e64-ew4a-spot \
  --node-id node-64 \
  --zone europe-west4-a \
  --accelerator-type v6e-64 \
  --runtime-version v2-alpha-tpuv6e \
  --valid-after-duration 1m \
  --spot \
  --internal-ips

bitter nebula Dec 4, 2025, 9:04 AM

#

how is it failing / are you getting an error code back?

novel oyster Dec 4, 2025, 9:48 AM

#

bitter nebula how is it failing / are you getting an error code back?

I am only getting error code 13, internal error.
No description, just says error without much explanation

#

With every region I have 64 v6e spot quota, any setup with more than 16 tpus results similar problems.

#

With console I do get something like this if I retry soon
Cloud TPU was unable to complete the operation. Please try again, or contact support if the problem persists. [EID: 0x369a052af37e209f]

bitter nebula Dec 4, 2025, 10:04 AM

#

❤️ opaque errors
sounds like it could be quota related though
(if you have access to them) can you try an older gen TPU to see if it works / fails in the same way?

#

if it is specific to v6e I'd guess that you might not have enough Hyperdisk quota, in which case you'd need to request an increase

keen crescent Dec 5, 2025, 1:19 AM

#

i think the trc people forgot to gave me quota for us-central2-b

#

the quota page doesn't show i have qota for us central 2b, the dashboard for creating the tpu doesnt even list the region, and the cli here doesnt work

bitter nebula Dec 5, 2025, 9:27 AM

#

keen crescent the quota page doesn't show i have qota for us central 2b, the dashboard for cre...

if you received an email saying that you have the quota then you should have the quota...

#

could be a timing thing though, are you still having an issue?

keen crescent Dec 5, 2025, 3:47 PM

#

yes

#

i did "gcloud compute tpus tpu-vm create will-2 --zone=us-central2-b --accelerator-type=v4-32 --version=tpu-ubuntu2204-base --project=trctpu-123"

bitter nebula Dec 8, 2025, 8:30 AM

#

Probably worth contacting trc-support@google.com

keen crescent Dec 8, 2025, 7:53 PM

#

yeah i did

uneven seal Dec 9, 2025, 8:42 PM

#

are we able to use GKE with our quota of TPUs?

past flax Dec 10, 2025, 8:22 AM

#

past flax Dec 10, 2025, 8:25 AM

#

uneven seal are we able to use GKE with our quota of TPUs?

not right now, but eventually yes

hazy panther Dec 15, 2025, 9:55 PM

#

Hi, I received access to
32 on-demand Cloud TPU v4 chips in zone us-central2-b
and some other ones

but

trying to create a TPU VM using zone us-central2-b returns Permission denied on 'locations/us-central2-b' (or it may not exist).
Is this location deprecated or something?
and for the other zones I simply get Insufficient capacity
Is there a place to check zone capacity?

bitter nebula Dec 16, 2025, 8:07 AM

#

Is this location deprecated or something?

no, if you still can't access it you might want to email trc-support@google.com

Is there a place to check zone capacity?

also no, but you can use the Queued Resource API to queue until there is availability

hazy panther Dec 17, 2025, 9:11 AM

#

Any recommendation on library for logging and train-run monitoring on TRC?

hazy panther Dec 17, 2025, 10:13 AM

#

I found a rly good repo called tpux. is it widely used?

#

I'm considering using it to help iteration speed. I want to edit my training loop on one VM but run the test between all of them. Doesn't seem like there's a simple way to do that using gcloud

uneven seal Dec 18, 2025, 7:08 PM

#

hazy panther I found a rly good repo called tpux. is it widely used?

let me know how it goes, i'd love to see more monitoring tools

hazy panther Dec 20, 2025, 7:43 PM

#

yes I got it working. tpux with Wandb

sinful spindle Jan 2, 2026, 11:15 AM

#

Hey, so i got access to the cloud tpus, but when i try to create a tpu vm it starts creating for a few minutes and then it shuts down, and when i see the logs it says this:

btw im on the free trial, and also i tried using both tpuv6e and v5, and i tried creating through the web interface and cli but its still shutting down

I already sent an email yesterday, but no response

sinful spindle Jan 2, 2026, 1:46 PM

#

just realized i sent them the wrong project number (in the email)💀

sinful spindle Jan 2, 2026, 2:05 PM

#

ok it works with v4-32

obtuse pilot Jan 3, 2026, 3:02 PM

#

Sorry @tropic topaz Google does not provide official support on this server
Please check #general message for the official support links

bitter nebula Jan 5, 2026, 8:58 AM

#

obtuse pilot Sorry <@1155131026947981352> Google does not provide official support on this se...

to be sure, there are members of the TRC team that are on the server & contribute to this channel specifically, myself included.

trc-support@google.com is the official way to get support, though - in our view the channel is intended to be driven by the community.

quaint silo Jan 6, 2026, 1:29 AM

#

Hi, can anyone recommend online communities for getting help on PyTorch XLA related issues? I've spent couple of days porting our existing CUDA code to work on the TPU cloud. It works however it is not as efficient due to lower memory bandwidth of TPUs. I don't want to waste too much time before the TRC timeline ends. Thanks in advance!

sturdy valley Jan 13, 2026, 1:54 PM

#

Can anyone help please? Am I doing it wrong or the TPUs are out of stock for now?

bitter nebula Jan 14, 2026, 7:54 AM

#

Unfortunately no one is going to be able to offer much help based on the screenshot as it doesn't give any information as to why the failures are happening.

As a first step I'd recommend using the gcloud CLI as it will typically give more detailed error messages - if it's not something obvious (like out of capacity errors) then share the error message and someone might be able to help.

river nexus Jan 31, 2026, 12:33 PM

#

twin zenith Soo, I don't know if I posted in the right spot, but, I think I stumbled across ...

Any gains you get will be absolutely annihilated by a binary-pentry conversion layer.

Because the rest of the world operates on that level I'll finish absolutely ripping it to pieces in a minute

#

You also critically rely on experimental memresister parts that are prone to bit flips and other issues, especially at low voltage levels.

#

C=2.717ish

#

Radix (b) Efficiency Constant (b/\ln b) Information per Digit (\log_2 b) Relative Cost (vs. Ternary)
2 (Binary) 2.885 1.00 bits 1.056
3 (Ternary) 2.731 1.58 bits 1.000
4 (Quaternary) 2.885 2.00 bits 1.056
5 (Pentary) 3.106 2.32 bits 1.137
10 (Decimal) 4.343 3.32

Good shot, but I'm afraid it's just not going to work. I've tried similar things myself.

#

I will spare you the rest of the problems as I don't want to make you cry, but suffice to say, the idea is not tenable, although it is novel. Keep at it.

naive wagon Jan 31, 2026, 12:44 PM

#

Has anyone used tpus for inference I would like to know gains compared to standard gpu's in collab

placid scarab Feb 2, 2026, 11:02 AM

#

Hello, Is anyone familiar with torch-xla able to confirm whether this is an appropriate way to measure inference runs on a TPU:

⁨```python
for _ in range(num_warmup_runs):
with torch.no_grad():
logits = torch_model(torch_inputs).logits

latencies_ms = []

for i in range(args.num_iterations):
start_time = time.time()
with torch.no_grad():
logits = torch_model(torch_inputs).logits
torch_xla.sync()

  xm.wait_device_ops()
  end_time = time.time()

  latencies_ms.append((end_time - start_time) * 1000)

#

xm.wait_device_ops() seems to be blocking the run indefinitely

marsh fractal Feb 2, 2026, 1:28 PM

#

placid scarab Hello, Is anyone familiar with torch-xla able to confirm whether this is an appr...

I can help! The issue is likely that xm.wait_device_ops() is deprecated/broken in newer torch-xla versions and can cause hangs.
The torch_xla.sync() call should already be sufficient to wait for TPU operations to complete - you probably don't need the wait_device_ops() at all.
Question: What version of torch-xla are you using? If it's recent (2.0+), just remove the xm.wait_device_ops() line and rely on torch_xla.sync() alone.

placid scarab Feb 2, 2026, 2:50 PM

#

marsh fractal I can help! The issue is likely that xm.wait_device_ops() is deprecated/broken i...

Hi Joshua, thanks for the help. I've left only torch_xla.sync() but I'm now observing another issue where the program doesn't terminate. Plus, I'm observing similar performance results with or without using ⁨torch_xla.sync()⁩, which makes me wonder whether I even need it. Do you know why this might be?

marsh fractal Feb 2, 2026, 5:02 PM

#

placid scarab Hi Joshua, thanks for the help. I've left only torch_xla.sync() but I'm now obse...

Both issues suggest the sync isn't working properly. The program hanging and identical performance with/without sync usually means your tensors or model aren't actually on the TPU device, so there's nothing to synchronize.
I can either walk you through debugging the device placement and fixing the sync points, or if you share your full code snippet I can just rewrite the benchmarking section with proper TPU synchronization for you.
Which approach works better for you?

placid scarab Feb 2, 2026, 5:09 PM

#

Ah thanks! Here's a code snippet:

⁨```python
device = torch_xla.device()

torch_model = torch_model.to(device)
torch_inputs = torch_inputs.to(device)

print(f"Running inference on {model}")
print(f"Warming up {model} with {num_warmup_runs} runs")
for _ in range(num_warmup_runs):
    with torch.no_grad():
        logits = torch_model(torch_inputs).logits
        torch_xla.sync(wait=True)

latencies_ms = []

for i in range(args.num_iterations):
    start_time = time.time()
    with torch.no_grad():
        logits = torch_model(torch_inputs).logits
        torch_xla.sync(wait=True)

    end_time = time.time()
    latencies_ms.append((end_time - start_time) * 1000)


Can you figure out what I'm doing wrong?

full arch Feb 2, 2026, 5:25 PM

#

placid scarab Ah thanks! Here's a code snippet: ⁨```python device = torch_xla.device() ...

I can see the issue you're calling sync(wait=True) but that parameter doesn't exist in torch_xla.sync(). The sync is likely failing silently, which explains both the hanging and the inconsistent timing.
Here's the thing this needs a proper rewrite with correct sync semantics and proper timing placement. Rather than going back-and-forth in the thread, would you be open to discussing this privately? I can fix this properly for you and make sure your TPU benchmarking is accurate.
Want to move this to DMs so I can help you sort this out?

placid scarab Feb 2, 2026, 5:41 PM

#

The parameter does exist for the sync method: https://docs.pytorch.org/xla/release/r2.8/learn/api-guide.html#torch_xla.sync

#

Anyone able to point out where this is going wrong (without charging a fee) ?

tribal stag Feb 3, 2026, 4:16 AM

#

I have been "waiting for resources" for days on v4 tpus. sigh. Anyone else feels the same?

tired sluice Feb 3, 2026, 10:34 AM

#

tribal stag I have been "waiting for resources" for days on v4 tpus. sigh. Anyone else feels...

try smaller slices, v6e-8 works instantly for me in some region

tribal stag Feb 4, 2026, 5:48 AM

#

I dont have quota for on demand v6e-8 in us-central2-b. I only have preemptible v6e in us-central1-a, but it got preempted instantly. lol.

quaint silo Feb 5, 2026, 4:07 AM

#

I am not sure if this helps anyone, but I've shared my experiences on migrating an LM training pipeline to run on TPUs here: https://dogac.dev/blog/2026/migrating-to-tpu/, maybe it helps you to migrate your existing pipelines and debug performance issues.

naive wagon Feb 7, 2026, 9:19 AM

#

quaint silo I am not sure if this helps anyone, but I've shared my experiences on migrating ...

Can we collaborate I do have some questions

quaint silo Feb 7, 2026, 4:36 PM

#

naive wagon Can we collaborate I do have some questions

Feel free to ask questions here directly

naive wagon Feb 7, 2026, 5:05 PM

#

quaint silo Feel free to ask questions here directly

I did read your blog on running tpus in cloud however the initialisation takes lot of time and I'm not sure how to utilise the ones provided by colab would like a way it can be initialised properly

naive wagon Feb 8, 2026, 5:05 AM

#

naive wagon I did read your blog on running tpus in cloud however the initialisation takes l...

Any suggestions @quaint silo

quaint silo Feb 8, 2026, 5:06 AM

#

I am not sure what you mean. I create a queued resource as TPU documentation recommends and I usually get access to it within 5-10 minutes. If it is pre-emptied, I retry.

waxen pilot Feb 9, 2026, 7:26 PM

#

This is a great content Dogacel! Thanks for sharing

chrome bough Feb 11, 2026, 2:33 PM

#

yo guys can you make a free tpu with infinite compute power and storage

#

that would be useful ty

#

||jk||

restive notch Feb 11, 2026, 6:28 PM

#

Does anybody know how the TRC team reacts to asking for more TPUs (or more advanced TPUs like the v7x)? I have a couple of projects that could benefit from the 3d topology (and optical switches). Plus, my v6e-16 cluster has been getting preempted a lot. I have a couple of proofs of concept already. Also, has anybody ever gotten a v7x or not? As far as I am aware, it should be available, but only in public preview. Would it be possible, or is the capacity being used for internal LLM training?

tropic falcon Feb 11, 2026, 6:59 PM

#

chrome bough yo guys can you make a free tpu with infinite compute power and storage

https://tenor.com/view/ohhh-duh-why-didnt-i-think-of-that-gif-21849807

Tenor

#

:p

sullen rivet Feb 11, 2026, 8:10 PM

#

Hi all,
As part of TPU research cloud, I have been provisioned free access to v4-32 on-demand TPUs in us-central2-b. However, I am getting an issue when I try to create TPUs with that configuration. The error message is not that informative so I was wondering if I did something wrong.

Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead? I can try to create 4 v4-8 VMs and maybe try to network them somehow but the cost display on the side of the screen is making me cautious. My understanding was that my original approach itself was supposed to create multiple VMs .

Any help would be appreciated here. Thanks!

vapid siren Feb 11, 2026, 10:26 PM

#

sullen rivet Hi all, As part of TPU research cloud, I have been provisioned free access to v...

I would try to allocate the same machine from the cli with gsutil, for some reason the cli error messages are often much more helpful then the web UI ones (you can ask an LLM to translate the screenshots into a gsutil command).

#

Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead?
No, that should never be necessary, the TPU deployment thing also sets the correct env variables for JAX so that the hosts can find each other

sullen rivet Feb 11, 2026, 10:33 PM

#

vapid siren I would try to allocate the same machine from the cli with gsutil, for some reas...

Thanks! I seem to be able to allocate spot TPUs through the UI but when I try on-demand it seems to have this issue even though I do have access to on-demand TPU chips based on the original email. I will try the cli though

chrome bough Feb 11, 2026, 11:14 PM

#

tropic falcon https://tenor.com/view/ohhh-duh-why-didnt-i-think-of-that-gif-21849807

yayy 😄 i believed in yall

bitter nebula Feb 12, 2026, 8:24 AM

#

restive notch Does anybody know how the TRC team reacts to asking for more TPUs (or more advan...

If you're able to provide a compelling plan for your research then an increase and/or extension might be possible, but TRC doesn't currently offer v7 FYI.

bitter nebula Feb 12, 2026, 8:28 AM

#

sullen rivet Thanks! I seem to be able to allocate spot TPUs through the UI but when I try on...

Hard to know exactly what's happening without the error response but if you're able to create spot v4 and not od in the same zone then it's most likely a stockout - spot devices often have better availability

sullen rivet Feb 12, 2026, 10:51 PM

#

bitter nebula Hard to know exactly what's happening without the error response but if you're a...

Thanks! I finally managed to get something on-demand provisioned through the UI and I’m sure it’s part of the TRC grant I’ve been allotted (v4-32/us-central-2b). I can’t exactly stop the TPU when not in use because it’s not a single device TPU so I’m trying to see if there’d be some mechanism to verify that I’m not being charged for this before I randomly see a really big charge tomorrow

restive notch Feb 13, 2026, 1:17 AM

#

bitter nebula If you're able to provide a compelling plan for your research then an increase a...

I see. Are v5p-64 TPUs available then? I had a couple of ideas for the OCS and 3d topologies. Hopefully, I can test out, see if they work, and then publish a paper about it. If anybody is knowledgeable on TPU specifics, could they check these ideas?

The first one is using topology aware manhattan distance for routing in MoE models. Trying to penalize/reward the model for using TPUs that are physically closer to each other. Likely, I would just anneal the penalty to try not kill model performance. Plus, some load balancing between experts. The big issue I can see is if the communication overhead between MoE models isn't large enough to justify the work. That being said, it could be a (mostly) free performance gain.

The second one is using the 3D torus for helical ring attention. Theoretically, having a Hamiltonian path or multiple interleaved helices should make it so that you can transfer the KV-block while it is computing attention. So increasing the memory bandwidth tremendously and thus the context. Personally, I suspect the Gemini family of LLMs does something similar, and that's why they can have such large context windows.

The third one is arguably the most difficult. The hard part is the SparseCores. I cannot for the life of me find any good documentation on the SparseCores, even though it would be amazing for specific tasks. If any Google insiders familiar with the SparseCore functions, documentation, or details could help me out, it would be greatly appreciated. Anyways, the general idea is to use the SparseCore as a Locality Sensitive Hashing engine to retrieve the attention (Leaving the MXU for computation). The benefit of it is that you can use the SparseCore for Top-K relevant keys, then compute them on the MXU. While it could be incredible, I have no idea if the SparseCore would support this, so this is why it's the least likely out of all 3.

bitter nebula Feb 13, 2026, 10:42 AM

#

sullen rivet Thanks! I finally managed to get something on-demand provisioned through the UI ...

For TRC-enroled projects it should generally not be possible to start a device for which you lack the necessary quotas but, yeah, probably a good idea to keep an eye on the billing if you're concerned

fallen salmon Feb 14, 2026, 11:34 PM

#

Hi all, I'm running into an issue where I spin up a v6e-8 spot instance and it gets provisioned successfully, but when I login I can see that it has a placeholder container (fake_tensorflow:latest) and no access to actual TPU resources (TPU libraries show that only the CPU is available, /dev folder has no accel subfolders, etc).

Wondering if anyone else has encountered this. I've tried a variety of settings to boot up the instance (runtime: v2-alpha-tpuv6e or tpu-ubuntu2204-base, CLI: alpha or general, direct tpu-vm or queued resource, etc).

vapid siren Feb 15, 2026, 2:34 AM

#

I have seen the fake_tensorflow before as well, but was still able to communicate with the TPU. I'm also not sure if the chips are still exposed as device files under /dev. (I actually think I also had that problem before that I couldn't find the chip there). Have you tried just running jax.devices() to see if the accelerator is still found? What library calls did you use to try to find it?

fallen salmon Feb 15, 2026, 3:42 AM

#

vapid siren I have seen the fake_tensorflow before as well, but was still able to communicat...

Huh, I see, that's promising.
Yeah, I ran jax.devices() and it gave me the warning

Devices: [CpuDevice(id=0)]
Num devices: 1```
`libtpu` is installed. I'm pretty much just installing vLLM for TPU and running sanity checks, and it's failing those because it can't find the TPU. Have you gotten a message like this before?

vapid siren Feb 15, 2026, 3:58 AM

#

Last time I had that problem I had selected the wrong software, but it looks like you already tried selecting a different image

#

Are you sure you have installed jax[tpu] and libtpu?

fallen salmon Feb 15, 2026, 4:34 AM

#

vapid siren Last time I had that problem I had selected the wrong software, but it looks lik...

Well, I just spun up another spot instance to replicate the issue, and now Jax is finding the TPU 🫠
I think you're right it must just have been a package installation issue on my end. I double checked and the fake_tensorflow container is on this one too. Thanks for letting me know that it was a red herring!

vapid siren Feb 15, 2026, 4:36 AM

#

fallen salmon Well, I just spun up another spot instance to replicate the issue, and now Jax i...

I hope to one day find out what the fake_tensorflow is about, still sometimes confuses me when I look at ps aux.

fallen salmon Feb 15, 2026, 4:39 AM

#

Really appreciate your help! It's also good to know that the /dev/accel thing was bad/outdated information and those folders don't exist anymore

vapid siren Feb 15, 2026, 4:43 AM

#

Always happy to help! Can always DM me if I’m not looking at the channel

sullen rivet Feb 20, 2026, 12:05 PM

#

Hi, Is there someplace we can reach out to about possibly extending the free cloud TPUs that we receive? I have about 6 days left before my allocation expires and have emailed TRC support but I guess I’m generally seeking advice on how to go about this/best way to mitigate costs after the allocation expires. Even a more limited set of configurations/zones would be helpful here. Thanks!

fallen salmon Mar 4, 2026, 3:43 PM

#

~~Has anyone else run into this issue, usage of quota shown as 100% despite not having any TPUs or queued requests in the console?~~
Leaving for reference. It resolved to 25% around 10 min after deleting old suspended queued requests.

Screenshot_2026-03-04_at_10.42.27_AM.png

marsh fractal Mar 9, 2026, 7:33 AM

#

Hey, I can help with this. Code 13 usually happens because of quota limits, region capacity issues, or network config (like external IP restrictions) when using free credits. What region are you trying to create the TPU in, and are you attaching it to a VM with an external or internal IP?

marsh fractal Mar 10, 2026, 7:36 AM

#

Hey! I’m actually not an AI lol, just someone who works with this stuff and tries to help when I can.
Since you’re using europe-west4-a with internal IPs, the code 13 error could be coming from TPU quota limits on free credits or capacity issues for larger TPU types in that zone. I’ve helped troubleshoot similar setups before.
If you want, you can DM me the exact config you’re using and the TPU type you’re trying to create. I can take a quick look and help you get it working. If it ends up being something more involved, I can also help you set it up properly.

restive notch Mar 14, 2026, 12:50 AM

#

I would actually love to hear if you guys ended up fixing this

obtuse pilot Mar 15, 2026, 4:24 AM

#

Hey @gleaming sierra! Just a friendly reminder to keep the chat relevant and easy to read. We need to avoid spam, nonsensical messages, and excessive emoji use to keep the conversation flowing for everyone. Your recent messages that violated this rule have been deleted. Thanks for understanding!

knotty mirage Mar 18, 2026, 11:38 PM

#

Hi, I am trying to test-drive the TPUs, but I have not yet been able to get an on-demand TPU (queued for half a day) - and spot instances (tried multiple zones) seem to get preempted within minutes. Is that expected behaviour? Because if it is, I would probably not invest any more time into this. Any advice?

vestal coral Mar 19, 2026, 1:10 AM

#

Hello, I am a developer of a large language model based on SNN. I am interested in what the token rate per second will be on TPU and how to properly optimize it. My model has 618 million parameters.

#

I just have very slow speeds on server video cards. 0.3Ts

hazy halo Mar 19, 2026, 8:21 AM

#

knotty mirage Hi, I am trying to test-drive the TPUs, but I have not yet been able to get an o...

I'm not sure if my experience applies here, as it was quite some time ago. From what I know, preemptible tpu being preempted often is a thing.

#

Maybe not that bad, but preemptible were not worth the effort for training back then

knotty mirage Mar 19, 2026, 9:34 AM

#

Hm. Though why do they even bother with the TRC program then? If this is a normal experience, I cannot imagine that many people are likely to use or recommend TPUs going forward.

bitter nebula Mar 19, 2026, 3:08 PM

#

knotty mirage Hm. Though why do they even bother with the TRC program then? If this is a norma...

We do it because we think the mission is important.

Over the years TPUs have become increasingly popular, and with modern coding agents the technical barriers to using them have practically vanished. So, unfortunately, what you have today is a time of huge demand and relatively constrained supply (and that isn't unique to us, obviously). We understand that things could be better, and we try to make positive changes to that end whenever the opportunity arises, but, yeah, it's hard out there right now.

#

And on that note, thanks to anyone who sees this that has stuck with us through the rough times - we'd be nothing without you! ♥️

bitter nebula Mar 19, 2026, 3:16 PM

#

hazy halo I'm not sure if my experience applies here, as it was quite some time ago. From ...

Anecdotally it's usually easier to get a spot (=preemptible) instance than an on-demand one but it's harder to keep unless you're targeting lower-utilization devices/zones/times of day and also have a decent amount of luck

vestal coral Mar 19, 2026, 3:16 PM

#

bitter nebula We do it because we think the mission is important. Over the years TPUs have be...

How well do your chips work with the models SNN?

bitter nebula Mar 19, 2026, 3:24 PM

#

vestal coral How well do your chips work with the models SNN?

Sorry, I don't have any data on that in particular, if you don't get a response here I'd probably look at e.g. sites.research.google/trc/publications (or ArXiv directly) to see if you can spot some relevant papers and reach out to the authors

vestal coral Mar 19, 2026, 3:25 PM

#

bitter nebula Sorry, I don't have any data on that in particular, if you don't get a response ...

Yes, thank you, I just have my server with 4x 3090s, they can't handle the load, but I heard about your chips and thought maybe they would perform better.

vestal coral Mar 19, 2026, 3:26 PM

#

bitter nebula Sorry, I don't have any data on that in particular, if you don't get a response ...

It just turns out that my model is a bit demanding on hardware, even though it only has 700 million parameters.

river adder Mar 23, 2026, 5:33 AM

#

yo guys where's the v5e slot i cant see it anywhere

bitter nebula Mar 23, 2026, 7:50 AM

#

I've removed your initial post, please don't share information specific to your project on Discord.

This is a community-led channel, not an official outlet for contacting TRC support - if you have questions about something specific to your project please email trc-support@google.com, or it is something more general feel free to repost here without the screenshots etc.

grave orchid Mar 23, 2026, 7:51 AM

#

Thank you! I didn't know.
I'll be careful next time.

rotund scroll Mar 24, 2026, 1:38 AM

#

Hello. I am new here and earlier I thought you could use Colab for using your tpu grant but is that not true? Do you have to use Google Cloud Console?

grave orchid Mar 24, 2026, 8:01 AM

#

rotund scroll Hello. I am new here and earlier I thought you could use Colab for using your tp...

Yes

bitter nebula Mar 24, 2026, 8:08 AM

#

rotund scroll Hello. I am new here and earlier I thought you could use Colab for using your tp...

To the best of my knowledge this is still possible but requires some setup that isn't particularly well-documented; it might be worth asking your LLM of choice to investigate if you haven't already.

tame grotto Mar 24, 2026, 10:07 PM

#

Thanks for the ping!

outer totem Mar 28, 2026, 3:14 PM

#

Hi everyone, I tried spinning up every combination of region and Spot VM type available in my allocation, however they all end up failing with the error code 13 after being provisioned. I see earlier messages about how this could be related to TPU quota limits and while this has happened to me before within specific regions, it's never been across the entire allocation. Are other people running into this today as well?

bitter nebula Mar 30, 2026, 10:36 AM

#

outer totem Hi everyone, I tried spinning up every combination of region and Spot VM type av...

I had a chat with the support folks about this and it is actively being investigated, seems to be affecting multiple projects

opaque berry Mar 30, 2026, 1:27 PM

#

bitter nebula I had a chat with the support folks about this and it is actively being investig...

@outer totem and I found a fix by passing a labels argument (set to any value) when creating a TPU, for example:

gcloud compute tpus queued-resources create \
  tpu-v6e-8-0 \
  --node-id=tpu-v6e-8-0 \
  --zone=europe-west4-a \
  --accelerator-type=v6e-8 \
  --runtime-version=v2-alpha-tpuv6e \
  --labels='a=b' \
  --spot

This seems like a new bug.

rotund scroll Mar 30, 2026, 2:55 PM

#

I had the same problem! But I was trying on demand and it failed.

fallen salmon Mar 30, 2026, 3:05 PM

#

opaque berry <@1397900021113819229> and I found a fix by passing a `labels` argument (set to ...

Thank you, passing the labels argument fixed it for me as well

jaunty vine Apr 1, 2026, 8:18 PM

#

talk to me nice !!

compact vine Apr 11, 2026, 1:56 PM

#

opaque berry <@1397900021113819229> and I found a fix by passing a `labels` argument (set to ...

not working for me is this issue persisting for anybody else too , any workout anybody found out

compact vine Apr 11, 2026, 1:56 PM

#

bitter nebula I had a chat with the support folks about this and it is actively being investig...

any updates ?

bitter nebula Apr 11, 2026, 4:35 PM

#

resolved since the 31st, but also if the flag didn't work then it isn't the same issue afaiaa

trim nexus Apr 19, 2026, 9:56 PM

#

I am not able to spin up TPU instances past a certain size despite having the quota for it as my Hyperdisk Balanced Capacity quota is too low. However, requests to increase it have been auto-denied, has anyone else experienced this issue?

full arch Apr 20, 2026, 11:29 PM

#

trim nexus I am not able to spin up TPU instances past a certain size despite having the qu...

It sounds like your TPU quota may be available, but the blocking issue is the Hyperdisk Balanced Capacity quota, which TPU instances also depend on for attached storage. Auto-denials usually happen when the request doesn’t match recent usage history, region limits, or project billing/activity signals.
Which GCP region are you trying to create the TPU in, and what TPU size/type are you requesting?

tulip kestrel Apr 22, 2026, 4:56 PM

#

Seen new v8s tpus? They are amazing!!! Hope someday on TRC too. (ik not soon. Cus we recently got the v5s generation. But still mmm gona be good haha)

stiff grail Apr 25, 2026, 6:08 PM

#

compact vine not working for me is this issue persisting for anybody else too , any workout a...

Same for me, a queued spot v6e >16 chip slice is waiting -> provisioning -> suspending -> failed with err code 13 in europe-west4-a and us-east1-d

stiff grail Apr 25, 2026, 6:44 PM

#

Is there a workaround?

stiff grail Apr 25, 2026, 11:46 PM

#

stiff grail Is there a workaround?

Multislice fails with same error (4xv6e-16)

tulip kestrel Apr 27, 2026, 8:27 AM

#

Does trc team is on vacations or free week? (idk if there is any holiday in USA) cus 4days ago I sent a message and didn't get any response back. 😔

bitter nebula Apr 27, 2026, 10:07 AM

#

It's not likely that anyone is going to discuss their working schedules on an open Discord server, but worth noting that 2 of the last 4 days were the weekend...

#

Anyway I'll mention it to them but would generally recommend a bit of patience as it is a very small group of folks and sometimes the amount of inbound emails can be intense

tulip kestrel Apr 27, 2026, 10:53 AM

#

bitter nebula It's not likely that anyone is going to discuss their working schedules on an op...

Oh. Okey. I didn't ment to discuss working schedules sorry. Was just asking if there was a weekend or something. Sorry 😅

tulip kestrel Apr 27, 2026, 10:53 AM

#

bitter nebula Anyway I'll mention it to them but would generally recommend a bit of patience a...

Ye. Just thought that my message was in spam cus they were usually very quick in responses. ^^

#

But thanksss for info :))

jaunty vine Apr 27, 2026, 4:44 PM

#

hello everyone !

tulip kestrel Apr 27, 2026, 5:20 PM

#

jaunty vine hello everyone !

Ello

tulip kestrel Apr 27, 2026, 6:40 PM

#

bitter nebula Anyway I'll mention it to them but would generally recommend a bit of patience a...

also much thanks ya gona mention it to them. didn't wanted to disturb thier work. just a bit paniced cus i got message that tpus ending soon (week) and my training run didn't finished yet 😅 hehe.

spice willow May 3, 2026, 2:12 AM

#

Would a digital twin that does real time nv scanning be useful working on something like that currently

jaunty vine May 3, 2026, 12:31 PM

#

let me get credits for Huawei's research page im gettin them in my bag 🎒

#

👿 🦹‍♂️

jaunty vine May 3, 2026, 12:54 PM

#

bitter nebula Anyway I'll mention it to them but would generally recommend a bit of patience a...

set up boss im readyyy

spice willow May 3, 2026, 3:48 PM

#

bitter nebula Anyway I'll mention it to them but would generally recommend a bit of patience a...

What if you possibly have a legitimate pre digital twin thats very unique

#

Have about a 208 page paper detailing a potential framework for LCVD in-situ at 10mk

meager fog May 4, 2026, 4:34 AM

#

Hiii everyone

nimble stream May 11, 2026, 9:13 AM

#

Hi, my queued resource has been waiting 44h. Can I DM details?

spice willow May 19, 2026, 4:47 AM

#

bitter nebula It's not likely that anyone is going to discuss their working schedules on an op...

Question is there any place to get a technical look at a particular project/theoretical idea

bitter nebula May 19, 2026, 7:29 AM

#

TRC / GDM doesn't provide anything like that but there are other discord servers etc that might be able to help

#

I'd say you could post here and ask for feedback from the community but traffic is pretty low so you might not get anything useful from it

spice willow May 19, 2026, 10:03 AM

#

bitter nebula I'd say you could post here and ask for feedback from the community but traffic ...

I built a public reproducible simulation/package for a staged cryogenic + in-situ LCVD concept called QTA.

Repo:
https://github.com/cakeisalie89/qta-submission-package

Why I’m posting here:
The package includes a Python simulation, gate logic, Monte Carlo outputs, CSV artifacts, manifest/hash checks, and a manuscript-style technical report. I’m looking for feedback on whether this is structured clearly enough as a research-compute / reproducibility package.

Important boundary:
This is NOT claiming working hardware or a validated breakthrough. It is explicitly blocked/conditional and separates assumed parameters from measured requirements.

Looking for quick sanity-check feedback on:

Is the repo structure understandable?
Is the simulation reproducible from the README?
Are the Monte Carlo / gate outputs presented clearly?
Are assumptions vs measured claims separated enough?
Would this need TPU/GPU/cloud compute at all, or is CPU fine for the current scope?

tulip kestrel May 20, 2026, 5:59 PM

#

spice willow I built a public reproducible simulation/package for a staged cryogenic + in-sit...

can only answer 5th. probably GPU will be fine. don't see anything ML/AI related a lot where TPUs would really shine. so ig GPU/CPU would be fine.

feral oak May 22, 2026, 10:14 AM

#

whats tpu and diff bet tpu gpu and cpu, also Google Colab chooses which and why

swift junco May 25, 2026, 3:13 PM

#

Hi TRC community
We’ve been running large-scale training workloads on TRC, but around 60% of our allocation has effectively been unusable for the past 4 weeks. I’ve emailed trc-support twice and haven’t heard back yet.
v5e (128 chips, currently unusable):
v5litepod-8 requires 8 chips, but the quota TPUV5sPreemptibleLitepodServingPerProjectPerZoneForTPUAPI is capped at 4. It looks like the allocation landed in a serving quota bucket instead of training. Seeing this in both us-central1-a and europe-west4-b. Has anyone run into this before?
v4 spot (32 chips, ~5% usable):
In us-central2-b, jobs sit in WAITING_FOR_RESOURCES indefinitely. I had one finally provision after ~4 hours, then it got preempted shortly after.
v4 on-demand (32 chips, unusable):
Using the v2-alpha-tpuv6e runtime, JAX falls back to CPU and eventually OOMs. Is there a known-good runtime image for v4 + JAX right now?
Would really appreciate any guidance from anyone who’s dealt with similar issues.
Kim, Modulith Research CIC

#

Separately, a couple of findings from our v6e training runs that may be useful for the Pallas/Splash teams:

jax.nn.dot_product_attention appears to materialize the full S×S attention matrix in fp32 on TPU regardless of input dtype, which caused OOMs for us at seq_len=2048. We worked around it by switching to splash_attention_kernel and adding remat.
Splash attention seems to default sm_scale=1.0 instead of the expected 1/√d_head. In our case this caused training to stall around ~PPL 12. Setting sm_scale=1/√128 fixed convergence immediately. Worth flagging because it’s a silent failure mode — training keeps running, but converges to garbage.

Happy to put together proper bug reports for either if helpful.
Kim, Modulith Research CIC

tulip kestrel May 27, 2026, 11:35 AM

#

Do TRC is overbooked rn? Many people on Twitter complained about it. Wonder if there is any official info.

bitter nebula May 27, 2026, 1:17 PM

#

That's correct, TRC is not currently accepting applications - there's a new form linked on the site that you can use to express interest in placement if/when it becomes available.

tulip kestrel May 27, 2026, 1:20 PM

#

bitter nebula That's correct, TRC is not currently accepting applications - there's a new for...

Ooo thanks!! ❤️ Hope TRC gona have more tpus in future cus the program is so cool!

spice willow May 29, 2026, 6:43 AM

#

bitter nebula That's correct, TRC is not currently accepting applications - there's a new for...

When do you think applications will be accepted again

bitter nebula May 29, 2026, 6:54 AM

#

spice willow When do you think applications will be accepted again

I'm not sure if the old application process will be restored, or if so when that might be