#tpu-research-cloud
1 messages · Page 2 of 1
yeah in general we recommend that folks use gcloud, give that a try: https://cloud.google.com/tpu/docs/v6e-intro#set_up_jax_using_queued_resources
Hi everyone — hope you’re doing well. I’m a recent TRC grantee setting up workloads and wanted to sanity-check pricing to avoid unexpected spend. I’d really appreciate any firsthand experiences or pointers to official docs:
-
TPU VM pricing (Spot/Preemptible)
• Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?
• If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately? -
What still costs money outside the TPU allocation?
• GCS buckets (storage, API ops, lifecycle)
• Network egress (inter-region / to Internet)
• Persistent Disk attached to TPU VM
• External/static IPs, NAT, Load Balancers
• Cloud Logging/Monitoring, Artifact Registry, Pub/Sub, Cloud Build, etc.
Context: Region = [X], TPU type = [v5e-8 / v4-8].
Thank you in advance! And if this has been answered before, I’d be grateful for a link to the thread.
Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?
When you register a project you'll get a notification with the quotas that you have access to - those are covered.
If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately?
The host VM architecture has been deprecated for a long time (it might not even be usable anymore but I'm not sure) - you should be using TPU VM via Queued Resource w/ TRC quota so this wouldn't be an issue.
What still costs money outside the TPU allocation?
TRC covers the Cloud TPU service so anything that you use otherwise is subject to regular billing.
What do you guys usually do to use larger models like 100b + since the disk is liwe 90gb perhost
Hi! Recently I'm training some LLMs on Google v5e and v4 TPUs. However, I have encountered NaN loss problems on v5litepod-64 in zone us-central1-a. Specifically, the model trains successfully for ~4000 steps and suddenly the gradient and loss exploded.
I tried to run training with exactly the same setup on v4-128 and did not encounter the same problem. I wonder:
- if anyone else has encountered this problem before? Is this likely a hardware issue?
- if there a way to exclude certain TPU cores when training? (like we typically have a --exclude flag in slurm systems)
Thank you for your help in advance! Please let me know if more details are needed:)
I believe most people use buckets to store large data files and ckpts
Where\How should I be contacting Google Research TRC regarding the sharing of research work through open source releases, blog posts, program feedback, or anything else as mentioned in the onboarding email?
The easiest way is probably just to reply to the email but there's also a form linked at the bottom of sites.research.google/trc/publications
For the record, if it's been indexed by Google Scholar and it mentions the program by name we may have already picked it up - probably searching for the paper name or the first author's surname on that page is the easiest way to tell
Ahhhh, thank you for the information. I missed the form you mentioned. New to the program and I've never shared any of my work before, so trying to figure out the best way to go about all that.
Yeah that link is fairly hard to find - the publication page was a lot shorter when we added it. 🙂
Haha, that's a good sign at least. I hope I can contribute something novel to the program.
Hello, TRC team.
The trc email said free access to those resources
1 on-demand Cloud TPU v5e chips in zone europe-west4-a
1 on-demand Cloud TPU v5lite chips in zone europe-west4-b
may I know whether the v5lite-32/64 can be covered in this case? Thanks for your help!
you'd need to take that up with trc-support@google.com
Ok
https://docs.google.com/presentation/d/1TZMmXumbaCf4PEIHcNVOuhwniLFJyOvLjPoZrnanb78/edit?usp=sharing
Excited to share here my slides at Google's Devfest in Manila for a talk about TPUs and the TRC program
The TPU talk actually got accepted to two events - last week (10m lightning) and this week (30m). These slides are bit more detailed:
https://docs.google.com/presentation/d/1C6ccqrJz--90Po2eo1G8F4Uko0TOejT6SNwInZswiJ4/edit?usp=sharing
Hello!
I submittted an application to TRC two weeks ago and didn't get any response, neither approval, neither denial, so I decided to post here and to check. (I checked spam folder too, nothing's there)
Not sure that's something that can be covered in detail on discord but generally speaking no email == not approved, unfortunately
so it doesn't mean that my application was lost or ignored, just denied. correct?
I suppose anything is possible but it's not likely the case - you could try emailing trc-support@google.com to see if they can verify
Will do, thanks
Hello, I have a question. I use TRC to receive the zone I received, which I created as I normally receive it, following the normal rules. Then I found that the code cannot find the TPU on the hardware. What is the cause?
Thanks in advance for answering me.
There's not really enough info here to be able to establish a root cause - if it's throwing a particular error the best bet is to search for that to see if anyone else has mentioned it / how they resolved it. Failing that you might provide the same info here so anyone who has any insight can chime in.
I'm starting over again, if I find anything else I'll let you know, thanks.
at a super-high level a lot of times that ends up being because the code is bugged (eg using an api incorrectly) or the tpu is unreachable (preempted, maintenance event etc)
Hey! I am a bit confused with global batch size calculation on v6e-8 (for example). So jax.local_device_count("tpu") shows 8 for a v6e-8. I am using MaxText that has a per_device_batch_size parameter. So when I use 64 as per_device_batch_size does it mean, that my global batch size is 64 x 8 = 512? If I would use a v6e-64, is it then 64 x 64 = 4096?
Hi, I am trying to create a tpu v4 on demand instance but when I run this, I get an error, what am I doing wrong?
gcloud compute tpus queued-resources create tpuv4 \
--node-id v4-32-ond \
--zone us-central2-b \
--accelerator-type v4-32 \
--runtime-version tpu-ubuntu2204-base
ERROR: (gcloud.compute.tpus.queued-resources.create) NOT_FOUND: Cloud TPU was unable to complete the operation.
Hey all – excited to get working with the TPUs. I was trying some of the hello world stuff on Colab but unfortunately couldn't get anything to run. I think the tutorials may be based on a v2 TPU hardware that isn't live anymore, as I can only select v5. Any tips on getting started here? Also happy to move to just running the gcloud API to try this. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb#scrollTo=Zo-Yk6LFGfSf
Hello, I have a question about GCP credits. I received a trial credit for GenAI App Builder.
What are 32K credits? And how do I check
On ur profile section
it's not possible to answer this without details about your project that shouldn't be shared on discord, recommend contact trc-support@
we recently pulled the colab links out of TRC's welcome documentation as, to your point, they referenced no-longer-supported generations of hardware - using gcloud is probably the easiest path unless someone knows of some other official notebook that could replace it (unfortunately I don't)
just as an fyi to future readers: TRC has nothing to do with GCP credits, please don't worry if you've been approved for the program but don't have any credits issued in the UI 😅
Did you ever figure this out? I'm not sure what MaxText considers to be a "device" but I'd assume it's either a chip (in which case 512 sounds right) or a host (8 chips/host -> 64)...
I found one relevant information in this readme ( https://github.com/AI-Hypercomputer/maxtext/blob/69ed0c5d29aa25c61fd4c31a666ef35cf345d30e/docs/reference/architecture_overview.md?plain=1#L63) it says "Sets the local batch size per accelerator chip." But I will ask as well on GitHub 🙂
This is also useful lol. I signed up for free trial credits then somehow used all of them in 24hrs
any idea why I can't create the TPU? (also I checked. all of the quota usage is at 0). gcloud compute tpus tpu-vm create node-tpu --zone=us-central1-a --accelerator-type=v5litepod-64 --version=v2-tpuv5-litepod --preemptible
Create request issued for: [node-tpu]
Waiting for operation [projects/tpu-proj-ai-detection/locations/us-central1-a/operations/operation-1763648567926-644076e665ecc-2a7ceae1-85b1a8b4] to complete...failed.
ERROR: (gcloud.compute.tpus.tpu-vm.create) {
"code": 8,
"message": "You have reached IN_USE_ADDRESSES limit. [EID: 0xbf338a209239294b]"
}
sounds like a quota issue, I think you can request an increase via the GCP console
i dont know which quota it is because all quotas i found have usage on 0 and limit at 8
Got a problem. One of my Tpus is stuck and I cannot ssh to it. And it's a pod so idk if it's even possible to restart it. And I got stuff saved there. Possible to somehow just take the stuff from there or unstuck it? It happend when I ssh to it and the pc crashed then I opened the pc again ssh and it said that connection timed out.
to the best of my knowledge no, but maybe someone else on the server has found a way...
Ouch :((
Hellow everyone, I am new here, what is tpu about?
tpus are like gpus or cpus. but they work a lot diffrent. and Tpu Research Cloud provide that tpus to peoples so they can train or run experiments on them. very cool stuff ^^
you can learn how tpus works here > https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Thank you
np!
Even gemini was not able to help. So I just gona delete the tpu and start fresh. 
Soo, I don't know if I posted in the right spot, but, I think I stumbled across an architecture that might, like, nuke GPU architecture and TPU's for LLM design. I really want someone to tear this idea down, cause the math keeps saying it amplifies LLM speed by 200% and shrinks it exponentially, while allowing for rapid iteration. I can't have been the only person to have stumbled across this, I feel like.
https://github.com/Kaleaon/Pentary is the repo I posted the preliminary ideas, research, etc, on.
yeah, probably not the best channel for this, I see you already posted in eg #ai-general which I would expect to be more likely to generate some interest in the idea though
Can anyone tell me what configuration should I use for running JAX on v6e-64?
Boa noite.
Hey! Glad to meet all of you guys who're in the same boat as testing out TPUs for research. I've been looking at a ton of the guides/documentation, and wanted to ask about some of the cool stuff you've been doing that you haven't wrote up/posted about. Any cool challenges you had to work through? etc.
If you mean by the type of VM, you should be using v2-alpha-tpuv6e for the best compatibility, at least from the documentation I've been reading for running some test workloads
Also, is it normal to get internal error while creating tpu vm even with spot + queue?
I've used this command to queue 64 v6e resources but it just keeps failing.
gcloud compute tpus queued-resources create qr-v6e64-ew4a-spot \
--node-id node-64 \
--zone europe-west4-a \
--accelerator-type v6e-64 \
--runtime-version v2-alpha-tpuv6e \
--valid-after-duration 1m \
--spot \
--internal-ips
how is it failing / are you getting an error code back?
I am only getting error code 13, internal error.
No description, just says error without much explanation
With every region I have 64 v6e spot quota, any setup with more than 16 tpus results similar problems.
With console I do get something like this if I retry soon
Cloud TPU was unable to complete the operation. Please try again, or contact support if the problem persists. [EID: 0x369a052af37e209f]
❤️ opaque errors
sounds like it could be quota related though
(if you have access to them) can you try an older gen TPU to see if it works / fails in the same way?
if it is specific to v6e I'd guess that you might not have enough Hyperdisk quota, in which case you'd need to request an increase
i think the trc people forgot to gave me quota for us-central2-b
the quota page doesn't show i have qota for us central 2b, the dashboard for creating the tpu doesnt even list the region, and the cli here doesnt work
if you received an email saying that you have the quota then you should have the quota...
could be a timing thing though, are you still having an issue?
yes
i did "gcloud compute tpus tpu-vm create will-2 --zone=us-central2-b --accelerator-type=v4-32 --version=tpu-ubuntu2204-base --project=trctpu-123"
Probably worth contacting trc-support@google.com
yeah i did
are we able to use GKE with our quota of TPUs?
not right now, but eventually yes
Hi, I received access to
32 on-demand Cloud TPU v4 chips in zone us-central2-b
and some other ones
but
-
trying to create a TPU VM using zone
us-central2-breturnsPermission denied on 'locations/us-central2-b' (or it may not exist).
Is this location deprecated or something? -
and for the other zones I simply get
Insufficient capacity
Is there a place to check zone capacity?
Is this location deprecated or something?
no, if you still can't access it you might want to email trc-support@google.com
Is there a place to check zone capacity?
also no, but you can use the Queued Resource API to queue until there is availability
Any recommendation on library for logging and train-run monitoring on TRC?
I found a rly good repo called tpux. is it widely used?
I'm considering using it to help iteration speed. I want to edit my training loop on one VM but run the test between all of them. Doesn't seem like there's a simple way to do that using gcloud
let me know how it goes, i'd love to see more monitoring tools
yes I got it working. tpux with Wandb
Hey, so i got access to the cloud tpus, but when i try to create a tpu vm it starts creating for a few minutes and then it shuts down, and when i see the logs it says this:
btw im on the free trial, and also i tried using both tpuv6e and v5, and i tried creating through the web interface and cli but its still shutting down
I already sent an email yesterday, but no response
just realized i sent them the wrong project number (in the email)💀
ok it works with v4-32
Sorry @tropic topaz Google does not provide official support on this server
Please check #general message for the official support links
to be sure, there are members of the TRC team that are on the server & contribute to this channel specifically, myself included.
trc-support@google.com is the official way to get support, though - in our view the channel is intended to be driven by the community.
Hi, can anyone recommend online communities for getting help on PyTorch XLA related issues? I've spent couple of days porting our existing CUDA code to work on the TPU cloud. It works however it is not as efficient due to lower memory bandwidth of TPUs. I don't want to waste too much time before the TRC timeline ends. Thanks in advance!
Can anyone help please? Am I doing it wrong or the TPUs are out of stock for now?
Unfortunately no one is going to be able to offer much help based on the screenshot as it doesn't give any information as to why the failures are happening.
As a first step I'd recommend using the gcloud CLI as it will typically give more detailed error messages - if it's not something obvious (like out of capacity errors) then share the error message and someone might be able to help.
Any gains you get will be absolutely annihilated by a binary-pentry conversion layer.
Because the rest of the world operates on that level I'll finish absolutely ripping it to pieces in a minute
You also critically rely on experimental memresister parts that are prone to bit flips and other issues, especially at low voltage levels.
C=2.717ish
Radix (b) Efficiency Constant (b/\ln b) Information per Digit (\log_2 b) Relative Cost (vs. Ternary)
2 (Binary) 2.885 1.00 bits 1.056
3 (Ternary) 2.731 1.58 bits 1.000
4 (Quaternary) 2.885 2.00 bits 1.056
5 (Pentary) 3.106 2.32 bits 1.137
10 (Decimal) 4.343 3.32
Good shot, but I'm afraid it's just not going to work. I've tried similar things myself.
I will spare you the rest of the problems as I don't want to make you cry, but suffice to say, the idea is not tenable, although it is novel. Keep at it.
Has anyone used tpus for inference I would like to know gains compared to standard gpu's in collab
Hello, Is anyone familiar with torch-xla able to confirm whether this is an appropriate way to measure inference runs on a TPU:
```python
for _ in range(num_warmup_runs):
with torch.no_grad():
logits = torch_model(torch_inputs).logits
latencies_ms = []
for i in range(args.num_iterations):
start_time = time.time()
with torch.no_grad():
logits = torch_model(torch_inputs).logits
torch_xla.sync()
xm.wait_device_ops()
end_time = time.time()
latencies_ms.append((end_time - start_time) * 1000)
xm.wait_device_ops() seems to be blocking the run indefinitely
I can help! The issue is likely that xm.wait_device_ops() is deprecated/broken in newer torch-xla versions and can cause hangs.
The torch_xla.sync() call should already be sufficient to wait for TPU operations to complete - you probably don't need the wait_device_ops() at all.
Question: What version of torch-xla are you using? If it's recent (2.0+), just remove the xm.wait_device_ops() line and rely on torch_xla.sync() alone.
Hi Joshua, thanks for the help. I've left only torch_xla.sync() but I'm now observing another issue where the program doesn't terminate. Plus, I'm observing similar performance results with or without using torch_xla.sync(), which makes me wonder whether I even need it. Do you know why this might be?
Both issues suggest the sync isn't working properly. The program hanging and identical performance with/without sync usually means your tensors or model aren't actually on the TPU device, so there's nothing to synchronize.
I can either walk you through debugging the device placement and fixing the sync points, or if you share your full code snippet I can just rewrite the benchmarking section with proper TPU synchronization for you.
Which approach works better for you?
Ah thanks! Here's a code snippet:
```python
device = torch_xla.device()
torch_model = torch_model.to(device)
torch_inputs = torch_inputs.to(device)
print(f"Running inference on {model}")
print(f"Warming up {model} with {num_warmup_runs} runs")
for _ in range(num_warmup_runs):
with torch.no_grad():
logits = torch_model(torch_inputs).logits
torch_xla.sync(wait=True)
latencies_ms = []
for i in range(args.num_iterations):
start_time = time.time()
with torch.no_grad():
logits = torch_model(torch_inputs).logits
torch_xla.sync(wait=True)
end_time = time.time()
latencies_ms.append((end_time - start_time) * 1000)
Can you figure out what I'm doing wrong?
I can see the issue you're calling sync(wait=True) but that parameter doesn't exist in torch_xla.sync(). The sync is likely failing silently, which explains both the hanging and the inconsistent timing.
Here's the thing this needs a proper rewrite with correct sync semantics and proper timing placement. Rather than going back-and-forth in the thread, would you be open to discussing this privately? I can fix this properly for you and make sure your TPU benchmarking is accurate.
Want to move this to DMs so I can help you sort this out?
The parameter does exist for the sync method: https://docs.pytorch.org/xla/release/r2.8/learn/api-guide.html#torch_xla.sync
Anyone able to point out where this is going wrong (without charging a fee) ?
I have been "waiting for resources" for days on v4 tpus. sigh. Anyone else feels the same?
try smaller slices, v6e-8 works instantly for me in some region
I dont have quota for on demand v6e-8 in us-central2-b. I only have preemptible v6e in us-central1-a, but it got preempted instantly. lol.
I am not sure if this helps anyone, but I've shared my experiences on migrating an LM training pipeline to run on TPUs here: https://dogac.dev/blog/2026/migrating-to-tpu/, maybe it helps you to migrate your existing pipelines and debug performance issues.
Can we collaborate I do have some questions
Feel free to ask questions here directly
I did read your blog on running tpus in cloud however the initialisation takes lot of time and I'm not sure how to utilise the ones provided by colab would like a way it can be initialised properly
Any suggestions @quaint silo
I am not sure what you mean. I create a queued resource as TPU documentation recommends and I usually get access to it within 5-10 minutes. If it is pre-emptied, I retry.
This is a great content Dogacel! Thanks for sharing
yo guys can you make a free tpu with infinite compute power and storage
that would be useful ty
||jk||
Does anybody know how the TRC team reacts to asking for more TPUs (or more advanced TPUs like the v7x)? I have a couple of projects that could benefit from the 3d topology (and optical switches). Plus, my v6e-16 cluster has been getting preempted a lot. I have a couple of proofs of concept already. Also, has anybody ever gotten a v7x or not? As far as I am aware, it should be available, but only in public preview. Would it be possible, or is the capacity being used for internal LLM training?
:p
Hi all,
As part of TPU research cloud, I have been provisioned free access to v4-32 on-demand TPUs in us-central2-b. However, I am getting an issue when I try to create TPUs with that configuration. The error message is not that informative so I was wondering if I did something wrong.
Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead? I can try to create 4 v4-8 VMs and maybe try to network them somehow but the cost display on the side of the screen is making me cautious. My understanding was that my original approach itself was supposed to create multiple VMs .
Any help would be appreciated here. Thanks!
I would try to allocate the same machine from the cli with gsutil, for some reason the cli error messages are often much more helpful then the web UI ones (you can ask an LLM to translate the screenshots into a gsutil command).
Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead?
No, that should never be necessary, the TPU deployment thing also sets the correct env variables for JAX so that the hosts can find each other
Thanks! I seem to be able to allocate spot TPUs through the UI but when I try on-demand it seems to have this issue even though I do have access to on-demand TPU chips based on the original email. I will try the cli though
yayy 😄 i believed in yall
If you're able to provide a compelling plan for your research then an increase and/or extension might be possible, but TRC doesn't currently offer v7 FYI.
Hard to know exactly what's happening without the error response but if you're able to create spot v4 and not od in the same zone then it's most likely a stockout - spot devices often have better availability
Thanks! I finally managed to get something on-demand provisioned through the UI and I’m sure it’s part of the TRC grant I’ve been allotted (v4-32/us-central-2b). I can’t exactly stop the TPU when not in use because it’s not a single device TPU so I’m trying to see if there’d be some mechanism to verify that I’m not being charged for this before I randomly see a really big charge tomorrow
I see. Are v5p-64 TPUs available then? I had a couple of ideas for the OCS and 3d topologies. Hopefully, I can test out, see if they work, and then publish a paper about it. If anybody is knowledgeable on TPU specifics, could they check these ideas?
The first one is using topology aware manhattan distance for routing in MoE models. Trying to penalize/reward the model for using TPUs that are physically closer to each other. Likely, I would just anneal the penalty to try not kill model performance. Plus, some load balancing between experts. The big issue I can see is if the communication overhead between MoE models isn't large enough to justify the work. That being said, it could be a (mostly) free performance gain.
The second one is using the 3D torus for helical ring attention. Theoretically, having a Hamiltonian path or multiple interleaved helices should make it so that you can transfer the KV-block while it is computing attention. So increasing the memory bandwidth tremendously and thus the context. Personally, I suspect the Gemini family of LLMs does something similar, and that's why they can have such large context windows.
The third one is arguably the most difficult. The hard part is the SparseCores. I cannot for the life of me find any good documentation on the SparseCores, even though it would be amazing for specific tasks. If any Google insiders familiar with the SparseCore functions, documentation, or details could help me out, it would be greatly appreciated. Anyways, the general idea is to use the SparseCore as a Locality Sensitive Hashing engine to retrieve the attention (Leaving the MXU for computation). The benefit of it is that you can use the SparseCore for Top-K relevant keys, then compute them on the MXU. While it could be incredible, I have no idea if the SparseCore would support this, so this is why it's the least likely out of all 3.
For TRC-enroled projects it should generally not be possible to start a device for which you lack the necessary quotas but, yeah, probably a good idea to keep an eye on the billing if you're concerned
Hi all, I'm running into an issue where I spin up a v6e-8 spot instance and it gets provisioned successfully, but when I login I can see that it has a placeholder container (fake_tensorflow:latest) and no access to actual TPU resources (TPU libraries show that only the CPU is available, /dev folder has no accel subfolders, etc).
Wondering if anyone else has encountered this. I've tried a variety of settings to boot up the instance (runtime: v2-alpha-tpuv6e or tpu-ubuntu2204-base, CLI: alpha or general, direct tpu-vm or queued resource, etc).
I have seen the fake_tensorflow before as well, but was still able to communicate with the TPU. I'm also not sure if the chips are still exposed as device files under /dev. (I actually think I also had that problem before that I couldn't find the chip there). Have you tried just running jax.devices() to see if the accelerator is still found? What library calls did you use to try to find it?
Huh, I see, that's promising.
Yeah, I ran jax.devices() and it gave me the warning
Devices: [CpuDevice(id=0)]
Num devices: 1```
`libtpu` is installed. I'm pretty much just installing vLLM for TPU and running sanity checks, and it's failing those because it can't find the TPU. Have you gotten a message like this before?
Last time I had that problem I had selected the wrong software, but it looks like you already tried selecting a different image
Are you sure you have installed jax[tpu] and libtpu?
Well, I just spun up another spot instance to replicate the issue, and now Jax is finding the TPU 🫠
I think you're right it must just have been a package installation issue on my end. I double checked and the fake_tensorflow container is on this one too. Thanks for letting me know that it was a red herring!
I hope to one day find out what the fake_tensorflow is about, still sometimes confuses me when I look at ps aux.
Really appreciate your help! It's also good to know that the /dev/accel thing was bad/outdated information and those folders don't exist anymore
Always happy to help! Can always DM me if I’m not looking at the channel
Hi, Is there someplace we can reach out to about possibly extending the free cloud TPUs that we receive? I have about 6 days left before my allocation expires and have emailed TRC support but I guess I’m generally seeking advice on how to go about this/best way to mitigate costs after the allocation expires. Even a more limited set of configurations/zones would be helpful here. Thanks!
Has anyone else run into this issue, usage of quota shown as 100% despite not having any TPUs or queued requests in the console?
Leaving for reference. It resolved to 25% around 10 min after deleting old suspended queued requests.
Hey, I can help with this. Code 13 usually happens because of quota limits, region capacity issues, or network config (like external IP restrictions) when using free credits. What region are you trying to create the TPU in, and are you attaching it to a VM with an external or internal IP?
Hey! I’m actually not an AI lol, just someone who works with this stuff and tries to help when I can.
Since you’re using europe-west4-a with internal IPs, the code 13 error could be coming from TPU quota limits on free credits or capacity issues for larger TPU types in that zone. I’ve helped troubleshoot similar setups before.
If you want, you can DM me the exact config you’re using and the TPU type you’re trying to create. I can take a quick look and help you get it working. If it ends up being something more involved, I can also help you set it up properly.
I would actually love to hear if you guys ended up fixing this
Hey @gleaming sierra! Just a friendly reminder to keep the chat relevant and easy to read. We need to avoid spam, nonsensical messages, and excessive emoji use to keep the conversation flowing for everyone. Your recent messages that violated this rule have been deleted. Thanks for understanding!
Hi, I am trying to test-drive the TPUs, but I have not yet been able to get an on-demand TPU (queued for half a day) - and spot instances (tried multiple zones) seem to get preempted within minutes. Is that expected behaviour? Because if it is, I would probably not invest any more time into this. Any advice?
Hello, I am a developer of a large language model based on SNN. I am interested in what the token rate per second will be on TPU and how to properly optimize it. My model has 618 million parameters.
I just have very slow speeds on server video cards. 0.3Ts
I'm not sure if my experience applies here, as it was quite some time ago. From what I know, preemptible tpu being preempted often is a thing.
Maybe not that bad, but preemptible were not worth the effort for training back then
Hm. Though why do they even bother with the TRC program then? If this is a normal experience, I cannot imagine that many people are likely to use or recommend TPUs going forward.
We do it because we think the mission is important.
Over the years TPUs have become increasingly popular, and with modern coding agents the technical barriers to using them have practically vanished. So, unfortunately, what you have today is a time of huge demand and relatively constrained supply (and that isn't unique to us, obviously). We understand that things could be better, and we try to make positive changes to that end whenever the opportunity arises, but, yeah, it's hard out there right now.
And on that note, thanks to anyone who sees this that has stuck with us through the rough times - we'd be nothing without you! ♥️
Anecdotally it's usually easier to get a spot (=preemptible) instance than an on-demand one but it's harder to keep unless you're targeting lower-utilization devices/zones/times of day and also have a decent amount of luck
How well do your chips work with the models SNN?
Sorry, I don't have any data on that in particular, if you don't get a response here I'd probably look at e.g. sites.research.google/trc/publications (or ArXiv directly) to see if you can spot some relevant papers and reach out to the authors
Yes, thank you, I just have my server with 4x 3090s, they can't handle the load, but I heard about your chips and thought maybe they would perform better.
It just turns out that my model is a bit demanding on hardware, even though it only has 700 million parameters.
yo guys where's the v5e slot i cant see it anywhere
I've removed your initial post, please don't share information specific to your project on Discord.
This is a community-led channel, not an official outlet for contacting TRC support - if you have questions about something specific to your project please email trc-support@google.com, or it is something more general feel free to repost here without the screenshots etc.
Thank you! I didn't know.
I'll be careful next time.
Hello. I am new here and earlier I thought you could use Colab for using your tpu grant but is that not true? Do you have to use Google Cloud Console?
Yes
To the best of my knowledge this is still possible but requires some setup that isn't particularly well-documented; it might be worth asking your LLM of choice to investigate if you haven't already.
Thanks for the ping!
Hi everyone, I tried spinning up every combination of region and Spot VM type available in my allocation, however they all end up failing with the error code 13 after being provisioned. I see earlier messages about how this could be related to TPU quota limits and while this has happened to me before within specific regions, it's never been across the entire allocation. Are other people running into this today as well?
I had a chat with the support folks about this and it is actively being investigated, seems to be affecting multiple projects
@outer totem and I found a fix by passing a labels argument (set to any value) when creating a TPU, for example:
gcloud compute tpus queued-resources create \
tpu-v6e-8-0 \
--node-id=tpu-v6e-8-0 \
--zone=europe-west4-a \
--accelerator-type=v6e-8 \
--runtime-version=v2-alpha-tpuv6e \
--labels='a=b' \
--spot
This seems like a new bug.
I had the same problem! But I was trying on demand and it failed.
Thank you, passing the labels argument fixed it for me as well
talk to me nice !!
not working for me is this issue persisting for anybody else too , any workout anybody found out
any updates ?
resolved since the 31st, but also if the flag didn't work then it isn't the same issue afaiaa
I am not able to spin up TPU instances past a certain size despite having the quota for it as my Hyperdisk Balanced Capacity quota is too low. However, requests to increase it have been auto-denied, has anyone else experienced this issue?
It sounds like your TPU quota may be available, but the blocking issue is the Hyperdisk Balanced Capacity quota, which TPU instances also depend on for attached storage. Auto-denials usually happen when the request doesn’t match recent usage history, region limits, or project billing/activity signals.
Which GCP region are you trying to create the TPU in, and what TPU size/type are you requesting?
Seen new v8s tpus? They are amazing!!! Hope someday on TRC too. (ik not soon. Cus we recently got the v5s generation. But still mmm gona be good haha)
Same for me, a queued spot v6e >16 chip slice is waiting -> provisioning -> suspending -> failed with err code 13 in europe-west4-a and us-east1-d
Is there a workaround?
Multislice fails with same error (4xv6e-16)
Does trc team is on vacations or free week? (idk if there is any holiday in USA) cus 4days ago I sent a message and didn't get any response back. 😔
It's not likely that anyone is going to discuss their working schedules on an open Discord server, but worth noting that 2 of the last 4 days were the weekend...
Anyway I'll mention it to them but would generally recommend a bit of patience as it is a very small group of folks and sometimes the amount of inbound emails can be intense
Oh. Okey. I didn't ment to discuss working schedules sorry. Was just asking if there was a weekend or something. Sorry 😅
Ye. Just thought that my message was in spam cus they were usually very quick in responses. ^^
But thanksss for info :))
hello everyone !
Ello
also much thanks ya gona mention it to them. didn't wanted to disturb thier work. just a bit paniced cus i got message that tpus ending soon (week) and my training run didn't finished yet 😅 hehe.
Would a digital twin that does real time nv scanning be useful working on something like that currently
set up boss im readyyy
What if you possibly have a legitimate pre digital twin thats very unique
Have about a 208 page paper detailing a potential framework for LCVD in-situ at 10mk
Hiii everyone
Hi, my queued resource has been waiting 44h. Can I DM details?
Question is there any place to get a technical look at a particular project/theoretical idea
TRC / GDM doesn't provide anything like that but there are other discord servers etc that might be able to help
I'd say you could post here and ask for feedback from the community but traffic is pretty low so you might not get anything useful from it
I built a public reproducible simulation/package for a staged cryogenic + in-situ LCVD concept called QTA.
Repo:
https://github.com/cakeisalie89/qta-submission-package
Why I’m posting here:
The package includes a Python simulation, gate logic, Monte Carlo outputs, CSV artifacts, manifest/hash checks, and a manuscript-style technical report. I’m looking for feedback on whether this is structured clearly enough as a research-compute / reproducibility package.
Important boundary:
This is NOT claiming working hardware or a validated breakthrough. It is explicitly blocked/conditional and separates assumed parameters from measured requirements.
Looking for quick sanity-check feedback on:
- Is the repo structure understandable?
- Is the simulation reproducible from the README?
- Are the Monte Carlo / gate outputs presented clearly?
- Are assumptions vs measured claims separated enough?
- Would this need TPU/GPU/cloud compute at all, or is CPU fine for the current scope?
can only answer 5th. probably GPU will be fine. don't see anything ML/AI related a lot where TPUs would really shine. so ig GPU/CPU would be fine.
whats tpu and diff bet tpu gpu and cpu, also Google Colab chooses which and why
Hi TRC community
We’ve been running large-scale training workloads on TRC, but around 60% of our allocation has effectively been unusable for the past 4 weeks. I’ve emailed trc-support twice and haven’t heard back yet.
v5e (128 chips, currently unusable):
v5litepod-8 requires 8 chips, but the quota TPUV5sPreemptibleLitepodServingPerProjectPerZoneForTPUAPI is capped at 4. It looks like the allocation landed in a serving quota bucket instead of training. Seeing this in both us-central1-a and europe-west4-b. Has anyone run into this before?
v4 spot (32 chips, ~5% usable):
In us-central2-b, jobs sit in WAITING_FOR_RESOURCES indefinitely. I had one finally provision after ~4 hours, then it got preempted shortly after.
v4 on-demand (32 chips, unusable):
Using the v2-alpha-tpuv6e runtime, JAX falls back to CPU and eventually OOMs. Is there a known-good runtime image for v4 + JAX right now?
Would really appreciate any guidance from anyone who’s dealt with similar issues.
Kim, Modulith Research CIC
Separately, a couple of findings from our v6e training runs that may be useful for the Pallas/Splash teams:
-
jax.nn.dot_product_attentionappears to materialize the fullS×Sattention matrix in fp32 on TPU regardless of input dtype, which caused OOMs for us atseq_len=2048. We worked around it by switching tosplash_attention_kerneland addingremat. -
Splash attention seems to default
sm_scale=1.0instead of the expected1/√d_head. In our case this caused training to stall around ~PPL 12. Settingsm_scale=1/√128fixed convergence immediately. Worth flagging because it’s a silent failure mode — training keeps running, but converges to garbage.
Happy to put together proper bug reports for either if helpful.
Kim, Modulith Research CIC
Do TRC is overbooked rn? Many people on Twitter complained about it. Wonder if there is any official info.
That's correct, TRC is not currently accepting applications - there's a new form linked on the site that you can use to express interest in placement if/when it becomes available.
Ooo thanks!! ❤️ Hope TRC gona have more tpus in future cus the program is so cool!
When do you think applications will be accepted again
I'm not sure if the old application process will be restored, or if so when that might be
Thanks for the info project is a weird subject so hopefully it gets seen lol
Subject ive been looking into is atomic scale manufacturing