#tpu-research-cloud

1 messages · Page 2 of 1

strange quartz
#

Used the web console, could be that? Also isn’t —worker=all needed only for multi-host?

bitter nebula
carmine bay
#

Hi everyone — hope you’re doing well. I’m a recent TRC grantee setting up workloads and wanted to sanity-check pricing to avoid unexpected spend. I’d really appreciate any firsthand experiences or pointers to official docs:

  1. TPU VM pricing (Spot/Preemptible)
    • Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?
    • If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately?

  2. What still costs money outside the TPU allocation?
    • GCS buckets (storage, API ops, lifecycle)
    • Network egress (inter-region / to Internet)
    • Persistent Disk attached to TPU VM
    • External/static IPs, NAT, Load Balancers
    • Cloud Logging/Monitoring, Artifact Registry, Pub/Sub, Cloud Build, etc.

Context: Region = [X], TPU type = [v5e-8 / v4-8].
Thank you in advance! And if this has been answered before, I’d be grateful for a link to the thread.

bitter nebula
# carmine bay Hi everyone — hope you’re doing well. I’m a recent TRC grantee setting up worklo...

Are TPU VM instances (e.g., v4/v5e) fully covered under TRC when launched as Spot/Preemptible?

When you register a project you'll get a notification with the quotas that you have access to - those are covered.

If there’s a separate “host VM” component (vCPU/RAM) in TPU VM, is that covered too, or billed separately?

The host VM architecture has been deprecated for a long time (it might not even be usable anymore but I'm not sure) - you should be using TPU VM via Queued Resource w/ TRC quota so this wouldn't be an issue.

What still costs money outside the TPU allocation?

TRC covers the Cloud TPU service so anything that you use otherwise is subject to regular billing.

vivid plank
#

What do you guys usually do to use larger models like 100b + since the disk is liwe 90gb perhost

lofty yarrow
#

Hi! Recently I'm training some LLMs on Google v5e and v4 TPUs. However, I have encountered NaN loss problems on v5litepod-64 in zone us-central1-a. Specifically, the model trains successfully for ~4000 steps and suddenly the gradient and loss exploded.

I tried to run training with exactly the same setup on v4-128 and did not encounter the same problem. I wonder:

  1. if anyone else has encountered this problem before? Is this likely a hardware issue?
  2. if there a way to exclude certain TPU cores when training? (like we typically have a --exclude flag in slurm systems)

Thank you for your help in advance! Please let me know if more details are needed:)

lofty yarrow
slate pond
#

Where\How should I be contacting Google Research TRC regarding the sharing of research work through open source releases, blog posts, program feedback, or anything else as mentioned in the onboarding email?

bitter nebula
#

For the record, if it's been indexed by Google Scholar and it mentions the program by name we may have already picked it up - probably searching for the paper name or the first author's surname on that page is the easiest way to tell

slate pond
#

Ahhhh, thank you for the information. I missed the form you mentioned. New to the program and I've never shared any of my work before, so trying to figure out the best way to go about all that.

bitter nebula
#

Yeah that link is fairly hard to find - the publication page was a lot shorter when we added it. 🙂

slate pond
#

Haha, that's a good sign at least. I hope I can contribute something novel to the program.

eternal tide
#

Hello, TRC team.

The trc email said free access to those resources

1 on-demand Cloud TPU v5e chips in zone europe-west4-a
1 on-demand Cloud TPU v5lite chips in zone europe-west4-b

may I know whether the v5lite-32/64 can be covered in this case? Thanks for your help!

eternal tide
#

Ok

rustic ginkgo
#

Excited to share here my slides at Google's Devfest in Manila for a talk about TPUs and the TRC program

rustic ginkgo
rich frigate
#

Hello!
I submittted an application to TRC two weeks ago and didn't get any response, neither approval, neither denial, so I decided to post here and to check. (I checked spam folder too, nothing's there)

bitter nebula
#

Not sure that's something that can be covered in detail on discord but generally speaking no email == not approved, unfortunately

rich frigate
#

so it doesn't mean that my application was lost or ignored, just denied. correct?

bitter nebula
#

I suppose anything is possible but it's not likely the case - you could try emailing trc-support@google.com to see if they can verify

rich frigate
#

Will do, thanks

junior storm
#

Hello, I have a question. I use TRC to receive the zone I received, which I created as I normally receive it, following the normal rules. Then I found that the code cannot find the TPU on the hardware. What is the cause?

#

Thanks in advance for answering me.

bitter nebula
junior storm
#

I'm starting over again, if I find anything else I'll let you know, thanks.

bitter nebula
#

at a super-high level a lot of times that ends up being because the code is bugged (eg using an api incorrectly) or the tpu is unreachable (preempted, maintenance event etc)

prime bear
#

Hey! I am a bit confused with global batch size calculation on v6e-8 (for example). So jax.local_device_count("tpu") shows 8 for a v6e-8. I am using MaxText that has a per_device_batch_size parameter. So when I use 64 as per_device_batch_size does it mean, that my global batch size is 64 x 8 = 512? If I would use a v6e-64, is it then 64 x 64 = 4096?

hoary swallow
#

Hi, I am trying to create a tpu v4 on demand instance but when I run this, I get an error, what am I doing wrong?

gcloud compute tpus queued-resources create tpuv4 \
    --node-id v4-32-ond \
    --zone us-central2-b \
    --accelerator-type v4-32 \
    --runtime-version tpu-ubuntu2204-base
ERROR: (gcloud.compute.tpus.queued-resources.create) NOT_FOUND: Cloud TPU was unable to complete the operation.
weak geyser
junior storm
#

Hello, I have a question about GCP credits. I received a trial credit for GenAI App Builder.
What are 32K credits? And how do I check

bitter nebula
bitter nebula
bitter nebula
bitter nebula
prime bear
weak geyser
fiery siren
#

any idea why I can't create the TPU? (also I checked. all of the quota usage is at 0). gcloud compute tpus tpu-vm create node-tpu --zone=us-central1-a --accelerator-type=v5litepod-64 --version=v2-tpuv5-litepod --preemptible
Create request issued for: [node-tpu]
Waiting for operation [projects/tpu-proj-ai-detection/locations/us-central1-a/operations/operation-1763648567926-644076e665ecc-2a7ceae1-85b1a8b4] to complete...failed.
ERROR: (gcloud.compute.tpus.tpu-vm.create) {
"code": 8,
"message": "You have reached IN_USE_ADDRESSES limit. [EID: 0xbf338a209239294b]"
}

bitter nebula
#

sounds like a quota issue, I think you can request an increase via the GCP console

fiery siren
#

i dont know which quota it is because all quotas i found have usage on 0 and limit at 8

tulip kestrel
#

Got a problem. One of my Tpus is stuck and I cannot ssh to it. And it's a pod so idk if it's even possible to restart it. And I got stuff saved there. Possible to somehow just take the stuff from there or unstuck it? It happend when I ssh to it and the pc crashed then I opened the pc again ssh and it said that connection timed out.

bitter nebula
#

to the best of my knowledge no, but maybe someone else on the server has found a way...

vague osprey
#

Hellow everyone, I am new here, what is tpu about?

tulip kestrel
tulip kestrel
tulip kestrel
twin zenith
#

Soo, I don't know if I posted in the right spot, but, I think I stumbled across an architecture that might, like, nuke GPU architecture and TPU's for LLM design. I really want someone to tear this idea down, cause the math keeps saying it amplifies LLM speed by 200% and shrinks it exponentially, while allowing for rapid iteration. I can't have been the only person to have stumbled across this, I feel like.

bitter nebula
novel oyster
#

Can anyone tell me what configuration should I use for running JAX on v6e-64?

tidal garnet
#

Boa noite.

novel oyster
#

Retrying at least 10 times for v6e-64 vm

#

😭

uneven seal
#

Hey! Glad to meet all of you guys who're in the same boat as testing out TPUs for research. I've been looking at a ton of the guides/documentation, and wanted to ask about some of the cool stuff you've been doing that you haven't wrote up/posted about. Any cool challenges you had to work through? etc.

uneven seal
#

If you mean by the type of VM, you should be using v2-alpha-tpuv6e for the best compatibility, at least from the documentation I've been reading for running some test workloads

novel oyster
#

Also, is it normal to get internal error while creating tpu vm even with spot + queue?
I've used this command to queue 64 v6e resources but it just keeps failing.

gcloud compute tpus queued-resources create qr-v6e64-ew4a-spot \
  --node-id node-64 \
  --zone europe-west4-a \
  --accelerator-type v6e-64 \
  --runtime-version v2-alpha-tpuv6e \
  --valid-after-duration 1m \
  --spot \
  --internal-ips
bitter nebula
#

how is it failing / are you getting an error code back?

novel oyster
#

With every region I have 64 v6e spot quota, any setup with more than 16 tpus results similar problems.

#

With console I do get something like this if I retry soon
Cloud TPU was unable to complete the operation. Please try again, or contact support if the problem persists. [EID: 0x369a052af37e209f]

bitter nebula
#

❤️ opaque errors
sounds like it could be quota related though
(if you have access to them) can you try an older gen TPU to see if it works / fails in the same way?

#

if it is specific to v6e I'd guess that you might not have enough Hyperdisk quota, in which case you'd need to request an increase

keen crescent
#

i think the trc people forgot to gave me quota for us-central2-b

#

the quota page doesn't show i have qota for us central 2b, the dashboard for creating the tpu doesnt even list the region, and the cli here doesnt work

bitter nebula
#

could be a timing thing though, are you still having an issue?

keen crescent
#

yes

#

i did "gcloud compute tpus tpu-vm create will-2 --zone=us-central2-b --accelerator-type=v4-32 --version=tpu-ubuntu2204-base --project=trctpu-123"

bitter nebula
keen crescent
#

yeah i did

uneven seal
#

are we able to use GKE with our quota of TPUs?

past flax
past flax
hazy panther
#

Hi, I received access to
32 on-demand Cloud TPU v4 chips in zone us-central2-b
and some other ones

but

  1. trying to create a TPU VM using zone us-central2-b returns Permission denied on 'locations/us-central2-b' (or it may not exist).
    Is this location deprecated or something?

  2. and for the other zones I simply get Insufficient capacity
    Is there a place to check zone capacity?

bitter nebula
#

Is this location deprecated or something?

no, if you still can't access it you might want to email trc-support@google.com

Is there a place to check zone capacity?

also no, but you can use the Queued Resource API to queue until there is availability

hazy panther
#

Any recommendation on library for logging and train-run monitoring on TRC?

hazy panther
#

I found a rly good repo called tpux. is it widely used?

#

I'm considering using it to help iteration speed. I want to edit my training loop on one VM but run the test between all of them. Doesn't seem like there's a simple way to do that using gcloud

uneven seal
hazy panther
#

yes I got it working. tpux with Wandb

sinful spindle
#

Hey, so i got access to the cloud tpus, but when i try to create a tpu vm it starts creating for a few minutes and then it shuts down, and when i see the logs it says this:

btw im on the free trial, and also i tried using both tpuv6e and v5, and i tried creating through the web interface and cli but its still shutting down

I already sent an email yesterday, but no response

sinful spindle
#

just realized i sent them the wrong project number (in the email)💀

sinful spindle
#

ok it works with v4-32

obtuse pilot
#

Sorry @tropic topaz Google does not provide official support on this server
Please check #general message for the official support links

bitter nebula
quaint silo
#

Hi, can anyone recommend online communities for getting help on PyTorch XLA related issues? I've spent couple of days porting our existing CUDA code to work on the TPU cloud. It works however it is not as efficient due to lower memory bandwidth of TPUs. I don't want to waste too much time before the TRC timeline ends. Thanks in advance!

sturdy valley
#

Can anyone help please? Am I doing it wrong or the TPUs are out of stock for now?

bitter nebula
#

Unfortunately no one is going to be able to offer much help based on the screenshot as it doesn't give any information as to why the failures are happening.

As a first step I'd recommend using the gcloud CLI as it will typically give more detailed error messages - if it's not something obvious (like out of capacity errors) then share the error message and someone might be able to help.

river nexus
#

You also critically rely on experimental memresister parts that are prone to bit flips and other issues, especially at low voltage levels.

#

C=2.717ish

#

Radix (b) Efficiency Constant (b/\ln b) Information per Digit (\log_2 b) Relative Cost (vs. Ternary)
2 (Binary) 2.885 1.00 bits 1.056
3 (Ternary) 2.731 1.58 bits 1.000
4 (Quaternary) 2.885 2.00 bits 1.056
5 (Pentary) 3.106 2.32 bits 1.137
10 (Decimal) 4.343 3.32

Good shot, but I'm afraid it's just not going to work. I've tried similar things myself.

#

I will spare you the rest of the problems as I don't want to make you cry, but suffice to say, the idea is not tenable, although it is novel. Keep at it.

naive wagon
#

Has anyone used tpus for inference I would like to know gains compared to standard gpu's in collab

placid scarab
#

Hello, Is anyone familiar with torch-xla able to confirm whether this is an appropriate way to measure inference runs on a TPU:

⁨```python
for _ in range(num_warmup_runs):
with torch.no_grad():
logits = torch_model(torch_inputs).logits

latencies_ms = []

for i in range(args.num_iterations):
start_time = time.time()
with torch.no_grad():
logits = torch_model(torch_inputs).logits
torch_xla.sync()

  xm.wait_device_ops()
  end_time = time.time()

  latencies_ms.append((end_time - start_time) * 1000)
#

xm.wait_device_ops() seems to be blocking the run indefinitely

marsh fractal
# placid scarab Hello, Is anyone familiar with torch-xla able to confirm whether this is an appr...

I can help! The issue is likely that xm.wait_device_ops() is deprecated/broken in newer torch-xla versions and can cause hangs.
The torch_xla.sync() call should already be sufficient to wait for TPU operations to complete - you probably don't need the wait_device_ops() at all.
Question: What version of torch-xla are you using? If it's recent (2.0+), just remove the xm.wait_device_ops() line and rely on torch_xla.sync() alone.

placid scarab
marsh fractal
# placid scarab Hi Joshua, thanks for the help. I've left only torch_xla.sync() but I'm now obse...

Both issues suggest the sync isn't working properly. The program hanging and identical performance with/without sync usually means your tensors or model aren't actually on the TPU device, so there's nothing to synchronize.
I can either walk you through debugging the device placement and fixing the sync points, or if you share your full code snippet I can just rewrite the benchmarking section with proper TPU synchronization for you.
Which approach works better for you?

placid scarab
#

Ah thanks! Here's a code snippet:

⁨```python
device = torch_xla.device()

torch_model = torch_model.to(device)
torch_inputs = torch_inputs.to(device)

print(f"Running inference on {model}")
print(f"Warming up {model} with {num_warmup_runs} runs")
for _ in range(num_warmup_runs):
    with torch.no_grad():
        logits = torch_model(torch_inputs).logits
        torch_xla.sync(wait=True)

latencies_ms = []

for i in range(args.num_iterations):
    start_time = time.time()
    with torch.no_grad():
        logits = torch_model(torch_inputs).logits
        torch_xla.sync(wait=True)

    end_time = time.time()
    latencies_ms.append((end_time - start_time) * 1000)

Can you figure out what I'm doing wrong?
full arch
# placid scarab Ah thanks! Here's a code snippet: ⁨```python device = torch_xla.device() ...

I can see the issue you're calling sync(wait=True) but that parameter doesn't exist in torch_xla.sync(). The sync is likely failing silently, which explains both the hanging and the inconsistent timing.
Here's the thing this needs a proper rewrite with correct sync semantics and proper timing placement. Rather than going back-and-forth in the thread, would you be open to discussing this privately? I can fix this properly for you and make sure your TPU benchmarking is accurate.
Want to move this to DMs so I can help you sort this out?

placid scarab
#

Anyone able to point out where this is going wrong (without charging a fee) ?

tribal stag
#

I have been "waiting for resources" for days on v4 tpus. sigh. Anyone else feels the same?

tired sluice
tribal stag
#

I dont have quota for on demand v6e-8 in us-central2-b. I only have preemptible v6e in us-central1-a, but it got preempted instantly. lol.

quaint silo
#

I am not sure if this helps anyone, but I've shared my experiences on migrating an LM training pipeline to run on TPUs here: https://dogac.dev/blog/2026/migrating-to-tpu/, maybe it helps you to migrate your existing pipelines and debug performance issues.

naive wagon
quaint silo
naive wagon
quaint silo
#

I am not sure what you mean. I create a queued resource as TPU documentation recommends and I usually get access to it within 5-10 minutes. If it is pre-emptied, I retry.

waxen pilot
#

This is a great content Dogacel! Thanks for sharing

chrome bough
#

yo guys can you make a free tpu with infinite compute power and storage

#

that would be useful ty

#

||jk||

restive notch
#

Does anybody know how the TRC team reacts to asking for more TPUs (or more advanced TPUs like the v7x)? I have a couple of projects that could benefit from the 3d topology (and optical switches). Plus, my v6e-16 cluster has been getting preempted a lot. I have a couple of proofs of concept already. Also, has anybody ever gotten a v7x or not? As far as I am aware, it should be available, but only in public preview. Would it be possible, or is the capacity being used for internal LLM training?

sullen rivet
#

Hi all,
As part of TPU research cloud, I have been provisioned free access to v4-32 on-demand TPUs in us-central2-b. However, I am getting an issue when I try to create TPUs with that configuration. The error message is not that informative so I was wondering if I did something wrong.

Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead? I can try to create 4 v4-8 VMs and maybe try to network them somehow but the cost display on the side of the screen is making me cautious. My understanding was that my original approach itself was supposed to create multiple VMs .

Any help would be appreciated here. Thanks!

vapid siren
#

Because I want the 32 chip allocation, am I supposed to create multiple TPU hosts myself instead?
No, that should never be necessary, the TPU deployment thing also sets the correct env variables for JAX so that the hosts can find each other

sullen rivet
chrome bough
bitter nebula
bitter nebula
sullen rivet
# bitter nebula Hard to know exactly what's happening without the error response but if you're a...

Thanks! I finally managed to get something on-demand provisioned through the UI and I’m sure it’s part of the TRC grant I’ve been allotted (v4-32/us-central-2b). I can’t exactly stop the TPU when not in use because it’s not a single device TPU so I’m trying to see if there’d be some mechanism to verify that I’m not being charged for this before I randomly see a really big charge tomorrow

restive notch
# bitter nebula If you're able to provide a compelling plan for your research then an increase a...

I see. Are v5p-64 TPUs available then? I had a couple of ideas for the OCS and 3d topologies. Hopefully, I can test out, see if they work, and then publish a paper about it. If anybody is knowledgeable on TPU specifics, could they check these ideas?

The first one is using topology aware manhattan distance for routing in MoE models. Trying to penalize/reward the model for using TPUs that are physically closer to each other. Likely, I would just anneal the penalty to try not kill model performance. Plus, some load balancing between experts. The big issue I can see is if the communication overhead between MoE models isn't large enough to justify the work. That being said, it could be a (mostly) free performance gain.

The second one is using the 3D torus for helical ring attention. Theoretically, having a Hamiltonian path or multiple interleaved helices should make it so that you can transfer the KV-block while it is computing attention. So increasing the memory bandwidth tremendously and thus the context. Personally, I suspect the Gemini family of LLMs does something similar, and that's why they can have such large context windows.

The third one is arguably the most difficult. The hard part is the SparseCores. I cannot for the life of me find any good documentation on the SparseCores, even though it would be amazing for specific tasks. If any Google insiders familiar with the SparseCore functions, documentation, or details could help me out, it would be greatly appreciated. Anyways, the general idea is to use the SparseCore as a Locality Sensitive Hashing engine to retrieve the attention (Leaving the MXU for computation). The benefit of it is that you can use the SparseCore for Top-K relevant keys, then compute them on the MXU. While it could be incredible, I have no idea if the SparseCore would support this, so this is why it's the least likely out of all 3.

bitter nebula
fallen salmon
#

Hi all, I'm running into an issue where I spin up a v6e-8 spot instance and it gets provisioned successfully, but when I login I can see that it has a placeholder container (fake_tensorflow:latest) and no access to actual TPU resources (TPU libraries show that only the CPU is available, /dev folder has no accel subfolders, etc).

Wondering if anyone else has encountered this. I've tried a variety of settings to boot up the instance (runtime: v2-alpha-tpuv6e or tpu-ubuntu2204-base, CLI: alpha or general, direct tpu-vm or queued resource, etc).

vapid siren
#

I have seen the fake_tensorflow before as well, but was still able to communicate with the TPU. I'm also not sure if the chips are still exposed as device files under /dev. (I actually think I also had that problem before that I couldn't find the chip there). Have you tried just running jax.devices() to see if the accelerator is still found? What library calls did you use to try to find it?

fallen salmon
vapid siren
#

Last time I had that problem I had selected the wrong software, but it looks like you already tried selecting a different image

#

Are you sure you have installed jax[tpu] and libtpu?

fallen salmon
vapid siren
fallen salmon
#

Really appreciate your help! It's also good to know that the /dev/accel thing was bad/outdated information and those folders don't exist anymore

vapid siren
#

Always happy to help! Can always DM me if I’m not looking at the channel

sullen rivet
#

Hi, Is there someplace we can reach out to about possibly extending the free cloud TPUs that we receive? I have about 6 days left before my allocation expires and have emailed TRC support but I guess I’m generally seeking advice on how to go about this/best way to mitigate costs after the allocation expires. Even a more limited set of configurations/zones would be helpful here. Thanks!

fallen salmon
#

Has anyone else run into this issue, usage of quota shown as 100% despite not having any TPUs or queued requests in the console?
Leaving for reference. It resolved to 25% around 10 min after deleting old suspended queued requests.

marsh fractal
#

Hey, I can help with this. Code 13 usually happens because of quota limits, region capacity issues, or network config (like external IP restrictions) when using free credits. What region are you trying to create the TPU in, and are you attaching it to a VM with an external or internal IP?

marsh fractal
#

Hey! I’m actually not an AI lol, just someone who works with this stuff and tries to help when I can.
Since you’re using europe-west4-a with internal IPs, the code 13 error could be coming from TPU quota limits on free credits or capacity issues for larger TPU types in that zone. I’ve helped troubleshoot similar setups before.
If you want, you can DM me the exact config you’re using and the TPU type you’re trying to create. I can take a quick look and help you get it working. If it ends up being something more involved, I can also help you set it up properly.

restive notch
#

I would actually love to hear if you guys ended up fixing this

obtuse pilot
#

Hey @gleaming sierra! Just a friendly reminder to keep the chat relevant and easy to read. We need to avoid spam, nonsensical messages, and excessive emoji use to keep the conversation flowing for everyone. Your recent messages that violated this rule have been deleted. Thanks for understanding!

knotty mirage
#

Hi, I am trying to test-drive the TPUs, but I have not yet been able to get an on-demand TPU (queued for half a day) - and spot instances (tried multiple zones) seem to get preempted within minutes. Is that expected behaviour? Because if it is, I would probably not invest any more time into this. Any advice?

vestal coral
#

Hello, I am a developer of a large language model based on SNN. I am interested in what the token rate per second will be on TPU and how to properly optimize it. My model has 618 million parameters.

#

I just have very slow speeds on server video cards. 0.3Ts

hazy halo
#

Maybe not that bad, but preemptible were not worth the effort for training back then

knotty mirage
#

Hm. Though why do they even bother with the TRC program then? If this is a normal experience, I cannot imagine that many people are likely to use or recommend TPUs going forward.

bitter nebula
# knotty mirage Hm. Though why do they even bother with the TRC program then? If this is a norma...

We do it because we think the mission is important.

Over the years TPUs have become increasingly popular, and with modern coding agents the technical barriers to using them have practically vanished. So, unfortunately, what you have today is a time of huge demand and relatively constrained supply (and that isn't unique to us, obviously). We understand that things could be better, and we try to make positive changes to that end whenever the opportunity arises, but, yeah, it's hard out there right now.

#

And on that note, thanks to anyone who sees this that has stuck with us through the rough times - we'd be nothing without you! ♥️

bitter nebula
vestal coral
bitter nebula
vestal coral
vestal coral
river adder
#

yo guys where's the v5e slot i cant see it anywhere

bitter nebula
#

I've removed your initial post, please don't share information specific to your project on Discord.

This is a community-led channel, not an official outlet for contacting TRC support - if you have questions about something specific to your project please email trc-support@google.com, or it is something more general feel free to repost here without the screenshots etc.

grave orchid
#

Thank you! I didn't know.
I'll be careful next time.

rotund scroll
#

Hello. I am new here and earlier I thought you could use Colab for using your tpu grant but is that not true? Do you have to use Google Cloud Console?

bitter nebula
tame grotto
#

Thanks for the ping!

outer totem
#

Hi everyone, I tried spinning up every combination of region and Spot VM type available in my allocation, however they all end up failing with the error code 13 after being provisioned. I see earlier messages about how this could be related to TPU quota limits and while this has happened to me before within specific regions, it's never been across the entire allocation. Are other people running into this today as well?

bitter nebula
opaque berry
rotund scroll
#

I had the same problem! But I was trying on demand and it failed.

fallen salmon
jaunty vine
#

talk to me nice !!

compact vine
bitter nebula
#

resolved since the 31st, but also if the flag didn't work then it isn't the same issue afaiaa

trim nexus
#

I am not able to spin up TPU instances past a certain size despite having the quota for it as my Hyperdisk Balanced Capacity quota is too low. However, requests to increase it have been auto-denied, has anyone else experienced this issue?

full arch
# trim nexus I am not able to spin up TPU instances past a certain size despite having the qu...

It sounds like your TPU quota may be available, but the blocking issue is the Hyperdisk Balanced Capacity quota, which TPU instances also depend on for attached storage. Auto-denials usually happen when the request doesn’t match recent usage history, region limits, or project billing/activity signals.
Which GCP region are you trying to create the TPU in, and what TPU size/type are you requesting?

tulip kestrel
#

Seen new v8s tpus? They are amazing!!! Hope someday on TRC too. (ik not soon. Cus we recently got the v5s generation. But still mmm gona be good haha)

stiff grail
stiff grail
#

Is there a workaround?

stiff grail
tulip kestrel
#

Does trc team is on vacations or free week? (idk if there is any holiday in USA) cus 4days ago I sent a message and didn't get any response back. 😔

bitter nebula
#

It's not likely that anyone is going to discuss their working schedules on an open Discord server, but worth noting that 2 of the last 4 days were the weekend...

#

Anyway I'll mention it to them but would generally recommend a bit of patience as it is a very small group of folks and sometimes the amount of inbound emails can be intense

tulip kestrel
tulip kestrel
#

But thanksss for info :))

jaunty vine
#

hello everyone !

tulip kestrel
tulip kestrel
spice willow
#

Would a digital twin that does real time nv scanning be useful working on something like that currently

jaunty vine
#

let me get credits for Huawei's research page im gettin them in my bag 🎒

#

👿 🦹‍♂️

spice willow
#

Have about a 208 page paper detailing a potential framework for LCVD in-situ at 10mk

meager fog
#

Hiii everyone

nimble stream
#

Hi, my queued resource has been waiting 44h. Can I DM details?

spice willow
bitter nebula
#

TRC / GDM doesn't provide anything like that but there are other discord servers etc that might be able to help

#

I'd say you could post here and ask for feedback from the community but traffic is pretty low so you might not get anything useful from it

spice willow
# bitter nebula I'd say you could post here and ask for feedback from the community but traffic ...

I built a public reproducible simulation/package for a staged cryogenic + in-situ LCVD concept called QTA.

Repo:
https://github.com/cakeisalie89/qta-submission-package

Why I’m posting here:
The package includes a Python simulation, gate logic, Monte Carlo outputs, CSV artifacts, manifest/hash checks, and a manuscript-style technical report. I’m looking for feedback on whether this is structured clearly enough as a research-compute / reproducibility package.

Important boundary:
This is NOT claiming working hardware or a validated breakthrough. It is explicitly blocked/conditional and separates assumed parameters from measured requirements.

Looking for quick sanity-check feedback on:

  1. Is the repo structure understandable?
  2. Is the simulation reproducible from the README?
  3. Are the Monte Carlo / gate outputs presented clearly?
  4. Are assumptions vs measured claims separated enough?
  5. Would this need TPU/GPU/cloud compute at all, or is CPU fine for the current scope?
tulip kestrel
feral oak
#

whats tpu and diff bet tpu gpu and cpu, also Google Colab chooses which and why

swift junco
#

Hi TRC community
We’ve been running large-scale training workloads on TRC, but around 60% of our allocation has effectively been unusable for the past 4 weeks. I’ve emailed trc-support twice and haven’t heard back yet.
v5e (128 chips, currently unusable):
v5litepod-8 requires 8 chips, but the quota TPUV5sPreemptibleLitepodServingPerProjectPerZoneForTPUAPI is capped at 4. It looks like the allocation landed in a serving quota bucket instead of training. Seeing this in both us-central1-a and europe-west4-b. Has anyone run into this before?
v4 spot (32 chips, ~5% usable):
In us-central2-b, jobs sit in WAITING_FOR_RESOURCES indefinitely. I had one finally provision after ~4 hours, then it got preempted shortly after.
v4 on-demand (32 chips, unusable):
Using the v2-alpha-tpuv6e runtime, JAX falls back to CPU and eventually OOMs. Is there a known-good runtime image for v4 + JAX right now?
Would really appreciate any guidance from anyone who’s dealt with similar issues.
Kim, Modulith Research CIC

#

Separately, a couple of findings from our v6e training runs that may be useful for the Pallas/Splash teams:

  1. jax.nn.dot_product_attention appears to materialize the full S×S attention matrix in fp32 on TPU regardless of input dtype, which caused OOMs for us at seq_len=2048. We worked around it by switching to splash_attention_kernel and adding remat.

  2. Splash attention seems to default sm_scale=1.0 instead of the expected 1/√d_head. In our case this caused training to stall around ~PPL 12. Setting sm_scale=1/√128 fixed convergence immediately. Worth flagging because it’s a silent failure mode — training keeps running, but converges to garbage.

Happy to put together proper bug reports for either if helpful.
Kim, Modulith Research CIC

tulip kestrel
#

Do TRC is overbooked rn? Many people on Twitter complained about it. Wonder if there is any official info.

bitter nebula
#

That's correct, TRC is not currently accepting applications - there's a new form linked on the site that you can use to express interest in placement if/when it becomes available.

tulip kestrel
spice willow
bitter nebula
spice willow
spice willow
#

Subject ive been looking into is atomic scale manufacturing