#Deploy on Kubernetes Cluster - Permify D...
1 messages · Page 1 of 1 (latest)
Hey @untold gull , that performance gap is pretty big. From our load tests, we do not get results as you mentioned, so I would like to figure out what’s going on. I have a few quick questions:
- Are you using the latest version?
- Can you share your Permify schema?
- Can you share your deployment YAML too?
Hey @lone yew thanks for the help
Yes, we are using Yes, we are using the latest image ghcr.io/permify/permify:latest
The schema is a placeholder for testing and won't represent our final use case but the schema in use in the load test is:
entity user {}
entity appgroup {
relation member @user
}
entity person {
relation user_of @user
permission user = user_of
}
entity organization {
relation admin_of @appgroup
relation employee_of @person
permission admin = admin_of.member
permission manage = employee_of.user or admin
}
entity order {
relation guardian @person
relation organization @organization
permission view = guardian.user or organization.manage
permission edit = guardian.user or organization.manage
permission delete = organization.admin
}
entity charge {
relation order @order
permission view = order.view
permission edit = order.edit
permission delete = order.delete
}
entity receivable {
relation charge @charge
relation origin @receivable
permission view = charge.view or origin.view
permission edit = charge.edit or origin.edit
permission delete = charge.delete or origin.delete
}
entity invoice {
relation receivable @receivable
permission view = receivable.view
permission edit = receivable.edit
permission delete = receivable.delete
}
The deployment YAML is
The check permission we are running is like
checkPermissionRequest := &permifyTypes.PermissionCheckRequest{
TenantId: "1",
Metadata: &permifyTypes.PermissionCheckRequestMetadata{
SchemaVersion: "",
Depth: 20,
},
Entity: &permifyTypes.Entity{
Type: "order",
Id: "Order_1",
},
Permission: "edit",
Subject: &permifyTypes.Subject{
Type: "user",
Id: "User_1",
},
}
An interesting thing I noticed today is that if I set it to only 1 pod, the performance for 500 req/s improved to a 180ms p50
Still not ideal but a 10x improvement over 10 pods (which is really weird)
Another thing of note is that the k8s Service being used is of type ClusterIP and not LoadBalancer
Hello @untold gull , we are investigating your usecase. I will let you know when we get the results.
Hey @untold gull ,
I checked your deployment and didn’t notice anything wrong. When I tested your use case in our cloud environment, I got: (med): 52.66ms, (p90): 193.88ms
Is it possible that in your test scenario, you are writing the same data repeatedly? Are you seeing any serialization errors in the application logs or any connection drops from the database?
Could you try testing checks separately to isolate the issue?
If you’d like to discuss this further, we’d be happy to jump on a call with you.
Hey @celest walrus
We are indeed seeing a lot a serialization errors.
Postgres logs are
ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on commit attempt with conflict in from prepared pivot.
However, we are not trying to write the same data repeatedly - as I state the data is similar to
{
Entity: &permifyTypes.Entity{
Type: "person",
Id: "Person_1",
},
Relation: "user_of",
Subject: &permifyTypes.Subject{
Type: "user",
Id: "User_1",
},
},
The _1 is based on a shared index but we are protecting it with mutex and I have checked that each request correctly has a different number.
These errors only occur on writes though, and I'm worried mostly with Check performance. In our load test the writes are segregated from the reads and we only start reading when every write has completed. If we modify our code to not clear the data on startup and skip the writes, so we test only the reads with preexisting data, the performance does improve a bit but still has a mean of 800ms.
Hello @untold gull ,
Can you check if all pods are under load when you are testing. Maybe there is a problem with ClusterIP.
Even with 0 cache hits, the maximum response time I see for your schema is 200ms(p90). We can prepare a deployment document to help identify any unseen misconfigurations, or we can set up a cloud environment for POC purposes.
Hello @untold gull ,
We have identified the root cause of the write errors you encountered.
Could you test using version v1.3.0 and share the results with us? If possible, using the k6 tool would help us match our results.
I’m attaching the results from our test along with the script I prepared for your use case.
Thank you for your collaboration. I look forward to your response.