#ai-village-capture-the-flag-defcon31
1 messages Β· Page 9 of 1
CIFAR had been solved or not? The prompt said "Simple counting".
.
I got so many heatmaps like this π
eventually ended up brute forcing the top few chars in each position + all chars in 4/5/7 but still missed it i guess
I think I didn't see that the first class could be l
Model Inversion as black box really depends on data you use, if you're too far from the real one, it goes nowhere
did anyone get a usable heatmap for 4/5/7?
i think the "ascii" prompt led me down the wrong path a lot π
At the begining I thought it was an AI (LLM) so we' had to OHE ASCII chars a with max length = 32
At the end, no AI, no Ouija, just letters
Also for pixelated...
I tried so many more and more complex attacks... eventually turns out the issue was that the detected text field had to be non-empty
Pixelated needed to crash the OCR and then with (old) XML skills and a single letter it was good
but crashing the OCR was not straight, and the prompt provided direction to Jenny song without any purposes
Btw @abstract rose , I think previously you had a conversation with moo regarding pickle. Would you mind share it if there something useful there?
Sure, let me find it
It was about the pickle protocol, the same input gives the flag (or not) depending on pickle protocol:
we will ever know the solution of CIFAR and Granny-Pixel ?
BTW: I've tried all outliers counting, percentile, stats on CIFAR100 (1 row per class), whatever full pixels or RGB, ... nothing just 'try again'.
for CIFAR I tried many things, 100 for many statistics on CIFAR100, 100 for the top-100 pixel frequency, I also tried many statistics of the 100 images in the official webpages https://www.cs.toronto.edu/~kriz/cifar.html
but nothing...it's a nightmare
I hope it's not CIFAR100 to have no regret π
Or maybe that's just the datasets we've all downloaded that were not the good one
People solved hush/passphrase but not CIFAR, that's crazy
I think that the dataset (pytorch or tf) read the official dataset, with the same order and pixels
I hope so
"SIMPLE counting problem"
Most prompt were here to troll us π
One thing I noticed for CIFAR in my many, many attempts is that when you count the labels in the first 10,000 images, ex:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar100.load_data()
np.unique(y_train[:10000], return_counts=True)[1]
the max ends up being 125, which corresponds to the first value in the input data hint (input_data = [125, 245, 0, 10000]), similar to how 256 is the max in MNIST, but no idea if that's anything or just a coincidence
You can get 125, 245 also with P5 and P95
interesting
One idea I thought about is that maybe the numbers given were ranges. The first 2 numbers represent a range from 125 to 245, and the last two represent from 0 to 10000
The 2nd one maybe represent the images in the dataset, but couldn't figure out the first.
just out of curiosity, does anyone know of other ctf style competitions outside of the purely cyber security realm?
Like some of the most low level ones have general computer science / algorithmic problem, so one that would be mostly, if not all, arround that area
I would like to see more ML/AI ctfs aswell, I think there was CTF with some ML flags a while ago but I havnt seen that many
Cyber Apocalypse iirc
but it was like 8 tasks
I totally agree, when I discovered that "everything is equivalent" was related only of the original score I solved in 2 minutes...it was robbery π
ML/AI would be the ideal, but even if just more general comp sci
general comp sci is just competitive programming no?
hum i see your point
exactly, not sure how ascii leads to solution
There is something that is bothering me, and I would like to ask all of you since you are more experienced than me.
Before discovering that the weights of the PyTorch network were wrong, I tried to create a sort of black-box attack (aka copycat) in the following way:
- I took 1000 variations of the wolf image
- 1000 variations of an apple
- I labeled this dataset with the output of the server. The labels were [probability_wolf, probability_apple, probability_other].
- Locally created a MobileNetV2 model with pretrained weights (not the default ones, I discovered that 2 weeks later damn me)
- Fine-tuned using Mean Squared Error (MSE) as the loss. If my model said 0.87 for an apple when there was an apple, but the server said 0.34, I adjusted the parameters to reflect that prediction.
The moral of the story is, I couldn't match the server's output.
The thing I wonder is: why?
There are many reasons I suppose, but I can't identify why in the end I couldn't land on a model that approximated the server model well:
- Deciding to have a three classes prediction may not approximate the problem well. I chose to do that because it was messy to predict 1000 classes
- MSE is not a suitable metric. I also tried KLDivLoss but without results.
- Other reasons.
I also thought at some point that matching the server's output is a sort of knowledge distillation, so I tried this implementation https://pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html#knowledge-distillation-run without solid results.
Any suggestions on this matter? At this point I think my approach was naive, underestimating what it really takes to mimick a model given another one.
However, if you think that my approach could/should have worked (for instance, cosider the link above, with the teacher being the server and the student my local model), I'd be more than happy to revisit the code as probably it was just a bug somewhere in my implementation
solutions to granny3 and cifar still unknown?
also, lots of solutions to pickle but any guesses to the server evaluation function?
did they just turn off the servers?
I tried drawing letters using different types of ascii art haha
I was saving my notebook π
I also tried fine-tuning mobilenetv2 in the same way but using keras model
going through all the notebooks from the top echelons of the leaderboard and none of them solved cifar lmao
didnt work at all
I used a naive substitution approach, but wasnt even close to get a match. Though I only used about 200 images (timber wolf + granny smith) to start, and then kept perturbing the 200 images with random noise and retrained
ya I wonder what I'm missing
probably distillation is fine for classification with hard labels
I was convinced people were doing substitute models to match the model, but it seems it was possible to match just using the pre-trained torch model
idk, I have to study more I guess
with correct preprocessing
ya evetually I got that and solved 1 and 2 right away
I still cant comprehend how I didnt even manage to solve granny1 considering the approaches I tried haha
I applied methods from at least 5 different papers
I think my mistake was trying methods that were written for CIFAR10 dataset which is much lower resolution
ascii art like this https://www.kaggle.com/competitions/ai-village-capture-the-flag-defcon31/discussion/454370 ? π
Collect flags by evading, poisoning, stealing, and fooling AI/ML
hey, i still don't get why hush was returning variable length of output, coz i got all the values in the range of 2 to 12.... like how was it even processed with whisper in action
Sequence generation stops after predicting <eos> token. 2 means that start token is followed by <eos>
Oh... but the <sos> and <eos> token will always be there , whatever audio input we give. so why does there value vary ?
it makes sense because we never get fewer than two tokens
up to 12 (i guess the correct answer is 12)
what were the actual numbers? similarity? p-value of that token in that position?
have the hints about cifar and granny3 be released?
a daily question
My solution for Grammy 1, random pixel attack. 1782 different pixels from the original image. Maybe one of this 1782 is the one for Grammy3...
I see more than one pixel difference.
I don't think you can distill model this way. The original model was trained on a huge dataset with proper input distribution. Your way to approximate/distill the model is by fine-tuning on a very limited dataset (wolf and apples), you will just end up overfitting to your input.
the other aspect is that, the class prob is result of a softmax. a softmax over 3 class and a softmax over 1000 class are different things. I think at least you should inverse the class prob and use the pre-softmax value as target label
and perhaps more importantly. The difference is not even the model weights, it's the preprocessing, which the model is not designed to learn
Any hints or soltion on cifar
yes, big hint. Don't tell anyone though: ||input_data = [125, 245, 0, 10000]||
This hint definitely should be in your YouTube recap for this year haha
When trying to match the local model with the server for Grannys I found that
pytorch model created as
model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V2)
is very different from
model = torch.hub.load('pytorch/vision:v0.10.0', 'mobilenet_v2', pretrained=True)
i.e. 0.28 vs 0.85 predictions for timber wolf. Do these models assume different kind of preprocessing, do they use different weights? Can anyone explain this huge difference?
The image is half-attacked already with specific weights in mind
Half attacked meaning attacked, but not too much
so these mobilenetv2 models are based on different weights?
apparently so
no hints so far for CIFAR ? the server will be down soon (if not down already) ... if CIFAR was not solved ... it may come next year hunting us :/
servers seem to be already down, tried to restart my solution's notebook and got query errors
DEFCON32 - Challange 13 - CIFAR - Did you miss me ? , Description : really ? .,,, Simple Count ++ [125, 245, 0, 10000, ????]
maybe @olive ledge can provide the hash value for the CIFAR solution, if the server cost is an issue?
Hunting us, hating us and haunting us.
Given the nature of the problem, isn't enough to overfit on just two images (and their variations)? I mean, if the wolf is predicted as wolf exactly like the server (same for an apple), then I can craft an adv example and who cares about the other classes.
The values are probabilities of the tokens of the target sequence and depend on the input sequence
Yeah, the trick was to try different combinations of weights and preprocessing
those 100 images were also my idea, it also maps to the last year's crop2 challenge with its "arbitrary" palette as a solution
The thing that I keep thinking about with CIFAR is that moo wrote something about not overthinking it which suggests that it should be something fairly obvious, something like the MNIST count.
i imagine the obvious solution is to leave this flag out and solve all the other ones
in my experiments the best match was the model that you mentioned+ resize(256)+crop(224)+ToTensor()+normalize(mean imnet, std imnet). I obtained different results (after 6th digit) if device is cuda or cpu. If you change resize method you also obtain different results (e.g. PIL vs cv2 vs PIL-SIMD) because of different implementations.
also tried ToTensor+resize(256)+crop(224) + normalize
which can accept plain array as input
in that case torchvision will use a different resize if you see the code, because input is not PIL image but tensor
yes, it seems like resize is different on tensor, which was a shame for granny 3 because we had to resize on cpu
another approach that I explored is to find the equivalent convolution of bilinear resize, it's like a mobilenetv2 witha pre-layer, and understand propagation maps from there
" input is not PIL image but tensor
from pytorch documnet, PIL resize and F.interpolate can give exact same results
Yes, I successfully matched my local model to the server, otherwise couldn't solve Granny 1-2, but I'm still puzzled by the pytorch mobilenetv2 zoo of models and don't understand how to identify them
in my case too, if not matched my attacks didn't work
ok, maybe for antialias
I tried it and all interpolation modes
It did not match as well
Yeah, I saw this page. I've landed here first https://pytorch.org/hub/pytorch_vision_mobilenet_v2/ it has the correct preprocessing, but the model won't match... then experimenting with the models I got mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V2). Don't understand the difference between the ways creating models in pytorch...
"I've landed here first https://pytorch.org/hub/pytorch_vision_mobilenet_v2/ it has the correct preprocessing, but the model won't match...
processing in granny server is not same as ppytorch default preprocessing
the granny server us resize 256, crop 224
std and mean are the same
to test mean and std, one can pass zero image, red image( 255,0,0), blue imgae
this is a trick
the issue was the model itself
for red input, you should see prediction = matchstiick, for blue, prediction should be all the shark (blue ocean). this is how you can guess if the std, mean is 0.5,0.50.5 (tf values) or 0.4x,0.4x,0.4x (pytorch values)
nice, luckily normalization was not issue in my case, but I had to supply it to torchattacks so that it properly designs the attack
my issue is trying to understand the difference between the pytorch models with the same name
now I understand they may use different weights
"my issue is trying to understand the difference between the pytorch models "
once you can get the preprocessing correct (resize, crop, std, mean), it is just then trial and error for all avaiable weight files
right, but without the proper model how can I know I've matched the preprocessing? I assume blue ocean prediction won't depent on crop/resize by a lot
the steps are:
mean and std yes, but they're already known for imagenet
by reading source code
I meant preprocessing on the Granny server
- use zero image of various size. if tiy return prob is always the same, we conform there is resize
- send image , image[y=0]=0. if return results are the same, we confirm there is crop
then send image[y=1]=0., image[y=2]=0. .... to find crop size
send chessborad image with blocksize=1, youi will get this
nice, thanks for the info, may be will use these tricks for the next year Grannys π
resize is symmetrical (becuase of resize artifcats)
luckily this year preprocessing was simple
from the chart, you know results is either 256 or 512
thi is my best matching with the server model, the device si "cpu", jit optimization improve the gap
By trying all possible combinations of weights and standard preprocessing.
Also you can add interpolation to the mix.
But this is all based on the hypothesis that they used something standard and just tweak a bit
You kinda have to test basic hypothesis first
I have exactly the same code https://www.kaggle.com/code/kononenko/ctf-a-silver-medal-journey-22-flags?scriptVersionId=150145838&cellId=56
yep, I tried some custom preprocessing with no luck, then found some standard stuff on pytorch website that worked
That is why I think CIFAR is pure bs. The search space is too huge with binary (no) feedback
Well, there is no guarantee that each problem must have a solution π
It could be that CIFAR backend is simply print("Try again!")
This is the 'best' I achieved for inversion...next step I flipped images (but lost the output), and there you can decider an "e" on position 4
I only tried the second one π₯²
I finally managed to post my write-up (3rd place, 25 points). Yesterday I got a "Too many requests" error every time I tried to post it for some strange reason. https://www.kaggle.com/competitions/ai-village-capture-the-flag-defcon31/discussion/454720
Collect flags by evading, poisoning, stealing, and fooling AI/ML
I saw it yesterday, thought you deleted and reposted it for some reason. Creating pictures manually in paint requires a strong determination π Congrats on your 3rd place!
Thanks! Yes, it was hard work with Paint π The reason I deleted it was that it didn't get attached to the leaderboard and it doesn't seem like you can change that with edit?
nope, can't reattach, the only option is to delete/repost
haha I also did paint, worked great
Not sure if it was mentioned in the meantime, but i think the torch hub weights correspond to the imagenet1k_v1 weights
Me : what is Count CIFAR ?
GPT-4: Try again
We need hints for CIFAR to stop it from comming back next year
π’
the funny thing is...CIFAR has been solved by some people. That means: easy upvotes to the first loading the notebook with the solution
cooome on
i hope moo was trolling hahaha
So I come back some days after to discover the solution of passphrase that haunted me so hard. I expected an incredible insight, you know, this "eureka" you get having a shower. I read the different solutions and I couldn't find two persons solving it with the same approach, explaining it with the same constraint ... A huge deception !!!
i asked it on kaggle in case these people arent using the discord channel
lets hope we get some clarification
passphrase ... I should have kept my code running for days (even if it found many many same predictions like benchmark....) this could have got me the flag (I should get lucky once ...)
I found the preprocessing steps burried in the torchvision MobileNet_V2_Weights.IMAGENET1K_V2 docs. For me it matched almost eactly with the API. Only difference with what you did is it resized to 232 instead of 256. https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html
The inference transforms are available at MobileNet_V2_Weights.IMAGENET1K_V2.transforms and perform the following preprocessing operations: Accepts PIL.Image, batched (B, C, H, W) and single (C, H, W) image torch.Tensor objects. The images are resized to resize_size=[232] using interpolation=InterpolationMode.BILINEAR, followed by a central crop of crop_size=[224]. Finally the values are first rescaled to [0.0, 1.0] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].
Ye I saw, and I tried also with 232 instead of 256 but 256 is closest to the api....I don't know why, I tried all combinations of resize,crop in a double for cycle and the best match is (256,224)...
Ah, sorry. I'm seeing now you guys already discussed this.
And I just double checked and you're right, I did use 256 resizing.
oh no problem, btw the api model is still a mistery, maybe it depends from cuda/torch versions also
I found the reverse preprocessing took some time to implement out too.
btw there was some modification in granny3...first vrsion was with arrays, second one with base64 png, I also discovered that in a temporal window (2 days maybe) they accepted also an image with 2 swapped pixels, so I was so happy because swapped pixel attack in same case is stronger than one pixel attack, but they fixed it, and after that only 1 pixel change was legal! I swear, I didn't dream it π
In IP1 & IP2, the task is to redirect email to 172.0.0.1, but the message sent in the event of success says "Email sent to 127.0.0.1". Is the typo (127 for 172) deliberate?
lesson learned from IP1&2: LLM is a joke...
IP1&2 responses and solutions don't makes sense as far as I know.
IP1&2 weren't LLM, they were caveman NLP
That explains the huge difference, thanks
Miss youuuu
CIFAR please :)...
btw, Moo claimed that 'a decent amount of folks have solved CIFAR.' Maybe it's just a joke π 
my writeup for CIFAR
I see you are still discussing the model matching for the Granny challenges... I think many of you guys may have overthought it... I literally copied the code in the model usage description from PyTorch and then tried with the two versions of weights available... it turned out that V2 was the weights version being used... you didn't really need to tinker with the preprocessing...
Looking forward to next year's edition!!
Yes, that's what I used as well. The problem with that code is that resizing is done on CPU which becomes a performance bottleneck on Granny 3. PyTorch resize works differently on PIL image and tensor and we need it to work exactly like the API. @random minnow mentioned above that F.interpolate(x, size=(256, 256), mode='bilinear', align_corners=False, antialias=True) should work, but this does not seem to produce the exact same score for me. Now that I think about it more, I don't think my local model matched the API down to the last decimal. What if the Granny 3 solution is so specific that we just haven't matched the online model closely enough? For example any difference in interpolation method could cause our changed pixel to become something completely different.
mmm, personally, I don't think that the matching to all decimal places was necessary... to be honest I do believe that for solving Granny 3 there was something else going on that we missed... as others pointed out, I think that if there was a single pixel that by changing it would lead to a different classification result, someone would have found it... it's hard to tell, for many challenges I focused to much on the description to later find out that they were sort of misleading... for example, in pixelated, the description was talking about passwords and in passphrase, it kept talking about bytes and bits... maybe for G3 it was the same? π€·ββοΈ although that would have been quite awful since even the challenge endpoint was pointing to a single pixel... or maybe it wasn't? π€ come to think of it, it just said pixel... if there was something specific going on in the server side, I guess we'll never know now... π
Yes, I guess it could be something completely different that we all missed π
maybe understanding deciphering what they actually meant by "ancient incantation" was the key to solving it π
I could not match scores with torch resize. Also I measured difference between local and remote model by querying all pixels in the image with step 3 changing one of the colors to (c + 128) % 256. There were some difference but not that big
I was able to exactly match the API predictions. I used
model = torch.hub.load('pytorch/vision:v0.10.0', 'mobilenet_v2', weights="MobileNet_V2_Weights.DEFAULT")
And to reverse the resize to crop operation I used
def reverse_transforms(cropped_image, original_size):
"""
Reverse the center crop and resize operations.
:param cropped_image: The PIL Image that has been cropped to 224x224.
:param original_size: The original size of the image before any transformations.
:return: A PIL Image that is approximately the original size.
"""
# Check that the cropped image is 224x224
assert cropped_image.size == (224, 224),
# Reverse the center crop by padding
# Calculate padding amounts
left_pad = (256 - 224) // 2
top_pad = (256 - 224) // 2
right_pad = 256 - 224 - left_pad
bottom_pad = 256 - 224 - top_pad
# Apply padding to all sides
padded_image = ImageOps.expand(cropped_image, border=(left_pad, top_pad, right_pad, bottom_pad), fill=0)
# Reverse the resize by scaling back to the original size
resized_image = padded_image.resize(original_size, Image.LANCZOS)
return resized_image
Once everything matched I was able to push the apple prediction to 100% using PGD attack.
Is the website for submitting the json temporarily off or permanmently off? I hope if I can try running the notebook shared by other great participants
So how to solve CIFAR
@olive ledge will we have today the solution for CIFAR and granny 3? π
Where are the people who solved CIFAR?
Does anyone here had a different output from "try again"?
"invalid input, format should be (100,4)" or something like this 
Has it been off since the end of the competition ? I wanted to retry a few things as well :/
its was open for just a day after competition ended
when we gonna we our medals, it is my first medal, kind of excited
Delays in verification of results do happen, sometimes related to trying to identify and remove cheaters without wrongly disqualifying the innocent.
inb4 it is revealed that the cifar solution was in fact most common pixel per class but the validation alg was off by one picture
if that is the case, i'm gonna literally explode
@fallow valve did you actually solve cifar or just trolling? 
but the api are down?
I just read @fallow valve write up and came here to see what the CIFAR part was all about. π must be a troll
I want my colour changed for once

Technically I didn't say anything that wasn't trueπ
Posted it kinda late so I felt I had to, sorry
cifar will continue to remain a mystery 
Some relief from me here in that I thought that either I was failing on the Count Flags challenge or else Horea failed Test (Horea seems to mention another 24 as explicitly solved if we include Cifar).
also idk if anybody mentioned it
but i'm pretty sure the pickle task was to cause a false-positive of an LLM that detected bad pickles
aka make verdict = bad when in actuality it was not bad
Wait you were supposed to send the test flag??
i like this image from @final path solution notebook
Tirez sur l'autre, il y a des clochettes - comme on dit en anglais.
I got baited and spent the last 5 days on cifar instead of hush lol
Same! Next year I'll be more careful about psyopsπ
We need to save some memes for next year
cifar is simpler than mnist still haunts me
I don't know why are you all so obsessed about it. MNIST is ugly af it's simply stupid... I don't even wonder what CIFAR is knowing that, it could be anything without any meaning to it. Granny 3 is much more interesting on the other hand but CIFAR will bring you no extra value in terms of knowledge or skills...
I was thinking if it is actually a transposed version of a (4,100) array...
I just saw a job posting yesterday that had "Solved CIFAR count" as a required skill. This field moves fast, stay up to date or get left behind.
TBH, I don't think granny3 is that interesting either. Feels like an impossible problem that was added so that people wouldn't give up if all others were solved quickly.
or all kagglers are frauds that should go back to the benches of school :p (including myself of course)
and i followed you π₯²
same
So, do we have solutions for CIFAR and Granny 3 nowβ¦even if we cannot retry using the API I still want to have a look at it π
Are we really not going to get the official solutions?
Just going to say - in my defense. Asking for hints the first week of the challenge will always get you this answer. As hosts, we need to let the thing play out a bit.
We have challenges for everyone - beginners to experts.
We also set each challenge to 1 point for this reason. It's honestly really hard to gauge difficulty for you all. It did play out that way last year as well, people solved the majority of challenges quickly, with some holdouts
Thought being, those who are more experienced get more time on the harder challenges. Newer folks still get to be competitive.
and both get to lobotomize themselves over cifar 
π I joined two weeks late, but felt like it was possible to catch up. (Until I was brought to tears by CIF@R)
It wasn't impossible though π
It's too early to be reflecting too much - but 27 was a lot of challenges. I think we'll cap at 20 next year.
imo having more challenges was fun, because that requires being flexible and adaptive, which is not the case for many kaggle contests. also, it broadens your scope significantly, from text to audio to picture to tabular to weights visualization and that was very cool and unique. but count challenges barely contribute to that, i'd replace them with something more cybersecurity themed or make it clear but difficult to calculate (ex. probabilities of something)
We need some clues about CIFAR .... or atleast is it pixel based or evaluation based (TP, TN, FP, FN) on some model ? (I tried both directions ..)
if you had an objective of "make them use a lot of numpy/torch broadcasting" you'd be better off making something like picrelated where instead of digits there are some arrays/strings and you had some web api to interact with the black box that you'll have to eventually copy and get some flag with good enough copy
answer is 410 btw
Maybe next year allowing teams up to 2 persons would be nice.
(a-b)(a+b)
noooπ₯²
having many challenges was great, at least for me
The more accessible challenges there are, the higher the chance that the person gets hooked before hitting the difficulty wall
IMO, having many problems is good for late joiner. It's less likely that someone has solved all the problems. Facing multiple challenges is better than getting stuck on just one. Imagine if the competition only had Cifar.
F
people were teaming up anyway for sure so I dont think this is too bad of an idea
I dont think many did, thats why we were all so active here
Clearly didnβt troll enough, I got 0
there is probably an underlying function like cheat_request(troll) = 1 / troll
why do you think so? teaming up was against the rules and we will see how many will be removed from lb upon finalization, however, I donβt see a good way to track it as it was not a code comp and there is no problem to generate a new flag based on the shared idea
Any chance to see solutions for cifar and granny 3?
society now
People were teaming up anyway. that was quite obvious. With the amount of anonynmous request to exchange hints. What are the chances that two anons connects and started sharing solution. I would say a lot.
Some were even quite obvious
As you'll observe, people are slowly disappearing from the LB. I'm up 6 places since the close.
Some poisonous thoughts inspired by @minor falcon , in next year's comp. If feasible, people who got a real flag can also receive a poisonous solution, one that is uniqe enough that should not be obtained by innoncent participant. Feel free to distribute those poisonous solution when requsted by anon. Then we can tag all the malicious anon when the comp ends...
In any case, as I understand it someone else's flag won't give me a point - hence the zero scores for submissions of the test flag from the public starter notebook.
All the flags were unique in this comp already, i.e. each flag can be used only once, so you can easily trace who submitted it first and who was the secondβ¦ unfortunately this canβt stop sharing solutions behind the scene
I am not sure that the flags are unique.
I managed to get one or a few flags in WTF challenges just chatting with LLM. The response was not in the {"flag": ...} format, just LLM mentioning the flag casually.
I even remember, that one flag didn't fit, but then I realized it was missing '=' sign at the end. Fixed that and moved up on the leaderboard.
So I assume I got the flag that was part of the prompt and probably I was not alone.
I talked through some of my solutions last night. Video is here ->
https://youtube.com/live/WWPfo-ZLLFg?feature=share
Hacking AI! I placed in the top 1% in the latest hackathon for AI. Solutions to the Kaggle competition with the AI Village DEFCon31 CTF competition.Competiti...
who poisonned kaggle notebooks ?
so how from that follows the flags are not unique? it was mentioned several times by the host that each solve gets you a new unique flag and same flag wonβt work if submitted twice
Flags are unique.
yeah, but it is a pretty strange finalization because: https://www.kaggle.com/competitions/ai-village-capture-the-flag-defcon31/discussion/455664
Collect flags by evading, poisoning, stealing, and fooling AI/ML
kaggle rank halved :) its good to be back
i was also surprised the deleted account remained, was that the one from that trio of (potential) cheaters?
I've seen plenty of evidence of people asking for hints as to how solve problems, but not anyone asking to borrow someone else's flag.
I am talking only about llms
If you trigger the "flag giving code" - it will generate a unique flag, but there is hardcoded flag in llm's prompt and you can get one without triggering the "flag giving code"
I've been around this one with Kaggle in the past. They don't routinely remove deleted accounts from the LB unless there's evidence of cheating. I think the 'lost' medals should be passed on to people who actually want them, but that's not the current practice.
nope, probably related to this: #ai-village-capture-the-flag-defcon31 message
Anokas officially unpickled 
ahaha
oh yeah bro i did so much for pickle π
wrote a bunch of pickles by hand
then learnt that anything pickled in protocol 1/2 gets insta rejected
eventually just succeeded with sending sys.exit lol? (after maybe 100 attempts over a few days with increasingly elaborate stuff)
https://ctftime.org/writeup/16723 this was cool tho
CTF writeups, pyshv1
πyou tried too hard
I just asked gpt4 to write a pickle
Then told it that it was too dangerous
Until I got flag
Amateur numbers
My gpt usage so bad, I canβt afford to plant enough trees to offset the emissions
one of them also said he would select his first submission and said goodbye to everyone. I guess he didn't follow through on that.
finally, really happy with it. now it's road to gm π
I just saw a job posting yesterday that had "Solved CIFAR count"
can you provide the link? thanks
while trying to solve Granny 3, i came accross this paper and was really really fascinated by it. i never imagined i could do something like this. I wanted to try this with Granny 3, but later got to know that it requires modifying more than one pixel... still it was interesting
Glad to see that deleted accounts have now been removed from the LB.
Does anyone know what word embedding were used for Semantle?
OpenAI text-embedding-ada-002
Wow, real thanks! It helps a lot!
By any chance you may know how the host implement ada-002 to build the semantle game?
If we call from OpenAI API, I guess it should give non-deterministic similiarity score, but it seems the output is always the same?
Semantle website says it uses word2vec
yes, I guess it is just a matter of bringing it to the host/kaggle team attention
@olive ledge What is the usual carrier path in adv ML? Is it pure researcher role? Because there are a lot of toy examples, membership attacks have very big false positive rates, extracting big models using (paid) api seems unfeasable, data poisoning kinda works but it countered by testing againsta clean models. Or am I missing something?
Itβs part researcher part security. You exist at the edge of research and implementation. All of the blackbox stuff you did for this CTF counts as valid. Suppose an organization doesnβt log requests, or has no rate limiting, or is giving away probabilities in their API, or is using a fine-tuned version of a public model.
As an attacker you might not care about false positive rates.
How do you know you have a βcleanβ model to test against: https://arxiv.org/pdf/2302.10149.pdf
Maybe you donβt need to extract the whole model to achieve your goal: https://github.com/moohax/Proof-Pudding
Itβs a new space.
Copy cat model for Proofpoint. Contribute to moohax/Proof-Pudding development by creating an account on GitHub.
What venues do people typically publish at?
IEEE ICDCS is a venue that sees a lot of ML security related publications
Clean is hand curated old dataset from times, when it wasnβt cool to poison data, of reasonable size for anomaly detection.
I have just realized by checking presentations in your repositories that I have watched your presentation on youtubeπ Great stuff for understanding basics of advML
anyone still waiting for granny3 solution like me?
Does it even exists? It could be just meta stuff
? i thought the organizer said there is a solution for every problem posted.
Maybe some meta meta stuff 
Or they will keep ideas for next year.
i mean there definitely IS a solution for all problems
whether someone who isnt the author of the challenge can find the solution, thats a different story
Is this for global or Americans only?
I start missing how much this channel was active during competition, now I'm in another competition, and its channel is almost dead π₯²
you should make a post in other compeition entitled "competition hints and black magic will be disclosed at discord"
SatML 2024 has TWO LLM compeition. here is the other one
Either people wanting to discuss the competition migrate to Discord leaving not much happening in the Kaggle forum, as with the CTF, or the other way around. I'm currently in Optiver, OP2 and RNA folding competitions. All of these have reasonably active Kaggle fora, but their Discords are almost inactive.
for anyone else who hates logging into twitter: this post does not contain or link to a solution for cifar π’
Sometimes I still think about CIFAR.. awake at nights ποΈ
I literally thought of that when I decided to revisit this channel right now...
You guys doing the Santa 2023?
tempted, but is afraid that it may be too demanding. both mentally-wise, and hardware-wise
looks like a NP-hard problem, that would potentially translate into a competition of cpu cores and RAM...
did not had a look at it yet, but if it has to be solved on kaggle kernel, it's more a code optimization problem no ?
ah no just add a look, seems like its just a submission file to provide, so youo can actually code in any language - in which case python is the worst choice
hello
can i ask if the infra is shut down or not
cuz im trying to redo the challenge and cant send a request to the server :(((
if its one of those challenges that have a single solution you could use a hash of the solution and redo it
