-10

AI agents outperform human teams in hacking competitions (the-decoder.com)

submitted 2 months ago by Pro@programming.dev to c/cybersecurity@sh.itjust.works

10 comments fedilink hide all child comments

In two hacker competitions run by Palisade Research, autonomous AI systems matched or outperformed human professionals in demanding security challenges.

In the first contest, four out of seven AI teams scored 19 out of 20 points, ranking among the top five percent of all participants, while in the second competition, the leading AI team reached the top ten percent despite facing structural disadvantages.

According to Palisade Research, these outcomes suggest that the abilities of AI agents in cybersecurity have been underestimated, largely due to shortcomings in earlier evaluation methods.

top 10 comments

sorted by: hot top controversial new old

[-] taladar@sh.itjust.works 10 points 2 months ago

The event's puzzles were designed so they could be solved locally, making them accessible even to AI models with technical constraints.

Want to bet that those puzzles (or some very similar ones) were part of the training data of some of the agents?

[-] redsand@lemmy.dbzer0.com 7 points 2 months ago* (last edited 2 months ago)

Train AI on demo
Show AI rip through demo
???(rip off investors)
Profit

[-] Speiser0@feddit.org 9 points 2 months ago

Title is misleading. It's only outperforming some of the other participants. Also note that obviously not everyone is participating full try-hard.

In the first ctf, the top teams finish all 20 challenges in under an hour. Apparently it were simple challenges that could be solved with standard techniques:

We were impressed the humans could match AI speeds, and reached out to the human teams for comments. Participants attributed their ability to solve the challenges quickly to their extensive experience as professional CTF players, noting that they were familiar with the standard techniques commonly used to solve such problems.

They obviously also used tools. And so did the AI teams:

Most prompt tweaks were about:
[...]
• recommending particular tools that were easier for the LLM to use.

In the 2nd ctf (the bigger one with hard challenges), the AI teams only solved the easier ones, it looks like.

I haven't looked at the actual challenges. Would be too much effort. And the paper doesn't speak about the kind of challenges that were solved.

The 50% completion time looks to me like it's flawed. If I understand it right, it's assuming that each team is doing every task in parallel and starts directly, which is not possible if you don't have enough (equally good) team members.

Don't get me wrong, making an AIs that is able to solve such challenges autonomously at all is impressive. But I hate over-interpretation of results.

(Why did I waste my time again?)

[-] Tar_alcaran@sh.itjust.works 1 points 2 months ago

making an AIs that is able to solve such challenges autonomously at all is impressive.

I doubt that's the case. I find it exceptionally unlikely they said "Hack this system" and then sat back with their feet up while the computer crunched numbers.

[-] Speiser0@feddit.org 2 points 2 months ago

The paper didn't include the exact details of this (which made me mad). But if there's a person actively making parts of the work, and just using an AI chatbot as help, it's not an AI agent, right, right? So I assumed it's autonomous.

[-] Tar_alcaran@sh.itjust.works 1 points 2 months ago

They make frequent comments about using prompts and "AI teams" using "one or more agents".

Also, AI agents don't actually exist, so that's a pretty clear giveaway.

[-] Speiser0@feddit.org 2 points 2 months ago

An AI agent is just an intelligent agent, see https://en.wikipedia.org/wiki/Intelligent_agent.

Or do you mean that the things they call AI agents aren't actually AI agents?

[-] Tar_alcaran@sh.itjust.works 3 points 2 months ago

I mean, technically, you can call any controlling sensor an "agent". Any if-then loop can be an "agent".

But AI bros mean "A piece of software that can autonomously perform any broadly stated task", and those don't exist in real life. An "AI Agent" is software you can tell to "Order me a pizza", and it will do it to your satisfaction.

An AI agent is software you can tell "Hack that system and retrieve the flag". And it's not that.

[-] Tar_alcaran@sh.itjust.works 6 points 2 months ago

From the paper:

For the pilot event, we wanted to make it as easy as possible for the AI teams to compete. To that end, we used cryptography and reverse engineering challenges which could be completed locally, without the need for dynamic interactions with external machines. We calibrated the challenge difficulty based on preliminary evaluations of our React&Plan agent (Turtayev et al. 2024) on older Hack The Box-style tasks such that the AI could solve ~50% of tasks.

The conclusions that AI ranked in the "top XX percent" is also fucking bullshit. It was an open signup, you didn't need any skills compete. Saying you beat 12.000 teams is easy when those all suck. My grandmother could beat three quarters of the people on her building in a race, simply because she can walk 10 steps and 75% of the people there are in wheelchairs.

It's also pretty critically important these "AI Teams" are very much NOT autonomous. They being actively run by humans, and skilled humans at that.

[-] technocrit@lemmy.dbzer0.com 2 points 2 months ago

Programs doing what they were programmed to do. But there are no "AI agents" smh. No need for the bullshit hype.

this post was submitted on 01 Jun 2025

-10 points (32.1% liked)

Cybersecurity

8083 readers

17 users here now

c/cybersecurity is a community centered on the cybersecurity and information security profession. You can come here to discuss news, post something interesting, or just chat with others.

THE RULES

Instance Rules

Be respectful. Everyone should feel welcome here.
No bigotry - including racism, sexism, ableism, homophobia, transphobia, or xenophobia.
No Ads / Spamming.
No pornography.

Community Rules

Idk, keep it semi-professional?
Nothing illegal. We're all ethical here.
Rules will be added/redefined as necessary.

If you ask someone to hack your "friends" socials you're just going to get banned so don't do that.

Learn about hacking

Hack the Box

Try Hack Me

Pico Capture the flag

Other security-related communities !databreaches@lemmy.zip !netsec@lemmy.world !securitynews@infosec.pub !cybersecurity@infosec.pub !pulse_of_truth@infosec.pub

Notable mention to !cybersecuritymemes@lemmy.world

founded 2 years ago

MODERATORS

kid@sh.itjust.works

Lanky_Pomegranate530@midwest.social