260
Fucking nerd.
(hexbear.net)
Tabletop, DnD, board games, and minecraft. Also Animal Crossing.
3rd International Volunteer Brigade (Hexbear gaming discord)
Rules
Hm ok. So one writes source code in a coding language, it gets turned into 1s and 0s. Why can't you go back? Source code gets compiled into a specific order of 1s and 0s, but the same set of 1s and 0s could be made from different types of source code?
it's pretty hard to un-bake a cake
It's like trying to figure out the exact tools used to build a house by looking at the finished house. You can figure out some tools (a hammer, a paintbrush, etc) but it's hard to know exactly. Programs are so interdependent on the components that make them up, guessing isn't a good solution.
Like others said, you sort of can. But I also want to add that things like functions names, or comments explaining how a function works, are not needed by your computer when running the program, and thus they get lost after compiling. After running a program designed to reverse engineer a compiled program, you'll be able to see a very dumbed down version; no meaningful function or variable names nor comments explaining the code. You have to figure those out all by yourself.
And add to that that some companies/programmers make some parts of the program difficult to read on purpose, so you have more guesswork to do when reverse engineering, and you've got a giant task ahead of you reverse engineering even small games.
On a side note, the original source code can also just be interesting or funny to read. Valve's source code comments come to mind.
I found a YouTube link in your comment. Here are links to the same video on alternative frontends that protect your privacy:
You sort of can, there are de-compilers like Ghidra that can help with this, but it usually takes a lot of manual effort to properly decode.
Yeah, basically. Companies will also take extra steps to make it so people can't get source code from software, since it's their proprietary IP and whatever.
You can go back but it's very difficult. Only the biggest nerds can do it, with great dedication and time. That process is called reverse engineering.
For a very simple example, suppose I wrote some code to add how many apples Jack and Jill have together. The source code might look like
jackApples = 3
jillApples = 4
numApples = jackApples + jillApples
But the computer doesn't care about Jack, or Jill, or apples for that matter. It only cares about numbers. So when the compiler puts it into ones and zeros all those useful names get dropped. And when I decompile the binary (what we call those ones and zeros) what I get back might look more like
var1 = 3
var2 = 4
var3 = var1 + var2
And if I want to change how many apples Jill has it's a whole process of trial and error to figure out which variable is Jill's number of apples.
Now expand that to thousands or millions of lines of code and you begin to see why nerds want source code instead of binaries.
The compiler will see that
var3
is just two numbers added together and replace it with 7, which saves having to do an addition every time you run through that code, and is therefore faster.var1
andvar2
may be removed from the output as well; shorter code runs faster since you can fit more in the cache. In fact, sincevar3
is just a number, you can replace every place that it's used with a 7 as well; if you have some functions:... then the compiler will look at all that, delete the lot, and just use
1.4f
wherever theappleWeight()
function was called. Comment is gone, the decision making is gone, it's impossible to go backwards any more.I'm not a professional programmer and just a hobbyist, but if you also had a set function that changes jackApples to an input integer, what happens at compilation?
That disables a whole pile of the potential optimisations, of course. You could define
jackApples
as a "static variable" (as opposed to making it eg. a field in a class or struct):The most obvious consequence of this is that
jackApples
now has an address in memory, which you could find out with&jackApples
. Executable programs are arranged into a sequence of blocks when they're compiled, which have some historical names based on what they used to be for:text
section, which contains all of the executable code, and which might be made read-only by the OS.data
section, which contains variables that have a known value at startupbss
section, which contains variables that we know will exist but don't have a value. Might be zero'd out by the OS, might contain unknown leftover values.Because it's statically allocated,
jackApples
will be in thedata
section; if you opened up the executable with a hex editor, you'd see a 3 there.getTheNumberOfApples()
will be optimised by the compiler to return the contents of the memory address plus 4. That still counts as a very simple and short function, and it's quite likely that the compiler would inline it and remove the initial function. The actual process of calling a function is to:That takes a while, and worse - modern CPUs will try to "pipeline" all the instructions that they know are coming so that it all runs faster. Jumping to a function might break that pipeline, causing a "stall", which slows things down enormously. Much better to inline short functions - the fact that the value is "number in memory address plus four" might be optimised away a little wherever it's used, too.
To add on to what the others have said, the compiler will also optimise your code (which is why professional coders write in common patterns as much as possible, so the compiler can recognise them and optimise).
So many times, you literally won't even have the same program.
Also machine understandable code (assembly or 1s and 0s) is different depending on the processor used. You could give me machine code made for a risc-v processor and I could reconstruct a c program that made it. But if I had the same program compiled for an x86 processor ...