This is the fourth article in Operation Bellflower’s weekly tech posts series. Check out the previous ones here!
We’re finally done with all that frankly unnecessary rambling on the engine’s archive system, and now we’re ready do dive into the nitty gritty of how to translate a BGI game. The last couple of articles explained how to extract scenario files, which contain all of the game’s narration. So all we need to do next is to replace the original Japanese text with our translation. This isn’t as simple as it sounds, as you will see that the format of those scenario files is anything but easy to work with. It’s nothing insurmountable though. I’ll first walk you through the steps to do it manually, then talk about some tools to make the job easier. So without further ado, let’s get started!
Overview of a scenario file
First things first, open up the scenario file you want to translate (I’ll use Senmomo’s first one,
hat010010, as an example). You’ll need once again your favorite hexadecimal editor.
Yes, that’s the format we’ll have to work with. And no, this seemingly garbled mess after “function effect” isn’t corrupted data or unrecognized Japanese characters. It’s actual instructions that describe a scene in the game, like displaying messages in the main text box, playing voice lines, displaying sprites and backgrounds… Of course, (almost) no programmer in the world understands, let alone writes “code” like this directly. It’s been generated from a much more sensible scenario file format. However BGI does not recognize that original script format, it needs to be compiled into that abomination in order to be usable, and so that’s what the developers included in the game’s files rather than the original scripts. Legit localization companies typically strike deals with publishers to get access to those original files before translating anything. Check out, for instance, this super interesting article from Mangagamer’s staff blog where they show what the original script of a BGI game (Higurashi in this case) looks like (here’s also a link to an archived version with the missing images). But as for humble translation groups like ours, there’s no other choice but to work with the raw bytes of those compiled script files. We’re translating on hard mode for sure, but that certainly won’t be enough to stop us from trying.
So no need to despair (yet), while at first glance this whole thing looks might like an incomprehensible mishmash of random bytes, look closely and you will see a semblance of structure:
- The first four lines (40 bytes) have some text that’s actually readable – that’s our header. It starts with the magic message
BurikoCompiledScriptVer1.00, followed by four bytes to indicate the header’s size, then a couple of words (“function”, “effect”) that refer to other files that need to be imported. “effect”, for instance, allows scenario files to use various, well, effects. Like for instance rain animations or the date display screens.
- Then follows a huge list of instructions, each 4 bytes long (reminder that a byte is the equivalent of two hex characters). What’s going to be interesting for us here is to look for the ones that trigger a message in the text box.
- Scroll towards the end and you’ll see some recognizable words again. That’s a list of all the strings used throughout this script, which naturally include the content of the text shown on the screen that we want to translate.
A primer on assembly
While there is absolutely no need to be an expert on the subject to perform simple changes like modifying a string, I thought it’d be good to throw a few pointers on how assembly languages like the one we’re dealing with usually work.
CPUs are versatile little machines that are capable of performing many kinds of operation on the zeroes and ones that are fed to it through electrical current. That includes arithmetic (addition, multiplication…), logic (bitwise AND, bitwise OR…), reading and saving values… In fact, everything your computers does can be expressed as a combination of those simple operations. Assembly language is often called “machine code”, because it follows exactly this concept: it’s just a big list of instructions that your computer’s processor can directly understand. The BGI engine uses no less than three different variants of assembly:
- The one in the game’s main executable,
BGI.exe, which is assembly code generated from C++
- The system file’s assembly language, which is the one used by all the files with the extension
._bp. This code is read and interpreted by
BGI.exe. Note that it’s slightly different form the one shown here, but that’s a story for another time.
- The scenario file’s language and the focus of today’s article. It’s interpreted by the system files that start with
scr. You can differentiate it from the one above by looking at the file’s extension, scenario files simply don’t have one.
All the simple instructions supported by a given assembly language are associated to an operation code, or opcode. These codes are directly written inside the file to execute, and when one is encountered by the CPU it’s a signal to execute the associated operation. There’s hundreds of them in the scenario file language, but for today we’ll only need to know two of them:
- Opcode “140” triggers the display of a new line of narration or dialog. It’s written as “40 01” in the assembly code, because bytes need to be read from right to left. Try searching for it in your file, you should get one hit for every line of text defined in that file.
- Opcode “3” does a memory fetch operation that we’ll call “push string”. What it does is it fetches one of the string from the end of the file and add it to the stack. All you need to know about the stack is that it’s a memory structure which is used to hold intermediate data. Other functions can read from it, like the one triggered by opcode 140: it looks at what’s at the top of the stack to determine what text needs to be shown on screen. The specific string that opcode 3 loads is indicated by the next four bytes directly following it. These represent the address of the target string, or its position in the file after the header.
Patching a string in a scenario file
With the lengthy technical introduction out of the way, let’s get our hands dirty. Our goal for now is to change the very first line of dialog. So what we need to do is find the first occurrence of the message instruction, opcode 140, then find the opcode 3 that feeds the text to the message instruction, and edit it to make it point to a new string that we’ll add to the end of the file. Let’s first search for the hex sequence “4001” and jump to the first result:
What this highlighted section boils down to is: “Load the text at address 4B2CD (remember, this is little-endian format, so it needs to be read from right to left), then display it in the message box”.
Just to make sure we’re about to modify the right thing, let’s have a look at what’s at address 4B2CD and compare with the screenshot at the top of this article. An important thing to note is that this is a position from after the header section, not the beginning of the file. The header here being 40 bytes long, we’ll want to add 40 to that value to find out where we really need to jump to. Using the online hex converter of your choice or Window’s calculator in programmer mode, we get 4B2CD + 40 = 4B30D. Jumping to this address, we see…
We could directly modify these bits with our new text, but we’d be limited by the length of the original string, so instead let’s do things the proper way and add a new one at the bottom of our file. Don’t forget to leave a 00 between the previous string, because that’s how the engine determine where a string ends.
Don’t forget to write down the address you added the new string at, we’ll need it in a second. We can now jump back to 6D0 and edit the address of the string that’s loaded right before the first opcode 140 call. The address of our new string here is 61091, minus the 40 bytes of header that’s 61051, which is 51 10 06 in little-endian notation. Here’s what the final modified code looks like:
We’re almost done! All there’s left to do is to feed that modified script file to the engine. Thankfully this is the easy part, all you need to do is drag and drop it in the game’s folder. Now run the game and cross your fingers…
Automating the process
You have now successfully translated one line of text, all you need to do next is to repeat the process a couple tens of thousands of times. Obviously it would be completely unreasonable to keep doing this by hand, so here are a few tools from the community that will alleviate the pain:
- ArcusMaximus’s VNTextPatch: an extremely handy program, it can extract all the lines from a scenario file and dump them into a spreadsheet that translators just need to fill out then use the tool again to regenerate the patched game files. Note that it has not had an official release, so you’ll need to compile it yourself using Visual Studio (In GitHub, click the greeen “Code” button and pick “Open with Visual Studio”). Also, you will need to append the
.bgiextension to the end of your files, because VNTextPatch supports multiple engines and that’s how it determines which decompiler to use.
- Marcussacana’s EthornellEditor: this tool might look a bit less user-friendly than the one above, but it’s also more versatile. It opens up a window that lists all the strings in your scenario file, not just the dialog lines, and allow you to directly edit them.
Or better yet, write your own! All things considered, the steps for patching a string aren’t too complicated: dump your new strings at the end of the scenario file, find all the occurrences of opcode 140 (represented by bytes 40 01), then change the 4-bytes addresses right before them to point to your new strings, so you could write a simple program to do that for you. If you do, make sure to share!
That’s it for what’s probably the most important article in the entire series, but do be assured that we have a lot of other fun stuff to talk about. (Wait until you hear about proxy DLLs…) Next week we’ll keep messing around with that assembly code to see if there’s something we can do about those letters being all squished together and other font shenanigans. Like always, have fun!
Any question, comment, concern, or recommendations for a good modern hex editor (why do they all look like they were made in the 2000s?), hit me up on Reddit or Discord at Perturbed_pangolin#3792, or shoot me an email.