This is the second article in a series of tech posts documenting some of the hacking work done for Senmomo. Make sure to check out the first installment for an introduction!


So you want to translate a visual novel. Now arises the question: where to even start? In the next couple of articles, I’ll guide you towards changing your first line of text in a BGI visual novel, both using the tools introducing previously and by writing our own. (Because doing things by ourselves can be fun sometimes! Sometimes.) So let’s get our hands dirty.

We won’t get very far without finding the original Japanese text. It’s in there somewhere, among the gigabytes of zeroes and ones that make up the whole game. Looking at the game files, you’ll notice that the bulk of it is in those many files under the .arc format. These contain absolutely all of the resources that the game needs: sprites, backgrounds, music, voice lines, videos, UI elements, fonts, scripts, and of course the original text that we’re looking for, in what I like to call scenario files. Let’s see how we can extract them…

Extracting resources the easy way, with GARbro

The .arc format might be a bit unconventional, but thankfully we can let GARbro do the heavy lifting for us. If you haven’t already, go download its latest version here. Open up GARbro.GUI.exe, select File → Open... from the top menu and navigate to your game’s directory. In Senmomo, scenario files are stored in the archive named data01000.arc, so that’s the one we want to open (your experience might differ if you’re dealing with a different BGI game). But I do recommend snooping around the other .arc files to see what they contain and how they are organized, because you might need to extract more of those assets later. Don’t be shy, poke around!

Anyways, sticking to data01000.arc for now, you should see a view similar to the below screenshot. Here we’re going to right-click hat010010, the very first scenario file in Senmomo, and choose “Extract”. Confirm the destination folder and you’re good to go.

Congratulations! You have successfuly extracted your first scenario file. That’s it for this article, if you liked the content don’t forget to leave a like and subscribe…

I’m kidding of course. We can’t let it end here with letting a third-party tool doing all the work for us when you’re probably burning with curiosity to know what’s going on behind the scenes here and dying to start writing some code already. If you’re not then that’s fine and feel free to skip the rest, we won’t be creating anything more than what GARbro already does but it’s still a good exercise to learn more about the inner workings of the engine. And also a great warmup for all the programming that’s to come, as you will see that this won’t be anything too complicated. In fact the .arc format is the simplest compression format possible. I probably shouldn’t even call it that since there isn’t any actual compression going on. So let’s get to it!

Extracting resources the hard fun way, with a script

If you want to follow allong you will need your text editor of choice, a program capable of displaying hexadecimal files, and your favorite programming language. For these I recommend, respectively, Visual Studio Code, its Hex Editor extension, and Python. Let us begin by opening up data01000.arc (or any .arc file) and see exactly what we’re dealing with.

This sort of warning when you attempt to load a file is never a good omen. It basically means that the computer can’t format the data in a way that would make sense for our puny human brains. The usual reasons for this are:

  • the data is compressed
  • the file is encrypted
  • the content is executable bytecode

In any case, it’s the sign of much pain to come. Triply so for us, because as fate would have it, it turns out that here it’s all three at once. No need to fret though, we can pull out all these layers one by one like some sort of weird digital onion, even if it might take us multiple articles to finally reach our underlying data. The .arc format is the first of these abstraction layers, which is what we’ll focus on for today.

.arc files follow a specific structure, but rather than describe it in too much details I’ll let this diagram do the talking. Make sure you see something similar when you open your archive in your hex viewer. If it seems overwhelming now, I hope it’ll get clearer as we tackle each of the different sections in the code.

Anatomy of an .arc file
data01000.arc as seen in a hex viewer, with the various sections color-coded for your convenience. Notice that the name of our target file, hat010010, is here near the bottom

Without further ado, let’s create a Python script in the same folder as the target .arc file (data01000.arc for me). What we’re trying to do is build a program that can open up that file, then extract all the sub-files contained within the archive into the same directory. For this we’ll have to decipher the sections outlined above, one by one. The first two are pretty straightforward. We’ve got:

  • 12 bytes (or 24 hex characters) that form the letters “BURIKO ARC20”. This is a sort of passphrase to certify that the rest of the file follows the .arc convention.
  • 4 bytes that code an integer representing the amount of files contained in this archive

Our extraction script should start by opening the target file, then reading those two sections:

with open("data01000.arc", "rb") as arc_file: # "rb" means "read-only binary mode"
    assert arc_file.read(12) == b"BURIKO ARC20" # verifies that the files has the right header

    # there are 2 ways to read numbers byte by byte: right-to-left, and left-to-right. "little" refers to the first one
    files_count = int.from_bytes(arc_file.read(4), "little")
    print(files_count) # should display the amount of files that the archive contains

Make sure that the number you get after you run this is the same as the amount of files you see when opening the archive in GARbro. That’s it for the file’s introductory section. The next one is a set of blocks of metadata, one for  each file. Think of it as some sort of address book, with each entry having the name of a sub-file and the information needed to find it inside the .arc file, namely: the offset (“How many bytes do I have to skip to reach the beginning of the file?”) and the size (“How many bytes does this file contain?”). Let’s loop through each of these blocks and just display the information for now:

    for i in range(files_count):
        name = arc_file.read(96).decode() # decode() converts bytes to a string
        name = name[:name.find("\x00")] # removes the trailing null bytes after the file name
        offset = int.from_bytes(arc_file.read(4), "little")
        size = int.from_bytes(arc_file.read(4), "little")
        arc_file.read(24) # discards remaining empty bytes

        print(f"File: {name}, offset: {offset}, size: {size}")

At this point, our little program has now read the entire “Index” section and has reached the data part. Nothing too complicated here, it’s just the raw bytes of the compressed files concatenated together. All there is to do left is to use the data we collected from the index to find which blobs of bytes correspond to which sub-file, and save each as a new file on the disk. Here’s what our final decompression script looks like:

with open("data01000.arc", "rb") as arc_file: # "rb" means "read-only binary mode"
    """
    Header section: read "BURIKO ARC20" + files count
    """
    assert arc_file.read(12) == b"BURIKO ARC20" # verifies that the files has the right header

    # there are 2 ways to read numbers byte by byte: right-to-left, and left-to-right. "little" refers to the first one
    files_count = int.from_bytes(arc_file.read(4), "little")
    print(files_count) # should display the amount of files that the archive contains

    """
    Index section: read each file's metadata
    """
    names = []
    offsets = []
    sizes = []
    for i in range(files_count):
        name = arc_file.read(96).decode() # decode() converts bytes to a string
        name = name[:name.find("\x00")] # removes the trailing null bytes after the file name
        names.append(name)
        offset = int.from_bytes(arc_file.read(4), "little")
        offsets.append(offset)
        size = int.from_bytes(arc_file.read(4), "little")
        sizes.append(size)
        arc_file.read(24) # discards remaining empty bytes

        print(f"File: {name}, offset: {offset}, size: {size}")

    data_start = arc_file.tell() # tell() returns the amount of bytes we've read so far

    """
    Data section: write to disk each sub-file, using the index's information to find their content in the data section
    """
    # zip() is a function that allows iterating over multiple lists at once
    for name, offset, size in zip(names, offsets, sizes):
        arc_file.seek(data_start + offset) # seek() positions a byte stream to the given offset
        data = arc_file.read(size) # reads "size" bytes from our input
        with open(name, "wb") as output_file: # "wb" for "write binary"
            output_file.write(data)

If all went well, running this code should generate each of the sub-files into the same directory as your Python script.

How about compressing back to .arc?

Now that you’ve successfully extracted an .arc file, you might be wondering how to go the other way around. In other words, compressing multiple files back into a .arc archive. I’ll stop you right there: very fortunately for us, we don’t need to compress patched files back into their archives to have them replace the original resources. The engine will happily pick up any file you drop into the game’s folder and use that one as opposed to its counterpart stored in the original .arc, as long as it has the right name. It’s quite a peculiar behavior of the BGI engine, but the fact is that not having to repack our modified assets will make our life much easier. Maybe the developers did it because they took pity on the would-be Ethornell hackers that we are?

In any case, once you understand how .arc decompression works, which I hope I made clear enough through this article, compression should be pretty straightforward. You would have to re-build the .arc file by first adding the header, then by constructing the index piece by piece from the metadata of each of the files you want to compress. All there is to do left is to dump the sub-files’ content in the right order, and you’re good to go. Do try this out as an exercise if you’re interested, and see if you can generate the exact same .arc file that you started with.

If you’ve made it this far you should have been able to obtain at least one uncompressed scenario file. However you might have noticed that its content is different from what GARbro spat out. That’s because there is still one last layer of abstraction we need to decipher, which will be the focus of the next article. So keep your extracted file handy, we’ll need it for next time. Until then, have fun!


Any question, comment, concern, or opinion about my choice of recommended IDE (VSCode > all, fight me), hit me up on Reddit or Discord at Perturbed_pangolin#3792, or shoot me an email.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>