Endcoding Protobuf By Hand
Most engineers use Protobuf every day without knowing what actually travels over the wire.
In this article, we step below that abstraction and examine how a Protobuf message is encoded at the wire level.
Before We Start
This article assumes you are comfortable with :
- Protocol Buffers and basic
.protodefinitions - Bitwise operations (
<<,>>,&,|) - Reading small hex dumps
The Goal
By the end of this article, we will manually construct the raw bytes for the following Protobuf message, without using any generated code or runtime library.
message Car {
int32 id = 1;
string brand = 2;
}
Assume the message contains the following data:
id = 5
brand = "BMW"
Our objective is to encode this into the exact byte sequence that Protobuf would produce.
How Protobuf Thinks About Data
Unlike JSON, which represents data as "name": "value" pairs, Protobuf does not encode field names at all. Instead, it only encodes the field number and its value.
Each field is encoded as a "tag" followed by its value, repeated for every field in the message.
The tag is not just the field number, it encodes two things:
- The field number from your
.protodefinition, and - A wire type, which tells the decoder how to read the bytes that follow. (We'll get into wire types in the next section.)
[tag][value][tag][value]...
For our Car message, this means the string "brand" never appears in the wire format. Only its tag does.
Unpacking The "Tag"
The tag is computed by shifting the field number left by 3 bits and OR-ing it with the wire type ID:
tag = (field_number << 3) | wire_type
This packs both values into a single byte.
To compute the tag, we first need to know the wire type of our field.
Wire Type
A wire type tells the decoder how to read the bytes that follow, specifically, how many bytes to consume.
There are 6 wire types, though 2 are deprecated and rarely seen in practice.
| ID | Name | Used For |
|---|---|---|
| 0 | VARINT | int32, int64, uint32, uint64, sint32, sint64, bool, enum |
| 1 | I64 | fixed64, sfixed64, double |
| 2 | LEN | string, bytes, embedded messages, packed repeated fields |
| 3 | SGROUP | group start (deprecated) |
| 4 | EGROUP | group end (deprecated) |
| 5 | I32 | fixed32, sfixed32, float |
(For the complete wire type reference, see the official encoding guide.)
Computing The Tag
id field
Let's compute the tag for the id field.
As in the proto message, id is int32. According to the wire type table above, the wire type for int32 is 0.
Plugging these into the encoding formula:
= field_number << 3 | wire_type
= 1 << 3 | 0
The bit representation will look like :
= 0000 0001 << 3 | 0000 0000
= 0000 1000 | 0000 0000
= 0000 1000
Hence the resultant value is decimal 8 or hex 0x08.
brand field
Continuing with the same logic,
brand is of the type string which has a wire type of 2
= field_number << 3 | wire_type
= 2 << 3 | 2
In bit representation:
= 0000 0010 << 3 | 0000 0010
= 0001 0000 | 0000 0010
= 0001 0010
Hence the resultant value is decimal 18 or hex 0x12.
Encoding Values
5 In, One Byte Out
id = 5 is an int32, which uses the VARINT wire type.
5 in binary is 0000 0101, which in hex is 0x05.
That's it. Small integers encode directly to their binary representation in a single byte.
Three Letters, Four Bytes
brand = "BMW" is a string, which uses the LEN wire type.
LEN stands for "length-delimited".
Length-delimited encoding works as follows:
- Take the byte length of the value
- Prepend it to the encoded bytes
"BMW" is 3 characters, so the length prefix is 0x03.
Each character is then encoded as its ASCII byte value:
| Character | Hex |
|---|---|
| B | 0x42 |
| M | 0x4D |
| W | 0x57 |
So brand = "BMW" encodes to:
03 42 4D 57
Length prefix followed by the three character bytes.
Bringing Everything Together
Now we bring everything together in the format:
[tag][value][tag][value]...
Hence our message would look like
08 -> Tag for id (field 1, VARINT)
05 -> Value of id (5)
12 -> Tag for brand (field 2, LEN)
03 -> Length of "BMW"
42 4D 57 -> B M W
The final byte sequence would look like:
08 05 12 03 42 4D 57
That's the exact byte sequence Protobuf produces for this message, whether traveling over the wire or written to disk.