Data is not being serialized optimally (using larger types when fixint would be enough) #1933

Aug 20, 2024

gapspt
Aug 20, 2024

I noticed that if I deserialize the example shown in the "Try!" functionality on https://msgpack.org ({"compact":true,"schema":0}) using MessagePackSerializer.Deserialize<Dictionary<string, object>> (or simply MessagePackSerializer.Deserialize<object>), and then serialize it again using MessagePackSerializer.Serialize, the output has an extra byte.

After running some tests, I realized that the difference is that a byte defined as an object will be encoded as a uint 8 zero (0xCC00), even though a plain byte is correctly encoded as a positive fixint zero (0x00).
Similar results happen if instead of a byte we use an int, except of course it will use a int 32 zero (0xD200000000).

Is there a specific reason for this? Is it easily fixable or something that is more architectural on how boxing/unboxing is being handled?

Example test code:

int i = 0;
object o = i;
object li = new List<int> { 0 };
object lo = new List<object> { 0 };
Console.WriteLine(
    $"{i.GetType()}: {Convert.ToHexString(MessagePackSerializer.Serialize(i))}\n" +
    $"{o.GetType()}: {Convert.ToHexString(MessagePackSerializer.Serialize(o))}\n" +
    $"{li.GetType()}: {Convert.ToHexString(MessagePackSerializer.Serialize(li))}\n" +
    $"{lo.GetType()}: {Convert.ToHexString(MessagePackSerializer.Serialize(lo))}");

And corresponding output:

System.Int32: 00
System.Int32: D200000000
System.Collections.Generic.List`1[System.Int32]: 9100
System.Collections.Generic.List`1[System.Object]: 91D200000000

gapspt · Aug 22, 2024

AArnott
Aug 22, 2024
Maintainer

Thanks for the detailed report. I'm pretty sure it has to do with boxing in .NET. Yes, we could fix it, but that would sacrifice knowing the original runtime type.

If you deserialized the original msgpack using a dictionary typed with byte as the value type, it would be optimal. But since you used object, .NET boxed the deserialized value. The original msgpack encoding being fixint could fit into byte so that is the struct type that MessagePack used. Now that byte was boxed, serializing it back means MessagePack will encode it in a way that records the boxed type, which in this case is u8.

Maintaining the underlying value type is important because we have .NET programs that have object with boxed int, uint, byte, short, or whatever. They expect serializing and then deserializing will produce the original object graph, and allow them to unbox these values using their original runtime types. If we reduced an int32 with a value of 0 to fixint, we wouldn't know when deserializing that the program expects an int32, and when they unbox object to int the CLR would throw an exception.

2 replies

gapspt Aug 23, 2024
Author

serializing it back means MessagePack will encode it in a way that records the boxed type, which in this case is u8.

Maintaining the underlying value type is important because we have .NET programs that (...) expect serializing and then deserializing will produce the original object graph, and allow them to unbox these values using their original runtime types.

This is an interesting point.
I would like to argue the case that embedding the original type is not necessary and it is in fact counter-productive.
I hope that with version 3 not yet released, you could consider it acceptable as a breaking change from v2 - or at least I hope you will give it a second thought, even if you end up taking a conscious decision to dismiss the idea.

MessagePack's slogan is "It's like JSON. but fast and small.", and this implementation for unknown types seems to go against the "like JSON" and the "small" allegations.
Including information regarding the original type seems not only to not be "like JSON" (i.e. contains extra information), and it seems to also yield diminishing returns on the "small" part. As an extreme example, encoding a new Dictionary<string, object>() {{ "a", 0 }} will result in 8 bytes, whereas its JSON representation is only 7 bytes {"a":0}.

I understand there are advantages and disadvantages to both approaches of keeping a hint for the type VS shorten the data to the optimal size, and I would like to understand what is tipping the scale in one direction rather than the other.

I would focus on two points/questions:

Why should a library that is meant to provide JSON-like functionality need to have a guarantee (undocumented AFAIK) that serializing data of an unknown type (object) will result in deserializing the same data into the same type, if that is not necessary nor possible in JSON?
Why should a library degrade data of an unknown type (object) by making it larger, when serializing it back after deserialization; instead of doing the opposite (improve the serialized data by making it smaller where possible)?

Focusing on the point 1. for now,

I would argue that it should not. Its importance seems minimal to me, but I'd like to hear about the counterarguments.
I find it difficult to believe that programs rely on the underlying type of the deserialized data — that themselves serialized (or other programs relying on the same library and assumptions did) — in order to detect the type they should use for that data (and that they do it in a useful way).
In other words, I fail to understand how an int32 0 that is being read as a uint8 0 could cause any issue, since:

The actual values would still fit their original type, upon a cast
A cast would have to be made, regardless, since the declared type is object
Using an unknown type (object) strongly implies that the object's type is not known anyways, and should be treated as such
Extra work would have to be done using reflection to actively detect the type of the deserialized value

Is there an actual use case where this functionality is necessary or at least useful?
(Taking into account that such use case would have to be reliant on the fact that the data is encoded with this exact library and relying on its undocument behaviour, it still seems likely to me that such a case would inevitably be an incorrect program.)

AArnott Aug 23, 2024
Maintainer

Why should a library that is meant to provide JSON-like functionality need to have a guarantee (undocumented AFAIK) that serializing data of an unknown type (object) will result in deserializing the same data into the same type, if that is not necessary nor possible in JSON?

Because this library brings msgpack to .NET, and .NET has a strong type system.
JSON on the other hand is native to Javascript, which does not have as precise a type system. In fact JSON and msgpack do retain type information: integers, floats, and strings are distinct, which provides matching type precision for javascript.
In .NET's more precise type system, we have similar needs for primitive type to be preserved.

Why should a library degrade data of an unknown type (object) by making it larger, when serializing it back after deserialization

The "back after deserialization" is irrelevant. There is no way the serializer can know that a boxed byte came from an even more compact representation before in order to recreate that serialized form.

And to answer the question, it is so that this doesn't fail in .NET:

record Wrapper(object o);

int v = (int)MessagePackSerializer.Deserialize(MessagePackSerializer.Serializer(new Wrapper(3)).o;

If we did what you're asking, that would fail with an InvalidCastException because the boxed byte cannot be unboxed as an int. This is counter-intuitive to a lot of .NET developers. The failure would be because we're leaking details about the msgpack protocol to the .NET layer, which most folks don't want to think about.

I hope you will give it a second thought, even if you end up taking a conscious decision to dismiss the idea.

Consider it consciously dismissed. Many folks depend on the existing behavior and would be broken if we changed it.
On the other hand, you're the very first I've heard from that even noticed and cared that we serialize boxed integers according to their .NET native size. So we'd be pleasing a small crowd and upsetting a large one to change this behavior.

But all is not lost for you. There is absolutely something you can do to have it your way too, via our extensibility system.

This is the code that implements our policy. Notice how it expressly calls MessagePackWriter.WriteUInt8 and such instead of simply MessagePackWriter.Write (which would write the most compact representation).
If you write your own such formatter with your preferred policy, and then your own PrimitiveObjectResolver to match (the names don't matter), and arrange for your resolver to be earlier in the resolver list than the built-in one, you can have the policy you want, without us breaking everyone else.

I hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data is not being serialized optimally (using larger types when fixint would be enough) #1933

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment · 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

Data is not being serialized optimally (using larger types when fixint would be enough) #1933

Uh oh!

gapspt Aug 20, 2024

Replies: 1 comment · 2 replies

Uh oh!

AArnott Aug 22, 2024 Maintainer

Uh oh!

gapspt Aug 23, 2024 Author

Uh oh!

AArnott Aug 23, 2024 Maintainer

gapspt
Aug 20, 2024

AArnott
Aug 22, 2024
Maintainer

gapspt Aug 23, 2024
Author

AArnott Aug 23, 2024
Maintainer