Kategorien
General Open Source

Array Thickening: More can be less

When turning flexible file formats into data structures, the consensus seems to be to avoid using arrays whenever possible. However, more arrays may reduce code and coding errors. A plea against the extinction of arrays.

The problem

Parsing flexible data structures—such as electronic business cards (vCards), but also spreadsheets, log files, and many, many more—results in many fields with different cardinalities, the most common being:

  • At most once: 0…1 time
  • Exactly once: 1 time
  • At least once: 1…∞ times
  • Any number of times: 0…∞ times

The first two are probably easiest to represent in a data structure: Exactly-once is a fixed field of the required data type, whereas at-most-once is an optional field of the required type, often expressed as an alternative of the actual type and some form of NULLish type (null, undefined, None, nil, …). For example, in TypeScript, two fields (first: at-most-once/optional, second: exactly-once/required) might be represented as:

type ParsedData = {
  uniqueId?: T; //or: `T | null` or …
  fullName: T;
}

Optimize the common case?

Often, the fields which can occur up to a (theoretically) infinite number of times, occur most frequently only maybe once, maybe twice. For example, most spreadsheets only require one table, most business cards specify only one address, and often there is only one, maybe two phone numbers on the card.

So it seems to be only natural to „optimize the common case“ and—in dynamically-typed languages—just put that only instance (spreadsheet, address, phone number, whatever) directly at the top level instead of adding an extra layer of indirection and complexity.

Looking at the JavaScript vCard parsers out there, this seems to be the general thinking: Only interpose an array, if there really are multiple values, otherwise, store the only value directly in that field.

And until a rather futile (and ill-fated) attempt to add a TypeScript type definition to such a library, I might have walked this beaten path.

But now, I beg to disagree.

Encoding (maybe) multiple values

What are our options to store potentially multiple values?

1. Flattening

Store or return the value, if it is just one; use an array of values for more than one value only. In TypeScript, this would look simple enough:

fieldOccuringAtLeastOnce: T | T[];

Pros: This approach uses fewer objects while creating the data structure, thus improving speed and memory consumption.

Cons: Every access needs to distinguish between the single- and multi-value case:

if (Array.isArray(fieldOccuringAtLeastOnce)) {
  // Code for the array case
} else {
  // Similar code again, but for the single-value case
}

Anyone aware of the Don’t Repeat Yourself principle will probably roll their eyes. Fine, you’re in good company.

2. Extending

One rarely seen option would be to store the first element in a direct variable and store the „overflow“ objects somewhere else (some might be reminded of handling hash collisions):

requiredEntry: T;
optionalAdditionalEntries?: T[];

Pros: Consumers of this data structure that are only interested in the required entry can always access it and never need to touch the array.

Cons: The data structure is now more complicated. Also, any code dealing with all entries will have to deal with two data structures simultaneously.

Yikes! But actually, we’re on a good way.

3. Thickening

Even though the most common case may be the single value, we always deal with the most general case: The array.

ourEntry: T[];

Pros: Simple data structure; always the same code.

Cons: Requires slightly more memory and creation time.

Using Array Thickening

In the at-least-once case, any consumer of this data structure can always assume that the element at index zero will always be present. This can even be expressed in TypeScript to help the development environment and compiler:

// First element always there:
type NonEmptyArray<T> = [T, ...T[]];

// Further down, in the data structure definition:
  ourEntry: NonEmptyArray<T>;

This might be three characters more for every access, but it is guided by the IDE, which can perform autocompletion and direct verification. Also, the compiler will be able to tell what is going on, if the programmer did not follow the IDE support.

A concrete example can be found in vCard4-ts.

Conclusions

Always and consistently using arrays when there may be multiple values has the following advantages:

  1. Consistent mental model
  2. Never forgetting to handle the less common case
  3. Less code (no partial duplication of code)
  4. Type definitions is easier and clearer (and therefore more powerful, i.e., helps coding and avoid bugs)
  5. Better code coverage analysis with fewer tests

Yes, the allocation time and memory consumption might be slightly higher. If this is the critical data structure for you, which is kept in memory millions of times in your application, you should consider optimizing it (and also consider flattening in this specific case).

But if this is just one of the many functions your application is implementing, Array Thickening saves you development time, and—if a bug actually sneaked in—debugging time and customer service time, explaining the problem to those users who experienced it (and might even have lost their data).

So, similar to Database normalization, use Array Thickening unless you have a specific reason not to. The next developer having to deal with your code will thank you.

And remember, as always, this just might be you a few months down the road.

Eine Antwort auf „Array Thickening: More can be less“

Schreibe einen Kommentar

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.