My friends and family are under attack in Ukraine. Donate to protect them directly or help international organizations.

Legacy Tales: Arrays Initialized Using Strings

March 29th, 2024

My team was recently tasked to upgrade a suite of applications from PHP 5 to PHP 7, as an intermediate step before going to 8. There are many compatibility issues, but one of them was too fun not to share with the world.

Consider this write operation:

$array = array();
$array['a']['b'] = 1;

You could alternatively write it like this (executable example):

$array = '';
$array['a']['b'] = 1;

Until PHP 7.0 inclusively, this will convert the empty string to an empty array internally. This doesn't work with non-empty strings though. Starting with PHP 7.1, this produces a fatal error. Do people actually write code like this? Yes, they do. Some of the codebases that we upgrade are littered with reliance on weird PHP type juggling behavior.

How do we fix it? Well, we can't change = '' to = array(). Many parts of the application rely on it being a string until the write operation. We can't afford to debug the logic of every single expression in the codebase, or our upgrades would take a decade instead of just a few weeks or months. The solution we came up with is actually pretty straightforward (executable example):

$array = '';
initArray($array);
initArray($array['a']);
$array['a']['b'] = 1;

function initArray(&$node) {
  if ($node === '') {
    $node = array();
  }
}

What we did here is write a static analysis rule (PHPStorm inspection in Kotlin) to find all write contexts that involve array keys. We call a function for every node except for the last one (['b']), which would be unnecessary. The above function will check, at runtime, whether the node is an empty string and initialize an array instead. We can't detect the empty string initialization itself using static analysis, because the code has no autoloading, no type declarations and copious amounts of type juggling, not to mention that the initialization could be too far removed from the scope where the write occurs. This is why we rely on a function to make this determination at runtime, which is much more reliable.

We wouldn't use this function in every project though. There could be additional cases with $node edge-case values that behave differently between the two PHP versions. There could also be pitfalls where the key is a function call and is not idempotent, such as $array[$results->readNext('column')].

What about read contexts, such as $read = $array['a']['b'] (executable example)? We discovered that this produces a fatal error in PHP 5.3, but not in PHP 5.4 through 7.4, only to become a fatal error again starting in PHP 8.0. Because the original PHP 5.3 behavior is already a fatal error, we assumed that there shouldn't be any occurrences of this in the code, and if there were, they would be easy to spot. We decided not to address the read incompatibility preemptively, and wait to see whether it's a real issue first.

We always write only what is needed for the purpose of the project. A one-size-fits-all solution might be too complex and consume too many CPU cycles to account for every use case that just doesn't happen in a given project.

I hope this helps the next legacy archaeologist. Happy coding!

Previous: Progressively Improving a Ball of Mud