Wikipedia:Wikipedia Signpost/2014-10-08/Technology report

Technology report

HHVM is the greatest thing since sliced bread

Legoktm is a platform engineer for the Wikimedia Foundation. He wrote this in his volunteer capacity with members of the MediaWiki Core team.

No seriously, it is.

HHVM stands for "HipHop Virtual Machine". It is an alternative PHP runtime, developed by Facebook and other open source contributors to improve the performance of PHP code. It stemmed from HipHop for PHP, an earlier project at Facebook which compiled PHP into C++ code. Compared to the default PHP runtime, it offers significant speedup for many operations.

In March 2014, a group of MediaWiki developers started working on ensuring that the codebase, along with the various PHP extensions used on Wikimedia servers, were compatible with HHVM. This involved making changes to MediaWiki, filing bugs with the HHVM project, and often also submitting patches for those bugs.

Users will see performance improvements in many places, especially when editing extremely large articles. If you're interested in helping the development team out with finding bugs, or just want your editing experience to be faster, you can enable the "HHVM" betafeature in your preferences.

We caught up with longtime MediaWiki developer and lead platform architect Tim Starling and asked him a few questions about the HHVM migration:

What is HHVM?

HHVM is a new implementation of the PHP language, written in C++. It has a rewritten runtime — that is to say, most of the functions exposed to PHP code have new implementations. It has a JIT compiler which can translate small snippets of PHP code to machine code.

What have the performance gains of HHVM been so far? Are they expected to increase over time?

At the moment, it is faster by roughly a factor of two. This is somewhat disappointing since the old HipHop compiler gave us a speedup of 3–5 depending on workload, although at the cost of an hour-long compile time. The performance gain is expected to increase over time, due to:
  • Deployment of RepoAuthoritative[1] mode. This involves analysing the PHP code for a few minutes prior to deployment, in order to generate more efficient HHVM bytecode. It is said to give a speedup of about 30%.
  • Optimisation of HHVM for MediaWiki's workload. Brett Simmers of Facebook has offered to spend some time on this after we have fully deployed HHVM.
  • Profiling and optimisation of MediaWiki running under HHVM. We expect big gains here, since not much profiling work has been done on MediaWiki while Ori and I have been working on HHVM. Even the Zend performance of MediaWiki is sub-optimal at present.
  • Ongoing performance work on HHVM by Facebook. Facebook have a performance team which constantly improves the performance of HHVM.

What sort of impact can users expect from the deployment of HHVM? What sort of issues might users run into?

We expect to see crashes and other fatal errors, and also more subtle bugs such as incorrect HTML generated by Lua or wikitext. Note that as we are rolling out HHVM, we are also upgrading from Ubuntu 12.04 to Ubuntu 14.04, which means different versions of many system libraries and utilities. When we move the image scalers to Ubuntu 14.04, there may be changes to SVG rendering.

What effort has gone into ensuring that HHVM performs well and is reliable, especially at Wikipedia's scale?

We have tested HHVM's performance by benchmarking, and also under real load by diverting a proportion of Wikipedia's traffic to a single HHVM backend server. We have some assurance from the fact that Facebook has been running HHVM for years in production, and their scale is significantly larger than our scale. Facebook's experience also gives us some confidence as to reliability.
With any open source project, it is difficult to give assurances as to reliability. PHP has not been uniformly reliable for us, and has presented all sorts of challenges for us over the years as we have scaled up. HHVM's architecture is built with much more awareness of the challenges of scaling than PHP was. So we have reason to think that HHVM will prove to be a more stable platform for a busy website in the long term.

What was the biggest challenge to rolling out HHVM?

Integration of Lua. I don't think anyone has integrated another language with HHVM before, in the same address space. We did it by rewriting HHVM's Zend compatibility layer, allowing our existing Lua extension for Zend to be also compiled against HHVM, with only a few instances of cheating (#ifdef).

What is Hack and do you think it will affect Wikimedia development?

Hack is the new name for the language extensions that Facebook has progressively added to the syntax of PHP over the last four years in HipHop Compiler and HHVM. It also refers to the static type checker that they have recently introduced. Hack allows types to be specified for function return values, and extends the existing support for specified types in function parameters. The net effect is to allow an existing PHP codebase to be progressively migrated to strong typing, with many type checks being done pre-commit instead of at runtime. This is beneficial for developers of large applications, and helps to avoid errors being seen by users at runtime.
For now, we are committed to compatibility with PHP, and so it is difficult for us to make use of Hack's language extensions, except in WMF-specific services and tools. I would love to see a move towards language harmonization by PHP towards Hack – for example, return type hints could easily be added. I'm not sure of the reason for the split, since PHP does not strike me as an especially conservative community.

Currently logged-out users have a significantly faster experience than logged-in users. Is it realistic to expect that logged-in users will one day have the same experience as logged-out users? If so, when?

HHVM by itself will not provide performance parity for most users. It will help to reduce parser cache hit times, and for many articles, for users near our main US data centre, the difference between logged-in and logged-out page view times may be imperceptible. But for users outside the US, we will continue to serve logged-out page views from the nearest cache, whereas logged-in page view requests are forwarded to Virginia, which will add 100-300ms due to the speed of light delay. This is not easy to fix.
Parser cache misses could be reduced or eliminated, but page save times are somewhat more difficult, in my opinion, especially if we continue to support pre-edit spam and vandalism detection. Some amount of processing is needed to detect vandalism – is it appropriate to pretend that we have saved the page when such processing is still going on, and to send a message later if the edit is rejected? And if we do that, do we show the updated site to the user while processing is in progress?

After HHVM is fully deployed, what are the next big projects to improve performance?

We are planning to work on editor performance, especially VisualEditor. Also, as previously noted, there will be ongoing profiling and optimisation work which will cumulatively improve performance.


More information is available at mw:HHVM/About, and information about the current development process can be found at mw:HHVM.

P.S.: If you too find HHVM to be awesome, you can leave your thanks to the developers here.