Custom Module Loading in a Node.js Environment

Early in 2017, in the course of deciding to convert to React as our frontend framework, we realized we needed to render our React components on the server. We adapted Airbnb’s open source Hypernova renderer for our purposes. In particular, we implemented custom module resolution and modified the Hypernova module loader in order to have complete sandboxing between requests, with much less chance of accidental data leaks.

During our completion of this work we learned a lot about module loading, and came up with ideas about how we might be able to go about it better in the future. Are you embarking on a similar project in your engineering organization, or are interested in custom module loading overall? Then this article is for you.

How Our Module Format Is Different

Hypernova runs on Node.js, which uses commonjs modules and relative paths. Our codebase is a mix of ES modules (new code) and AMD modules (legacy code). ES module code is transformed to AMD modules before being delivered to the browser. Rather than the standard relative path for imports, our modules use a flat namespace, and are referred to by their unique file name.

Node.js imports look like this:

const foo = require(‘../some_folder/foo’);

While Wayfair code imports looked like this:

define(‘some_module’, [‘foo’], function(foo){ /* module source */})

We had to roll up our sleeves a bit and figure out how to get Node to load these modules. Before doing so, we had to actually figure out how a Node application loads modules by default. Here are the important things you need to know to turn a module name (‘../some_folder/foo’, or ‘foo’) into loaded and running code.

How Module Loading Works

Module Resolution

Before an application can load code it has to find it on disk. Node uses a series of rules to search for code based on the argument to ‘require’. If the argument looks like a path, it then looks for the file in the path relative to the current file. If it doesn’t look like a path, it’s assumed to be in a node_modules folder, and Node will look for the file in a series of possible node_module locations. There are a few complicated bits hidden in here. The first is the building of relative paths. Given only a relative path as an argument, how does require know which file to look relative to? The second part of the complexity is the series of potential node_modules folders to look in. Node will look for node_modules folders all the way up the directory tree from the current file.

Module._resolveFilename is the function that’s ultimately responsible for turning a string like ‘../foo/bar.js’ into an absolute path.

Module Compiling and Execution

Once the source code is found, it needs to be loaded and run. The value returned by require needs to be created somehow. This happens inside Node’s Module.js. The compilation stage is handled by Node’s vm module, which, according to the documentation, “provides APIs for compiling and running code within V8 Virtual Machine contexts”. The vm module is like JavaScript’s built-in eval function, but offers much more control over the scope and execution environment.

First, the code is actually loaded from disk. This code isn’t executed directly, though. It’s modified slightly so that a closure is created into which Node can inject module-specific variables. Without understanding this step, it’s easy to think of the implicit __dirname and __filename variables as mysterious and magical. In fact, they’re just arguments to a function that Node wraps around your code. Module.wrap is responsible for building the final JavaScript string which will be parsed and executed.

var wrapper = Module.wrap(content);

The wrapped module looks like this:

(function (exports, require, module, __filename, __dirname) { 
   /* module body /*
});

Next, the wrapped code is parsed and compiled:

var compiledWrapper = vm.runInThisContext(wrapper, {
  filename: filename,
  lineOffset: 0,
  displayErrors: true
});

This gives us a function that, when called, will initialize or execute a module. This is the function we call to actually export the exports:

compiledWrapper.call(this.exports, this.exports, require, this, filename, dirname);

Where did we go from here?

How We Approached the Challenge

Module Resolution: Monkey Patching Module._resolveFilename

In order to get our application code working with Hypernova, we had to resolve two problems: First, our modules look like npm modules (which would be found in a node_modules folder), even though they aren’t. Instead of the standard relative paths: ‘../foo/bar’, we just refer to modules by a single name: ‘bar’. We realized that all module resolution eventually went through a single function, Module._resolveFilename, which we could override.

const nodeResolveFilename = Module._resolveFilename;
Module._resolveFilename = filename => {
  if(isWayfairModule(filename){
  return resolveWayfairFilename(filename);
}else{
  return nodeResolveFilename(filename);
}
}

This works, but has a few drawbacks. Monkey patching makes code harder to follow and can make applications much more difficult to reason about. On top of that, since the Hypernova rendering application and our customer-facing code being rendered are sharing a module loader, it’s now possible for our developers to accidentally break Hypernova. If a developer creates a module called ‘banana’, and Hypernova relies on a published npm module called ‘banana’, our application code could be loaded instead of the ‘banana’ npm module.

How We Got AMD Module Loading to Work

AMD modules are defined by calling an implicit (meaning not imported and non-local) define function.

define(‘some_module’, [‘foo’], function(foo){ /* module source */})

This function isn’t automatically provided by Node, so our AMD modules fail when loaded using the default Node machinery with the following error: “ReferenceError: define is not defined“. What we need is an implicit `define` function which will automatically call require(‘foo’) and pass the results of that call to the callback function, which is the third argument to define. Another bit of monkey patching solves this problem for us.

We take the original Module._wrap and replace it with a version that provides this define function.

Original (simplified for readability):

Module._wrap = moduleBody => `(function (exports, require, module, __filename, __dirname) { 
${moduleBody} 
})`;

Enhanced:

Module._wrap = `(function (exports, require, module, __filename, __dirname) { 
const define = (moduleName, dependencies, callback) => {
    // load all the dependencies
    const resolvedDependencies = dependencies.map(require);
    // then call the callback, passing in those dependencies
    callback.apply(this, resolvedDependencies);
};
${moduleBody}
});`

Aside from the slightly queasy feeling that comes with any monkey patching, we’ve discovered no major downsides to this approach. This is interesting because it shows how much flexibility we have when loading modules: We can give any module any implicit data we want. It’s useful for things like running AMD modules, and potentially useful for exposing ways to communicate between our application code and our rendering service. For example, what if we wanted to make a request object available to all modules, containing things like requestID, and a means to access a WebSocket session directly?

How We Sandboxed Modules Between Requests

When React code runs on the client side, it runs in the browser’s context, on the end user’s own machine. When it runs on the server side, thousands of different users are being served by code running on the same application instance. Unlike our primary server-side technology, PHP, Node modules are by default singletons which persist across requests, for the lifetime of the application. Because of the complexity of our application code, and our large developer population (hundreds of people writing React code which runs on our customer-facing websites), we were uncomfortable with the risk of data leaking, via these singleton modules, from one request to another.

Let’s imagine one feature on our codebase uses the singleton nature of modules to serve as a global data bus that different legacy components can hook in to. In a server-rendered application, we could accidentally leak data from one request to another, displaying incorrect information like this:

myGlobalDataBusModule.someGlobalMessage =  
myGlobalDataBusModule.someGlobalMessage || 
getMessageInLocationSpecificLanguage()

We chose to look into aggressively sandboxing requests so that it would be difficult, if not impossible, for developers to accidentally allow request-specific data to persist from one user to another.

In order build a sandbox, we need to change the default behavior of our application. This is so the modules imported by code run in the course of serving a request aren’t singletons, but instead are request-specific instances. You might recall from earlier in our article that loading modules involves both a compile and an execute step.

Compile:

const compiledWrapper = vm.runInThisContext(...args));

Execute:

compiledWrapper.apply(exports, require, module, __filename, __dirname);

Compiling code is relatively expensive, thus we only do it once per file. We then execute this compiled code (compiledWrapper.apply(...args)) on every request. This still adds a non-trivial performance penalty that we would rather not pay — roughly 50 milliseconds on a complex page — but it ensures that the exports of every module are unique for every request.

In A Nutshell

Through a few bits of strategic monkey patching (Module.wrap and Module._resolveFilename), we were able to use our particular module format (named amd modules, flat namespace) in a Node application. By breaking apart the compile-execute steps, and re-executing the compiled wrapper on every request, but only compiling on application startup, we can achieve a good amount of security with significantly less of a performance penalty than if we completely re-ran the entire module loading process on every request.

Thinking Ahead and Further Improvements

The modifications we’ve made have served us well, but entail both technical debt and some degree of performance-vs-security trade-offs. We’ve hijacked the built-in Node module loading mechanism which results in code that’s both difficult to understand and exposes us to potential bugs. We also rerun our module initializer functions (compiledWrapper.apply(...args)), incurring a non-trivial performance penalty on every single page. Let’s explore some ways we could avoid these downsides.

In order to avoid monkey patching Node module loading code, and in order to provide an implicit request object to all modules, we’d need to build a parallel module loader which is much less intertwined with Node’s. This means we’ll need to create our own require function, since require calls Module._resolveFilename, which is the function we’re trying to avoid monkey patching in the first place.

We’ll import this module loader just as we’d import any other module:

// module_which_needs_to_load_unique_instance.js
import loadUniqueInstance from ‘load-unique-instance’
// run the following on every request
const uniqueInstance = loadUniqueInstance(‘root_component’, __filename, {});

Our custom module loader will then supply its own version of require to all of root_component’s dependencies.

The following is the simplest possible module loader. There are a few essential features missing from this one, but it gets the job done for the most basic cases:

import {readFileSync} from 'fs';
import vm from 'vm';
import {_findPath} from 'module';
import {dirname} from 'path';
const loadUniqueInstance = (moduleName, from, request) => {
  const makeRequire = pathToParent => {
    return moduleName => {
      const parentDir = dirname(pathToParent);
      // use Node's built-in module resolver here, but we could easily pass in our own
      const filename = _findPath(moduleName, [parentDir]);
      const source = readFileSync(filename, 'utf-8');
      // The default wrapper function takes five arguments: exports, require, module, __filename, __dirname
      // If we control module loading, we can provide whatever additional implicit variables we want
      const wrapped = `(function (exports, require, module, __filename, __dirname, __request) { 
        ${source} 
      })`;
   // todo -- cache this so we only compile every file once 
      const compiled = vm.runInNewContext(wrapped, {}, {});
      const exports = {};
   // this is OUR require, completely independent of the built-in node require
      const require = makeRequire(filename);
      const module = {exports};
   // todo -- cache exports so we reuse them inside every request,
 	 // but don’t share them across requests
      compiled(exports, require, module, filename, dirname(filename), request);
      return module.exports;
    };
  };
  return makeRequire(from)(moduleName);
};
export default loadUniqueInstance;

We like the security offered by sandboxing modules, but want to squeeze every drop of performance out of our server infrastructure as possible. It would be great if we didn’t have to pay the 50 millisecond penalty for calling compiled. There’s no getting around the fact that initializing all those JavaScript modules takes time, but there may be no requirement for when that work happens. It’s also possible to complete that 50 milliseconds of work before the initialized modules are needed, and have them ready and waiting when a request comes in.

To illustrate this idea in a little more detail, imagine we have a React component called ProductDetail. This is a complex component which depends on many hundreds of JavaScript modules, each of which needs to be initialized on every request if we want to avoid sharing modules between requests. In our imaginary future module loader, after the initial request to ProductDetail comes in, the system would say: “It’s likely that another request similar to this will come in. I should always have some instances of ProductDetail, plus dependencies, waiting and ready for a corresponding request to come in.”

Once we have a system that’s pre-initializing all of these modules, we still have 50 milliseconds of work that’s happening. Often, a request will come in and still have to wait for that work to finish. We can further improve performance by chunking this module initialization work. Rather than a single, blocking 50 millisecond function call, we can make every individual module initialization asynchronous, possibly even waiting until the application is idle (meaning there are no in-flight requests it’s currently processing) before undertaking this work.

const modulePool = {};
const loadModuleQuickly = (moduleName, parentDir) => {
  // Create an entry in modulePool for moduleName, if none exists.
  const thisModulePool = (modulePool[moduleName] =
    modulePool[moduleName] || []);
  // If we we have an already-initailized module waiting, go ahead and
  // use that one.
  const results = thisModulePool.length
    ? thisModulePool.pop()
    : // otherwise, create one right now.
      customRequire(moduleName, parentDir);
  // Wait a tick
  Promise.resolve().then(() => {
    // then pre-initialize the module that was just requested.
    thisModulePool.push(customRequire(moduleName, parentDir));
  });
  return results;
};

Using this approach, it’s likely that we’ll be able to have secure sandboxing without any performance penalties. The price we’ll pay here comes by way of application complexity. This will be especially true once we add asynchronous dependency graph initialization and a wait-for-idle feature, meaning there will be a lot of intricate moving parts.

Regardless of how many of these features we choose to implement in future versions of our server-side rendering system, we have a lot of options. Module loading is not magic, and by taking it into our own hands we can potentially have improved security, better performance, and simpler, more digestible code.

A Team Effort

Mastering server-side rendering was very much a team effort, with the author generally acting as an interested observer as other software engineers came up with the solutions described above. Most of the engineering credit goes to Arty Buldauskas, Ben Roberts, and Matt DeGennaro. Additionally, we’re grateful to Airbnb for open-sourcing Hypernova.

Comments? Feedback? Let us know in the comments below!