Skip to content

John Bennett's blog

Business services, EDA and asynchronous messaging

Sunday, April 12, 2009

Business services are not the same thing as web services.  A business service is a high-level concept: a large-grained, cohesive set of business functionality.  Everywhere you see the word “service” in this post you should think “big chunk of business functionality”.  Each business service is likely implemented as a set of smaller-grained components (some of which might be web services).  Business service is very macro.  Web service is very micro.

In an event-driven architecture (EDA), business services communicate with one another through asynchronous messaging, in order to reduce the coupling between them.  By communicating asynchronously, a business service can continue to operate (for some period of time) without any contact with other business services.  That’s where their “autonomous” quality comes from.

For example, a retailer might have an Inventory service, an Order service, an Inventory service, a Billing service, and a Shipping service.  The Order service can take orders even if the other services are unavailable.  When those services become available, queued messages can be exchanged — inventory gets allocated, orders get shipped and customers get billed.

Suppose an order is placed over the web.  The Order service might publish an OrderReceived message.  The Inventory service would then publish either an InventoryAssignedToOrder or an ItemOutOfStock message, depending on what was actually available.  The Order service would subscribe to those types of messages.  If it received the first, it would then publish an OrderApproved message.  If it received the second, it might take other action, such as sending an email to the customer.

You may have noticed that we just created a messaging protocol.  That is, the Order service must correlate multiple messages related to the same business process, and keep track of state related to those messages.  Within the Order service, the logic would be something like this:

  • When an order is placed (in an inbound web request), save it, publish an OrderReceived message, and send the customer a response message that the order was successfully received.
  • When an InventoryAssignedToOrder message is received, look up the order and record the assigned inventory quantity next to the quantity ordered.
  • When inventory has been assigned for all quantities of all products in the order, publish an OrderApproved message.
  • When an ItemOutOfStock message is received, send an email to the customer.
  • When an hour passes after sending an OrderReceived message, if not all quantities in the order have been assigned inventory and no ItemOutOfStock message has been received, publish an OrderDelayed message.

This is a saga — a long-running process within a single service that requires the service to correlate messages and track state across those messages.  Note that this is just one side of the equation.  The Inventory service might have its own saga. 

Notice how each of the saga’s business rules start with “When”.  These rules are what make the architecture event-driven.  Notice also how simple the rules are.  Especially interesting is that the technical implementation maps almost directly to the business rules.  We would create an OrderProcessingSaga class, which has methods for Handle<OrderReceived>(), Handle<InventoryAssignedToOrder>(), etc.  The simple logic of each method is described in one of the bullets above.  (The final bullet is implemented in Handle<Timeout>().)

How would we accomplish the same process in a synchronous RPC (remote procedure call) architecture?  In RPC (like typical HTTP-based web services), a request is issued and the requestor waits for the response.

  • When an order is placed (in an inbound web request), save it.  The inbound connection remains open.
  • The Order service opens a new connection to the Inventory service.  The connection remains open while the Inventory service determines whether inventory is available for all the quantities in the order.  The response to the Order service includes data on any product SKUs that are out of stock.  The response is sent and the connection from Order service to Inventory service is closed.
  • The Order service looks for any out of stock SKUs in the response.  Based on that, it creates the appropriate message (“order confirmed” or “items out of stock”), sends the customer a response message.

At first glance this might seem no different than the message-based process above.  But there are a number of problems with the RPC approach that reach across most of the -ilities:

Reliability.  Even with all the redundant hardware in the world, at some point the Inventory service will be down.  During that time, no orders can be accepted.  If each of the two services has 99% uptime, the total uptime for the system from the customer’s perspective is 99% x 99% = 98.01%.  The system as a whole is less reliable the least reliable component.  To get the system to 99% uptime, both services need uptime of at least 99.5%.  That means a greater investment in hardware, maintenance, etc.

In the message-based approach, the Order service can happily process orders without the Inventory service for up to an hour (or whatever rule is applied in the saga).  The services are not temporally coupled — they don’t always have to be running at the same time.  From the customer’s perspective, the uptime is 99%. 

The Inventory service can be taken down for maintenance.  Perhaps it could be deployed to two cheap servers instead of to an expensive active-passive cluster.  All without affecting the uptime of the Order service.

Performance.  In the RPC approach, the customer submits an order and then waits: while the Order service records the order, while the Order service contacts the Inventory service, while the Inventory service responds, and while the Order service figures out what message to return.  What if the Inventory service takes 5 seconds to determine whether inventory is available?  The customer sees a spinning ball.  In the message-based approach, the customer waits only for the Order service to record the order.  A response is them immediately sent to the customer. 

Now add the latency of communication between the services.  If they are in different data centers, that could add half a second to the customer’s wait time in the RPC approach.  It adds zero time in the message-based approach.

Scalability.  In the RPC approach, connections between the customer and the Order service remain open for longer (as described above).  Let’s say the Order service receives 10 orders/second.  If the connection remains open for 2 seconds, the Order service has to support 20 simultaneous connections.  If the message-based approach reduced the connection time to 1 second, the Order service only has to support 10 simultaneous connections — for the same volume of orders.  This disparity can become much worse under real-world traffic peaks.  Customer requests may wind up queuing (decreasing performance further) or even timing out.

The same problem occurs between the Order service and the Inventory service, as connections are consumed there in the same proportion.

Transparency.  Take another look at the logic for the RPC and message-based approaches.  In which one is the business logic more readily apparent?  Which one would do you think a sales rep would more easily understand?  I believe that the message-based approach wins handily on both counts.  As mentioned above, the business logic maps almost directly to the actual implementation classes and methods.  While a good RPC design can also have this quality, transparency is very natural and very easy in the message-based approach.

Maintainability.  Let’s add shipping to our business process.  In the RPC approach, during the order request, just after we assign the inventory to the order, we would open a connection to the Shipping service, ask it to ship the order, wait for its response, and then send the response to the customer.  With this third service, we’ve exacerbated all of the reliability, performance and scalability issues described above. 

All three systems have to be running.  At 99% uptime for each service, our overall our system is now down to 97.03% uptime.  We don’t want to ship the order unless we actually have all products in stock, so we have to wait for the request to the Inventory service to return before we talk to the Shipping service.  Our connections from the customer are now up to 3 seconds each, and our Order service has to support 30 simultaneous connections.

What happens when we’ve assigned inventory and then the call to the shipping service fails?  We have to free the inventory somehow, so we place the whole order request in a transaction.  That transaction flows across each service call, so that if shipping fails the inventory assignment is rolled back.  The transaction management adds half a second to the customer’s wait time, and we’re up to needing 35 simultaneous connections on our Order service.

In the message-based approach, we create a new saga within the Shipping service.  It has a rule saying: When an OrderApproved message is received, try to ship the order.  If successful, publish an OrderShipped message; otherwise publish an OrderFailed message.

We add rules to the OrderProcessingSaga:  When an OrderShipped message is received, update the order status; and when an OrderFailed message is received, update the order status and send an email to the customer. 

And we add a rule to the Inventory service’s saga:  When an OrderFailed message is received, release any inventory assigned to the order.

We haven’t had to make a single change to the processing logic for the customer’s initial request to place the order.  We haven’t had to create distributed transactions, because the business rules in the various sagas handle failures correctly.  We haven’t reduced our system uptime.  We haven’t lengthened the time a customer waits for a response, and therefore we don’t have to support more simultaneous connections from customers.

Conclusions

The business process described here is pretty simple.  Real world processes get much more complicated:  Add a credit check before assigning inventory.  Bill the customer after the order ships, after getting sales tax computation from an external service.  Substitute an equivalent product if the one ordered has been discontinued.  Yikes! 

If you read “service-oriented” as “synchronous RPC-style web services,” you’ll quickly be in big trouble with a system of any significant size.  However, if you read “service-oriented” as “event-driven, autonomous business services communicating with asynchronous message across service boundaries,” you’ll have a much better chance of actually enjoying the maintenance of your system.  I’ll take the latter, thanks.