Trouble Shooting Wiki

Handling errors in SOA based systems

From TroubleshootingWiki

Jump to: navigation, search
Handling errors in SOA based systems
Official Page
Project Documentation
Download
Source Book
200px-1847193552.jpg
ISBN 978-1-847193-55-1
Publisher Packt Publishing
Author(s) Antony Reynolds, Matt Wright

Handling errors or faults is an important consideration for SOA based applications, especially if you consider that a solution is likely to be a loose assembly of independent components each with varying levels of resilience, throughput, and availability.

How faults are handled depends on a number of factors, whether it's a business or system fault, was the service, where the fault originated, called synchronously or asynchronously, and whether the interaction between the client and the component detecting the fault is synchronous or asynchronous.

A business fault is loosely defined as one we know about in advance, in that it is defined as part of the service contract and so represents a legitimate state within the business process. How we handle a fault of this type is largely driven by business requirements, and so it makes sense to handle these within the context of the process.

A system fault, conversely, is one that is unexpected in nature, and could occur to any component in the solution. Such faults are often caused by infrastructure problems, such as a network outage or a service being unavailable. Often these are temporary and can be handled by retrying the service at a later time.

The interaction between the client and the component detecting the fault also influences how we handle it. If asynchronous the component has time to resolve the problem; for example, if the fault occurred due to a service being unavailable, it can retry it later.

With synchronous interactions, we only have a small window in which to resolve the fault before the client times out waiting for our component and raises its own fault. With this style of interaction, often all we can do is catch the fault, undo any partially completed activities so that we leave the system in a consistent state, and then return a fault to the client.


Contents

[edit] Business faults

A business fault is one that is defined in the WSDL of the service; how we defi ne the fault depends on whether a service is synchronous or asynchronous.

[edit] Defining faults in synchronous services

Synchronous services signal faults by returning a fault element in place of the defined output message for the service. These faults are defined in the WSDL of the service, and are denoted by the <fault>element.

For example, the oBay application implements a dummy CreditCardservice, which includes the operation verifyCreditCard; the definition of the operation is as follows:

<portType name="CardServices">
<operation name="verifyCreditCard">
<input message="tns:verifyCreditCard" />
<output message="tns:verifyCreditCardResponse"/>
<fault name="invalidCreditCard"
 message="tns:invalidCreditCardFault"/>
 </operation>
</portType>

As well as defining the standard input and output messages for the operation, it lists a fault message (highlighted above) that could be returned in place of the defined output operation. An operation can define zero, one, or many faults for an individual operation; they are similar in construct to an output message, except that they must also be named so that the client can distinguish which fault has been returned.

When a soap:Faultis generated, the faultcodewill contain the fault name (for example tns:invalidCreditCardin the above example) and the detailelement will contain the content of the fault message.

[edit] Defining faults in asynchronous services

Asynchronous services don't explicitly support the concept of faults; this is because the result of an asynchronous service is returned in a separate callback operation. So to signal a fault, the service will need to define additional callbacks, typically one extra callback per fault. If we take our credit card example and rewrite it as an asynchronous service, we get the corresponding WSDL:

<portType name="CardServices"
 <operation name="verifyCreditCard">
 <input message="tns:verifyCreditCard"/>
 </operation>
</portType>
<portType name="CardServicesCallback"
 <operation name="creditCardVerified">
 <input message="tns:creditCardVerified" />
 </operation>
 <operation name="invalidCreditCard">
 <input message="tns:invalidCreditCard" />
 </operation>
</portType>

Here we can see that we've defined a second callback operation (highlighted above). This corresponds to the fault we defined in the synchronous operation. If we examine this, we can see we've used the fault name as the operation name in the callback, and while we have two different messages, in reality they are identical—we have just used different names as we want to stick to our naming conventions.

It is still possible for the invocation of an asynchronous service to return a fault. This can occur when the system is unable to successfully deliver the invocation message to the asynchronous service, for example the network connection is down. This type of fault we would treat as a system fault as opposed to a business fault.

[edit] Handling business faults in BPEL

Within a BPEL process, any call to a PartnerLink could result in a fault being raised. Other activities within a process can also result in a fault being thrown (for example due to a selection failure within an assign activity), and in addition the process itself may need to signal a fault.

When a fault occurs in a BPEL process, the process must first catch the fault; otherwise the process will terminate with a state of closed.faulted. Once caught the next step is to decide whether the fault can be handled locally within the process or whether it needs to be returned to the client.

If the interaction between the client and the process is synchronous, it provides limited opportunity to correct the cause of the fault and retry the activity. For example if the fault occurred due to a service not being available, we can retry the service in the hope that its outage was very temporary, but if we wait for the service to come back up then the client of our BPEL process is likely to time out and raise its own fault.

So with synchronous interactions, all we can really do is catch the fault, undo any partially completed activities so that we leave the system in a consistent state, and then return a fault to the client.

The client itself may be a BPEL process, or another SOA component, again if the interaction between this component and its client is also synchronous, it will typically need to return its own fault, and so on up the chain until the interaction between a client and a component is asynchronous in nature.

With asynchronous interactions, we have a lot more flexibility to handle the fault within the context of the process, as the client is unlikely to timeout (though we still need to allow for the fact that the client may only wait so long).

If the fault is temporary in nature, such as a service not being available, we can wait for the issue to be resolved and retry the activity later. However, this type of fault should be handled using the Fault Management Framework (which we will cover later in this tutorial). This allows us to focus on handling business faults within our BPEL process, which keeps our process simpler and easier to maintain.

Handling business faults is just a natural extension to the process, in that we need to model the process to cater for these types of scenarios. For example if a fault occurred due to invalid data, then in a synchronous interaction we would just return details of the fault to the client, while in an asynchronous interaction we could create a Human Workflow Task for someone to capture the correct data in order that the process can resume.

[edit] Catching faults

The first step in handling a fault is to catch it. Within BPEL we do this using a <catch> branch, which can either be attached to a scope or the process. With a catch branch, we specify the name of the fault to be caught and the series of activities to be carried out in that event.

Once the catch branch has completed, processing will continue with the next activity following the scope in which the fault was caught, assuming of course another fault hasn't been thrown.

We can define as many catch branches as we want for a scope, and, in addition, we can also attach a <catchAll> branch which will catch any fault not caught by any of the specific <catch> activities.

When a fault is raised, the BPEL engine will first check the current scope to determine a suitable catch or catch all branch. If the fault is not caught, the BPEL engine will then check the containing scope for an appropriate fault handler, and so on up to process level.

If the fault is not caught at this level, then the process will terminate with a status of closed.faulted. If the interaction between the client and the process is synchronous, then the fault will be automatically returned to the client. However, if the interaction is asynchronous then the fault will not be returned, with the potential result being that the client may hang waiting for a response that is never sent.

[edit] Adding a catch branch

To demonstrate this we will look at the UserRegistration process which needs to carry out a number of checks; for example that the requested userId isn't already in use, that the supplied credit card is valid. Should one of these checks fail, we need to catch the fault and then return a reply to the client to indicate that an error has occurred.

To achieve this we will place each validation step in its own scope, and defi ne a fault handler for each one. To add a catch branch to a scope, click on the Add Catch Branch icon for the scope; this will add an empty catch branch to the scope as shown in the following figure:

The next step is to specify the type of fault that you want to catch; to do this double-click on the catch branch icon (circled in the previous screenshot). This will bring up the Catch dialogue as shown in the following figure:

Click on the flashlight for the Fault QName (circled in the previous screenshot) and this will launch the Fault Chooser dialogue. From here you can browse to the fault that we want to catch; in our case this is the invalidCreditCard fault defined in the WSDL file of the CreditCardServices Partner Link.

There is also the option to specify a fault variable to hold details of the fault returned; this should be of type Message and match the message type defined for the fault, that is client:invalidCreditCardFault in the case of the fault invalidCreditCard (as defined in the WSDL file for this service).

Once we have caught the fault, we need to specify the activities to perform in order to handle the fault. In our case we need to undo any activities completed in previous scopes, using the compensate activity before we return the fault invalidUserDetails to the caller of this process.

However, the current scope is not the correct context for triggering the required compensation (we will see why in a moment), so our fault handler needs to capture the reason for the fault and throw a new fault that can be handled at the appropriate place within our process.

[edit] Throwing faults

To do this expand the catch branch for the Fault Handler by clicking on the + symbol, and drag a Throw activity into it.

To specify the fault we wish to throw, double-click the Throw activity to bring up the dialogue to configure it, as shown in the following screenshot: Next click the flashlight to bring up the Fault Chooser. This time we want to browse to the fault we wish to throw which is the invalidUserDetailsfault, which is defined in the WSDL file for the UserRegistration process.

We also want to record the reason for the invalid user details, so we need to define a fault variable to hold this. The simplest way to do this is by clicking on the magic wand icon to create a variable of the right type, though you should specify that the variable is local to the scope as opposed to global.

Finally, we've added a simple assign activity before our throw activity to populate our fault variable, so our final catch branch looks as follows:

[edit] Compensation

As part of the user registration process we need to check that the requested userId is not already in use. We do this by attempting to insert a record into the obay_user table (where userId is the primary key).

If this succeeds we know the userId is unique and at the same time prevents anyone else from acquiring it (on the off chance that two requests with the same userId are submitted at the same time).

We do this before verifying the credit card, the result being that if the credit card fails verification we end up with a user record for the specified userId in the obay_user table. This will cause the next request to fail when the user resubmits their request with corrected credit card details.

An alternative approach would be to verify the credit card fi rst before validating the userId; however, with this approach if the user chooses multiple userIds that are already taken, their credit card would be validated several times which could cause issues with the card company.

To prevent resubmission of user registrations from failing, we need to undo the creation of the user record; one way of achieving this is to use the compensation model provided by BPEL.

This allows us to break a BPEL process up into logical components using scopes. For each scope we can define a compensation handler that will contain a sequence of one or more activities to reverse the effects of the activities contained within that scope.

In our case we need to define a compensate handler on the CreateUserscope, which deletes the user record created by the scope.

[edit] Defining compensation

To define the compensation activities for a scope, click on the Add Compensation Handler icon for the scope, this will add an empty compensation branch on the scope as shown in the following figure: Once you've created your compensation handler, simply add the activities that need to be carried out to undo the effect of the scope; in our case we just need to call the deleteUser operation on the UserManagement service.

[edit] Triggering a compensation handler

Compensation handlers aren't triggered automatically, rather they need to be explicitly invoked using the Compensate activity, which can only be invoked from within a fault handler or another compensation handler.

When the compensate activity is executed, it will only invoke the compensation handlers for those scopes directly contained within the scope for which the fault handler is defined. Or if invoked in a fault handler at the process level (as in our example), it will only execute the compensation handlers for the top level scopes.

The compensation handlers will only be invoked for those scopes which have completed successfully, and will be invoked in reverse order to that which they were completed. That is the compensation handler for the most recently completed scope will be invoked first, and then the next most recent and so on.

If a scope whose compensation handler has been invoked contains scopes for which compensation needs to be performed, then it will need to call the compensate activity within its own compensation handler.

If a scope doesn't have an explicit compensation handler defined for it, then it will have a default compensation handler which just invokes the compensate activity.

[edit] Adding a compensate activity

For our purposes we need to trigger the compensate activity at the process level, so to do this we have defined a fault handler on the process to catch the invalidUserDetails fault thrown by our previous fault handler.

Once done, we will add a Compensate activity as the first activity within our fault handler. To configure it, double-click the Compensate activity to bring up the dialogue as shown:

Here we have the option of specifying a Scope name to restrict it to invoking the compensation handler for that scope. For our purposes we want to invoke the compensation handler for all top level scopes so have left it blank.

[edit] Returning faults

If at run time the verifyCreditCard operation returns a fault of type invalidCreditCard then this will be caught by the catch branch we defined on the VerifyCreditCard scope.

This fault handler will throw an invalidUserDetails fault, which will get caught by the catch branch defined against our process. This will execute the compensate activity triggering the compensation handler on the CreateUserscope, which will delete the previously inserted user record.

The final step is to return an invalidUserDetailsfault to the caller of the BPEL process. To return a fault within BPEL we use the Reply activity. The difference is to configure it to return a fault as opposed to a standard output message, as shown in the following screenshot: Here we have confi gured the Partner Link and Operation as you would for a standard reply.

However, for the Variable we need to specify a variable that contains the content of the fault to be returned. In our case this is the content of the fault caught by our process level fault handler (and populated by the fault handler for the ValidateCreditCard scope).

Finally we need to specify that an invalidUserDetails fault should be returned. Specify this by clicking on the flashlight in the Fault QName panel to launch the now familiar Fault Chooser. After returning the fault, the process will complete.

If a fault had been triggered at the step of creating the user record (for example because the userId was already in use), then an invalidUserDetails fault would have been thrown in the fault handler for this scope. The process would follow the same flow as outlined above, apart from the fact that the compensation handler for the CreateUser scope would not have been triggered as the scope never completed.

[edit] Asynchronous considerations

As we pointed out earlier, asynchronous services don't explicitly support the concept of faults, so it's worth examining how we would manage the above scenario if all the messaging interactions were asynchronous.

As we covered earlier, an asynchronous version of the CreditCard service would require two callbacks, creditCardVerified and invalidCreditCard, which would be the equivalent of our fault in the synchronous example.

Within our VerifyCreditCard after our invoke activity, instead of having a receive activity to receive the callback we would need a pick activity with two onMessage branches, one for each callback. The branch for invalidCreditCard would be the equivalent of our synchronous fault handler above, and would contain the same activities as its synchronous equivalent.

We would still have the fault handler defined for our process, which would catch the fault thrown by our onMessagebranch for invalidCreditCard.

The activities of this fault handler would be similar to the fault handler in our synchronous version. We would still call the compensate activity, but rather than use the reply activity to return a fault, we would now use the invoke activity to invoke the appropriate callback to signal invalid user details.

[edit] Using the fault management framework

The Oracle SOA Suite provides a sophisticated framework for handling invocation faults within BPEL PM. Without the framework, when a BPEL process invokes a service, if an error occurs the fault is returned to the BPEL process to handle.

This is fine for a business fault as we need to handle it in a way appropriate to the business process as covered above.

But for system faults, such as network problems resulting in a service becoming temporarily unavailable, implementing the handling of this at the process level can be protracted, often requiring the same fragments of BPEL to be implemented in every process.

For managing these types of faults, BPEL PM provides a fault handling framework which allows us to define policies for handling faults which occur when a BPEL process executes an invoke activity.

When a fault occurs the framework intercepts the fault before it is returned to the BPEL process. It then attempts to identify an appropriate fault policy to handle the fault. If it finds one the policy is executed and, assuming the fault is resolved, the BPEL process continues as if nothing happened.

A policy consists of two basic components, the faults that you wish to catch and, once caught, the actions you wish to take, such as retry the service or perform manual recovery actions from the BPEL Console.

[edit] Defining a fault policy

Re-examine the UserRegistration process at the point that it invokes the credit card service to verify the user's card details. Apart from the business faults that could be returned, it could also return a system fault such as the following:

<soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope"
 xmlns:tns="http://schema.packtpub.com/obay/flt">
<soap:Fault>
 <faultcode>tns:TransportFault</faultcode>
 <faultstring>Transport Run Time Error</faultstring>
 <detail>
 <tns:fault>
 <tns:code>380002</tns:code>
 <tns:summary>Connection Error</tns:summary>
 <tns:detail>
 …
</tns:detail>
 </tns:fault>
 </detail>
 </soap:Fault>
</soap:Body>

Indicating that it's unable to call the service because of a transport problem, the code of 380002 is displayed indicating that this is probably due to a temporary problem. For this kind of scenario, we can define a fault policy to catch this error and re-try the service.

The outline of the fault policy for our credit card service is shown as follows:

<?xml version="1.0" encoding="UTF-8"?>
<faultPolicy version="2.0.1" id="ValidateCreditCardPolicy" 
xmlns:env="http://schemas.xmlsoap.org/soap/envelope/" 
xmlns="http://schemas.oracle.com/bpel/faultpolicy">
 <Conditions>
 …
 </Conditions>
 <Actions>
 …
 </Actions>
</faultPolicy>

From this we can see a fault policy consist of the top level element faultPolicy, with the attribute id, which is used to uniquely identify our policy (as we can define multiple policies per BPEL domain).

The policy itself is divided into two sections—the Conditions section, which defines the faults we wish to handle, and the Actions section, which defines the actions to take in order to recover from the fault.

Each fault policy should be defined in a separate XML file, which should be placed in the directory:

<SOA_HOME>\bpel\domains\<domain_name>\config\fault-policies

Here, <domain_name> corresponds to the domain to which the policy applies. The file can have any name, but for the purpose of clarity it is recommended naming it after the fault policy ID, so ValidateCreditCardPolicy.xml in our scenario.

[edit] Defining fault policy conditions

The first section of a fault policy defines the conditions that we wish to handle and contains a list of one or more faultNameelements that we want our policy to handle. For the above example, we could define the following:

<Conditions> <faultName xmlns:tns="http://schema.packtpub.com/obay/faults" name="tns:TransportFault">
 <condition> <test>$fault.payload/tns:fault/tns:code="380002"</test> <action ref="ora-retry"/>
 </condition>
 <condition>
 <action ref="ora-human-intervention"/>
 </condition>
</faultName>
 <faultName>
 <condition>
 <action ref="ora-human-intervention"/>
 </condition>
</faultName>
…
</Conditions>

Specifying the <faultName>

A faultName element is used to define a specific fault which we wish to handle; it contains a single attribute name, which specifies the fault code (i.e. tns:TransportFault in the previous example) of the fault to handle.

Note, a faultcode is defined as a QName type, which has a format of:prefix:faultName

Here, prefixmaps to a namespace. So within the faultNameelement we need to define the namespace to which the prefix is mapped, otherwise we won't get a match.

We can also specify a faultName element without a name attribute, which will match all faults. This allows us to define a generic catch-all policy for any fault not handled by a more specifi c policy.

Specifying the <condition>

The faultName element defines one or more conditions; each condition consists of an optional test element and an action reference.

The test element allows us to specify an XPath expression which is evaluated against the content of the fault; if the XPath expression evaluates to true, the condition is considered a match and the action referenced within the action element will be executed.

Otherwise the fault management framework will look to evaluate the next condition, and so on until it finds a match. A condition without a test element will always return a match.

In the above policy, we have defined the following test for our fi rst condition: <test>$fault.payload/tns:fault/tns:code="380002"</test> Where the variable $fault.payloadmaps to the root element within the payload part of our SOAP Fault, so the above test is the equivalent of: /soap:Fault/detail/tns:fault/tns:code="380002"

Which for the fault in our example will evaluate to true, so the Fault Management Framework would execute the action ora-retry; if tns:code contained some other value, then it would move onto the next condition. As this doesn't include a test element it will result in a match and execute the ora-human-intervention action.

The messageelement for some faults, including the extension faults defined by BPEL PM, contains multiple parts, for example code, summary, and detail. To evaluate the content of any of these parts, just append the part name to $fault., so to check the content of the code part you would specify $fault.code.

[edit] Defining fault policy actions

The second part of our fault policy defines the actions referenced in the Conditions section; this consists of an Actionselement, which contains one or more Actionelements.

Each Action element contains an id attribute, which is the value referenced by the action ref attribute within the conditions. For the conditions defined in the above policy, we have defined two actions: ora-retry and ora-human-intervention as shown here:

<Actions> <Action id="ora-retry">
<retry>
 <retryCount>5</retryCount>
 <retryInterval>15</retryInterval>
 <exponentialBackoff/>
 <retryFailureAction ref="ora-human-intervention"/>
 </retry></Action> <Action id="ora-human-intervention">
 <humanIntervention/>
 </Action>
</Actions>

The content of the action element is used to specify and configure the actual action to be executed by the fault management framework, which can be one of retry, humanIntervention, rethrow, abort, replayScope, or javaAction.

Retry action

The retry action instructs the fault management framework to retry a failed service invocation until it is successful or it reaches a specified limit. In the above example we have specified that we will retry the service five times; if after this the invoke is still failing we have specified that we want to invoke the ora-human-interventionaction.

If we didn't specify this, the fault management framework would just re-throw the fault to the BPEL process for it to handle.

The retry action takes a number of parameters that allows us to configure how it behaves, these are defined as follows:

  • retryCount: Specifies the maximum number of retries before this action completes with a failure status.
  • retryInterval: Specifies the period in seconds between retries.
  • exponentialBackoff: An optional element which takes no parameters. When specified, if a retry fails, the interval between this retry and the next retry is double that of the previous interval. In the above example the first retry would occur after 15 seconds, the second after 30 seconds, the third after 60 seconds, and so on.
  • retrySuccessAction: An optional element with a single attribute ref. This references another action to be taken upon successful retry of a service. This should only be used to reference a Java action (see below), which we can use to generate an alert.
  • retryFailureAction: An optional element with a single attribute ref, that allows you to define the action to be carried out should all retries fail.

For scenarios where the interaction between a BPEL process and its client are synchronous, we should only use small retry periods. This is because we are suspending the BPEL process between retries; thus if the retry period is too long then the client which invoked the BPEL process could timeout while waiting for a response.

Human intervention action

For errors which are more permanent, the humanInterventionaction gives us the ability to suspend a process; once suspended we can log into the BPEL console to manually handle the fault.

From within the console we can perform a number of actions. These include manually retrying the service, with the option of modifying the input payload in case this is causing the error, or in the event that the service can't be called, we can get the process to skip the invoke activity and manually create the output that should have been returned by the service. We look at how we can do this as well as other possible options in the next section.

Because we are pausing the BPEL process, we should only use this action if the interaction between the BPEL process and its client is asynchronous otherwise the client will timeout while waiting for the problem to be resolved.

Re-throw action

For errors that we don't want handled by the fault management framework, we can use the rethrowFault action to re-throw the fault to our BPEL Process.

This is often useful when we have defined a generic fault handler to catch all faults, but want to exclude certain faults. For example, if we look at the fault policy defined above, the final handler within our conditions section is defined as follows:

<faultName>
 <condition>
 <action ref="ora-human-intervention"/>
 </condition>
</faultName>

This will catch all faults that have not yet been handled. This is exactly what we want for any unknown system faults, however, we want business faults to be explicitly handled by our BPEL process.

The re-throw action allows us to do just this; we can define a fault handler that catches our business faults, such as the following:

<faultName xmlns:tns="http://xmlns.packtpub.com/obay/CardServices"name="invalidCreditCard" <condition> <action ref="ora-rethrow-fault"/> </condition> </faultName>

This will then invoke the following action:

<Action id="ora-rethrow-fault">
 <rethrowFault/>
</Action>

This will re-throw the fault to our BPEL process.

Abort action

This action causes the BPEL process to terminate. It's the equivalent of executing a terminate activity directly within the BPEL process. Note, in both cases no response is returned to the client of the BPEL Process.

An abort action takes no parameters and is defined as follows:

<Action id="ora-terminate">
 <abort/>
</Action>

Replay scope action

This action causes the fault management framework to return a replay fault to the BPEL process. This fault will be automatically caught by the scope in which the fault is thrown and trigger the BPEL engine to re-execute the scope from the beginning.

A replay scope action takes no parameters and is defined as follows:

<Action id="ora-replay-scope">
 <replayScope/>
</Action>

Java action

This enables us to call out to a custom Java class as part of the process of handling the fault. The class must implement the interface IFaultRecoveryJavaClass, which defines two methods:

public void handleRetrySuccess(IFaultRecoveryContext ctx );
public String handleBPELFault( IFaultRecoveryContext ctx );

The first method handleRetrySuccess is called after a successful retry of an invocation, otherwise handleBPELFault is called.

The class is not intended to handle a fault, but more for generating alerts, and so on. For example, you could use invocation of the method handleBPELFaultto generate a notification that there is a problem with a particular endpoint, and likewise use the invocation of the method handleRetrySuccess to generate a notification that the problem with the endpoint has now been resolved.

The method handleBPELFaultalso returns a string value, which can be mapped to the next action to be invoked by the framework, for example, if we defi ned the following javaAction:

<Action id="ora-java">
 <javaAction className="mypackage.myClass"
defaultAction=" ora-human-intervention ">
 <returnValue value="RETRY" ref="ora-retry"/>
 <returnValue value="MANUAL" ref="ora-human-intervention"/>
 </javaAction>
</Action>

The javaAction element takes two attributes: className which specifies the Java class to be invoked and defaultAction which specifies the default action to be executed upon completion of the Java action.

Within the javaAction element we can specify zero, one or more returnValue elements, each of which maps a value returned by handleBPELFault to a corresponding follow-up action to be executed by the fault management framework.

In the previous example, we have specified that for a return value of RETRY the framework should execute the ora-retry action, and that if a value of MANUAL is returned, then it should execute the ora-human-intervention action.

If no mapping is found for the return value, then the defaultAction specified as part of the javaAction is executed. This gives us the flexibility to calculate how we wish to handle a particular fault at run time.

[edit] Binding fault policies

To put a fault policy into operation, we need to specify the invoke activities within a BPEL process that the fault policy is to be applied; this is known as binding.

We can bind fault policies to Partner Links, Port Type or Processes, and we can define these either for a specific process or an entire domain.

[edit] Binding fault polices at the process level

Bindings are configured for an individual process by including a faultPolicyBindings section in its bpel.xml file. This is defined after the partnerLinkBindings section, as shown:

<?xml version = '1.0' encoding = 'UTF-8'?>
<BPELSuitcase>
 <BPELProcess id="UserRegistration" src="UserRegistration.bpel">
 <partnerLinkBindings>
 …
 <partnerLinkBinding name="CreditCardServices">
 <property name="wsdlLocation">CardServices.wsdl</property>
 </partnerLinkBinding>
 </partnerLinkBindings>
<faultPolicyBindings>
 <process faultPolicy="UserRegPolicy"/>
 <partnerLink faultPolicy="ValidateCreditCardPolicy
xmlns:ccd="http://xmlns.packtpub.com/obay/CardServices">
 <name>CreditCardService</name>
 <portType>ccd:CardServices</portType>
 </partnerLink>
 </faultPolicyBindings>
 </BPELProcess>
</BPELSuitcase>

Defining bindings on the process

The process element is an optional element, which allows us to specify the default fault policy for a BPEL process; it contains a single attribute faultPolicy, which contains the id of the fault policy to be used for the process.

In the previous example we have specified that the UserRegistrationprocess should use UserRegPolicy as its default fault policy.

Defining bindings on the PartnerLink

After the process binding, we can specify one or more partnerLink elements; each of which allows us to bind a fault policy to a specific Partner Link or Port Type. It contains a single attribute faultPolicy, which contains the id of the fault policy to be used for this binding.

Within the partnerLink elements, we specify zero or more name elements and zero or more portTypeelements.

The name element should contain the name of a PartnerLink within the process that we wish to bind the fault policy to, while the portType element should contain the name of the service port type as defined in the WSDL file that we wish to bind the fault policy to.

At first glance it may appear that there is little difference between binding the policy to either the partner link name or port type; however, it is quite possible to use the same port type within multiple partner links, each with their own name.

[edit] Binding fault policies at the domain level

In addition to specifying our fault policy bindings at the process level, we can also define this at the domain level; we do this by modifying the fault-bindings.xml file for the domain as appropriate. This file can be found at: <SOA_HOME>\bpel\domains\<domain_name>\config\fault-bindings.xml Here, <domain_name> corresponds to the domain to which the binding applies.

The structure of this file is the same as the faultPolicyBindings section of the bpel.xml file (and defined in the schema fault-policy-binding.xsd).

When defining bindings for Partner Links at the domain level, it's less error prone to define them for the port type, as opposed to the partner link name. It's unlikely that developers will always name partner links consistently; therefore as long as the same WSDL definition is being used to access a particular service, then the port type will always be the same.

Binding resolution

At run time when a fault occurs, the fault management framework will attempt to find a condition with a corresponding action that matches the fault.

It does this by first attempting to locate an appropriate fault policy binding, by looking for a binding in the following order:

  • Partner Link name binding in bpel.xml
  • Port type binding in bpel.xml
  • Process binding in bpel.xml
  • Partner Link name binding for the domain
  • Port type binding for the domain
  • Process binding for the domain

Once it finds a binding, it will check the fault policy to find a matching condition, and then execute its corresponding action. If no matching condition is found, it will then move to the next binding level; it will continue this process until either a matching condition is found or all binding levels have been checked.

[edit] Human intervention in BPEL Console

To manage processes suspended pending human intervention, we need to log into the BPEL Console. Once logged on, click on the Activities tab. This by default will list all activities with a state of pending (that is, awaiting human intervention) as shown in the following screenshot:

If you click on the activity label of a suspended process (for example, verifyCreditCard in the previous screenshot), then the Console will bring up the recovery screen for that activity, as shown:

From here you can carry out any of the standard recovery actions available in the fault management framework, such as retrying the service, re-throwing the exception, aborting the process, or replaying the scope. Plus it provides the ability to skip the failed invoke by selecting the continue activity.

In addition, we can get the value of any process variable, as in the screenshot, where we've fetched the variable verifyCreditCardInput that contains the message submitted to the failed invoke activity. From here we can also update the content of this or any other process variable.

This gives us a number of options for managing the fault, including changing the input variable and retrying the service or setting the output variable from a service and skipping the invoke activity.

[edit] Change the input variable contents and retry

A common cause for failure could be due to the input for the service being invalid; in the above example we may have provided an invalid security number for the credit card. One way to correct this is to modify our input and retry the service.

To do this, select the variable to update from the Available Variables drop down, that is, verifyCreditCardInput in the previous example and click Get. This will return the current content of the variable.

Next we can update securityNo to contain the correct value and click Set to update the content of the variable within the BPEL process.

Finally from the Actions available drop down select Retry and then click Recover; this will cause the BPEL process to retry the failed invoke activity with the updated variable and, if successful, to then continue with the remainder of the process.

[edit] Set the output variable and continue

Another scenario is that for whatever reason we are unable to call the service. For example in the above scenario the verifyCreditCard service is not available.

One way around this is to fallback to manual processing and for a service rep to phone the credit card company to verify the card and get an authorisation code. We can then manually update the appropriate variables within our process and continue processing.

To do this select the variable we want to update from the Available Variables drop down, that is, verifyCreditCardOutput in the previous example and click Get. This will return the current content of the variable, which in our case will be empty as it has yet to be initialized.

Next we need to enter the XML fragment that it should have contained if the service had been successfully invoked. A simple way to do this is click Skeleton Value; the console will then generate an empty XML skeleton of the variable into which we can enter the values.

Once we have specified the content of the variable, click Set and this will update the content of the variable within the BPEL process.

Finally from the Actions available drop down select Continue and then click Recover. This will cause the BPEL process to skip the failed invoke activity and continue with the remainder of the process.

This is only useful for synchronous or one-way invocations. This is because if we skip the invoke activity for an asynchronous two-way invocation, then when the process reaches the corresponding receive activity it will end up waiting forever.

[edit] Handling faults within the service bus

Before we look at how to handle faults inside a proxy service, it's worth taking a step back to revisit our SOA architecture and the purpose of the Virtual Service layer.

Essentially this layer provides a proxy service based on our canonical model, which is responsible for routing requests to the appropriate application service. In the process of which it will validate and transform the input message into the one expected by the application service and vice versa for the response.

Within our proxy service an error can occur at the validate stage (as discussed in the previous tutorial) in which case the proxy service needs to generate and return an appropriate fault to the client.

In addition when we call out to an external service, either to enrich the input message as part of the transformation or at the route stage, a fault could occur. This could either be a business or a system fault.

A business fault, by our definition, is just another valid response that can be returned by our application service, so the role of the proxy service is to transform that fault from an application specific one, to one defined in the WSDL of the proxy service which it can then return to its client.

In the case of a system fault, one option for the proxy service is to return the fault without modification directly to the client, and let it work out how to handle it.

However, it makes sense to define a standard set of system faults within our architecture that we map all other system faults to. This will simplify the implementation of standardized error handling for such faults across our applications.

With system faults that are temporary in nature, it may be tempting to build in the functionality to retry them; however, as we've already established, we only have a small window in which to resolves the fault before the client times out.

So we need to follow a strategy that avoids multiple layers in our composite application retrying temporary errors, because the role of the virtual service layer is to provide a standardized representation of the underlying service, including faults. As a guideline we will not attempt to retry transient faults within this layer.

One scenario where it makes sense to retry a business service, is where it has multiple endpoints. In this scenario if a call to one endpoint fails, the service bus can be configured to retry an alternate endpoint for the same business service.

[edit] Handling faults in synchronous proxy services

The basic strategy for handling faults within the service bus is essentially the same regardless of whether it is a business or system fault. That is to catch the fault, undo any partially completed activities so that we leave the system in a consistent state and map the underlying fault to a "standard" fault, which is then returned to the client.

If we examine the CreditCard service used by the above BPEL process, this is actually a proxy service implemented on the service bus. OBay accepts MasterCard and Visa, and in our scenario each of these card providers supplies their own service for card verification and payment processing.

The role of the CreditCard proxy is to provide a standardized service, independent of card type. It will then route requests to the appropriate service, based on the card being used.

As part of this process, the proxy service will transform the request from the oBay canonical form into the specific format required by the card provider and vice versa for the response.

If during execution of the proxy service an error occurs, the role of the proxy service is to intercept the fault and then map it to a specific type of fault, either a business fault defined by the proxy service or a standard system fault.

[edit] Raising an error

When an error occurs, the service bus performs a number of steps. First it will populate the $faultvariable with details of the error. Next if the error was caused by the external service returning a fault, it will update the $body variable to hold the actual fault returned.

For example if the verifyMasterCardoperation returned the following fault:

<env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">
<env:Fault xmlns:mcd="http://xmlns.packtpub.com/MasterCard">
<faultcode>mcd:invalidMasterCard</faultcode>
<faultstring>business exception</faultstring>
<faultactor>cx-fault-actor</faultactor>
<detail>
<invalidCard xmlns="http://xmlns.packtpub.com/MasterCard ">
<code>DECLINED</code>
<desc>MasterCard Declined</desc>
</declined>
</detail>
</env:Fault>
</env:Body>

This would be intercepted by the service bus, which would then populate $fault with the following:

<con:fault xmlns:con="http://www.bea.com/wli/sb/context">
<con:errorCode>BEA-380001</con:errorCode>
<con:reason>Internal Server Error</con:reason>
<con:location>
<con:node>RouteToVerifyMasterCard</con:node> <con:path>response-pipeline</con:path> </con:location> </con:fault>

Where errorCode and its corresponding reason provide an indication of the type of error that occurred, common error codes include:

  • BEA-380001: Indicates an internal server error, including the return of a fault by a SOAP service.
  • BEA-380002: Indicates a connection error, such as the SOAP service not being reachable or available.
  • BEA-382500: Indicates that a service callout returned a SOAP Fault.

We can also see from the content of the locationelement that the error occurred in the response pipeline of RouteToVerifyMasterCard node. This information can be useful if we are implementing a more generic error handler at either the pipeline or service level.

In addition to populating the $fault variable the $body variable will now contain the actual SOAP Fault returned by the external service.

Finally the service bus will raise an error, which if not handled by the proxy service will result in the service bus returning its own fault to the client of the proxy service.

[edit] Defining an error handler

The first step in handling an error is to catch it. Within a proxy service we do this using an error handler, which can be defined at the route, stage, pipeline, or service level.

When the service bus raises an error, it will first look to invoke the error handler on the route node or stage in which the error occurred.

If one isn't defined or the error handler does not handle the error, then the service bus will invoke the error handler for the corresponding pipeline. Again if the error isn't handled at the pipeline level, it will invoke the service level error handler and if not handled at this level then the service bus will return a soapenv:Server fault, with the detail element containing the content of $fault.

A fault is only considered handled if the error handler invokes either a reply or resume action. The reply action will immediately send the content of $body as a response to the client of the proxy service and completes the processing of the proxy, while a resume action will cause the proxy service to continue, with processing resuming on the next node following the node on which the error handler is defined.

For faults returned by external services it makes sense to define our error handler as close to the error as possible, that is on the route node, as we can handle the error in the context in that it occurred, simplifying the logic of our error handler.

For more generic errors, such as a connection error (for example, BEA-380002), we can define a higher level error handler at either the pipeline or service level.

In the case of our CreditCard service, this means defining an error handler on the route nodes for each endpoint, to handle errors specific to each service callout, and defining a generic error handler on the service itself.

Adding a route error handler

To define an error handler on a route node, click on it, and select the option Add Route Error Handler as shown in the following screenshot:

This will open the Edit ErrorHandler; Route Node window, where we can configure the error handler. An error handler consists of one or more stages, first we need to add a stage and name it accordingly (for example, HandleVerfifyMasterCardFault) as in the following example:

The first step within our error handler is to check whether we have received a SOAP Fault or something more generic. To do this we just need to add an If… Then… action, which checks if the value of $fault/ctx:errorCode is either BEA-382500 or BEA-380001.

While the service bus reserves the error BEA-382500for SOAP Faults, we find that when we return a custom SOAP Fault, that the service bus raises an error of type BEA-380001. So we check for both error codes to be safe.

Checking the type of SOAP Faults

Next we need to check the SOAP Fault returned (which will be in $body) so that we can handle it appropriately.

If we examine the WSDL for our verifyMasterCardoperation, we can see that it could potentially return one of two faults, mcd:declined and mcd:invalid, each of which needs to be mapped to a fault returned by our proxy service.

At first glance this all looks pretty straightforward. We just need to define an If… Then… action, with a branch to test for each type of fault returned and generate the appropriate fault to return.

For example, to test for a fault of type mcd:declined, we could define a branch with a condition such as the following: $body/soap-env:Fault/faultcode = 'mcd:declined' However, if we look at faultcodemore closely, we can see its type is QName, with a format of prefix:faultName (for example, mcd:declined), where prefix is mapped to a namespace in the soap:Fault element (for example, http://xmlns.packtpub. com/MasterCard).

The issue here is that there is no guarantee that the same prefix will always be used, which could cause our condition to be incorrectly evaluated.

[edit] Getting the qualified fault name

To ensure that our test condition is correctly evaluated we need to fully resolve the QName. We can do this using XQuery function resolve-QName. This takes two parameters. The first contains the QName that we wish to resolve (that is faultcode), the second contains an element in which the namespace prefix is defined (that is soap:Fault). This gives us a function call that looks like the following: fn:resolve-QName($body/soap:Fault/faultcode, $body/soap:Fault) As we will need to test this value multiple times, rather than embed this within our if condition, we can use an Assign action to assign it to a variable (for example, $faultcode).

Our modified condition to test for a fault of type mcd:declined, would now look like the following:

$faultcode = '{http://xmlns.packtpub.com/MasterCard}declined'

We can now defi ne an If…Then… action, with one branch for each fault we want to test for, plus an elsebranch to cover any unexpected faults.

Creating a SOAP Fault

Once we know the fault returned by the external service, we can generate the appropriate fault to be returned by the proxy service and assign this to the $body variable.

The simplest way to do this is create an Assign action, and for the XQuery Text, directly specify the actual SOAP Fault to be returned, as shown in the following screenshot:

[edit] Handling unexpected faults

In the case of unexpected faults, we have two choices, one is to return the fault as is and let the client figure out how to handle it, the other is to return a generic fault indicating that an unexpected error occurred. Typically we would recommend the latter approach as this will simplify error handling for the client.

It is often prudent to record details of the fault that occurred. For example if it's occurring frequently we may wish to add a specific branch to our error handler to manage a fault of this type, especially if it allows our client to make a more informed choice on how to handle the error.

One way of achieving this is to use the Report action; this takes two parameters; the first is the message we want to report, the second is zero, one or more name value pairs which we can use to search for specifi c reports.

In the case of error handler, we have configured it to capture details of the actual fault message, with a single key of the format BusinessService=$outbound/@name (which will evaluate to BusinessService=VerifyMasterCard) as shown in the following screenshot:

At run time this will cause a record containing the specified information as well as additional metadata to be written to the Service Bus Reporting Data Stream. The metadata includes information such as the error code, inbound service name, URI and operation and the outbound service, URI and operation.

By default the service bus is configured to write this data to a reporting data store which can then be queried from the service bus console. To view the report data click on the Operations tab and then click on Message Reports (under Reporting).

This will bring up the Summary of Message Reports, where you can search for report entries against a number of criteria, including data range, inbound service name, error code and the report key (defined in the Report action). From here you can click on a report entry to view its metadata and the actual message.

The Reporting Stream can be configured to write data to a number of targets including JMS Queues, database, file, and so on.

[edit] Returning a SOAP Fault

Once we have populated our $body variable with the appropriate SOAP Fault, the final step is for our proxy service to return it.

We do this by using a Reply action. The key here is to configure it to Reply with Failure, as shown in the following screenshot. This will cause the service bus to generate an HTTP 500 status, indicating a fault.

Once the reply has been sent the processing of the request is completed and no further processing will be done.

This completes the definition of our error handler for our RouteToVerifyMasterCard node, which looks as follows:

If an error other than a SOAP Fault occurs, then this handler will still be invoked, but because we don't handle it (that is, execute a Reply or Resume activity). Then the service bus will look to invoke an Error Handler on a higher level stage.

[edit] Adding a Service Error Handler

For handling errors other than those caused by SOAP Faults, we typically want to define a generic error handler at the service level. To do this click the Proxy Service icon and select Add Service Error Handler as shown in the following screenshot: In here we need to create a stage in which to define our error handling logic as we did for our route node error handler.

For errors which have been raised for a reason other than a SOAP Fault being returned by the external client, we just need to check the error code in $fault so that we can map it to an appropriate system fault.

When generating a system fault, rather than try to map a specific service bus error to a corresponding SOAP Fault, we need to think about how the client may handle the fault. Typically this will be driven by whether it is a permanent or transient fault.

[edit] Handling permanent faults

Permanent faults are ones where by the same submission will continue to cause an error. This could be due to a number of reasons, including invalid security credentials, erroneous data contained within the message payload, or an error within the actual service itself (that is, the request is valid, but for whatever reason the service is unable to process it).

For each type of error, a corresponding error code is defined by the service bus which can be accessed in the $fault variable at run time. These error codes are categorized into the following subsystems: transport, message flow, action, security, and UDDI.

Within our generic service level error handler, we typically want to use an If… Then… action to check which error category the error code falls into and then map it to a corresponding SOAP Fault. This follows a similar approach to the one we used for mapping business services faults to corresponding faults defined by the proxy service.

Once we have populated our $body variable with the appropriate SOAP Fault, we would then use a Reply action as before to return it to the client.

This ensures that any client of the proxy service will only have to deal with the business faults defined in the WSDL of the service and a handful of pre-defined system faults that any of the proxy services could return.

If we look at a BPEL process, this approach makes it very simple to write a fault policy for managing a small, well-defined set of system faults and within the BPEL process define fault handlers for the known business faults.

Generating alerts

When a permanent fault occurs it may indicate that we have an underlying problem in the system. So in addition to returning a SOAP Fault to the client, we may wish to notify someone of the problem.

One way to do this would be through the report action we looked at earlier, but in some cases we may have an issue that requires more immediate attention. For example we have an attempted security violation or there is an error in the actual logic of a recently deployed proxy service.

For these situations we can use the Alert action to publish an alert to an appropriate destination, which could be a JMS Queue, Email, SNMP Trap, or Reporting Data Stream.

To add an alert, click Add an Action | Reporting | Alert; this will insert an Alert action into our error handler, like the one shown as follows:

To specify the content of the alert, click on <Expression>. This will launch the XQuery Expression Editor, where we can define the alert body as required. We can also specify an optional alert-summary, which is presented according to the destination. For example it will form the subject line for an email notification. If this is left blank, then it defaults to ALSB Alert.

The severity level can take a value of Normal, Warning, Minor, Major, Critical, or Fatal. These don't have specific meanings, so you can attach your definitions to each of these values; though when we configure alerting for the proxy service (see below) we can opt to filter out alerts based on their severity level.

To specify the recipient of the alert click on <Destination>. This will launch the Select Alert Destination window, where we can search for and select any previously defined destination. If we don't specify a destination then the alert will be sent to the console.

Destinations are created and configured in the service bus console; this gives us the flexibility to change the actual recipient of the alert at a later point in time, just by re-configuring the destination appropriately.

Enabling alerts

For pipeline alerts to be generated, you must first enable them, otherwise Alert actions will just be skipped during the execution of the proxy service. Alerts need to be enabled in two places, first at the server level and then at the proxy service level.

To enable them globally, click on the Operations tab with the service bus console and then select Global Settings. This will display the Global Setting window. From here ensure the option Enable Pipeline Alerting is checked.

Once enabled globally, we can then specify settings for a proxy service. Select the proxy service and then click on the Operational Settings tab as shown in the following screenshot:

Select the check box for Pipeline Alerting and then, from the Enabling Alerting at drop down, select the level of alerting required. This will suppress the generation of any alerts with a lower severity. So in this example we have enabled alerting at Warning level or above, so any alert actions in the proxy service with a severity level of Normal will be skipped.

[edit] Handling transient faults

Transient faults, typically manifest themselves as non-responsive URI's (that is no response is being received for a particular service endpoint), which the service bus indicates with the error code BEA-380002.

In this scenario, we have already established that for a synchronous proxy service there is limited scope to take any corrective action. However, for services which provide multiple endpoints, one option is to retry an alternate endpoint.

Retrying non-responsive business service

A business service allows you to configure multiple endpoints for a service, which it can load balance requests across (using a variety of algorithms). This can be useful when a particular endpoint becomes non-responsive, as we can configure the business service to automatically retry an alternative endpoint.

When we have multiple URIs specified for an endpoint, if the initial call to an endpoint fails, the business service will immediately attempt to invoke an alternate URI. It will continue to do this until either successful, the Retry Count is reached or all online URIs have been tried.

If at this point the retry count has not been reached, the business service will wait for the duration specified by the Retry Iteration Interval before iterating over the endpoints again.

Finally you need to ensure that we set Retry Application Errors to No, otherwise any SOAP fault returned by the business service will be treated as a failure and will prompt the service bus to retry.

In this example, where we have defined two URIs, if the first call fails then the service bus will immediately call the second URI. If this fails then it will have reached the retry limit and the underlying error will be returned to the proxy service. If the retry count was two, then it would wait 30 seconds before attempting one final retry.

[edit] Handling faults in one-way proxy services

The service bus also allows you to define one-way proxy services, where the client issues a request to the service bus and then continues processing without ever receiving a response. This is often referred to as fire and forget.

The approach for handling errors for one-way proxy services is quite different from that of synchronous services. For transient errors, it makes absolute sense to retry the Business Service until we are successful as no one is going to timeout waiting for a response.

For permanent errors, we can't return a fault to the client and let them resolve it. Rather we need to alert a third party so that they can take some corrective action to resolve the error, and then re-run the request.

One way to do this is to publish an alert notification to a JMS Queue. We could do this directly or go via the alerting mechanism described above. The content of the alert will typically need to contain details of the actual error so that we know what corrective action to perform, the proxy service invoked and its payload, so that we can re-invoke the proxy with the original payload once the issue has been resolved.

Once we've published the alert, we also need to implement something on the other end of the JMS Queue to process it. One approach would be to implement this as a BPEL process, containing a human workflow task to correct the error. Once corrected then the BPEL process could re-invoke the proxy service.

[edit] Source

The source of this content is Chapter 13: Handling errors in SOA based systems of Oracle SOA Suite Developer's Guide by Antony Reynolds, Matt Wright (Packt Publishing, 2009).

Personal tools