I have been working on rounding out the BPMN Interoperability (BPMN-I) spec and tool in the area of data flow, and I am puzzled by a fundamental concept where the BPMN 2.0 spec and non-normative “BPMN by Example” documents disagree.  I wrote to the experts on the BPMN 2.0 committee but have not heard back, so let me just put it out there and maybe BPMS Watch readers will help sort it out.

The issue has to do with dataInput and dataOutput, elements which are defined for a process, task, or event ?(but not subProcess).  For a process or task, they are part of the ioSpecification, which defines one or more inputSets and outputSets.   Each inputSet and outputSet references zero or more dataInputs and dataOutputs.  The operational semantics of ioSpecification and inputSet say that a process or task cannot start until the inputSet data is available (unless marked “can start without me”).   In that sense, ioSpecification defines the interface or signature of the process or task, i.e. the data requirements.

But for a task (including callActivity) or event, dataInput and dataOutput are also used to model the data flow.  A dataInput is the target of a dataInputAssociation, which maps data from a variable (dataObject) or external record (dataStore), and a dataOutput is similarly the source of a dataOutputAssociation.  The spec is explicit that this applies to tasks and events, but is mysterious when it comes to dataInput and dataOutput of a process.  The key question is this:  Can a process dataInput be the source of a dataInputAssociation, and can a process dataOutput be the target of a dataOutputAssociation?  The spec (mostly) says No, but BPMN by Example says Yes.  And what does a process dataInput or dataOutput mean, anyway?  Is it just a signature, like a WSDL portType, or is it actual instance data, like a variable?  Let’s look at both sides of the argument.

1. Just a signature, not actual instance data

The evidence for this comes mainly from the spec.

  • p213.  “Data Inputs MAY have incoming Data Associations.” 
    [It does not say may have outgoing.]
  • p213.  “If the Data Input is directly contained by the top-level Process, it MUST not be the target of Data Associations within the underlying model. Only Data Inputs that are contained by Activities or Events MAY be the target of Data Associations in the model.”  [This would imply NO data associations can connect to a data input for a process.]
  • p215.  “Data Outputs MAY have outgoing Data Associations.”  [It does not say may have incoming.]
  • p215.  “If the Data Output is directly contained by the top-level Process, it MUST not be the source of Data Associations within the underlying model. Only Data Outputs that are contained by Activities or Events MAY be the target of Data Associations in the model.”  [Again this would imply NO data associations can connect to a data output for a process.]

Subsequent discussion explicitly refers to data inputs and data outputs of activities and events, not processes, except for this:

  • p225.  “In the case of a Start Event, the Data Inputs of the enclosing process are available as targets to the DataOutputAssociations of the Event. This way the Process Data Inputs can be filled using the elements that triggered the Start Event.  In the case of an End Event, the Data Outputs of the enclosing process are available as sources to the DataInputAssociations of the Event. This way the resulting elements of the End Event can use the Process Data Outputs as sources.”  [In other words, a process data input can have incoming data association from a start event, and a process data output can have outgoing data association to an end event.  The purpose of this – it seems to me – would be to allow transformation between request/response message data and  the signature defined by the process ioSpecification.]

The spec (p213, 215) only mentions displaying the dataInput or dataOutput shape for a process, not a task or event.  From the above discussion, it would appear that no data association shape should connect to that dataInput or dataOutput shape, with the possible exception of incoming from a startEvent shape and outgoing to an endEvent shape.

2. Actual instance data, not just a signature

This alternative interpretation is used in more than one diagram and serialization in BPMN By Example.  It is characterized by using the process dataInput as the source of a dataInputAssociation to a task or event.  Here, for instance, is a clip from the Email Voting example:

You might say that from this diagram, the dataInput belongs to the task Review Issue List not the process (and I would agree with you!), but the serialization provided shows the dataInput defined for the process, and mapped from there to the task dataInput by direct dataInputAssociation (i.e. no intervening dataObject).

This would appear to be illegal, according to the statements from the spec quoted above.  The one puzzling statement from the spec in its favor is this one:

  • p221.  “The purpose of retrieving data from Data Objects or Process Data Inputs is to fill the Activities inputs and later push the output values from the execution of the Activity back into Data Objects or Process Data Outputs.”

This is a strong statement, but it is the ONLY mention of using a process dataInput as the source of a dataInputAssociation, while there are numerous statements that suggest this is not allowed.

Why does it matter?

Is it worth quibbling over these details?  If your purpose is to make the diagram understandable to the viewer, maybe we should just agree to live and let live.  The email voting diagram conveys the information clearly.  But if your goal is interoperation between modeling tools, then resolving the issue is important.  The serialization in BPMN by Example is either correct or it is not (and even if legal per the spec, it could be declared interoperable or not by BPMN-I.)  And this is not just an issue for executable processes.  Even the Descriptive subclass (non-executable) contains data objects and data associations, so this serialization issue affects even the most basic model interchange.

So… what do you think?  And why?  Please comment on this post.

By the way, I think a better way to model the start of Email Voting (with data flow) would be something like this:

It shows the source and internal flow of the data more clearly (defining a data input for a process triggered by a timer is inherently confusing, I think), and maps to execution more cleanly as well.  But that’s just me.