OpenCDISC Validation Framework

1. Introduction 2. Configuration File Overview 3. Validation Rules   3.1 Match Rule   3.2 Unique Rule   3.3 Regular Expression Rule   3.4 Conditional Rule   3.5 Required-When Rule   3.6 Lookup Rule   3.7 Metadata Rule

1. Introduction

The OpenCDISC Validator decouples the definition of validation rules from the application logic. This provides ultimate flexibility to create and maintain any number of validation rule definitions to meet the diverse needs of sponsors, CROs, laboratories, and anyone else involved in collection, storage, and exchange of clinical data. The validation rules are defined using an XML-based validation framework with the following goals: • Allow for the validation of any CDISC compliant dataset, including SDTM, ADaM, ODM, and LAB • Support near-SDTM, SDTM+, ODM extensions, and custom datasets • Provide a portable format for sharing data validation rules between partners The following sections will describe in more detail how these goals were achieved and show you how to use the OpenCDISC Validation Framework to define validation rules to meet the needs of your organization.

2. Configuration File Overview

The structure of an OpenCDISC Validator configuration file is designed to be simple, yet powerful. Designed as an extension to the define.xml specification, it allows users to combine the metadata definitions included in their study definition with a rich set of validation rules. At a bare minimum, a configuration file resembles the following structure:
<ODM xmlns=""
  xsi:schemaLocation=" resources/schema/validator1-0-0.xsd"
  CreationDateTime="2009-06-09T19:06:27-04:00" ODMVersion="1.2" 
  <Study OID="org.opencdisc.validator">
      <StudyName>OpenCDISC Validator</StudyName>
      <StudyDescription>OpenCDISC Validator Configuration</StudyDescription>
    <MetaDataVersion OID="CDISC.SDTM.3.1.1"
      Name="OpenCDISC Validator Configuration for CDISC SDTM v3.1.1 (Native)" 
      Description="OpenCDISC Validator Configuration for CDISC SDTM v3.1.1 (Native)"
      def:StandardName="CDISC SDTM"
      <ItemGroupDef OID="domain"
        <ItemRef ItemOID="variable" Mandatory="mandatory" val:Core="corestatus"/>
        <val:ValidationRuleRef RuleID="ruleid" Active="active"/>
        <def:leaf ID="location" xlink:href="file">
      <ItemDef OID="variable" Name="variable" DataType="datatype" Length="length" def:Label="label">
        <val:RuleType Rule Attributes...>
Each ItemGroupDef definition is contained with the MetaDataVersion tag, in accordance with the define.xml and ODM specifications. These ItemGroupDefs are defined with a Name attribute, which is associated with your domain datasets based on their file name without its extension. For example, the ItemGroupDef with Name="DM" is associated with your Demographics dataset, which (if you are using SAS) should be named DM.xpt as specified by the SDTM guidelines. Additionally, the attributes def:Label and def:Class provide additional information used by the Validator. See the define.xml specification for more information about what values these attributes may take. Each ItemGroupDef also contains a list of ItemRef (variable reference) and ValidationRuleRef (validation rule reference) elements. Each ItemRef references an ItemDef later in the document, which together fully describe the use of the variable in that given domain. On the ItemRef element, the use of Mandatory="Yes" indicates that the variable is Required, causing the Validator to throw errors both of if the variable is not included in the source dataset, and if it is included but contains null values. As part of the Validator configuration extension, the ItemRef element can also have the val:Core attribute, which takes the values Permissible, Expected, and Required, providing finer control over how variables should be validated based on the three levels defined in the SDTM implementation guide. ItemDef elements supplement this information by providing the Name of the variable as it appears in your dataset, along with the DataType, which may be one of the ODM-defined data types of float, text, integer, date, or datetime. Note: As the list of data types is expanded in the future, the Validator will provide support for them via this attribute, but additional type checks may be defined through existing validation rules if you need stricter control than the current set provides. The standard form of the ItemRef and ItemDef tags:
<ItemRef ItemOID="variable" OrderNumber="ordernumber" Mandatory="mandatory" Role="role" val:Core="corestatus"/>
<ItemDef OID="variable" Name="variable" DataType="text" Length="length" def:Label="label"/>
Configuration authors may also reference validation rules in their ItemGroupDef using the Validation rule reference tag. The standard form of the Validation Rule reference tag:
<val:ValidationRuleRef RuleID="ruleid" Active="active"/>
Next, the ValidationRules tag contains all of the Validation rule definitions. All rules must be placed inside, and can be referenced in any of the ItemGroupDefs using the tag mentioned above. To get a better idea of what is possible using validation rule definitions, we elaborate on each type below.

3. Validation Rules

The Validation Rule is the cornerstone of the OpenCDISC Validator’s capabilities. To allow for a rich set of validations, there are a several different XML tags that are used to invoke particular checks by the validation engine. When defining a Validation Rule, the configuration author may choose from one of the following types: • Match RuleUnique RuleRegular Expression RuleConditional RuleRequired-When RuleLookup RuleMetadata Rule While each type has its own unique set of Attributes—data within the XML tag that provides information about the rule—and distinct tag name, all rules share these common Attributes that help with report generation and general validation: • ID – Specifies the rule ID, and must be defined. It allows the rule to be referenced in other areas of the configuration (discussed later in this document) and identifies the rule in the report output. • Variable – Indicates which variables this rule should apply to. What data is acceptable for this Attribute is rule-specific, so read the related rule explanation for more detail. This Attribute must be defined for most rules. • Type – The type of issue indicated by data that fails this rule. Acceptable values include Information, Warning, and Error. This Attribute must be defined. • Severity – The relative importance that this rule carries for data that fails to be validated by it. Acceptable values include Low, Medium, and High. This Attribute must be defined. • Message – A short sentence giving an overview of what data which fails the rule indicates. While there is a default value for this built into each rule, they are very generic, so it is recommended that you always write your own Message for each new rule you define. • Description – A more detailed explanation of what was wrong with data that may have failed this rule. Like the Message Attribute, you should always define your own Description. • Warn – Indicates whether a message should be printed out to the report log if this Validation Rule has to be removed due to irresolvable incompatibilities between the rule and your source data. The default value is No. Combining all of these attributes, we get the basic form of a Validation Rule XML tag.
<val:RuleTagName ID="id" Variable="variable" Rule-Specific Attributes
	Type="type" Severity="severity" Warn="warn" />

Now we’ll take a look at how to work with the various specific rule types.

3.1 Match Rule

The Match Rule allows a variable’s data to be checked against a set of acceptable terms defined within the rule. It introduces the Terms and Delimiter Attributes which provide the list of acceptable variable values and the character used to separate them in that list, respectively. The default value for Delimiter is a comma (,) if one is not specified. The Terms Attribute is required. The structure of the Match Rule:
<val:Match ID="id" Variable="variable" Terms="terms" Delimiter="delimiter"
	Type="type" Severity="severity" Warn="warn"/>

3.2 Unique Rule

The Unique Rule checks for the uniqueness of a given variable’s value across all records in the current data source, optionally grouping by one or more other variable values. For example, the CDISC specification states that the Sequence variable must be unique for a given Subject ID. In this case, the variable analyzed for uniqueness would be the Sequence, grouped by the Subject ID (since different subjects may have the same Sequence number). This is done with the addition of the GroupBy Attribute, which takes a comma-separated list of variable names on which to compare the uniqueness of the data in the variable specified in the Variable Attribute. The structure of the Unique Rule:
<val:Unique ID="id" Variable="variable" GroupBy="variables"
	Type="type" Severity="severity" Warn="warn"/>

3.3 Regular Expression Rule

The Regular Expression, or Regex, Rule allows a variable’s value to be validated against a pre-defined pattern. For example, you are able to enforce the length of a variable’s value, and specify that it must be alpha-numeric, etc. Regular expressions can be quite complex, however, and how to define them is beyond the scope of this document. However, there are many online resources available that can help explain how regular expressions should be written. The Regular Expression Rule introduces the Test attribute, which contains the regular expression string used to validate the data. If the data does not match the regular expression, then the record will fail that Validation Rule. The structure of the Regex Rule:
<val:Regex ID="id" Variable="variable" Test="regular expression"
	Type="type" Severity="severity" Warn="warn"/>

3.4 Conditional Rule

The Conditional Rule allows comparisons to be made between variable values in the current record, and optionally only if a particular precondition is satisfied. In more general terms, using this rule we can make sure one or more variables have their expected values if another set of variable values indicates that they should. This is accomplished using an expression language to establish the variable relationships. Mastering this expression language may take some time, but hopefully this introduction will get you started on the right foot. An expression consists of one or more variables, and conditions that the value of those variables must meet. These conditions may be grouped together using parenthesis, and are linked logically by either @and or @or. When we use @and, we’re saying that the condition on both the left and the right side of the @and must be true. When we use @or, only one of the sides must be true. When defining the expressions, we are always writing that must be true for the rule to succeed. If the conditions are not met, then the current record will fail for that rule, and a message will be printed to the log. To make a condition, you use a variable name, an operator, and either another variable name or a constant value. This takes the following form: variable operator variable/value A value may be a word, which must be surrounded by single quotation marks, or a number, which may just be written by itself. Additionally, to specify a null, or blank, value, an empty set of single quotation marks ('') can be used. If two variables are used, their values from the current are compared against one another. This comparison is done using the operator, which may be one of the following:
== equal to; the value on the left must match the value on the right for the condition to be true
!= not equal to; the value on the left must not match the value on the right for the condition to be true
@eqic equal to, ignore case; the value on the left must be the same text as the value on the right for the condition to be true, and the case of each letter in that text is not important
@neqic not equal to, ignore case; the value on the left must not be the same text as the value on the right for the condition to be true, and the case of each letter in that text is not important
@gt greater than; the value on the left must be greater than the value on the right for the condition to be true
@gteq greater than or equal to; the value on the left must be greater than or equal to the value on the right for the condition to be true
@lt less than; the value on the left must be less than the value on the right for the condition to be true
@lteq less than or equal to; the value on the left must be less than or equal to the value on the right for the condition to be true
!( ) negation; generates the opposite result (true » false, false » true) of the combined result of the conditions contained in the parenthesis
These expressions are applied using the Test and When Attributes that the Conditional Rule introduces. Both Attributes accept an expression string as their value, but only the Test Attribute is required. If the When Attribute is used, that expression will be evaluated first. If the result is true, then the Test expression will be evaluated and its result will determine if the current record passes or fails the Validation Rule. If the When expression evaluates to false, then the record passes. Likewise, if the When Attribute is not used, the result of the Test expression will determine the record’s validity. The structure of the Conditional Rule:
<val:Condition ID="id" Test="test expression" When="when expression"
	Type="type" Severity="severity" Warn="warn"/>

Note: The Conditional Rule is the only rule that does not have the Variable Attribute.

3.5 Required-When Rule

The Required-When Rule ensures that a variable has a value when a specific condition is met. The condition is specified in the When Attribute, which uses the same expression language as described in the Conditional Rule documentation. The structure of the Required-When Rule:
<val:Required ID="id" Variable="variable" When="when expression"
	Type="type" Severity="severity" Warn="warn"/>

3.6 Lookup Rule

The Lookup Rule allows values from particular variables to be validated against a set of values from an alternative source. This source may be another data source in the validation set, or an external file that is can be read by one of the provided data sources. The rule tag is setup slightly differently for each of these kinds of sources, so it’s best to go over them individually.

Cross-Data Source Lookup

The most direct of the available Lookups, the cross-data source lookup compares records across multiple files in your source list. For instance, the CDISC specification contains rules where records in one Domain must have links on certain key variables to another Domain. This is the setup used to satisfy those kinds of conditions. To setup this lookup, the From Attribute is used to specify the name of the domain that values in the current domain will be compared against. Generally, where possible, it's best to do the lookup against the smaller of the Domains, for performance reasons. That is, the value of From should be the dataset with the smaller number of records. Also note, the Validator attempts to guess a reasonable order for processing the datasets so that as much of the lookup information can be cached while validating the relevant source. The Variable Attribute is used to specify which variables should be used for comparison between the two datasets, as a comma-separated list of remote / local variable pairs. This is done in the form REMOTE-NAME = LOCAL-NAME, and must be in this form even if the remote and local names are identical. If REMOTE-NAME is defined by a variable in the local dataset, you can have the rule resolve it automatically using the syntax [LOCAL-NAME-REF], where LOCAL-NAME-REF is a variable that defines the name of the variable in the remote dataset (e.g. the IDVAR variable of any given Supplemental Qualifiers domain in the SDTM definition). Likewise, this [LOCAL-NAME-REF] syntax may also be used in the From attribute. Additionally, the From attribute may take one of the class identifiers to be applied to all datasets of that type. These classes are Findings, Suppqual, Events, and Interventions. Finally, the Where condition Attribute behaves exactly like the Variable Attribute, with the exception that the Where attribute has a literal value on the right-hand side, instead of a local variable value. This allows you to narrow the scope of the lookup. Future implementations will expand on the acceptable syntax for this attribute to allow for more complete control, as currently this only functions like the database "AND" syntax. The structure of the cross-data source Lookup Rule:
<val:Lookup ID="id" Variable="remote-variable = local-variable" From="dataset name"
        Where="remote-variable = literal-value" When="when expression"
	Type="type" Severity="severity" Warn="warn" />

External File Lookup

The format of the external file lookup is nearly identical to that of the cross-data source one. The only difference is that instead of using the From Attribute to specify the dataset name, a file path is used instead with a prefix that tells the Validator how to parse it. This path may be relative to the directory where the OpenCDISC Validator is running, or it may be an absolute path. The prefix takes the form FILE:Type:Path, where Type may be CSV, TAB (tab-delimited text), PIPE (pipe-delimited text), or XPT (SAS). The structure of the external file Lookup Rule:
<val:Lookup ID="id" Variable="remote-variable = local-variable" From="FILE:Type:Path"
        Where="where-expression" When="when expression"
	Type="type" Severity="severity" Warn="warn"/>

3.7 Metadata Rule

The metadata rule checks the metadata of other datasets based on the data in the local dataset. This functions similarly to the Lookup rule, but only checks the definition of the dataset, not its contents. This is the only Validation rule that does not require the Variable attribute in all cases. For instance, you may only want to supply the From attribute, like in the case of SDTM validation rule IR4510. This rule naturally supports the [LOCAL-NAME-REF], as this is the most common use for this kind of validation. The structure of the Metadata Rule:
<val:Metadata ID="id" Variable="remote-variable" From="FILE:Type:Path"
        When="when expression"
	Type="type" Severity="severity" Warn="warn"/>

There are some shortcuts that configuration authors may use while writing rules as well. For instance, in Attributes that accept variable names, you can insert a double-underscore (__) to automatically insert the domain code of the current dataset being validated, or the remote dataset for Attributes that only deal with remote variable names.