Content migration

Migrating content from legacy XML environments to standards such as DITA and S1000D can be a challenging experience!

Mekon provide a range of customised conversion solutions to support customer migration of:

  • HTML to XML, XML to DITA XML using XSLT.
  • Unstructured source content to DocBook, S1000D and DITA.
  • Unstructured MS Word, FrameMaker, MadCap Flare to other forms.
  • SGML to XML.

To achieve this, Mekon follow a customised route to meet individual clients’ needs, for example, when migrating from a legacy XML format, such as AuthorIT or DocBook, to DITA. This differs from the approach taken when using automated standard conversion tools, which can often result in incomplete content requiring considerable costly and lengthy rework and testing. The Mekon approach ensures that we fully understand the customer’s needs, before conversion commences and deliver a consistent quality of converted content.  This is achieved by mapping structures in the legacy content to equivalent structures in the new DITA content, using rules defined with the customer and encapsulated in a client information model. This allows our clients to start working and authoring using the new converted DITA content immediately and avoids costly reworking of the content, necessary to add back in structure and rules that an automated tool is unable to model. This approach also enables Mekon to identify reusable content structures from the legacy content and build these into the new DITA content. This approach saves clients a great deal of time and effort, and allows new DITA content to be generated using the existing reusable elements immediately the migration is completed, improving the efficiency of content development and delivery as well as the authoring experience.

Mekon plans the migration by initially evaluating how content will be restructured to use the client’s defined information architecture. Mekon recommend that a detailed upfront analysis of the customer content is carried out to allow the conversion and migration methodology to be carefully assessed.

 

 

Typical Phases in a Content Migration Project

A successful migration of content from a legacy environment to structured content involves several phases.  These are typically not undertaken sequentially but tend to be done iteratively.  Only the final phase, the migration itself, is undertaken after all the other steps have been completed.

Phase 1 – Building and testing an information model

The first phase involves Mekon specialists building an information model with the customer team. This enables clear target structure mappings to be defined between the customer’s existing legacy content and the equivalent target DITA structures. Major areas to cover include:

Map structure Structuring your DITA (book)maps consistently so that the stylesheets work and

other authors can reuse your work efficiently.

Topic types The types of information chunk that have been defined to provide consistent,

clear information for users as well as easier processing and management.

Element usage Using important types of element in a consistent and easy to process way.
Object names

and metadata

Naming and tagging content appropriately.
Image formats

and usage

The file formats and expected dimensions of images to be used.
Reusing

content

Using reusable modules and snippets of content for consistency, accuracy, and

maintainability.

Managing

variability

Using advanced techniques to manage variation with minimal

impact on users or authors.

 

From the information model, we help the client develop a definitive set of test content. A mocked-up and marked-up target content model will be produced and a specification defined for mapping between the legacy content structures and the new structures. The key output from this phase is specification for mapping source legacy structures to target DITA to intelligently maintain reuse patterns and create idiomatic content with appropriate metadata.

Phase 2 – Set up and test publishing process

Customization of the publishing toolchain will be carried out after the draft information model has been created, but before the model is finally locked down, as publishing considerations may highlight changes to the model. The target is to produce a comprehensive and representative set of test content that includes all relevant mapped structures i.e. every topic type, element, image format, and different sequences of those items, where that could make a difference to the output. This will ensure repeatable, consistent testing of the migration script(s) developed later in the conversion process, is possible.

Phase 3 – Add contextual cues where required and clean up legacy content

Where a legacy schema is generally less constrained than the target schema for migration, it is sometimes helpful to add extra semantic information and clean up the legacy content, so it maps more easily and systematically to the target structures.

A simple example is that if you wish to convert a DocBook section to several DITA task topics, you would first open out the DocBook section into several nested section files, structure them in a way that corresponds to DITA tasks, and then name them with a “t-” prefix so that the conversion script could take the appropriate action to convert them to task topics.

This process can only be undertaken once a draft information model has been defined and once authors have access to the legacy content to be able to make such changes.

Phase 4 – Develop and test migration script(s)

The last phase of the conversion process involves developing a customized automated, often XSLT-based migration script, using the mappings and information model developed in the earlier phases.  This is a highly iterative process and the script will make use of the previously defined sample test input, along with the sample marked up output for comparative purposes.  The output may be the same dedicated DITA maps that were used to test publishing. In which case, the input will be legacy structures that are expected to be mapped to those target DITA structures, including any necessary extra contextual cues.

When Mekon develops these scripts for our clients, using the initial content sample helps to get the development process started, but it is vital to also test the process with real content. There will always be new structures or sequences that have not previously been accounted for, and the script will need to be adjusted to reflect this. Doing such custom script development leads to a high-quality conversion that preserves the semantics of the original content and the reuse within that content. A good conversion script will only convert each reused topic once, and will map legacy reuse patterns to idiomatic target architectural patterns.

Phase 5 – Schedule conversion according to clusters of reuse

It is rare for an organization to be able to convert all content at the same time.  It is unlikely every group of users producing content, will be ready to move to the new architecture at the same time. However, to maximise reuse and simplify script development, it is helpful to convert content in batches according to the clusters of reuse within that content. For example, it might make sense convert all installation manuals at one time, due to some reuse between them, and convert web services docs in a subsequent batch. It is possible to migrate individual documents from a set like this, and others from the same set later, but it requires developing a more complex conversion tool that can store the paths or IDs of files already converted.  With this approach of course, there is always a potential risk that reused content may be modified between a first and subsequent import.

If you would like to find out more about how Mekon can help with your content migration, email moc.n1508377470okem@1508377470ofni1508377470 or call +44 (0)20 8722 8400.