OTGS Segmenter

Introduction

WPML provides an xliff file where each trans-unit encapsulates all content in [CDATA] elements. These elements contain everything, including markup. To be able to translate these xliff files, further processing is required so that the translatable text can be properly extracted and html markup removed. Now, Translation Proxy (TP) / Translation Hub (TH) is able to provide xliff files in XLIFF 1.2 standard format. This means that non-translatable content and markers are replaced with xliff tags and attributes.

Opt-in

By default, WPML-generated xliff files are used when translations are sent from TP. To receive the new improved xliff files, you need to opt-in by contacting the Integrations Team via email with a formal request. Also, several options are available to tune up output xliff for different systems. Those options have to be specified in the request. For any not specified option, the default will be set up.

Options

Option	Description	Default value	Value type	Accepted values
`ignore_targets`	Don’t use targets from original xliff, just ignore them and use source as target	`false`	Boolean	true, false
`remove_targets`	Remove targets from outputed xliff	`false`	Boolean	true, false
`force_xliff_standard`	Enforce XLIFF 1.2 strict validation. When is set to true mrk_status and keep_url options are ignored and external-file tag from original xliff is not added to output xliff.	`false`	Boolean	true, false
`keep_target_attrs`	keep state="needs-review-translation" state-qualifier="tm-suggestion" if they are present in the original xliff. For this to work, both ignore_targets and remove_targets options has to be set to false	`false`	Boolean	true, false
`keep_url`	Output xliff will include url from “<header reference external-file>#href” in the “<header phase-group phase[name=’wpml-url’] note>#text”.	`false`	Boolean	true, false
`keep_notes_in_parsed`	Copy note blocks from original xliff to output xliff. In WPML this can be set by client as instructions for translator	`false`	Boolean	true, false
`convert_invalid_tags`	WP blocks sometimes have tags which are distributed across different blocks (tag is opened in one block and closed in another one). So such kind of blocks are invalid in raw form, the option transforms non-matching tag pairs to `wpml_invalid_tag` tags so produced segments are valid as raw html.	`true`	Boolean	true, false
`mrk_status`	Add mrk_status attribute to output xliff. This option is ignore if force_xliff_standard is set to true.	`true`	Boolean	true, false
`segmentation`	Segment original trans-units and produce new smaller trans-units which contain markers and sentences below or equal to configured limits.	`true`	Boolean	true, false
`perfect_words_limit`	Each segment’s word count is limited to this number, segment will always have less or equal words than this value. There are few exception when content can't be splitted (eg: json).	`50`	Integer	Any positive integer value.
`perfect_markers_limit`	Inline markers number of each segment is limited to this number, segment will always have less or equal markers than this value, except parsing errors and other exceptional situations.	`20`	Integer	Any positive integer value.
`perfect_tolerant_markers_limit`	Tolerant tags usually are part of sentence and do not contain separate blocks, so segment is allowed to have less or equal number of them.	`15`	Integer	Any positive integer value.
`perfect_heavy_markers_limit`	Heavy tags usually contain separate sentences, so it doesn’t make sense to keep more of them in the same segment. When limit is set to 0 it means that it is impossible to have segment with heavy tags inside, then 1 - it means that segment can have 1 heavy segment max.	`0`	Integer	Any positive integer value.
`perfect_other_markers_limit`	A setting for normal tags, they can contain separate sentences, or be containers for words inside of a bigger segment, segment can have less or equal number of them.	`5`	Integer	Any positive integer value.
`tolerant_tags`	List of tags to apply the perfect_tolerant_markers_limit on.	`a b span bold strong i em p abbr small sub sup text br wpml_linebreak wpml_nbsp`	String	Space sepparated string, naming desired tags.
`heavy_tags`	List of tags to apply the perfect_heavy_markers_limit on.	`p h1 h2 h3 h4 h5 h6 ol ul li div form table fieldset wpml_separator`	String	Space sepparated string, naming desired tags.
`untranslatable_recognizers`	List of algorithms which detect untranslatable content, and automatically skip those trans-units in output xliff. Empty list to disable this option.	`[:whitespace, :numbers, :plugin_settings, :css_keywords, :urls, :json, :uuid, :too_many_nodes, :big_text, :css_snippet]`	List	List of comma sepparated name of algorithms. Supported algorithms: :whitespace, :numbers, :plugin_settings, :css_keywords, :urls, :json, :uuid, :too_many_nodes, :big_text, :css_snippet.
Options for `untranslatable_recognizers`
`whitespace`		Skip trans-units and segments which contain only spaces, tabs in different encodings.
`numbers`		sSip trans-units and segments which contain only digits and numbers in different forms (100%, 100.22, +2, -0.6) in different encodings.
`plugin_settings`		Skip trans-units which contains specific markers in trans-unit id (css, fonts, fields...) related to plugin settings.
`css_keywords`		Skip known css keywords, hex colors, font values, font-awesome icons, etc.
`urls`		Skip correctly written urls starting with http or https protocol.
`json`		Skip valid json.
`uuid`		Skip UUID codes
`too_many_nodes`		Don’t segment trans-unit if it contains more than 2000 and ratio of words and tags number if lower than 0.1%.
`big_text`		Don’t segment trans-unit if it contains more than 10000 chars of raw text.
`css_snippet`		Skip trans-unit if it contains valid CSS only