Multimedia_text objects
The first object we meet is called multimedia_text. It is our magic swiss knife, our container for everything. It provides a unified interface that allows to system to process textual and multimedia content pretty much in the same way, while delegating to specialized sub-containers the storage of mutimedia files and text (see the two child pages at the bottom).
This is an important feature, because it means that content can indifferently be in textual, pictorial, audio or video format and the system will treat it anyway for what it is: a piece of content. It frees the system from format dependent problems, since the latter are delegated.
.preview.jpg)
A multimedia_text object contains:
A word about language classification
Language is stated in terms of communicative_system and medium (we shall return with a more in-depth view on this issue when we will deal with special classes) . This terminology is dictated by content's polymorphic nature. A textual expression in English is classified as:
The same content, when stored as an audio file, is classified as:
Such a classification also allows us to make a proper distinction between Serbian Latin and Serbian Cyrillic, Traditional and Simplified Chinese, and to catalog the various scripts used in Japanese. This information is important, for example, in order to generate orthographic correctors. A proper tracking of the medium also allows us to catalog text across orthographic reforms for a given language.
In most cases the classification applies to the content piece as a whole, yet, languages like Japanese use different scripts much in the way western languages use italic and bold typefaces, so it's pretty common to have phrases that mix up scripts in Japanese. You can also have quotations of other languages into any phrase of any language.
Therefore we have introduced the notion of prevalent language. This is the communicative_system/medium coupling that applies to most of the content piece. Any number of segment that use a different coupling can be specified for each piece of content, but this information is not held by the multimedia_text object (thus not indexed).
The term communicative_system is used in place of "language" because not all content is expressed in a language, and not all languages are human (no, this is not about Klingon, as we shall see). For example, you see a picture of a swiss knife in this page. It's not expressed in any language, and yet other pictures may contain text and actually be language dependent (i.e. they may need translation).
Moreover, if you build a corpus of, say, chemical elements, you may want to store instructions for a molecular renderer, so that it can generate a graphic representation of the molecule. What you store is by all means content, it is text (a sequence of commands for the renderer), yet the "language" in which it is expressed is definitely not human. Same may be said of a midi file, for example.
As you see, a multimedia_text object is really an incredibly powerful swiss knife...