Ambaradan is an open source storage system for language based information, whether textual, spoken or gestured. It allows both semantic annotation and a simple management of individual and co-operative translation processes. It aims to serve third parties, that can use it to store, share and manage information through its API. It shall eventually develop into a distributed repository, so that users can become direct donors of server space.
Nowadays you can find dedicated storage systems for most kind of data, like PostGIS for geographic objects. It is, however, quite uncommon to hear that someone is developing a dedicated storage system for language. The reasons behind this are multi-fold, but mostly it all boils down to the fact that even 8 bit boxes already did text, so dealing with it is considered sort of retro.
Yet, as a matter of fact, coping with language diversity is one of the largest problems facing the Internet and the international economy. As a result of the current refactoring of the world's economics, the English language's dominant position is going to be further eroded, and open source tools that can manage language diversity are bound to become a strategic asset.
Separating content from its rendering technology
Having a centralized repository for content and its semantic annotation means you have content that is not embedded into some application. It means your application may age and get substituted, but the data it shaped will stay. If you come to think about it, it's the same thing that happened when we stopped writing accounting applications and started to use relational databases, so that we could drop our UIs without loosing our data.
This aspect cannot be overstated, as the pace of innovation gets quicker and quicker, and the life expectancy of rendering technologies gets accordingly shorter. So moving to a semantic repository is not just about languages, it's about your capability to keep your knowledge safe while changing your technology arsenal.
Your own private cloud
Cloud computing has caused a lot of rumour and doubts, as we all moved data to a destination unknown. It proved a good idea (all success stories are, and clouds are definitely a success story), but it also raised questions about privacy and ultimate information control. If we all live with clouds, the main power switch is in the hands of those who control the weather.
So the ultimate goal of ambaradan is to let end users choose their own clouds, be it a totally private system or a large co-operative environment made by volunteers. This is currently no more than a strategic direction. We know what we want to achieve, but we have not committed to any tactical choice of technology for our clouding.
A word about less resourced cultures
You may be surprised by finding that a charity assisting less resourced cultures ends up in developing a general purpose industrial tool. You should not, as this is a political stance. We believe that the road out of the ghetto is called dignity, and respect is something to be earned.
When solutions to the problems of less resourced languages become a general utility, these solutions may become robust, they can be financed and they can grow and spread. We would never be able to assess the complexity of linguistic dependent content if we did not study this complexity at large. So this complexity and the needs of less resourced cultures are the engine behind the development of this tool.
The need for tremendous individual productivity is much higher where you have but a handful of native speakers that can volunteer for work. It is by addressing this need for a dramatic surge in productivity that we become useful to everyone. Our message is that minorities' problems are everyone's problem, when we solve them, we have generated a positive outcome for everyone.
A spoonful of history
This storage engine was originally born as an upgrade for Omegawiki, but in time it has evolved to such a conceptual distance from Omegawiki that at this stage they can only be presented as remote relatives.
Ambaradan has already undergone 6 alpha releases by now, and it has chosen PostgreSQL as its relational backend. After the KDE sprint in Randa it has made a strategic choice towards a REST interface that makes it an online resource for third parties, an objective to be achieved by the upcoming 0.7.0 (codenamed Elastica) release.
The choice to produce data for KDE is strategic. The development of any storage system aiming for success needs to be driven by the needs of large data consumers, rather than by the vision of a handful of developers. We are aware that many things are going to change as people starts to complain about things, and are happy with this source of quality feedback.
Ambaradan in objects
Everything in Ambaradan is an object, and as in all decent programming families (hopefully you got the self-ironic tone here) our objects are made of hierarchies, inherited values and behaviours.
To present such uncanny pieces of black computing magic, the usual geek convention is to start from the root element, write loads of pages on how powerful and generic it is, and then slowly get to things a simple user can understand. Too bad that most simple users have long quit reading, by the time things get to be interesting for them.
We obviously cannot build a house starting from the roof, yet we did try our best to build this guide with things that non-programmers can immediately grasp, without any need to get deeply involved in object oriented theology. You may want to read the following pages to get a good grasp of how this engine works.
Multimedia_text objects
The first object we meet is called multimedia_text. It is our magic swiss knife, our container for everything. It provides a unified interface that allows to system to process textual and multimedia content pretty much in the same way, while delegating to specialized sub-containers the storage of mutimedia files and text (see the two child pages at the bottom).
This is an important feature, because it means that content can indifferently be in textual, pictorial, audio or video format and the system will treat it anyway for what it is: a piece of content. It frees the system from format dependent problems, since the latter are delegated.
.preview.jpg)
A multimedia_text object contains:
A word about language classification
Language is stated in terms of communicative_system and medium (we shall return with a more in-depth view on this issue when we will deal with special classes) . This terminology is dictated by content's polymorphic nature. A textual expression in English is classified as:
The same content, when stored as an audio file, is classified as:
Such a classification also allows us to make a proper distinction between Serbian Latin and Serbian Cyrillic, Traditional and Simplified Chinese, and to catalog the various scripts used in Japanese. This information is important, for example, in order to generate orthographic correctors. A proper tracking of the medium also allows us to catalog text across orthographic reforms for a given language.
In most cases the classification applies to the content piece as a whole, yet, languages like Japanese use different scripts much in the way western languages use italic and bold typefaces, so it's pretty common to have phrases that mix up scripts in Japanese. You can also have quotations of other languages into any phrase of any language.
Therefore we have introduced the notion of prevalent language. This is the communicative_system/medium coupling that applies to most of the content piece. Any number of segment that use a different coupling can be specified for each piece of content, but this information is not held by the multimedia_text object (thus not indexed).
The term communicative_system is used in place of "language" because not all content is expressed in a language, and not all languages are human (no, this is not about Klingon, as we shall see). For example, you see a picture of a swiss knife in this page. It's not expressed in any language, and yet other pictures may contain text and actually be language dependent (i.e. they may need translation).
Moreover, if you build a corpus of, say, chemical elements, you may want to store instructions for a molecular renderer, so that it can generate a graphic representation of the molecule. What you store is by all means content, it is text (a sequence of commands for the renderer), yet the "language" in which it is expressed is definitely not human. Same may be said of a midi file, for example.
As you see, a multimedia_text object is really an incredibly powerful swiss knife...
A file_descriptor object is a specialized container for the the information that allows finding the sort of files computers use to store multimedia content. It can contain references to basically anything, be it an audio file, a video, a picture etc.
It is important to understand that we are NOT storing the actual file, we simply store an howto that allows the system to find the file, and that has basic information about its type, size, location. This is an important feature as multimedia files are large resource consumers, and Ambaradan aims to develop into a distributed repository.
Most of the nodes in the distributed repository will not need a copy of all the available multimedia content. Corporate users will probably be happy with a single file server on their intranets, and single nodes will often be willing to host only the part of content they actually use. In any case, we shall have a network on which nodes will be able to publish content for other nodes to import it on demand. So our strategy is to build a set of multimedia yellow pages, by means of our file_descriptor objects.
The state of these objects is highly evolutionary, as no final choice has been made regarding the network technology to be used in implementing the distributed repository. The design implemented in version 0.7.0 is derived by the current candidate technologies of Freenet and GNUnet. Yet, you should be aware that both these candidates could be superseeded by some form of Nepomuk integration, if this choice could prove to offer better access to our data for KDE applications. A final decision is expected to be taken in the first part of 2011.
A file_descriptor object currently stores:
A textual_content object is a specialized container for text. It can contain text up to any length, so that when building a dictionary you can actually store the whole bible as an expression, and write something like 'collections of sacred scripture of Judaism and Christianity'.
Such an extreme use would make little practical sense, if any. As we shall see later, entire documents are better managed by breaking them in single sentences, and storing the sequence that allows their reconstruction. Yet, it is nice to be free from size limits when you manage, say, a collection of system messages that need to be localized.
Along with its content, a textual_content object stores basic information that allows full efficient text searching. In all, an instance of this class contains:
Object objects
Every multimedia_text, textual_content and file_descriptor are descendants of object. For those of you who aren't familiar with Object-oriented Programming, this is probably going to sound mysterious. Now, think of it in terms of you and I. We are both humans, right? Well, it's the same here. Jack is Jack and Jill and Jill, you can surely tell who is who, yet, they both are human, so you know they have a heart, lungs etc.
When we say that everything is an object, we make a similar assumption. Whatever we can say about object will apply to its descendants as well. Although this is a bit too general (and there are actually exceptions) this is as much programming you need to understand this documentation. So, what is this whatever we can say about object? At first sight, this object thing seems to do very little, and in fact it is a very general being.
It is what we can call a low-level DNA: it contains a set of flags that tell the descendant how to behave. You don't know what a flag is in Information Technology? Well, it's simply a good old switch. It can be true or false, i.e., it can be on or off. You used switches for most of your life, so there's no secret about them.
Currently our DNA is fairly simple, as it's got just three switches:
The first switch says whether the descendant object can be licensed as the concept of 'drink' cannot be copyrighted, but a drink's name can (if used as a trademark). Copyright-able objects need licence information to be usable, so this switch tells the descendant object (no matter what it is) that it should carry a licence before being accepted by the system. Obviously the licence can be public domain, which is by all means a licence.
If the networked switch is on, it means that the descendant object will be broadcast on the distributed repository. Not all individual objects in the system need broadcasting, as many are generated from rules. To remain on legal grounds, a copyright-able object assumes the licence of the region (we shall see what a region is later on) to which it belongs, unless it states an exception. In both cases the need for a licence is satisfied, but only exceptions to regional rules need being broadcast individually.
Last we have a flag called mediated. When mediated is set TRUE, it means that the content is language dependent, and in order to be meaningful for a user it must be put into the most appropriate linguistic phase. I can see a lot of eyebrows moving. What sort of content can be language independent? Well, you can easily find out in what language is expressed the word wine, and in what is expressed вино, but what's the language of the drink itself?

As we said, object is not only included into multimedia_text and specialized container objects. Other entities that descend from it always carry language independent information (i.e. not the name of the drink in a given language, but the drink itself, before language). The border between language dependent and language independent entities is fixed, as an object's flags can be set only on creation, and never be modified afterwards.
A further service given by object is that it can be classified. We shall see in detail how classification works later on, at this stage it is important to understand that you can attribute semantic qualities to any object, and since everything is built on top of an object... you can give semantic classification to any element in the system.
There is a final service performed by object: it is our main catalog of everything that is in the system. Networked objects carry uuids (if you don't understand what they are think of the bar-code that identifies a product in a large network of supermarkets) that identify them all over the distributed repository, but the larger set of local objects has only numeric identifiers, that are supplied by object. Since all tables that contain data of the descendant entities cascade on this numeric id, if you delete an object you are actually automatically deleting whatever entity had been derived from it.
The next step introduces us to our before language container, the profile object.
Profile objects
NOTE! You will want to read this to fully understand what we mean by "profile". In short, by profile we mean the abstract concept that manifests itself in language. The idea of a cat, that keeps together all possible ways to say "cat" in all possible languages. 
If you look at its database definition, a profile object appears to be one of the most abstract and impotent objects we have in our catalogue. In fact it doesn't do anything, it simply provides a reference that content can use to be semantically grouped.
By itself it doesn't contain any information whatsoever, but since it inherits object it can be classified, licensed etc. It is by definition a language independent information,as things can exist long before anyone gives them a name.
If you look at the the database table that maps profile objects in the database you are probably going to think it's a waste of space. It's simply repeating the same identifying number issued by object. So if you are thinking that we could without this table, by just adding an is_profile switch to object... you are right. So far we have kept it, as it makes documentation more readable to have it, but it may disappear in the future, to become simply a flag.
As we have seen, a multimedia_text object is assigned to given profile, so we can collect all that means the same thing. We still do not know how to tell an expression from a definition, in a dictionary structure, but we shall see that in the next chapter.
Hierarchical data and translation trees
Our understanding of reality is often built in frames containing each other. A cat is a feline, which is a mammal, which is an animal, which is a living entity, etc. When we say "Fritz the cat" we imply a full lot of information in that animal codename.
When dealing with content cataloguing we need a lot of such frames, and we need to make sure we correctly map what is included in what. But we need frames even to track the process by which a multimedia_text came to be assigned to a given profile.
It is important for us to know that someone created a piece of content as an original, or as a translation. If it was a translation we must know from what it was translated in the first place. This serves two purposes:
So our content gets assigned to a profile by a so called translation tree. This tree marks the way in which multimedia_text objects were produced and let's us immediately see that translation 1-4 is pretty likely to contain a huge semantic drift (do you remember we met this concept with multimedia_text already?).
This tree is not an object, but simply a structure that can be contained by proper objects. So while looking at its database definition you see nothing at all in the profile table, ambaradan can track the whole process that added content to it and, most important, it can track translation processes that happen within the system.
Yet, wait! Aren't we making dictionary-like entries? So what is this content we are talking about? The lemma itself or its definition? The answer is in the kind of tree we use. There are actually many, ambaradan uses them to map all possible kinds of taxonomic relations with just one dedicated set of routines that manages them all. There are trees to order
They all work in the same way, but before we proceed to explain their general structure it is probably better to make an example using translation trees. A tree node knows:
Tree nodes do not need to refer directly a multimedia_text object, for the simple reason that they can refer to their included object in instead. This allows trees to order literally anything in the system, not just linguistic dependant content. So you can easily reformulate the following example to understand how the other tree type work.
You will probably have noticed that the first three elements do not tell us anything about the hierarchical position of a node in the tree. They do not say that translation 1-4 came from translation 1-3. Let's see why.
One of the most difficult challenges for the coder is to find a way to efficiently represent hierarchical data in a relational database. It may seem weird, but there is no immediate way to retrieve a taxonomic tree from relational tables by a single efficient query. So we all resort to tricks.
All tree elements have a left and right value. They work as frames, so we immediately see that element 0-7 includes all the others, element 1-6 is included in 0-7 and includes both 2-3 and 4-5. This makes it trivial to arrange queries that retrieve a full taxonomic mapping in a single shot and can compute an element depth on the fly. It also makes it trivial to move around parts of a tree. Here we have the basic structure that allows merging and splitting things without much fuss and without any risk of loosing bits and pieces in the process.
Such a genial solution is obviously no invention of the ambaradan team, all credits for it go to Mike Hillyer for a very clear explanation of this method, along with basic code snippets that helped us build what was needed.
So, once we explained the basic technology we use to map and move relational data, let's move to the way in which "dictionary entries" are built.
What exactly is a profile?
As we saw, a profile seems to be quite a poor, powerless thing. But once ambaradan collected and assembled all the linguistic dependant content assigned to a given profile, we got a totally different picture.
What we get is a dictionary entry, in the usual sense of an expression = definition equation. This happens because during the linguistic content assembly process, two families of hierarchical tree structures are used:
By now it should be clear that there is no structural difference whatsoever between the two. They are made by assembling the same multimedia_text objects by means of the same tree structures. A single unified set of routines serves them bo
th. The difference between them is all in the role they play for human users.
A basic point is that since a profile is a language independent object (it is created as a empty set, as far as linguistic dependent content is concerned) none of the two tree families is mandatory. You can use a profile to collect expressions only or definitions only or any mix of the two.
The second important point is that this structure is not limited to dictionary management. Basically everything can be represented as an equation of content elements.
A news service or a CMS may use the expression tree to store articles, and the definition tree to store those small teasers that usually get published in the front-page. A software localization service may store a message codename in one of the two trees and its localized versions in another. A textual repository may store a full book in the expression, and a short description of the text in the definition. A Corpus Juris may have the text of its Laws in the expression, while serving commentaries in the definition tree, etc.
Obviously, no such service will be happy to call the two members expression and definition. But this is really not a problem, because we are in a multilayered architecture. All they need is to properly rename these entities in their GUIs, the logics remain unchanged anyway.
The members of the equation are fully interchangeable exactly because, as we have seen in the very beginning, a multimedia_text object is absolutely unaware of how it's going to be used. So our search engine can analyse the whole of our content base without needing to jump here and there. And it can do it really quick, even if we get to host billions of entries.
But the most important point is that a profile object can now emerge anywhere in its linguistic phase space. That is, we have an entity that while being absolutely language independent is also fully capable of materializing itself within just any linguistic space. And the more original and/or translated linguistic content we assign to this profile object, the more capable of adapting to different users and cultures it gets.
This also means that we can assemble into a unique profile information originating from any number of sources. There is in fact no need for expressions and definitions to be unique. We can set as many equivalent pieces on both sides of the equation, and all of them can have a translation chain on their own. This not only addresses synonyms and alternate definitions, it also allows to compare information from different sources.
We shall build a lot upon this, as you'll see.
Linguistic independent classes
Being able to collect an immense amount of linguistic polymorphism is surely nice, yet what makes a cataloguing system strong is its capability to classify things. The capability to record that a cat is a feline, is a mammal, is an animal.
In the real world any concept can be used to classify any other. In order to use a profile to classify another in the OWm2 storage engine, a user must say that he/she is willing to use the first as a class object. This is aimed to limit the number of classes to avoid turning GUIs widgets into a nightmare.

When you say that a given profile is also a class you may choose to set it abstract. If you do so it means that this class object cannot be used to directly classify simple profile objects, but only to classify other class objects in a taxonomical tree.
This may sound quite obscure, so we shall better make a clear example of the implications. On classifying animals, among many other things you may possibly want to say that "tiger" is a "feline", is a "mammal", is an "animal". Just in the same way you may be interested in stating that German is an Indo-European language.
Yet you might not want to spend ages in tagging all single German expressions as Indo-European, or all mammals as animals, and you might not be satisfied if anyone said that "tiger" is simply an "animal". You may want to state once for all that "feline" is "mammal", is "animal" and be done with it. And you may want your GUI widgets to propose only analytical class objects, while keeping the more generic conceptual layers in an invisible implied background.
The answer is in defining these class objects as abstract. Abstract class objects can build up tremendously powerful semantic frames, while keeping the number of choices in widgets to a very minimum. Since class objects are classified by tree structures, as anything else, you can always change and move your classification structure later on.
Using such implied background information can make your data extremely powerful, because while nobody will be offered a chance to use an abstract class object to classify a given profile, anyone will be able to use these class objects as search terms. So a comparative linguist, for example, will be able to extract in one go all Indo-european expressions for "cat".
Once again, all class objects include a profile object that is responsible for their linguistic manifestation. So you can translate them and define them as any other concept, because concepts is what they are.
What we build is more than a traditional taxonomy. Many relations can be expressed as a directed graph, as shown in the attached picture. Such structure allows for inclusion of a class into an infinite number of traditional taxonomy trees.
In the example we see how the class object medicinal herbs can be included in the wider class object medicine by a number of different paths.
In fact both the following taxonomies are expressed:
This gives our classificatory structure a better degree of flexibility and allows for complex systems to be efficiently represented. But it is not enough, yet, as we shall see. Let's consider one more example, shown in the third attached picture. Here we deal with agriculture and soil classification.
A particular kind of soil, the Alfisol, is so called because of the presence in it of Aluminium and Iron. The profile objects "aluminium" and "iron" are surely included in the conceptual "base" that can lead to fully grasp what an Alfisol profil
e really is. But is this inclusion hierarchical? Can we say that "everything about iron" should labelled as "belonging into an Alfisol class object"? Most people will say that no, we cannot.
A hierachical categorization is not the best tool to express the relation, here we need something more like a WWW link. Something that suggests you may also have a look at the information about Iron, but that it does not automatically fit the Iron profile into the list of components of the "Alfisol set". Good candidates for this set are, in instead, landmarks where Alfisol is common, or specialized agricultural techniques for this peculiar kind of soil.
So what we have is non-directed link, that we expressed as a pointed line in our last example. In matematical terms, we shall say that we use both a directed and a non-directed graph to map semantic values. The resulting table of combinations (to remain in mathematical terms we would say "the resulting incidence matrix") is not simply composed by Yes/No values, but rather of any pick of empty/directed/non-directed. And what we have, in practical terms, is the capability of expressing a relation between two profiles that either:
To optimize space consumption we obviously do not store empty values, so all we have is the cells of the matrix that contain a relational value. At design time we decided that this is "enough". This was simply the designer's decision, and certainly not a law of physics, but for all practical means this structure really seems to do all a dictionary needs.
Next we shall see that some classes are more classes than others.
System vs user classification
Some classes are heavily used by the system for its own internal jobs. There was no point in developing two separated classification systems, one for humans and one for the engine, because basically they do the same thing, only with a different meaning.
When the system assigns a given content to the "English" class object, it means that it is partially expressed in English (if you remember, the multimedia_text object only indexes the prevalent language). Yet a Hindi native speaker may classify (in Hindi) the profile "Saxon genitive" with the very same "English" class object, because it is related to the English language. And he/she may do it while remaining totally immerse in the Hindi linguistic phase.
Both uses are obviously correct, and it is important that the system can tell the difference between the two. In this case we could use the fact that only a multimedia_text object is classified as linguistically dependent, while whatever is said of a profile is obviously a "normal human classification", yet there are more subtle logical traps ahead.
All object objects in ambaradan bear licence information. Most of them simply say they use a defualt regional licence, yet potentially all can state an exception. So what happens when you assign a given profile to the class object "CC-BY"? Are you saying that this profile is related to this particular license, or are you rather stating licensing information proper?
You cannot tell, unless you state clearly which classification activity was made for what. So any time the system records the assignment of an object to a given class object, it also states whether it is doing so for its own internal purposes or as a result of human interaction. In the API you'll find the internal purposes named as "user" vs "copyright" and "linguistics" classification.
By all practical means we maintain in the system three parallel classificatory layers. Two are service oriented, and classify objects by language, script, licensing information, the other is the semantic machine by which a human user may state that "this is related to that". Both layers share exactly the same software.
Special classes
Special class objects are created for system's sake only, as they include the data needed by the system to perform special internal operations. They are made by wrapping a normal class object into a larger container, that is basically used only by the system. Users still see them and use them as ordinary class objects.
In particular, two special classes map the "legal input" for all linguistic content in the system. They do so by building a table that says what language/script coupling are allowed. There is indeed no point in storing English content in Cyrillic transliteration, but Serbian, for example, must map the possibility of both Serbian/Latin and Serbian/Cyrillic. Japanese has up to 4 possible variants and they must be kept well ordered and identifiable from each other.
When reading the API documentation you'll never see expressions like "Language" or "Script", though. This happens because the system is ready to accept non-human languages, too. We are not speaking about Martians Invaders, obviously, but rather about software.
There are a number of protocols that allow representing mathematical expressions, molecular 3D rendering etc. Calling these "languages" or "scripts" would have been (at the very best) improper, so we decided to use more generic labels. This is the reason why in the API you meet
So now the circle is closed, and we finally came to see how a multimedia_text object knows in which language it is expressed.
Original vs sourced material
You might have noticed that we spoke about "sources" in the previous pages. It is time to explain what we meant. When users insert content in the storage engine, they may either mark it as an original content they made by themselves or make it a "quote".
A quote is content that is not supposed to be edited, unless improperly quoted. It can be translated, obviously, but the original should not be altered. This is especially efficient when importing public "collections", because the collection as such is usually copyrighted and the related licensing information must be properly stored and shown.
This is also an efficient way to mark that a given definition is "authoritative", for example when issued by a public Board that is in charge of defining some kind of standards. An authority object is by all means a usual class object, but defining it as an authority tells the system that it can be used to source objects.
Users that feel that an "authoritative" definition is not the best tool to define a profile object may still freely add other original definitions as they feel fit.
A peculiar use for this mechanism consisted in marking the import of the all the content originally stored in the Omegawiki database.
The network layer
This is where true geeks would have started their explanations from. We rather chose to present it here, when at least the subject of such extended logorrhoea is finally clear. If you are not interested in the co-operative mechanism included in the OWm2 storage engine you may happily skip this page.
We said that the object object is included in everything we store, like an engine that performs lots of useful tasks while many drivers simply ignore most of its details. Well, inside an object object there is yet another hidden engine, the network_object object. This really is the basement layer, there is nowhere deeper than here.
A network_object knows everything that is needed to move objects along different nodes in the extended network. It carries identifiers that have a universal validity, it identifies where and by whom an object was created and it is responsible for keeping an history of the larger object built on top of it, any time this larger entity gets altered in any of its parts.
Since it is the ambassador of any item towards the network at large, a network_object has much more awareness of the total entity in which it's included than any other part of it. It must have it, because the network daemon needs to know which is which, to correctly organize the sequence of its operations.
It is in fact relevant to introduce new elements in such an order that will ensure that any new object will find in place all the elements it needs to work. For example, when a new language object is created and used to classify content, it is vital to import this object before we import the content that it will classify.
The network_object is also the place in which a user stores his/her commands to the network daemon. A user may want to keep some content private, or he/she may be unhappy with what other editors or administrators out there are doing with a given object.
The network_object is the engine that can perform an unlink operation, thus ensuring that a particular object is disconnected by the global network and that no matter what other people do with it, the user will never see it changed again (but obviously the user can still edit it locally). Since literally everything contains a network_object users can unlink from the network whatever they please.
Unlinked objects are always returned at the top of local searches. This is no ideological position, but a simple practical consideration: people usually protect those things they most care for. It would be very annoying if those very things were buried at the bottom of their search results.
We built this software with several goals and guidelines in mind. They all came from joining a large number of "books of dreams" and initially people reacted quite sceptically to this list. Yet, all that we are using today as everyday tools was "crazy dreams", just some 5-10 years ago.
Here is the list of requirements we started from:
No digital barrier, so no stoppers for those who cannot connect to Internet. People shall be able to fully work off-line. When no Internet connection is available it must be possible to send and receive content exchanging supports like RAM keys. This is especially vital for countries in which expensive public infrastructure cannot be expected to grow out of thin air overnight.We assembled our pick of technologies based on these requirements. In the next chapters we shall discuss what layers are responsible for co-operative content management and what pre-existing tools have been chosen to perform the related tasks.
The first thing we wanted was a safety exit: the chance for us to "change components" with (almost) zero impact. On assembling different pieces of pre-existing software one often stumbles into surprises, so we wanted things to integrate loosely enough for us to change a piece of software with another without actually compromising the general architecture. We also wanted a high redundancy, which means that we wanted to have many copies of the data, so that if a node or part of it gets damaged there are still good chances to get data back.
The final result is shown in the attached graphic. Rising up from basement level what you have is:
Any GUI that can manage front-end relations with the user. It can be any number of them, they can be either desktop applications or macros for Open Office or local intraweb applications, you name it. They all exchange messages with Amabaradaemon, who in turn delivers them in proper format to the package that executes them.
We shall now spend a few words on some of the components, to see more in detail what they actually do.
Questions about our choice of Freenet as a network transport may range from Freenet's "pirate" reputation to more technical aspects, like the time it takes to a node to become fully functional. We welcome the broadest possible discussion, because it will surely help in making sounder decisions.
Buccaneers, le Carré plots, 5th columns and all that jazz
We use Freenet technology, but do not host ANY of Freenet's content. OWm2 is by all means an independent network. We simply use the software and get all the updates and bug fixes official Freenet gets. Technology and the use people make of it are two different things. Cellular phones are sometimes used to detonate bombs in public places. You haven't thrown away your mobile phone because of this, have you?It's written "strategy", it reads "money"
A geek's view
Freenet has actively developed and maintained since 2002, it is quite stable and it is frequently attacked by well organized spammers, so it has a lot of development to make a spammer's life difficult. There are many other networks around, but this seems to have undergone a good deal of bug-fixing, plus, its developers are extremely friendly and co-operative. Using their product takes an enormous load off our backs.What exactly do we use Freenet for?
Freenet is mostly a cache for us. Multimedia files are published onto the net and keep circulating among nodes, where chunks of them are locally cached. This means that most popular files are more widely cached and will be retrieved more quickly, while "niche" content may need to be regularly reposted by the daemons, to make sure it's still cached. A good strategy for efficiency is downloading content you want to cache. Future releases will study a system to make this an automated procedure.
Using a distributed repository means that you don't need to be connected to a central machine to write your updates. In instead, you write them locally, and once in a while you can commit them to a global container, from which they will diffuse to all other users. None of your updates gets lost when you are not connected to the network at large, you simply cannot deliver them for a while, and that's it.
All our objects are stored as XML representations. Putting them all into just one repository would quite defeat one of our main goals: making sure a user downloads just what is needed. As a matter of fact no user needs the whole networked knowledge base, most of them need but some 5% of it, many will need much less. So instead of one big fat repository we use many smaller specialized containers that are kept on different nodes, thus minimizing the traffic load a single node bears.
As previously said, we offer the chance to have the big bulk of the content in a few languages, while a user may decide to get the semantic classifiers in a much wider linguistic range. So we need to build independent containers for class objects and simple profile objects. None of them has any linguistic content, as we already know, but they build the logical structure we use to decide what linguistic content is needed.
Once we have the full logical structure in place it is trivial to match it against the node's linguistic subscriptions to build our download list. As we see in the attached picture, there is one independent content repository per language. The whole set of repositories forms what we call a "region", that is, a coherent data subset. As we shall see in the relative documentation, regions can be be combined with each other to form larger dataset.
Let's see how many independent repositories a "region" implies. The number of languages registered in ISO 639-3 amounts to 7,622. Each of them will need 2 repositories (one for class objects, the other for profile objects), plus 2 repositories for the linguistic independent information. This makes a theoretical total amount of 15,246 repositories. It would be fantastic to need them all, yet actual numbers aren't even remotely close to this mark. Content produced by the Omegawiki project in years of work covers 183 languages in all, for a grand total of 368 physical repositories.
How multiple dataset can combine
In the documentation about region you will find how users can select to use more regions, combining them into a single database. This happens at MySQL level, while at Mercurial repositories level data do not mix at all. All objects are identified by a UUID, which is also the name of the file that identifies them in the repository, and a given UUID belongs to one and only one region (that is, it is contained in only one set of repositories).
Regions can specify a set of "included regions", just like a software package has dependencies. Regions like "Biochemistry", "Physical chemistry" and "Neurochemistry" may all benefit from including a more general "Chemistry" region. Quite obviously, such a complex data organization cannot be achieved overnight. Chances are that the need to move and organize objects will come only once a given critical mass is reached. Administrators will need to split regions that become too big or simply too confused to be useful.

What if a given source wants to use this channel to "sell data" or if corporate users want to restrict access to their data? This will be possible in a next release. Private content is to be distributed and cached along the net, so the situation cannot be compared to accessing some well identified server. It is rather like receiving satellite TV broadcasts. The content goes around to each and everyone, but only some users can make any use of it. Part of the content may be public (the advertising messages that many channels send out), while the rest can be restricted.
Only authorized receivers will be able to use encrypted content. Keys may be sold, issued/revocated on the fly and/or given a validity in time. Such content may be efficiently included in the published set of repositories without compromising the seller's interests. Users obviously download only the encrypted repositories they have a key for. Nothing keeps them from downloading the others, too. Yet, they would only waste bandwidth if they did.
We also need to explain how a region can add semantic classification to profile objects that belong into another. This will be a constant need. If you build a region called "animals", the region "sports" will have a lot to say about it in terms of classification and relations. How can it do it? The answer is in using "headless" information.
An object is created by an XML generator that includes all data about the object, thus including semantic classification. "Headless" information includes only information about semantic classification and content. The object itself remains a property of another region, but the current region can "expand it", provided that the user already has the head information from another source (another region, that is). Whether new information should be added to the owner region or kept in other as "headless" is a users' decision.
Considering that information is constantly scanned and retrieved, even circular calls are solved. The first time headless information that has no object to which it can be attached will not be processed, but the second scan will find the heads in place and will add the information.
A similar process can be used to treat encrypted information. The overall number of our repositories slightly grows, but we gained a lot of functionality by such a small growth in size. Now specialized regions can add specialized expressions and definitions, they can use other regions' simple profiles as classes, and they can add specialized classification to just any general object.
An important consequence of this approach is that some of your objects may vanish because someone out there moved them onto a region you do not include in your subscriptions. When such is the case each user (either individual or corporate) may either add the new region or simply unlink the original object and keep it. When this happens at corporate level it may generate duplicates, and admins in this case should establish a relation among the two profile objects, rather then forcing a merge they cannot force on people anyway. In an OWm2 storage engine admins propose a solution, but they cannot impose it.
If users gather that a given choice is sensible they will join it. So no planned economy here, as it will always be the users to eventually determine which is which, rather than the admins. And no admin or mob rule can force a point of view on individuals and minorities. At least, not as long as people take the time to review what others are doing. But that's life, we all have but the rights we are prepared to defend.
In an OWm2 storage engine admins propose a solution, but they cannot impose it... hmmm, it sounds like a dangerous thing, doesn't it? It actually only means that there is an endless number of aboslute powers that cannot overlap each other. It means good fences, not chaos.
So this is no security leak, if you come to think about it. It's true that a malicious attacker can start a new independent region and include all sort of weird content in it. So what? He is not damaging any of the existing content of other regions, and it's only up to end-users to decide whether this new region is worth downloading or not.
Spammers may be annoying, but they are not stupid. Even the least sophisticated spammer will not want to spam his own self to death in the absence of public. He will want to spam an existing region, something people read and use. This happens to be quite difficult, though. All new nodes are born greylisted, which means that what they post never gets automatically pulled into the distribution engine, until someone has verified that "this is human". So yes, there are billions of ways in which a malicious hacker can inject content into its OWN node, but this content is only going to reach a public repository.
If the region admins will so decide, the node will become whitelisted, if the situation is unclear it may remain greylisted, or it can be blacklisted for good. Once blacklisted, all it's sending to the region will be automatically thrashed. How the admins will decide to keep or thrash things is only up to them. They manage their own region and set policies as they please. One can pretty much decide to open a read-only region, whit one and only one authorized editor.
So, before fooling anyone, a spammer has to work for a while to produce content that will please the local human testers. After that he will get the greenlight to insert without checking. Yet the system will greylist him again automatically if he suddenly exceeds his usual input rate. One cannot just do a bit of casual work and happily flood a whole nation right after that. If he wants to send out like 100 spam profiles a day he first has to regularly produce an equivalent frequency in good profiles, and he must keep his useful activity up for days.
Now, isn't this censorship, isn't this an open road to "mob rule", to local communities deciding whatever they want and forcing their decisions on minorities and individuals??? The answer is a big capital YES. As we already said, we use Freenet technolgy but we are not Freenet. There is NO common policy about what is spam and what is not. Regions are 100% free to set their own standards, and anyone who doesn't like the standard is free to open another region with a totally opposite set of rules. Nobody needs anyone's approval for this.
To have good chances to remain unseen a malicious attacker needs to hide an amount of litter in the middle of a majority of good material. Such an attack vector forces admins to carefully examine things. The result is worth the effort anyway, because when they thrash the litter they are still left with the good content the spammer used as a shield. So many regions may welcome at least smart spammers, after all.
Does this mean that no anonimous contributions are allowed? No. It means we don't need to deal with floating IPs etc. Nobody can tell who you are when you run an OWm2 node, but your node UUID is fixed. If you add content twice it will twice bear the same signature. You can reinstall the node from scratch and get a new UUID, but then you start from being greylisted again...
Power and how we cook it
Admins have decisional power over THEIR region, and that's all they rule. The OWm2 storage engine has no central server to which one must be "admitted" in order to be hosted. There cannot be any "overall community", council, foundation or company in charge to decide whether to host a project or not. Even if such entity appeared out of thin air our software would not allow its decisions. So if one doesn't like the way a region is administered, all he has to do is to open his own. Nobody here gets more censorship than he is willing to accept.
Needless saying, freedom has a price. A new region will need content, users and anti-spam administration, i.e. hard work, marketing and the infinite boredom of revising incoming stuff. One could default whitelist access to all newcomers, but spammers never sleep, and his content would soon be so littered that most people wouldn't be interested in it. Let's be very clear: we offer freedom and independence, we don't serve free lunches. Users don't pay any money for hosting (at least not to us), but it's fully up to them to provide and organize all the labour it takes to keep their stuff running.
A small case of study
Now let's abandon the theoretical field to take a dive into the real world. Say user X joined region Y and found out that a voice in the dictionary is (according to his point of view) downright wrong. He immediately corrects it and drops in "some basic common sense". He feels a happy man.
Too bad that after just a few hours his work is gone and the voice bears exactly the same wording that offended him in the first place. Together with this unpleasant refusal he finds out he has been blacklisted. Pretty rude, sure, but it may well happen. Admins are people, and people sometimes are downright rude.
At first user X refuses the admins' decision and keeps his content on his node. Later on he realizes that he is simply speaking aloud in an empty room while others carry on what can only be labeled as a blatant case of mass disinformation. So he bravely decides to open a new region.
He imports region Y's contents into his new creature and starts editing all he does not agree with. Will his edits reach the masses? Mind you, no, they certainly won't. His new region doesn't own that content, so whatever he does is still thrashed away as blacklisted. His edits still happen in region Y!
Blind alley? No, a simple filter against stupid edit wars. What his new region can do is to add headless information to region Y's objects. Other expressions, definitions, classifications and relations. Alternative stuff, that is. This new content will belong to HIS region, in which he and his friends rule.
Did he reach the masses this time? Well... Reaching the masses is hard work, once the content is in place one still needs to market it. He will need to tell people about this new region, and only his success in the process will determine the size of the masses he will reach.
Do we mean there is no such thing as ''the truth"?
The reader may wonder what's the point in having a knowledge base that says everything and the opposite of everything. How can we publish things that simply exclude each other, like Darwinism and Creationism? Shouldn't we guarantee some form of "objective information"? No, we shouldn't. No big philosophical issues here, our strategy is purely marketing driven:
Translating things does not require sympathizing with the text. A faithful translation is based on a deep understanding of the source, though. So the more different (even clashing) points of view are represented and translated, the better for a dictionary. We repeat it and underline it: we make a DICTIONARY here, not an encyclopaedia, so no POV/NPOV for our system.
Besides, users get as many regions as they choose. If one doesn't want to read material inspired by sect XY all he has to do is to avoid subscribing to the Church of XY region. Yet, pretty often one may have good professional reasons to need stuff he actually hates.
Think about it: a scientist may question a definition in chemistry that stems from some weird alchemy books. Yet, editors who publish alchemy books exist, and they all pay translators who need specialized dictionaries that give them "the proper language for the subject". Sometimes they need it with a serious tinge for expensive editions, sometimes they work at an instant book... There is no limit to what can be useful for a translator. This is the main "commercial" reason for us to favour the birth of openly "confessional" regions. They may be born as "confessional", but they are eventually useful as "specialized".
In short
Everyone is welcome to start a region that accepts only what they rate to be the truth, the best infomation level, the one and only sane approach to reality. Actually most regions will do exactly that, because this is what our system has been designed to do. It is up to each individual end-user to decide who is sane and who is not, the system will never hint a ready-made solution to such a question.
Everyone is also welcome to open a region that will finally state "the truth about OWm2", to properly inform the audience about all the things we deliberately (or not) hid from their attention. When this will happen we will rate it as our best success.
Let us start from individual users, before we move to the corporate level. As a note on the potentially endless debate on what a point of view is, we can only recommend that you reflect once more on the concept of linguistic phase space. On treating "the right to hold one's point of view" we use a very similar concept called semantic phase space.
What we mean is that the profile "water" is not only emerging as a different entity depending on its current linguistic phase, but also depending on its current semantic phase. There is no particular warranty that two semantic phases will remain mutually intelligible within the same linguistic phase.
By all practical means, a user may well need a translation from English to English, in order to make any use of what was originally born in deeply specialized terms. It's indeed not many people who will immediately understand the scientific version of the common phrase "liquid water is wet". And even if they could understand it, chances are this is not what they are looking for. Both the individual and the corporate user may want to choose one or more preferred semantic phases.
What we aim to build is an adaptive content network. Something an end user may tune both in terms of linguistic and semantic phases. This implies some tool by which a user can tell the system what is relevant for him/her, and any such tool needs a well identified set of parameters on which it can operate.
We are not replicating the way human brains work. Human brains work as fractally complex parallel systems that simultaneously process a combination of analogue and digital information at an incredible amount of different levels. We have no such resource at hand because we are using a Von Neumann machine, so we simply try and make a practical use of the data we already have.
Now let's see what we can use to make a filter. We have linguistically dependant information, that is, the extended classification by which our system tracks any content's linguistic phase. We also have a linguistically independent vector, the thing saying that a tiger is a feline is a mammal is an animal. We shall call this a semantic vector.
A profile object, as we see on the graphic, has no direct knowledge of any of the two vectors. So a profile object is portable anyway, no matter what happens to the semantic and linguistic values that were attached to it. If there is nothing stopping us from sending it out, the next step is to define how a user will say whether he/she wants to receive the object or not.
Suppose a user speaks wonderful Thai and Khmer, but he/she has no notion whatsoever of English. This person may well decide to ignore whatever semantic classification other people made, as long as the involved class objects cannot manifest themselves in a linguistic phase he/she can use. This implies that such person's semantic content will be full of holes, holes that the user can fill by making his/her own new profile objects used as class objects for semantic classification. In short, since no แมว class object appears on the screen and emersions like cat, кот and ネコ were ignored as useless, the user proceeds to make a new แมว class object on his/her own.
Now wait a minute, isn't this a whole lot of silly dupes?? Well... yes, absolutely. Yet, we must live with this problem anyway, no matter whether we broadcast all content or not. Even if we sent this guy terabytes of classification just to make sure that "no holes are there" those holes would vanish only from his/her local database. The user's linguistic phase space would not extend a bit in the process (just think of the use most English speakers can make of emersions like кот, ネコ, قط ,חתול or შინაური კატა) and thus duplicate profile objects would still be created in huge lots.
This is something that only multilingual human personnel can solve, by patiently marking candidates for merging whenever they meet them. The process will always be long and difficult, since no person on earth speaks ALL the possible linguistic phases we can store.
We also have the case of someone who can use one or more languages that he/she doesn't want to download. Say this person is a translator from Bulgarian to Hindi, looking for a dictionary on construction engineering. This user will mostly ignore Dutch terminology, but maybe he/she can speak Dutch, if even just enough to make use of a class object emerging like Spoorweg (Rail transport). This user may well want to ignore all profile objects that cannot emerge in the Bulgarian or Hindi linguistic phase, but decide to accept class objects that can emerge in Dutch.
We encourage people to use the widest linguistic phase surface for semantic classification, as this dramatically accelerates the process leading to a compact set of profile objects. Profiles that have the same kind of detailed semantic classification and no overlapping linguistic phase are automatically sent for evaluation to the admins. Maybe they are not "exactly the same thing", but they are somehow closely related anyway, so it's worth giving them a closer look.
With this goal in mind we offer our users an extended linguistic phase surface for the network synchronisation of class objects only. Users do not get any "plain content" on extension phases, so the process eats up very little bandwidth and space while providing them with the best linguistic support for their semantic vector. Hmmm.... extended linguistic phase surface!? Now what the heck is that?? Okay, okay... Let's translate it from English to English: you can state a number of additional languages the system can use to send you class objects. These languages' usual content is of no interest to you, but they can be succesfully used to give you classificatory information when such information is not available in any of your usual language of choice.
So much for the linguistic vector. Now if you apply exactly the same filter to the semantic vector, you find yourself telling the system that you want to receive those profile obejcts that are classified by the construction engineering class AND by the Spoorweg class, maybe NOT those classified by the Stone rails included class (say you are not interested in archaeological stuff).
When the network daemon connects to other nodes, it will know what to request based on such specifications. This is what OWm2 calls its subscription mechanism, which is the basic mechanism allowing individuals to pick their choice of available content. In the next chapter we shall finally see how this applies to corporate level, thus giving birth to our region objects.