The OWm2 storage engine is the place where most Ambaradan magic happens. It's the real brain of the system, as everything else limits its functions to data display and interaction with both the user and other nodes in the network. Whatever you do in Ambaradan eventually boils down to a sequence of commands that are executed here.
A spoonful of history
This storage engine was originally born as an upgrade for Omegawiki (hence the OWm2 codename), but in time it has evolved to such a conceptual distance from Omegawiki that at this stage they can only be presented as remote relatives.
The most radical changes happened in the late autumn of 2008, when it became clear that most administrative problems originated from inconsistencies in the first datadesign. This forced us to produce two complete rewrites of the codebase, and it basically meant losing all the work that had been done up to then.
It was quite an unpleasant decision, as we already were sitting on a very tight schedule and there was no available funding, so the work had to be fully self-financed. At this point in time the only thing that remains of the original Omegawiki framework is the content that was stored in it, everything else was remade from scratch, including the most basic concepts behind the system.
In particular:
OWm2 in objects
Everything in OWm2 is an object, and as in all decent programming families (hopefully you got the self-ironic tone here) our objects are made of hierarchies, inherited values and behaviours.
To present such uncanny pieces of black computing magic, the usual geek convention is to start from the root element, write loads of pages on how powerful and generic it is, and then slowly get to things a simple user can understand. Too bad that most simple users have long quit reading, by the time things get to be interesting for them.
We obviously cannot build a house starting from the roof, yet we did try our best to build this guide with things that non-programmers can immediately grasp, without any need to get deeply involved in object oriented theology. You may want to read the following pages to get a good grasp of how this engine works.
Multimedia_text objects
The first object we meet is called multimedia_text. It is our main storage unit. It can contain any length of UTF-8 text (yes, any really means any) but it can also have no text at all and in instead point to an executable file object.
This is an important feature, because it means that content can indifferently be in textual, pictorial, audio or video format and the system will treat it anyway for what it is: a piece of content.
What we do with textual content is quite ordinary: we store it as compressed text, so it won't eat up your disk when it's millions of content parts you are hosting, we decompress it and serve it.
Life with files is more complicated, as there is no warranty that when a user receives from the network a video content part he/she is also willing to spend bandwidth to download what potentially amounts to gigabytes.
So we do not store the file in this object, we simply keep a pointer to a file list that will give us a receipt to obtain the actual file, if the user really wants to have it.
Wait a minute! How can we tell in what language this content part is? Or in what script? Well, the multimedia_text doesn't know anything about it. Its mission is to store things and this is what it does. So how do we do it?
Here is where inheritance comes handy. A multimedia_text is built on top of an object called object. You can think of it as of something encapsulated into multimedia_text, like the engine in a car. You don't actually see it, but it's there and it does loads of useful things. All answers on how multimedia_text gets to know more about itself lay in what object does, as we shall see in the next chapter.
Object objects
At first sight, this object thing seems to do very little, let alone a very important thing: it contains a flag called mediated. When mediated is set TRUE, it means that the content is language dependant, and in order to be meaningful for a user it must be put into the most appropriate linguistic phase.
I can see a lot of eyebrows moving. This is a dictionary and we are to use it to translate things, right? So what sort of content can be language independent? Well, suppose you decide to have a picture that has no text and explains a concept. In what language is your your picture?
Besides, as we shall later see, object is not only included into multimedia_text objects. All the other entities that are using it carry exclusively language independent information, so it is very important that we have a switch to tell an object when its linguistic services are needed.
So okay, to return to our doubts about content, now we know that a given content part is expressed in "some" language, but how do we get to know in which one?
The answer is again in what object does, because object is the unified support for classification. Anything that includes object can be classified. So there is a classification telling us that a given object is expressed in some language, by using a given script, and we can retrieve it.
Exactly in the same way we can retrieve copyright information for this object, we can retrieve the source from where it came and the full list of classes to which this content was assigned. It would be very nice if we could immediately explain how classification works, but we still need to introduce some basic concepts, before we can do it.
The next step introduces us to our main semantic catalogue, the profile object.
Profile objects
NOTE! You will want to read this to fully understand what we mean by "profile". In short, by profile we mean the abstract concept that manifests itself in language. The idea of a cat, that keeps together all possible ways to say "cat" in all possible languages. 
If you look at its database definition, a profile object appears to be one of the most abstract and impotent objects we have in our catalogue. In fact it doesn't do anything, it simply provides a reference that content can use to be semantically grouped.
By itself it doesn't contain any information whatsoever, but since it inherits object it can be classified, licensed etc. It is by definition a language independent information.
So, how do multimedia_text objects get assigned to a profile, if a profile doesn't know anything about content?
The answers are in the way the system manages hierarchical data, as we shall see in the next chapter.
Hierarchical data and translation trees
Our understanding of reality is often built in frames containing each other. A cat is a feline, which is a mammal, which is an animal, which is a living entity, etc. When we say "Fritz the cat" we imply a full lot of information in that animal codename.
When dealing with content cataloguing we need a lot of such frames, and we need to make sure we correctly map what is included in what. But we need frames even to express how a multimedia_text came to be appended to a given profile.
It is important for us to know that someone created a piece of content as an original, or as a translation. If it was a translation we must know from what it was translated in the first place. This serves two purposes:
So our content gets assigned to a profile by a so called translation tree. This tree marks the way in which multimedia_text objects were produced and let's us immediately see that translation 1-4 is pretty likely to contain a huge semantic drift.
This tree is not an object, but simply a structure that can be contained by proper objects. So while looking at its database definition you see nothing at all in the profile table, the OWm2 engine can assemble the whole relational structure that is built on top of it.
Yet, wait! Aren't we making dictionary entries? So what is this content we are talking about? The lemma itself or its definition? The answer is in the kind of tree we use. There are actually many, OWm2 uses them to map all possible kinds of taxonomic relations with just one dedicated set of routines that manages them all. There are trees that order class taxonomy, others that order network topology, others mark the way an "entry" is built. But before we proceed to this there are a few things to explain.
A tree structure knows many things:
You will probably have noticed that these three elements do not tell us anything about the hierarchical position of a given element within the tree.They do not say that translation 1-4 came from translation 1-3. Let's see why.
One of the most difficult challenges for the coder is to find a way to efficiently represent hierarchical data in a relational database. It may seem weird, but there is no immediate way to retrieve a taxonomic tree from relational tables by a single efficient query. So we all resort to tricks.
All tree elements have a left and right value. They work as frames, so we immediately see that element 0-7 includes all the others, element 1-6 is included in 0-7 and includes both 2-3 and 4-5. This makes it trivial to arrange queries that retrieve a full taxonomic mapping in a single shot and can compute an element depth on the fly. It also makes it trivial to move around parts of a tree. Here we have the basic structure that allows merging and splitting things without much fuss and without any risk of loosing bits and pieces in the process.
Such a genial solution is obviously no invention of the OWm2 team, all credits for it go to Mike Hillyer for a very clear explanation of this method, along with basic code snippets that made it trivial to build what OWm2 needs.
So, once we explained the basic technology we use to map and move relational data, let's move to the way in which "dictionary entries" are built.
What exactly is a profile?
As we saw, a profile seems to be quite a poor, powerless thing. But once the OWm2 engine collected and assembled all the linguistic dependant content assigned to a given profile, we got a totally different picture.
What we get is a dictionary entry, in the usual sense of an expression = definition equation. This happens because during the linguistic content assembly process, two families of hierarchical tree structures are used:
By now it should be clear that there is no structural difference whatsoever between the two. They are made by assembling the same multimedia_text objects by means of the same tree structures. A single unified set of routines serves them bo
th. The difference between them is all in the role they play for human users.
A basic point is that since a profile is a language independent object (it is created as a empty set, as far as linguistic dependent content is concerned) none of the two tree families is mandatory. You can use a profile to collect expressions only or definitions only or any mix of the two.
The second important point is that this structure is not limited to dictionary management. Basically everything can be represented as an equation of content elements.
A news service or a CMS may use the expression tree to store articles, and the definition tree to store those small teasers that usually get published in the front-page. A software localization service may store a message codename in one of the two trees and its localized versions in another. A textual repository may store a full book in the expression, and a short description of the text in the definition. A Corpus Juris may have the text of its Laws in the expression, while serving commentaries in the definition tree, etc.
Obviously, no such service will be happy to call the two members expression and definition. But this is really not a problem, because we are in a multilayered architecture. All they need is to properly rename these entities in their GUIs, the logics remain unchanged anyway.
The members of the equation are fully interchangeable exactly because, as we have seen in the very beginning, a multimedia_text object is absolutely unaware of how it's going to be used. So our search engine can analyse the whole of our content base without needing to jump here and there. And it can do it really quick, even if we get to host billions of entries.
But the most important point is that a profile object can now emerge anywhere in its linguistic phase space. That is, we have an entity that while being absolutely language independent is also fully capable of materializing itself within just any linguistic space. And the more original and/or translated linguistic content we assign to this profile object, the more capable of adapting to different users and cultures it gets.
We shall build a lot upon this, as you'll see.
Linguistic independent classes
Being able to collect an immense amount of linguistic polymorphism is surely nice, yet what makes a cataloguing system strong is its capability to classify things. The capability to record that a cat is a feline, is a mammal, is an animal.
In the real world any concept can be used to classify any other. In order to use a profile to classify another in the OWm2 storage engine, a user must say that he/she is willing to use the first as a class object. This is aimed to limit the number of classes to avoid turning GUIs widgets into a nightmare.

When you say that a given profile is also a class you may choose to set it abstract. If you do so it means that this class object cannot be used to directly classify simple profile objects, but only to classify other class objects in a taxonomical tree.
This may sound quite obscure, so we shall better make a clear example of the implications. On classifying animals, among many other things you may possibly want to say that "tiger" is a "feline", is a "mammal", is an "animal". Just in the same way you may be interested in stating that German is an Indo-European language.
Yet you might not want to spend ages in tagging all single German expressions as Indo-European, or all mammals as animals, and you might not be satisfied if anyone said that "tiger" is simply an "animal". You may want to state once for all that "feline" is "mammal", is "animal" and be done with it. And you may want your GUI widgets to propose only analytical class objects, while keeping the more generic conceptual layers in an invisible implied background.
The answer is in defining these class objects as abstract. Abstract class objects can build up tremendously powerful semantic frames, while keeping the number of choices in widgets to a very minimum. Since class objects are classified by tree structures, as anything else, you can always change and move your classification structure later on.
Using such implied background information can make your data extremely powerful, because while nobody will be offered a chance to use an abstract class object to classify a given profile, anyone will be able to use these class objects as search terms. So a comparative linguist, for example, will be able to extract in one go all Indo-european expressions for "cat".
Once again, all class objects include a profile object that is responsible for their linguistic manifestation. So you can translate them and define them as any other concept, because concepts is what they are.
What we build is more than a traditional taxonomy. Many relations can be expressed as a directed graph, as shown in the attached picture. Such structure allows for inclusion of a class into an infinite number of traditional taxonomy trees.
In the example we see how the class object medicinal herbs can be included in the wider class object medicine by a number of different paths.
In fact both the following taxonomies are expressed:
This gives our classificatory structure a better degree of flexibility and allows for complex systems to be efficiently represented. But it is not enough, yet, as we shall see. Let's consider one more example, shown in the third attached picture. Here we deal with agriculture and soil classification.
A particular kind of soil, the Alfisol, is so called because of the presence in it of Aluminium and Iron. The profile objects "aluminium" and "iron" are surely included in the conceptual "base" that can lead to fully grasp what an Alfisol profil
e really is. But is this inclusion hierarchical? Can we say that "everything about iron" should labelled as "belonging into an Alfisol class object"? Most people will say that no, we cannot.
A hierachical categorization is not the best tool to express the relation, here we need something more like a WWW link. Something that suggests you may also have a look at the information about Iron, but that it does not automatically fit the Iron profile into the list of components of the "Anfisol set". Good candidates for this set are, in instead, landmarks where Anfisol is common, or specialized agricultural techniques for this peculiar kind of soil.
So what we have is non-directed link, that we expressed as a pointed line in our last example. In matematical terms, we shall say that we use both a directed and a non-directed graph to map semantic values. The resulting table of combinations (to remain in mathematical terms we would say "the resulting incidence matrix") is not simply composed by Yes/No values, but rather of any pick of empty/directed/non-directed. And what we have, in practical terms, is the capability of expressing a relation between two profiles that either:
To optimize space consumption we obviously do not store empty values, so all we have is the cells of the matrix that contain a relational value. At design time we decided that this is "enough". This was simply the designer's decision, and certainly not a law of physics, but for all practical means this structure really seems to do all a dictionary needs.
Next we shall see that some classes are more classes than others.
Special classes and extended vs normal classification
Some classes are heavily used by the system for its own internal jobs. There was no point in developing two separated classification systems, one for humans and one for the engine, because basically they do the same thing, only with a different meaning.
When the system assigns a given linguistic content to the "English" class object, it means that it is expressed in English. Yet a Hindi native speaker may classify (in Hindi) the profile "Saxon genitive" with the very same "English" class object, because it is related to the English language. And he/she may do it while remaining totally immerse in the Hindi linguistic phase.
Both uses are obviously correct, and it is important that the system can tell the difference between the two. In this case we could use the fact that only a multimedia_text object is classified as linguistically dependant, while whatever is said of a profile is obviously a "normal human classification", yet there are more subtle logical traps ahead.
All object objects in the OWm2 storage engine bear license information. Most of them simply say they are not subject to copyright, yet potentially all are. So what happens when you assign a given profile to the class object "CC-BY"? Are you saying that this profile is related to this particular license, or are you rather stating licensing information proper?
You cannot tell, unless you state clearly which classification activity was made for what. So any time the system records the assignment of an object to a given class object, it also states whether it is doing so for its own internal purposes or as a result of human interaction. In the API you'll find the internal purposes named as "extended" vs "normal" (human) classification.
By all practical means we maintain in the system two parallel classificatory layers. One is service oriented, and it classifies objects by language, script, licensing information, file format and source, the other is the free associative machine by means of which a human user may state that "this is related to that". Once again, both layers share exactly the same software.
Special class objects are created for system's sake only, as they include the data needed by the system to perform special internal operations. They are made by wrapping a normal class object into a larger container, that is basically used only by the system. Users still see them and use them as ordinary class objects.
In particular, two special classes map the "legal input" for all linguistic content in the system. They do so by building a table that says what language/script coupling are allowed. There is indeed no point in storing English content in Cyrillic transliteration, but Serbian, for example, must map the possibility of both Serbian/Latin and Serbian/Cyrillic. Japanese has up to 4 possible variants and they must be kept well ordered and identifiable from each other.
When reading the API documentation you'll never see expressions like "Language" or "Script", though. This happens because the system is ready to accept non-human languages, too. We are not speaking about Martians Invaders, obviously, but rather about software.
There are a number of protocols that allow representing mathematical expressions, molecular 3D rendering etc. Calling these "languages" or "scripts" would have been (at the very best) improper, so we decided to use more generic labels. This is the reason why in the API you meet
So now the circle is closed, and we finally came to see how a multimedia_text object knows in which language it is expressed.
Original vs sourced material
You might have noticed that we spoke about "sources" in the previous pages. It is time to explain what we meant. When users insert content in the storage engine, they may either mark it as an original content they made by themselves or make it a "quote".
A quote is content that is not supposed to be edited, unless improperly quoted. It can be translated, obviously, but the original should not be altered. This is especially efficient when importing public "collections", because the collection as such is usually copyrighted and the related licensing information must be properly stored and shown.
This is also an efficient way to mark that a given definition is "authoritative", for example when issued by a public Board that is in charge of defining some kind of standards. An authority object is by all means a usual class object, but defining it as an authority tells the system that it can be used to source objects.
Users that feel that an "authoritative" definition is not the best tool to define a profile object may still freely add other original definitions as they feel fit.
A peculiar use for this mechanism consisted in marking the import of the all the content originally stored in the Omegawiki database.
The network layer
This is where true geeks would have started their explanations from. We rather chose to present it here, when at least the subject of such extended logorrhoea is finally clear. If you are not interested in the co-operative mechanism included in the OWm2 storage engine you may happily skip this page.
We said that the object object is included in everything we store, like an engine that performs lots of useful tasks while many drivers simply ignore most of its details. Well, inside an object object there is yet another hidden engine, the network_object object. This really is the basement layer, there is nowhere deeper than here.
A network_object knows everything that is needed to move objects along different nodes in the extended network. It carries identifiers that have a universal validity, it identifies where and by whom an object was created and it is responsible for keeping an history of the larger object built on top of it, any time this larger entity gets altered in any of its parts.
Since it is the ambassador of any item towards the network at large, a network_object has much more awareness of the total entity in which it's included than any other part of it. It must have it, because the network daemon needs to know which is which, to correctly organize the sequence of its operations.
It is in fact relevant to introduce new elements in such an order that will ensure that any new object will find in place all the elements it needs to work. For example, when a new language object is created and used to classify content, it is vital to import this object before we import the content that it will classify.
The network_object is also the place in which a user stores his/her commands to the network daemon. A user may want to keep some content private, or he/she may be unhappy with what other editors or administrators out there are doing with a given object.
The network_object is the engine that can perform an unlink operation, thus ensuring that a particular object is disconnected by the global network and that no matter what other people do with it, the user will never see it changed again (but obviously the user can still edit it locally). Since literally everything contains a network_object users can unlink from the network whatever they please.
Unlinked objects are always returned at the top of local searches. This is no ideological position, but a simple practical consideration: people usually protect those things they most care for. It would be very annoying if those very things were buried at the bottom of their search results.
We built this software with several goals and guidelines in mind. They all came from joining a large number of "books of dreams" and initially people reacted quite sceptically to this list. Yet, all that we are using today as everyday tools was "crazy dreams", just some 5-10 years ago.
Here is the list of requirements we started from:
No digital barrier, so no stoppers for those who cannot connect to Internet. People shall be able to fully work off-line. When no Internet connection is available it must be possible to send and receive content exchanging supports like RAM keys. This is especially vital for countries in which expensive public infrastructure cannot be expected to grow out of thin air overnight.We assembled our pick of technologies based on these requirements. In the next chapters we shall discuss what layers are responsible for co-operative content management and what pre-existing tools have been chosen to perform the related tasks.
The first thing we wanted was a safety exit: the chance for us to "change components" with (almost) zero impact. On assembling different pieces of pre-existing software one often stumbles into surprises, so we wanted things to integrate loosely enough for us to change a piece of software with another without actually compromising the general architecture. We also wanted a high redundancy, which means that we wanted to have many copies of the data, so that if a node or part of it gets damaged there are still good chances to get data back.
The final result is shown in the attached graphic. Rising up from basement level what you have is:
Any GUI that can manage front-end relations with the user. It can be any number of them, they can be either desktop applications or macros for Open Office or local intraweb applications, you name it. They all exchange messages with Amabaradaemon, who in turn delivers them in proper format to the package that executes them.
We shall now spend a few words on some of the components, to see more in detail what they actually do.
Questions about our choice of Freenet as a network transport may range from Freenet's "pirate" reputation to more technical aspects, like the time it takes to a node to become fully functional. We welcome the broadest possible discussion, because it will surely help in making sounder decisions.
Buccaneers, le Carré plots, 5th columns and all that jazz
We use Freenet technology, but do not host ANY of Freenet's content. OWm2 is by all means an independent network. We simply use the software and get all the updates and bug fixes official Freenet gets. Technology and the use people make of it are two different things. Cellular phones are sometimes used to detonate bombs in public places. You haven't thrown away your mobile phone because of this, have you?It's written "strategy", it reads "money"
A geek's view
Freenet has actively developed and maintained since 2002, it is quite stable and it is frequently attacked by well organized spammers, so it has a lot of development to make a spammer's life difficult. There are many other networks around, but this seems to have undergone a good deal of bug-fixing, plus, its developers are extremely friendly and co-operative. Using their product takes an enormous load off our backs.What exactly do we use Freenet for?
Freenet is mostly a cache for us. Multimedia files are published onto the net and keep circulating among nodes, where chunks of them are locally cached. This means that most popular files are more widely cached and will be retrieved more quickly, while "niche" content may need to be regularly reposted by the daemons, to make sure it's still cached. A good strategy for efficiency is downloading content you want to cache. Future releases will study a system to make this an automated procedure.
Using a distributed repository means that you don't need to be connected to a central machine to write your updates. In instead, you write them locally, and once in a while you can commit them to a global container, from which they will diffuse to all other users. None of your updates gets lost when you are not connected to the network at large, you simply cannot deliver them for a while, and that's it.
All our objects are stored as XML representations. Putting them all into just one repository would quite defeat one of our main goals: making sure a user downloads just what is needed. As a matter of fact no user needs the whole networked knowledge base, most of them need but some 5% of it, many will need much less. So instead of one big fat repository we use many smaller specialized containers that are kept on different nodes, thus minimizing the traffic load a single node bears.
As previously said, we offer the chance to have the big bulk of the content in a few languages, while a user may decide to get the semantic classifiers in a much wider linguistic range. So we need to build independent containers for class objects and simple profile objects. None of them has any linguistic content, as we already know, but they build the logical structure we use to decide what linguistic content is needed.
Once we have the full logical structure in place it is trivial to match it against the node's linguistic subscriptions to build our download list. As we see in the attached picture, there is one independent content repository per language. The whole set of repositories forms what we call a "region", that is, a coherent data subset. As we shall see in the relative documentation, regions can be be combined with each other to form larger dataset.
Let's see how many independent repositories a "region" implies. The number of languages registered in ISO 639-3 amounts to 7,622. Each of them will need 2 repositories (one for class objects, the other for profile objects), plus 2 repositories for the linguistic independent information. This makes a theoretical total amount of 15,246 repositories. It would be fantastic to need them all, yet actual numbers aren't even remotely close to this mark. Content produced by the Omegawiki project in years of work covers 183 languages in all, for a grand total of 368 physical repositories.
How multiple dataset can combine
In the documentation about region you will find how users can select to use more regions, combining them into a single database. This happens at MySQL level, while at Mercurial repositories level data do not mix at all. All objects are identified by a UUID, which is also the name of the file that identifies them in the repository, and a given UUID belongs to one and only one region (that is, it is contained in only one set of repositories).
Regions can specify a set of "included regions", just like a software package has dependencies. Regions like "Biochemistry", "Physical chemistry" and "Neurochemistry" may all benefit from including a more general "Chemistry" region. Quite obviously, such a complex data organization cannot be achieved overnight. Chances are that the need to move and organize objects will come only once a given critical mass is reached. Administrators will need to split regions that become too big or simply too confused to be useful.

What if a given source wants to use this channel to "sell data" or if corporate users want to restrict access to their data? This will be possible in a next release. Private content is to be distributed and cached along the net, so the situation cannot be compared to accessing some well identified server. It is rather like receiving satellite TV broadcasts. The content goes around to each and everyone, but only some users can make any use of it. Part of the content may be public (the advertising messages that many channels send out), while the rest can be restricted.
Only authorized receivers will be able to use encrypted content. Keys may be sold, issued/revocated on the fly and/or given a validity in time. Such content may be efficiently included in the published set of repositories without compromising the seller's interests. Users obviously download only the encrypted repositories they have a key for. Nothing keeps them from downloading the others, too. Yet, they would only waste bandwidth if they did.
We also need to explain how a region can add semantic classification to profile objects that belong into another. This will be a constant need. If you build a region called "animals", the region "sports" will have a lot to say about it in terms of classification and relations. How can it do it? The answer is in using "headless" information.
An object is created by an XML generator that includes all data about the object, thus including semantic classification. "Headless" information includes only information about semantic classification and content. The object itself remains a property of another region, but the current region can "expand it", provided that the user already has the head information from another source (another region, that is). Whether new information should be added to the owner region or kept in other as "headless" is a users' decision.
Considering that information is constantly scanned and retrieved, even circular calls are solved. The first time headless information that has no object to which it can be attached will not be processed, but the second scan will find the heads in place and will had the information.
A similar process can be used to treat encrypted information. The overall number of our repositories slightly grows, but we gained a lot of functionality by such a small growth in size. Now specialized regions can add specialized expressions and definitions, they can use other regions' simple profiles as classes, and they can add specialized classification to just any general object.
An important consequence of this approach is that some of your objects may vanish because someone out there moved them onto a region you do not include in your subscriptions. When such is the case each user (either individual or corporate) may either add the new region or simply unlink the original object and keep it. When this happens at corporate level it may generate duplicates, and admins in this case should establish a relation among the two profile objects, rather then forcing a merge they cannot force on people anyway. In an OWm2 storage engine admins propose a solution, but they cannot impose it.
If users gather that a given choice is sensible they will join it. So no planned economy here, as it will always be the users to eventually determine which is which, rather than the admins. And no admin or mob rule can force a point of view on individuals and minorities. At least, not as long as people take the time to review what others are doing. But that's life, we all have but the rights we are prepared to defend.
In an OWm2 storage engine admins propose a solution, but they cannot impose it... hmmm, it sounds like a dangerous thing, doesn't it? It actually only means that there is an endless number of aboslute powers that cannot overlap each other. It means good fences, not chaos.
So this is no security leak, if you come to think about it. It's true that a malicious attacker can start a new independent region and include all sort of weird content in it. So what? He is not damaging any of the existing content of other regions, and it's only up to end-users to decide whether this new region is worth downloading or not.
Spammers may be annoying, but they are not stupid. Even the least sophisticated spammer will not want to spam his own self to death in the absence of public. He will want to spam an existing region, something people read and use. This happens to be quite difficult, though. All new nodes are born greylisted, which means that what they post never gets automatically pulled into the distribution engine, until someone has verified that "this is human". So yes, there are billions of ways in which a malicious hacker can inject content into its OWN node, but this content is only going to reach a public repository.
If the region admins will so decide, the node will become whitelisted, if the situation is unclear it may remain greylisted, or it can be blacklisted for good. Once blacklisted, all it's sending to the region will be automatically thrashed. How the admins will decide to keep or thrash things is only up to them. They manage their own region and set policies as they please. One can pretty much decide to open a read-only region, whit one and only one authorized editor.
So, before fooling anyone, a spammer has to work for a while to produce content that will please the local human testers. After that he will get the greenlight to insert without checking. Yet the system will greylist him again automatically if he suddenly exceeds his usual input rate. One cannot just do a bit of casual work and happily flood a whole nation right after that. If he wants to send out like 100 spam profiles a day he first has to regularly produce an equivalent frequency in good profiles, and he must keep his useful activity up for days.
Now, isn't this censorship, isn't this an open road to "mob rule", to local communities deciding whatever they want and forcing their decisions on minorities and individuals??? The answer is a big capital YES. As we already said, we use Freenet technolgy but we are not Freenet. There is NO common policy about what is spam and what is not. Regions are 100% free to set their own standards, and anyone who doesn't like the standard is free to open another region with a totally opposite set of rules. Nobody needs anyone's approval for this.
To have good chances to remain unseen a malicious attacker needs to hide an amount of litter in the middle of a majority of good material. Such an attack vector forces admins to carefully examine things. The result is worth the effort anyway, because when they thrash the litter they are still left with the good content the spammer used as a shield. So many regions may welcome at least smart spammers, after all.
Does this mean that no anonimous contributions are allowed? No. It means we don't need to deal with floating IPs etc. Nobody can tell who you are when you run an OWm2 node, but your node UUID is fixed. If you add content twice it will twice bear the same signature. You can reinstall the node from scratch and get a new UUID, but then you start from being greylisted again...
Power and how we cook it
Admins have decisional power over THEIR region, and that's all they rule. The OWm2 storage engine has no central server to which one must be "admitted" in order to be hosted. There cannot be any "overall community", council, foundation or company in charge to decide whether to host a project or not. Even if such entity appeared out of thin air our software would not allow its decisions. So if one doesn't like the way a region is administered, all he has to do is to open his own. Nobody here gets more censorship than he is willing to accept.
Needless saying, freedom has a price. A new region will need content, users and anti-spam administration, i.e. hard work, marketing and the infinite boredom of revising incoming stuff. One could default whitelist access to all newcomers, but spammers never sleep, and his content would soon be so littered that most people wouldn't be interested in it. Let's be very clear: we offer freedom and independence, we don't serve free lunches. Users don't pay any money for hosting (at least not to us), but it's fully up to them to provide and organize all the labour it takes to keep their stuff running.
A small case of study
Now let's abandon the theoretical field to take a dive into the real world. Say user X joined region Y and found out that a voice in the dictionary is (according to his point of view) downright wrong. He immediately corrects it and drops in "some basic common sense". He feels a happy man.
Too bad that after just a few hours his work is gone and the voice bears exactly the same wording that offended him in the first place. Together with this unpleasant refusal he finds out he has been blacklisted. Pretty rude, sure, but it may well happen. Admins are people, and people sometimes are downright rude.
At first user X refuses the admins' decision and keeps his content on his node. Later on he realizes that he is simply speaking aloud in an empty room while others carry on what can only be labeled as a blatant case of mass disinformation. So he bravely decides to open a new region.
He imports region Y's contents into his new creature and starts editing all he does not agree with. Will his edits reach the masses? Mind you, no, they certainly won't. His new region doesn't own that content, so whatever he does is still thrashed away as blacklisted. His edits still happen in region Y!
Blind alley? No, a simple filter against stupid edit wars. What his new region can do is to add headless information to region Y's objects. Other expressions, definitions, classifications and relations. Alternative stuff, that is. This new content will belong to HIS region, in which he and his friends rule.
Did he reach the masses this time? Well... Reaching the masses is hard work, once the content is in place one still needs to market it. He will need to tell people about this new region, and only his success in the process will determine the size of the masses he will reach.
Do we mean there is no such thing as ''the truth"?
The reader may wonder what's the point in having a knowledge base that says everything and the opposite of everything. How can we publish things that simply exclude each other, like Darwinism and Creationism? Shouldn't we guarantee some form of "objective information"? No, we shouldn't. No big philosophical issues here, our strategy is purely marketing driven:
Translating things does not require sympathizing with the text. A faithful translation is based on a deep understanding of the source, though. So the more different (even clashing) points of view are represented and translated, the better for a dictionary. We repeat it and underline it: we make a DICTIONARY here, not an encyclopaedia, so no POV/NPOV for our system.
Besides, users get as many regions as they choose. If one doesn't want to read material inspired by sect XY all he has to do is to avoid subscribing to the Church of XY region. Yet, pretty often one may have good professional reasons to need stuff he actually hates.
Think about it: a scientist may question a definition in chemistry that stems from some weird alchemy books. Yet, editors who publish alchemy books exist, and they all pay translators who need specialized dictionaries that give them "the proper language for the subject". Sometimes they need it with a serious tinge for expensive editions, sometimes they work at an instant book... There is no limit to what can be useful for a translator. This is the main "commercial" reason for us to favour the birth of openly "confessional" regions. They may be born as "confessional", but they are eventually useful as "specialized".
In short
Everyone is welcome to start a region that accepts only what they rate to be the truth, the best infomation level, the one and only sane approach to reality. Actually most regions will do exactly that, because this is what our system has been designed to do. It is up to each individual end-user to decide who is sane and who is not, the system will never hint a ready-made solution to such a question.
Everyone is also welcome to open a region that will finally state "the truth about OWm2", to properly inform the audience about all the things we deliberately (or not) hid from their attention. When this will happen we will rate it as our best success.
Let us start from individual users, before we move to the corporate level. As a note on the potentially endless debate on what a point of view is, we can only recommend that you reflect once more on the concept of linguistic phase space. On treating "the right to hold one's point of view" we use a very similar concept called semantic phase space.
What we mean is that the profile "water" is not only emerging as a different entity depending on its current linguistic phase, but also depending on its current semantic phase. There is no particular warranty that two semantic phases will remain mutually intelligible within the same linguistic phase.
By all practical means, a user may well need a translation from English to English, in order to make any use of what was originally born in deeply specialized terms. It's indeed not many people who will immediately understand the scientific version of the common phrase "liquid water is wet". And even if they could understand it, chances are this is not what they are looking for. Both the individual and the corporate user may want to choose one or more preferred semantic phases.
What we aim to build is an adaptive content network. Something an end user may tune both in terms of linguistic and semantic phases. This implies some tool by which a user can tell the system what is relevant for him/her, and any such tool needs a well identified set of parameters on which it can operate.
We are not replicating the way human brains work. Human brains work as fractally complex parallel systems that simultaneously process a combination of analogue and digital information at an incredible amount of different levels. We have no such resource at hand because we are using a Von Neumann machine, so we simply try and make a practical use of the data we already have.
Now let's see what we can use to make a filter. We have linguistically dependant information, that is, the extended classification by which our system tracks any content's linguistic phase. We also have a linguistically independent vector, the thing saying that a tiger is a feline is a mammal is an animal. We shall call this a semantic vector.
A profile object, as we see on the graphic, has no direct knowledge of any of the two vectors. So a profile object is portable anyway, no matter what happens to the semantic and linguistic values that were attached to it. If there is nothing stopping us from sending it out, the next step is to define how a user will say whether he/she wants to receive the object or not.
Suppose a user speaks wonderful Thai and Khmer, but he/she has no notion whatsoever of English. This person may well decide to ignore whatever semantic classification other people made, as long as the involved class objects cannot manifest themselves in a linguistic phase he/she can use. This implies that such person's semantic content will be full of holes, holes that the user can fill by making his/her own new profile objects used as class objects for semantic classification. In short, since no แมว class object appears on the screen and emersions like cat, кот and ネコ were ignored as useless, the user proceeds to make a new แมว class object on his/her own.
Now wait a minute, isn't this a whole lot of silly dupes?? Well... yes, absolutely. Yet, we must live with this problem anyway, no matter whether we broadcast all content or not. Even if we sent this guy terabytes of classification just to make sure that "no holes are there" those holes would vanish only from his/her local database. The user's linguistic phase space would not extend a bit in the process (just think of the use most English speakers can make of emersions like кот, ネコ, قط ,חתול or შინაური კატა) and thus duplicate profile objects would still be created in huge lots.
This is something that only multilingual human personnel can solve, by patiently marking candidates for merging whenever they meet them. The process will always be long and difficult, since no person on earth speaks ALL the possible linguistic phases we can store.
We also have the case of someone who can use one or more languages that he/she doesn't want to download. Say this person is a translator from Bulgarian to Hindi, looking for a dictionary on construction engineering. This user will mostly ignore Dutch terminology, but maybe he/she can speak Dutch, if even just enough to make use of a class object emerging like Spoorweg (Rail transport). This user may well want to ignore all profile objects that cannot emerge in the Bulgarian or Hindi linguistic phase, but decide to accept class objects that can emerge in Dutch.
We encourage people to use the widest linguistic phase surface for semantic classification, as this dramatically accelerates the process leading to a compact set of profile objects. Profiles that have the same kind of detailed semantic classification and no overlapping linguistic phase are automatically sent for evaluation to the admins. Maybe they are not "exactly the same thing", but they are somehow closely related anyway, so it's worth giving them a closer look.
With this goal in mind we offer our users an extended linguistic phase surface for the network synchronisation of class objects only. Users do not get any "plain content" on extension phases, so the process eats up very little bandwidth and space while providing them with the best linguistic support for their semantic vector. Hmmm.... extended linguistic phase surface!? Now what the heck is that?? Okay, okay... Let's translate it from English to English: you can state a number of additional languages the system can use to send you class objects. These languages' usual content is of no interest to you, but they can be succesfully used to give you classificatory information when such information is not available in any of your usual language of choice.
So much for the linguistic vector. Now if you apply exactly the same filter to the semantic vector, you find yourself telling the system that you want to receive those profile obejcts that are classified by the construction engineering class AND by the Spoorweg class, maybe NOT those classified by the Stone rails included class (say you are not interested in archaeological stuff).
When the network daemon connects to other nodes, it will know what to request based on such specifications. This is what OWm2 calls its subscription mechanism, which is the basic mechanism allowing individuals to pick their choice of available content. In the next chapter we shall finally see how this applies to corporate level, thus giving birth to our region objects.