Ambaradan: the OWm2 storage engine

The OWm2 storage engine is the place where most Ambaradan magic happens. It's the real brain of the system, as everything else limits its functions to data display and interaction with both the user and other nodes in the network. Whatever you do in Ambaradan eventually boils down to a sequence of commands that are executed here.

A spoonful of history

This storage engine was originally born as an upgrade for Omegawiki (hence the OWm2 codename), but in time it has evolved to such a conceptual distance from Omegawiki that at this stage they can only be presented as remote relatives.

The most radical changes happened in the late autumn of 2008, when it became clear that most administrative problems originated from inconsistencies in the first datadesign. This forced us to produce two complete rewrites of the codebase, and it basically meant losing all the work that had been done up to then.

It was quite an unpleasant decision, as we already were sitting on a very tight schedule and there was no available funding, so the work had to be fully self-financed. At this point in time the only thing that remains of the original Omegawiki framework is the content that was stored in it, everything else was remade from scratch, including the most basic concepts behind the system.

In particular:

  1. OWm2 has dropped the very idea of a defined meaning, which was central to the old design. It was a nicely intuitive concept but it generated a tremendous problem as it defined as language independent something that was actually 100% language dependent.
  2. The original function of the engine was to store dictionary entries. It has evolved to a point in which it is simply a machine for storing linguistic content and for establishing relations among content parts.
  3. Linguistic content is multimedia oriented from the roots. We can freely use languages that have no script and must be recorded on audio and/or video.
  4. OWm2 implements a total insulation between the storage system and the GUIs. No GUI has any idea of how data is actually stored, GUIs simply use a public XML based API to communicate with an interposed daemon and simply ignore what kind of storage engine is behind it.
  5. The optional co-operative structure is a peer to peer system, not a wiki. It does not require an Internet connection to work, it can be distributed on DVDs and RAM keys, it has no discussion structure and it's not community based in anyway.
  6. OWm2 is focused on distributed knowledge, whereas a wiki is a centralized server.
  7. The users network may edit the data you shared, but if you like your version better you can keep it, while the rest of the network keeps its evolving version.
  8. While it can still serve content from a web-server, OWm2 is mainly meant to serve local offline content with asynchronous occasional connectivity, so there is no use for a mediawiki oriented storage in the project.

OWm2 in objects

Everything in OWm2 is an object, and as in all decent programming families (hopefully you got the self-ironic tone here) our objects are made of hierarchies, inherited values and behaviours.

To present such uncanny pieces of black computing magic, the usual geek convention is to start from the root element, write loads of pages on how powerful and generic it is, and then slowly get to things a simple user can understand. Too bad that most simple users have long quit reading, by the time things get to be interesting for them.

We obviously cannot build a house starting from the roof, yet we did try our best to build this guide with things that non-programmers can immediately grasp, without any need to get deeply involved in object oriented theology. You may want to read the following pages to get a good grasp of how this engine works.

 

Multimedia_text objects

Multimedia_text objects

The first object we meet is called multimedia_text. It is our main storage unit. It can contain any length of UTF-8 text (yes, any really means any) but it can also have no text at all and in instead point to an executable file object.

This is an important feature, because it means that content can indifferently be in textual, pictorial, audio or video format and the system will treat it anyway for what it is: a piece of content.

What we do with textual content is quite ordinary: we store it as compressed text, so it won't eat up your disk when it's millions of content parts you are hosting, we decompress it and serve it.

Life with files is more complicated, as there is no warranty that when a user receives from the network a video content part he/she is also willing to spend bandwidth to download what potentially amounts to gigabytes.

So we do not store the file in this object, we simply keep a pointer to a file list that will give us a receipt to obtain the actual file, if the user really wants to have it.

Wait a minute! How can we tell in what language this content part is? Or in what script? Well, the multimedia_text doesn't know anything about it. Its mission is to store things and this is what it does. So how do we do it?

Here is where inheritance comes handy. A multimedia_text is built on top of an object called object. You can think of it as of something encapsulated into multimedia_text, like the engine in a car. You don't actually see it, but it's there and it does loads of useful things. All answers on how multimedia_text gets to know more about itself lay in what object does, as we shall see in the next chapter.

 

 

Object objects

Object objects

At first sight, this object thing seems to do very little, let alone a very important thing: it contains a flag called mediated. When mediated is set TRUE, it means that the content is language dependant, and in order to be meaningful for a user it must be put into the most appropriate linguistic phase.

I can see a lot of eyebrows moving. This is a dictionary and we are to use it to translate things, right? So what sort of content can be language independent? Well, suppose you decide to have a picture that has no text and explains a concept. In what language is your your picture?

Besides, as we shall later see, object is not only included into multimedia_text objects. All the other entities that are using it carry exclusively language independent information, so it is very important that we have a switch to tell an object when its linguistic services are needed.

So okay, to return to our doubts about content, now we know that a given content part is expressed in "some" language, but how do we get to know in which one?

The answer is again in what object does, because object is the unified support for classification. Anything that includes object can be classified. So there is a classification telling us that a given object is expressed in some language, by using a given script, and we can retrieve it.

Exactly in the same way we can retrieve copyright information for this object, we can retrieve the source from where it came and the full list of classes to which this content was assigned. It would be very nice if we could immediately explain how classification works, but we still need to introduce some basic concepts, before we can do it.

The next step introduces us to our main semantic catalogue, the profile object.

 

Profile objects

Profile objects

NOTE! You will want to read this to fully understand what we mean by "profile". In short, by profile we mean the abstract concept that manifests itself in language. The idea of a cat, that keeps together all possible ways to say "cat" in all possible languages.

If you look at its database definition, a profile object appears to be one of the most abstract and impotent objects we have in our catalogue. In fact it doesn't do anything, it simply provides a reference that content can use to be semantically grouped.

By itself it doesn't contain any information whatsoever, but since it inherits object it can be classified, licensed etc. It is by definition a language independent information.

So, how do multimedia_text objects get assigned to a profile, if a profile doesn't know anything about content?

The answers are in the way the system manages hierarchical data, as we shall see in the next chapter.

Hierarchical data and translation trees

Hierarchical data and translation trees

Our understanding of reality is often built in frames containing each other. A cat is a feline, which is a mammal, which is an animal, which is a living entity, etc. When we say "Fritz the cat" we imply a full lot of information in that animal codename.

When dealing with content cataloguing we need a lot of such frames, and we need to make sure we correctly map what is included in what. But we need frames even to express how a multimedia_text came to be appended to a given profile.

It is important for us to know that someone created a piece of content as an original, or as a translation. If it was a translation we must know from what it was translated in the first place. This serves two purposes:

  1. if the source material gets updated, users must be informed that the translation might not be correct any more, so that it will be possible to verify it
  2. if the translation was made on top of another translation we have to state the depth of the tree, so that users realize that semantic drift for this translation is probably higher than it would have been by making a direct translation from the original.

So our content gets assigned to a profile by a so called translation tree. This tree marks the way in which multimedia_text objects were produced and let's us immediately see that translation 1-4 is pretty likely to contain a huge semantic drift.

This tree is not an object, but simply a structure that can be contained by proper objects. So while looking at its database definition you see nothing at all in the profile table, the OWm2 engine can assemble the whole relational structure that is built on top of it.

Yet, wait! Aren't we making dictionary entries? So what is this content we are talking about? The lemma itself or its definition? The answer is in the kind of tree we use. There are actually many, OWm2 uses them to map all possible kinds of taxonomic relations with just one dedicated set of routines that manages them all. There are trees that order class taxonomy, others that order network topology, others mark the way an "entry" is built. But before we proceed to this there are a few things to explain.

A tree structure knows many things:

  1. a context, which is the profile object to which the tree belongs
  2. a root, because as we see in the picture we can have any number of original multimedia_text objects that are ripe for translation
  3. an object pointer, that tells the profile object what object is referred by a given element of the tree. Trees do not need to refer directly a multimedia_text object, for the simple reason that they can refer to their included object in instead. This allows trees to order literally anything in the system, not just linguistic dependant content.

You will probably have noticed that these three elements do not tell us anything about the hierarchical position of a given element within the tree.They do not say that translation 1-4 came from translation 1-3. Let's see why.

One of the most difficult challenges for the coder is to find a way to efficiently represent hierarchical data in a relational database. It may seem weird, but there is no immediate way to retrieve a taxonomic tree from relational tables by a single efficient query. So we all resort to tricks.

All tree elements have a left and right value. They work as frames, so we immediately see that element 0-7 includes all the others, element 1-6 is included in 0-7 and includes both 2-3 and 4-5. This makes it trivial to arrange queries that retrieve a full taxonomic mapping in a single shot and can compute an element depth on the fly. It also makes it trivial to move around parts of a tree. Here we have the basic structure that allows merging and splitting things without much fuss and without any risk of loosing bits and pieces in the process.

Such a genial solution is obviously no invention of the OWm2 team, all credits for it go to Mike Hillyer for a very clear explanation of this method, along with basic code snippets that made it trivial to build what OWm2 needs.

So, once we explained the basic technology we use to map and move relational data, let's move to the way in which "dictionary entries" are built.

 

What exactly is a profile

What exactly is a profile?

As we saw, a profile seems to be quite a poor, powerless thing. But once the OWm2 engine collected and assembled all the linguistic dependant content assigned to a given profile, we got a totally different picture.

What we get is a dictionary entry, in the usual sense of an expression = definition equation. This happens because during the linguistic content assembly process, two families of hierarchical tree structures are used:

  1. expressions
  2. definitions

By now it should be clear that there is no structural difference whatsoever between the two. They are made by assembling the same multimedia_text objects by means of the same tree structures. A single unified set of routines serves them both. The difference between them is all in the role they play for human users.

A basic point is that since a profile is a language independent object (it is created as a empty set, as far as linguistic dependent content is concerned) none of the two tree families is mandatory. You can use a profile to collect expressions only or definitions only or any mix of the two.

The second important point is that this structure is not limited to dictionary management. Basically everything can be represented as an equation of content elements.

A news service or a CMS may use the expression tree to store articles, and the definition tree to store those small teasers that usually get published in the front-page. A software localization service may store a message codename in one of the two trees and its localized versions in another. A textual repository may store a full book in the expression, and a short description of the text in the definition. A Corpus Juris may have the text of its Laws in the expression, while serving commentaries in the definition tree, etc.

Obviously, no such service will be happy to call the two members expression and definition. But this is really not a problem, because we are in a multilayered architecture. All they need is to properly rename these entities in their GUIs, the logics remain unchanged anyway.

The members of the equation are fully interchangeable exactly because, as we have seen in the very beginning, a multimedia_text object is absolutely unaware of how it's going to be used. So our search engine can analyse the whole of our content base without needing to jump here and there. And it can do it really quick, even if we get to host billions of entries.

But the most important point is that a profile object can now emerge anywhere in its linguistic phase space. That is, we have an entity that while being absolutely language independent is also fully capable of materializing itself within just any linguistic space. And the more original and/or translated linguistic content we assign to this profile object, the more capable of adapting to different users and cultures it gets.

We shall build a lot upon this, as you'll see.

Linguistic independent classes

Linguistic independent classes

Being able to collect an immense amount of linguistic polymorphism is surely nice, yet what makes a cataloguing system strong is its capability to classify things. The capability to record that a cat is a feline, is a mammal, is an animal.

In the real world any concept can be used to classify any other. In order to use a profile to classify another in the OWm2 storage engine, a user must say that he/she is willing to use the first as a class object. This is aimed to limit the number of classes to avoid turning GUIs widgets into a nightmare.

When you say that a given profile is also a class you may choose to set it abstract. If you do so it means that this class object cannot be used to directly classify simple profile objects, but only to classify other class objects in a taxonomical tree.

This may sound quite obscure, so we shall better make a clear example of the implications. On classifying animals, among many other things you may possibly want to say that "tiger" is a "feline", is a "mammal", is an "animal". Just in the same way you may be interested in stating that German is an Indo-European language.

Yet you might not want to spend ages in tagging all single German expressions as Indo-European, or all mammals as animals, and you might not be satisfied if anyone said that "tiger" is simply an "animal". You may want to state once for all that "feline" is "mammal", is "animal" and be done with it. And you may want your GUI widgets to propose only analytical class objects, while keeping the more generic conceptual layers in an invisible implied background.

The answer is in defining these class objects as abstract. Abstract class objects can build up tremendously powerful semantic frames, while keeping the number of choices in widgets to a very minimum. Since class objects are classified by tree structures, as anything else, you can always change and move your classification structure later on.

Using such implied background information can make your data extremely powerful, because while nobody will be offered a chance to use an abstract class object to classify a given profile, anyone will be able to use these class objects as search terms. So a comparative linguist, for example, will be able to extract in one go all Indo-european expressions for "cat".

Once again, all class objects include a profile object that is responsible for their linguistic manifestation. So you can translate them and define them as any other concept, because concepts is what they are.

What we build is more than a traditional taxonomy. Many relations can be expressed as a directed graph, as shown in the attached picture. Such structure allows for inclusion of a class into an infinite number of traditional taxonomy trees.

In the example we see how the class object medicinal herbs can be included in the wider class object medicine by a number of different paths.

In fact both the following taxonomies are expressed:

This gives our classificatory structure a better degree of flexibility and allows for complex systems to be efficiently represented. But it is not enough, yet, as we shall see. Let's consider one more example, shown in the third attached picture. Here we deal with agriculture and soil classification.

A particular kind of soil, the Alfisol, is so called because of the presence in it of Aluminium and Iron. The profile objects "aluminium" and "iron" are surely included in the conceptual "base" that can lead to fully grasp what an Alfisol profile really is. But is this inclusion hierarchical? Can we say that "everything about iron" should labelled as "belonging into an Alfisol class object"? Most people will say that no, we cannot.

A hierachical categorization is not the best tool to express the relation, here we need something more like a WWW link. Something that suggests you may also have a look at the information about Iron, but that it does not automatically fit the Iron profile into the list of components of the "Anfisol set". Good candidates for this set are, in instead, landmarks where Anfisol is common, or specialized agricultural techniques for this peculiar kind of soil.

So what we have is non-directed link, that we expressed as a pointed line in our last example. In matematical terms, we shall say that we use both a directed and a non-directed graph to map semantic values. The resulting table of combinations (to remain in mathematical terms we would say "the resulting incidence matrix") is not simply composed by Yes/No values, but rather of any pick of empty/directed/non-directed. And what we have, in practical terms, is the capability of expressing a relation between two profiles that either:

  1. puts one value INTO another (and thus has a "direction")
  2. simply links the two elements  to each other (and so has no predefined direction, it can be walked both ways with the same result)

To optimize space consumption we obviously do not store empty values, so all we have is the cells of the matrix that contain a relational value. At design time we decided that this is "enough". This was simply the designer's decision, and certainly not a law of physics, but for all practical means this structure really seems to do all a dictionary needs.

Next we shall see that some classes are more classes than others.

Special classes and extended vs normal classification

Special classes and extended vs normal classification

Some classes are heavily used by the system for its own internal jobs. There was no point in developing two separated classification systems, one for humans and one for the engine, because basically they do the same thing, only with a different meaning.

When the system assigns a given linguistic content to the "English" class object, it means that it is expressed in English. Yet a Hindi native speaker may classify (in Hindi) the profile "Saxon genitive" with the very same "English" class object, because it is related to the English language. And he/she may do it while remaining totally immerse in the Hindi linguistic phase.

Both uses are obviously correct, and it is important that the system can tell the difference between the two. In this case we could use the fact that only a multimedia_text object is classified as linguistically dependant, while whatever is said of a profile is obviously a "normal human classification", yet there are more subtle logical traps ahead.

All object objects in the OWm2 storage engine bear license information. Most of them simply say they are not subject to copyright, yet potentially all are. So what happens when you assign a given profile to the class object "CC-BY"? Are you saying that this profile is related to this particular license, or are you rather stating licensing information proper?

You cannot tell, unless you state clearly which classification activity was made for what. So any time the system records the assignment of an object to a given class object, it also states whether it is doing so for its own internal purposes or as a result of human interaction. In the API you'll find the internal purposes named as "extended" vs "normal" (human) classification.

By all practical means we maintain in the system two parallel classificatory layers. One is service oriented, and it classifies objects by language, script, licensing information, file format and source, the other is the free associative machine by means of which a human user may state that "this is related to that". Once again, both layers share exactly the same software.

Special class objects are created for system's sake only, as they include the data needed by the system to perform special internal operations. They are made by wrapping a normal class object into a larger container, that is basically used only by the system. Users still see them and use them as ordinary class objects.

In particular, two special classes map the "legal input" for all linguistic content in the system. They do so by building a table that says what language/script coupling are allowed. There is indeed no point in storing English content in Cyrillic transliteration, but Serbian, for example, must map the possibility of both Serbian/Latin and Serbian/Cyrillic. Japanese has up to 4 possible variants and they must be kept well ordered and identifiable from each other.

When reading the API documentation you'll never see expressions like "Language" or "Script", though. This happens because the system is ready to accept non-human languages, too. We are not speaking about Martians Invaders, obviously, but rather about software.

There are a number of protocols that allow representing mathematical expressions, molecular 3D rendering etc. Calling these "languages" or "scripts" would have been (at the very best) improper, so we decided to use more generic labels. This is the reason why in the API you meet

  1. Communicative systems (which include human languages and rendering software)
  2. Mediums (which are scripts and protocol versions for software)

So now the circle is closed, and we finally came to see how a multimedia_text object knows in which language it is expressed.

 

Original vs sourced material

Original vs sourced material

You might have noticed that we spoke about "sources" in the previous pages. It is time to explain what we meant. When users insert content in the storage engine, they may either mark it as an original content they made by themselves or make it a "quote".

A quote is content that is not supposed to be edited, unless improperly quoted. It can be translated, obviously, but the original should not be altered. This is especially efficient when importing public "collections", because the collection as such is usually copyrighted and the related licensing information must be properly stored and shown.

This is also an efficient way to mark that a given definition is "authoritative", for example when issued by a public Board that is in charge of defining some kind of standards. An authority object is by all means a usual class object, but defining it as an authority tells the system that it can be used to source objects.

Users that feel that an "authoritative" definition is not the best tool to define a profile object may still freely add other original definitions as they feel fit.

A peculiar use for this mechanism consisted in marking the import of the all the content originally stored in the Omegawiki database.

The network layer

The network layer

This is where true geeks would have started their explanations from. We rather chose to present it here, when at least the subject of such extended logorrhoea is finally clear. If you are not interested in the co-operative mechanism included in the OWm2 storage engine you may happily skip this page.

We said that the object object is included in everything we store, like an engine that performs lots of useful tasks while many drivers simply ignore most of its details. Well, inside an object object there is yet another hidden engine, the network_object object. This really is the basement layer, there is nowhere deeper than here.

A network_object knows everything that is needed to move objects along different nodes in the extended network. It carries identifiers that have a universal validity, it identifies where and by whom an object was created and it is responsible for keeping an history of the larger object built on top of it, any time this larger entity gets altered in any of its parts.

Since it is the ambassador of any item towards the network at large, a network_object has much more awareness of the total entity in which it's included than any other part of it. It must have it, because the network daemon needs to know which is which, to correctly organize the sequence of its operations.

It is in fact relevant to introduce new elements in such an order that will ensure that any new object will find in place all the elements it needs to work. For example, when a new language object is created and used to classify content, it is vital to import this object before we import the content that it will classify.

The network_object is also the place in which a user stores his/her commands to the network daemon. A user may want to keep some content private, or he/she may be unhappy with what other editors or administrators out there are doing with a given object.

The network_object is the engine that can perform an unlink operation, thus ensuring that a particular object is disconnected by the global network and that no matter what other people do with it, the user will never see it changed again (but obviously the user can still edit it locally). Since literally everything contains a network_object users can unlink from the network whatever they please.

Unlinked objects are always returned at the top of local searches. This is no ideological position, but a simple practical consideration: people usually protect those things they most care for. It would be very annoying if those very things were buried at the bottom of their search results.

Understanding the guiding lines for our development process

We built this software with several goals and guidelines in mind. They all came from joining a large number of "books of dreams" and initially people reacted quite sceptically to this list. Yet, all that we are using today as everyday tools was "crazy dreams", just some 5-10 years ago.

Here is the list of requirements we started from:

  1. Write as little software as possible. Break the system in layers and use whatever Open Source tool that can perform the layer's job.
  2. Ensure co-operative content production. We want people to be able to freely add and edit things.
  3. Low spam. We need to make any spam-bot simply useless.
  4. No server/bandwidth expenses. Our system must be able to serve content on the net if anyone wants to use it for that, but it shall never force anyone to start large fund-raisers to pay for servers and bandwidth.
  5. No digital barrier, so no stoppers for those who cannot connect to Internet. People shall be able to fully work off-line. When no Internet connection is available it must be possible to send and receive content exchanging supports like RAM keys. This is especially vital for countries in which expensive public infrastructure cannot be expected to grow out of thin air overnight.
  6. No central servers, no central control. So no group of people who has larger powers than anyone else. In instead, all users shall be able to "build a fence" and live as they please behind it. This must be true both at individual and corporate level. We aim to be a general purpose repository like the WWW, where everyone is free to publish what they please, and people are free to choose who and what is of interest to them.
  7. Possibility to store corporate private info. A group of people or a company may well want to keep their data secret, or to sell access to their data.
  8. Content usability. A network that can potentially group billions of nodes, each of which can store specialized content, builds up to a pretty complex puzzle for anyone who wants to become part of it. We needed a good compass for users.

We assembled our pick of technologies based on these requirements. In the next chapters we shall discuss what layers are responsible for co-operative content management and what pre-existing tools have been chosen to perform the related tasks.

The service layers

The first thing we wanted was a safety exit: the chance for us to "change components" with (almost) zero impact. On assembling different pieces of pre-existing software one often stumbles into surprises, so we wanted things to integrate loosely enough for us to change a piece of software with another without actually compromising the general architecture. We also wanted a high redundancy, which means that we wanted to have many copies of the data, so that if a node or part of it gets damaged there are still good chances to get data back.

The final result is shown in the attached graphic. Rising up from basement level what you have is:

  1. Freenet: an open source peer to peer network that has no central server and can serve pretty well the following layer. It can efficently cache the multimedia files on a large number of nodes, so that if you have any such file you can "publish it" and the network will provide to cache it on a number of nodes proportional to its popularity
  2. Mercurial: an open source distributed version control system. It manages a repository of the XML representations of our objects. The node writes its activity to this repository, that is pushed and pulled to/from the general repository in terms of minimal incremental changes to reduce the amount of traffic. It uses Freenet as a transport when an Internet connection is available, but it can also use just anything, from usual http synchs to hand transported flash drives (which frees our users from the need to have a permanent or even occasional Internet connection).
  3. Mysql: an open source relational database into which objects are assembled to be efficiently served to the GUIs.
  4. Sphinx: a free open-source SQL full-text search engine. It is deeply integrated with MySQL, so we put it in parallel. We use it as "contained" in mysql, as a plugin storage engine, but if needed it can be accessed as an independent entity.
  5. Ambaradaemon: this is what all external applications use to deal with our system. They have no notion of what software packages are doing what. They connect to this daemon and use a public API to send it commands and retrieve answers. So it is... a translator, if you please.

Any GUI that can manage front-end relations with the user. It can be any number of them, they can be either desktop applications or macros for Open Office or local intraweb applications, you name it. They all exchange messages with Amabaradaemon, who in turn delivers them in proper format to the package that executes them.

We shall now spend a few words on some of the components, to see more in detail what they actually do.

Why Freenet and how do we use it?

Questions about our choice of Freenet as a network transport may range from Freenet's "pirate" reputation to more technical aspects, like the time it takes to a node to become fully functional. We welcome the broadest possible discussion, because it will surely help in making sounder decisions.

Buccaneers, le Carré plots, 5th columns and all that jazz

  1. We use Freenet technology, but do not host ANY of Freenet's content. OWm2 is by all means an independent network. We simply use the software and get all the updates and bug fixes official Freenet gets. Technology and the use people make of it are two different things. Cellular phones are sometimes used to detonate bombs in public places. You haven't thrown away your mobile phone because of this, have you?
  2. Disconnecting our traffic from the official Freenet is in the interests of both sides. Private users and institutions may well object to the kind of content the official Freenet is often believed to store, and Freenet users have no peculiar interest in crowding their caches with tons of XML stuff that is meaningful only to an Ambaradaemon. Individuals who want to run both networks on the same machine can easily do it anyway.
  3. Our default values are totally performance oriented and provide no special security level (technically speaking we run an OpenNet). Single users may activate all the free masonry bits in their Freenet software, if so they please. Yet, those who face real danger of political repression should be well aware that no encryption will save them, in the long run. To infiltrate their trusted human relations all it takes is a small amount of money, no IT skills and no network monitoring at all. Most dictatorships are very good at this, even when they lack super-computers.

It's written "strategy", it reads "money"

  1. We chose Freenet mainly because it has no "centre". No main server, no network administrators, no nothing. So no expenses to maintain a vital part of the system and no need for fund-raisers even just to stay alive. We live in 2009, lack of funding is a serious issue for just any project. If anyone feels like spending money to add (optional) alternative transports they are absolutely welcome. When for any reason they will stop to do it Freenet will still be there, for free.
  2. Freenet is a mechanism of "implicit donation", or electronic barter, if you please. Users donate to the system shared disk space and bandwidth. They buy these resources (their computer and connection) where they please and simply use a part of them to run OWm2. Each user volunteers roughly 4 Gbytes to the network, i.e. 256 users produce a 1Tbyte disk for free. In exchange they get content, while no actual money comes to us.
  3. We are no idealists. Big corporate users may eventually want to buy consulting services from us once our system has proven solid, at least that's our bet. To prove we are solid we need a large user base, and we cannot get it if we charge a fee (even as a "free donation") to single end users who live in deep economic recession. Ours is a start-up, and with not much cash available to potential investors we need to allow them to invest something else, like time and resources. That's it.

A geek's view

  1. Freenet has actively developed and maintained since 2002, it is quite stable and it is frequently attacked by well organized spammers, so it has a lot of development to make a spammer's life difficult. There are many other networks around, but this seems to have undergone a good deal of bug-fixing, plus, its developers are extremely friendly and co-operative. Using their product takes an enormous load off our backs.
  2. Freenet has FEC (Forward Error Correction). When large files are inserted into Freenet, they are split into many small blocks making a so-called a splitfile. FEC adds redundant check blocks to a splitfile, so that if some of the blocks fall out of the network or can't be found, you might still be able to retrieve enough of the file to reconstruct it.
  3. In Freenet, people have no idea whatsoever of what is cached on their nodes. Our users generally have no concern about policemen and/or government officers reading their files (it's but a dictionary, after all), yet the idea of a malicious hacker bringing havoc to their content worries absolutely all. This detail makes forgery really hard.
  4. Freenet performance greatly benefits from permanent uptime, but not all machines need to be connected. If you have more than one machine you should think of making one node available to all other members of your LAN. It's important, because it means less money to spend for your electricity bill! If you cannot use/afford Internet at all you will have to use the alternate transport systems (flash disks etc) to deliver/retrieve updates to a connected node.

What exactly do we use Freenet for?

  1. Freenet is mostly a cache for us. Multimedia files are published onto the net and keep circulating among nodes, where chunks of them are locally cached. This means that most popular files are more widely cached and will be retrieved more quickly, while "niche" content may need to be regularly reposted by the daemons, to make sure it's still cached. A good strategy for efficiency is downloading content you want to cache. Future releases will study a system to make this an automated procedure.
  2. We also use Freenet as a data transport, thanks to FreenetHG (you need to run the official Freenet to see the FreenetHG page, so we point you to a related page on the the usual WWW in instead). FreenetHG synchronises the node's local Mercurial repository with the shared edition. It all boils down to an svn-like repository of XML object representations. Since most objects never change we can squeeze the number of incremental changes to a very minimum. In practice objects are saved twice on each node, once in the XML shared repository and once in MySQL. Since this data is constantly requested by all nodes, it is extremely well cached and it takes a very small number of hops to get it, even for a new node.
  3. Speed is not an issue for anyone using this system, so even if Freenet is surely slow, compared to the usual WWW ways, this simply does not matter. The daemon pushes updates to the network only once in 6 hours, and also pulls the network updates each every 6. In practice, your node reads updates from the network (say) at 9am, uploads your changes (if any) at 12am, and gets the next refresh at 3pm. This limits global network traffic (and individual bandwidth bills) to a reasonable minimum and it makes sure it does not matter whether it takes 2 minutes or 2 hours to download an update. All the data you use is resident on your machine (served by MySQL), so this usage of Freenet will never show you any "Hourglass icon". Things will appear in you local DB as soon as they are downloaded, and while the process goes on your client GUI will work just as if nothing happened.

Mercurial, or what does XML repository mean?

Using a distributed repository means that you don't need to be connected to a central machine to write your updates. In instead, you write them locally, and once in a while you can commit them to a global container, from which they will diffuse to all other users. None of your updates gets lost when you are not connected to the network at large, you simply cannot deliver them for a while, and that's it.

All our objects are stored as XML representations. Putting them all into just one repository would quite defeat one of our main goals: making sure a user downloads just what is needed. As a matter of fact no user needs the whole networked knowledge base, most of them need but some 5% of it, many will need much less. So instead of one big fat repository we use many smaller specialized containers that are kept on different nodes, thus minimizing the traffic load a single node bears.

As previously said, we offer the chance to have the big bulk of the content in a few languages, while a user may decide to get the semantic classifiers in a much wider linguistic range. So we need to build independent containers for class objects and simple profile objects. None of them has any linguistic content, as we already know, but they build the logical structure we use to decide what linguistic content is needed.

Once we have the full logical structure in place it is trivial to match it against the node's linguistic subscriptions to build our download list. As we see in the attached picture, there is one independent content repository per language. The whole set of repositories forms what we call a "region", that is, a coherent data subset. As we shall see in the relative documentation, regions can be be combined with each other to form larger dataset.

Let's see how many independent repositories a "region" implies. The number of languages registered in ISO 639-3 amounts to 7,622. Each of them will need 2 repositories (one for class objects, the other for profile objects), plus 2 repositories for the linguistic independent information. This makes a theoretical total amount of 15,246 repositories. It would be fantastic to need them all, yet actual numbers aren't even remotely close to this mark. Content produced by the Omegawiki project in years of work covers 183 languages in all, for a grand total of 368 physical repositories.

How multiple dataset can combine

In the documentation about region you will find how users can select to use more regions, combining them into a single database. This happens at MySQL level, while at Mercurial repositories level data do not mix at all. All objects are identified by a UUID, which is also the name of the file that identifies them in the repository, and a given UUID belongs to one and only one region (that is, it is contained in only one set of repositories).

Regions can specify a set of "included regions", just like a software package has dependencies. Regions like "Biochemistry", "Physical chemistry" and "Neurochemistry" may all benefit from including a more general "Chemistry" region. Quite obviously, such a complex data organization cannot be achieved overnight. Chances are that the need to move and organize objects will come only once a given critical mass is reached. Administrators will need to split regions that become too big or simply too confused to be useful.

What if a given source wants to use this channel to "sell data" or if corporate users want to restrict access to their data? This will be possible in a next release. Private content is to be distributed and cached along the net, so the situation cannot be compared to accessing some well identified server. It is rather like receiving satellite TV broadcasts. The content goes around to each and everyone, but only some users can make any use of it. Part of the content may be public (the advertising messages that many channels send out), while the rest can be restricted.

Only authorized receivers will be able to use encrypted content. Keys may be sold, issued/revocated on the fly and/or given a validity in time. Such content may be efficiently included in the published set of repositories without compromising the seller's interests. Users obviously download only the encrypted repositories they have a key for. Nothing keeps them from downloading the others, too. Yet, they would only waste bandwidth if they did.

We also need to explain how a region can add semantic classification to profile objects that belong into another. This will be a constant need. If you build a region called "animals", the region "sports" will have a lot to say about it in terms of classification and relations. How can it do it? The answer is in using "headless" information.

An object is created by an XML generator that includes all data about the object, thus including semantic classification. "Headless" information includes only information about semantic classification and content. The object itself remains a property of another region, but the current region can "expand it", provided that the user already has the head information from another source (another region, that is). Whether new information should be added to the owner region or kept in other as "headless" is a users' decision.

Considering that information is constantly scanned and retrieved, even circular calls are solved. The first time headless information that has no object to which it can be attached will not be processed, but the second scan will find the heads in place and will had the information.

A similar process can be used to treat encrypted information. The overall number of our repositories slightly grows, but we gained a lot of functionality by such a small growth in size. Now specialized regions can add specialized expressions and definitions, they can use other regions' simple profiles as classes, and they can add specialized classification to just any general object.

An important consequence of this approach is that some of your objects may vanish because someone out there moved them onto a region you do not include in your subscriptions. When such is the case each user (either individual or corporate) may either add the new region or simply unlink the original object and keep it. When this happens at corporate level it may generate duplicates, and admins in this case should establish a relation among the two profile objects, rather then forcing a merge they cannot force on people anyway. In an OWm2 storage engine admins propose a solution, but they cannot impose it.

If users gather that a given choice is sensible they will join it. So no planned economy here, as it will always be the users to eventually determine which is which, rather than the admins. And no admin or mob rule can force a point of view on individuals and minorities. At least, not as long as people take the time to review what others are doing. But that's life, we all have but the rights we are prepared to defend.

From recycled spam to an Escher-like world

In an OWm2 storage engine admins propose a solution, but they cannot impose it... hmmm, it sounds like a dangerous thing, doesn't it? It actually only means that there is an endless number of aboslute powers that cannot overlap each other. It means good fences, not chaos.

So this is no security leak, if you come to think about it. It's true that a malicious attacker can start a new independent region and include all sort of weird content in it. So what? He is not damaging any of the existing content of other regions, and it's only up to end-users to decide whether this new region is worth downloading or not.

Spammers may be annoying, but they are not stupid. Even the least sophisticated spammer will not want to spam his own self to death in the absence of public. He will want to spam an existing region, something people read and use. This happens to be quite difficult, though. All new nodes are born greylisted, which means that what they post never gets automatically pulled into the distribution engine, until someone has verified that "this is human". So yes, there are billions of ways in which a malicious hacker can inject content into its OWN node, but this content is only going to reach a public repository.

If the region admins will so decide, the node will become whitelisted, if the situation is unclear it may remain greylisted, or it can be blacklisted for good. Once blacklisted, all it's sending to the region will be automatically thrashed. How the admins will decide to keep or thrash things is only up to them. They manage their own region and set policies as they please. One can pretty much decide to open a read-only region, whit one and only one authorized editor.

So, before fooling anyone, a spammer has to work for a while to produce content that will please the local human testers. After that he will get the greenlight to insert without checking. Yet the system will greylist him again automatically if he suddenly exceeds his usual input rate. One cannot just do a bit of casual work and happily flood a whole nation right after that. If he wants to send out like 100 spam profiles a day he first has to regularly produce an equivalent frequency in good profiles, and he must keep his useful activity up for days.

Now, isn't this censorship, isn't this an open road to "mob rule", to local communities deciding whatever they want and forcing their decisions on minorities and individuals??? The answer is a big capital YES. As we already said, we use Freenet technolgy but we are not Freenet. There is NO common policy about what is spam and what is not. Regions are 100% free to set their own standards, and anyone who doesn't like the standard is free to open another region with a totally opposite set of rules. Nobody needs anyone's approval for this.

To have good chances to remain unseen a malicious attacker needs to hide an amount of litter in the middle of a majority of good material. Such an attack vector forces admins to carefully examine things. The result is worth the effort anyway, because when they thrash the litter they are still left with the good content the spammer used as a shield. So many regions may welcome at least smart spammers, after all.

Does this mean that no anonimous contributions are allowed? No. It means we don't need to deal with floating IPs etc. Nobody can tell who you are when you run an OWm2 node, but your node UUID is fixed. If you add content twice it will twice bear the same signature. You can reinstall the node from scratch and get a new UUID, but then you start from being greylisted again...

Power and how we cook it

Admins have decisional power over THEIR region, and that's all they rule. The OWm2 storage engine has no central server to which one must be "admitted" in order to be hosted. There cannot be any "overall community", council, foundation or company in charge to decide whether to host a project or not. Even if such entity appeared out of thin air our software would not allow its decisions. So if one doesn't like the way a region is administered, all he has to do is to open his own. Nobody here gets more censorship than he is willing to accept.

Needless saying, freedom has a price. A new region will need content, users and anti-spam administration, i.e. hard work, marketing and the infinite boredom of revising incoming stuff. One could default whitelist access to all newcomers, but spammers never sleep, and his content would soon be so littered that most people wouldn't be interested in it. Let's be very clear: we offer freedom and independence, we don't serve free lunches. Users don't pay any money for hosting (at least not to us), but it's fully up to them to provide and organize all the labour it takes to keep their stuff running.

A small case of study

Now let's abandon the theoretical field to take a dive into the real world. Say user X joined region Y and found out that a voice in the dictionary is (according to his point of view) downright wrong. He immediately corrects it and drops in "some basic common sense". He feels a happy man.

Too bad that after just a few hours his work is gone and the voice bears exactly the same wording that offended him in the first place. Together with this unpleasant refusal he finds out he has been blacklisted. Pretty rude, sure, but it may well happen. Admins are people, and people sometimes are downright rude.

At first user X refuses the admins' decision and keeps his content on his node. Later on he realizes that he is simply speaking aloud in an empty room while others carry on what can only be labeled as a blatant case of mass disinformation. So he bravely decides to open a new region.

He imports region Y's contents  into his new creature and starts editing all he does not agree with. Will his edits reach the masses? Mind you, no, they certainly won't. His new region doesn't own that content, so whatever he does is still thrashed away as blacklisted. His edits still happen in region Y!

Blind alley? No, a simple filter against stupid edit wars. What his new region can do is to add headless information to region Y's objects. Other expressions, definitions, classifications and relations. Alternative stuff, that is. This new content will belong to HIS region, in which he and his friends rule.

Did he reach the masses this time? Well... Reaching the masses is hard work, once the content is in place one still needs to market it. He will need to tell people about this new region, and only his success in the process will determine the size of the masses he will reach.

Do we mean there is no such thing as ''the truth"?

The reader may wonder what's the point in having a knowledge base that says everything and the opposite of everything. How can we publish things that simply exclude each other, like Darwinism and Creationism? Shouldn't we guarantee some form of "objective information"? No, we shouldn't. No big philosophical issues here, our strategy is purely marketing driven:

  1. we guarantee freedom and an adaptive approach that can offer content to just ANY culture, religion or point of view,
  2. we offer to professional translators the widest possible pick of definitions.

Translating things does not require sympathizing with the text. A faithful translation is based on a deep understanding of the source, though. So the more different (even clashing) points of view are represented and translated, the better for a dictionary. We repeat it and underline it: we make a DICTIONARY here, not an encyclopaedia, so no POV/NPOV for our system.

Besides, users get as many regions as they choose. If one doesn't want to read material inspired by sect XY all he has to do is to avoid subscribing to the Church of XY region. Yet, pretty often one may have good professional reasons to need stuff he actually hates.

Think about it: a scientist may question a definition in chemistry that stems from some weird alchemy books. Yet, editors who publish alchemy books exist, and they all pay translators who need specialized dictionaries that give them "the proper language for the subject". Sometimes they need it with a serious tinge for expensive editions, sometimes they work at an instant book... There is no limit to what can be useful for a translator. This is the main "commercial" reason for us to favour the birth of openly "confessional" regions. They may be born as "confessional", but they are eventually useful as "specialized".

In short

Everyone is welcome to start a region that accepts only what they rate to be the truth, the best infomation level, the one and only sane approach to reality. Actually most regions will do exactly that, because this is what our system has been designed to do. It is up to each individual end-user to decide who is sane and who is not, the system will never hint a ready-made solution to such a question.

Everyone is also welcome to open a region that will finally state "the truth about OWm2", to properly inform the audience about all the things we deliberately (or not) hid from their attention. When this will happen we will rate it as our best success.

How do I tell the network what is relevant for me?

Let us start from individual users, before we move to the corporate level. As a note on the potentially endless debate on what a point of view is, we can only recommend that you reflect once more on the concept of linguistic phase space. On treating "the right to hold one's point of view" we use a very similar concept called semantic phase space.

What we mean is that the profile "water" is not only emerging as a different entity depending on its current linguistic phase, but also depending on its current semantic phase. There is no particular warranty that two semantic phases will remain mutually intelligible within the same linguistic phase.

By all practical means, a user may well need a translation from English to English, in order to make any use of what was originally born in deeply specialized terms. It's indeed not many people who will immediately understand the scientific version of the common phrase "liquid water is wet". And even if they could understand it, chances are this is not what they are looking for. Both the individual and the corporate user may want to choose one or more preferred semantic phases.

What we aim to build is an adaptive content network. Something an end user may tune both in terms of linguistic and semantic phases. This implies some tool by which a user can tell the system what is relevant for him/her, and any such tool needs a well identified set of parameters on which it can operate.

We are not replicating the way human brains work. Human brains work as fractally complex parallel systems that simultaneously process a combination of analogue and digital information at an incredible amount of different levels. We have no such resource at hand because we are using a Von Neumann machine, so we simply try and make a practical use of the data we already have.

Now let's see what we can use to make a filter. We have linguistically dependant information, that is, the extended classification by which our system tracks any content's linguistic phase. We also have a linguistically independent vector, the thing saying that a tiger is a feline is a mammal is an animal. We shall call this a semantic vector.

A profile object, as we see on the graphic, has no direct knowledge of any of the two vectors. So a profile object is portable anyway, no matter what happens to the semantic and linguistic values that were attached to it. If there is nothing stopping us from sending it out, the next step is to define how a user will say whether he/she wants to receive the object or not.

Suppose a user speaks wonderful Thai and Khmer, but he/she has no notion whatsoever of English. This person may well decide to ignore whatever semantic classification other people made, as long as the involved class objects cannot manifest themselves in a linguistic phase he/she can use. This implies that such person's semantic content will be full of holes, holes that the user can fill by making his/her own new profile objects used as class objects for semantic classification. In short, since no แมว class object appears on the screen and emersions like cat, кот and ネコ were ignored as useless, the user proceeds to make a new แมว class object on his/her own.

Now wait a minute, isn't this a whole lot of silly dupes?? Well... yes, absolutely. Yet, we must live with this problem anyway, no matter whether we broadcast all content or not. Even if we sent this guy terabytes of classification just to make sure that "no holes are there" those holes would vanish only from his/her local database. The user's linguistic phase space would not extend a bit in the process (just think of the use most English speakers can make of emersions like кот, ネコ, قط ,חתול or შინაური კატა) and thus duplicate profile objects would still be created in huge lots.

This is something that only multilingual human personnel can solve, by patiently marking candidates for merging whenever they meet them. The process will always be long and difficult, since no person on earth speaks ALL the possible linguistic phases we can store.

We also have the case of someone who can use one or more languages that he/she doesn't want to download. Say this person is a translator from Bulgarian to Hindi, looking for a dictionary on construction engineering. This user will mostly ignore Dutch terminology, but maybe he/she can speak Dutch, if even just enough to make use of a class object emerging like Spoorweg (Rail transport). This user may well want to ignore all profile objects that cannot emerge in the Bulgarian or Hindi linguistic phase, but decide to accept class objects that can emerge in Dutch.

We encourage people to use the widest linguistic phase surface for semantic classification, as this dramatically accelerates  the process leading to a compact set of profile objects. Profiles that have the same kind of detailed semantic classification and no overlapping linguistic phase are automatically sent for evaluation to the admins. Maybe they are not "exactly the same thing", but they are somehow closely related anyway, so it's worth giving them a closer look.

With this goal in mind we offer our users an extended linguistic phase surface for the network synchronisation of class objects only. Users do not get any "plain content" on extension phases, so the process eats up very little bandwidth and space while providing them with the best linguistic support for their semantic vector. Hmmm.... extended linguistic phase surface!? Now what the heck is that?? Okay, okay... Let's translate it from English to English: you can state a number of additional languages the system can use to send you class objects. These languages' usual content is of no interest to you, but they can be succesfully used to give you classificatory information when such information is not available in any of your usual language of choice.

So much for the linguistic vector. Now if you apply exactly the same filter to the semantic vector, you find yourself telling the system that you want to receive those profile obejcts that are classified by the construction engineering class AND by the Spoorweg class, maybe NOT those classified by the Stone rails included class (say you are not interested in archaeological stuff).

When the network daemon connects to other nodes, it will know what to request based on such specifications. This is what OWm2 calls its subscription mechanism, which is the basic mechanism allowing individuals to pick their choice of available content. In the next chapter we shall finally see how this applies to corporate level, thus giving birth to our region objects.