Data processing has reached a high level of complexity. Enough so that lots of our data gathering and even creative work, such as brainstorming, is done digitally. However, there are few tools that are able to share complex data among them. Current efforts, such as Java and XML, try to unify processing and data storage, and exchange in particular. The industry is thus seeing the need for and the benefits of semantically rich content and standardized forms of, and access to, this content.
By semantically rich content, I mean data that is categorized and related to other data. The simplest types of editors, such as plain text and graphics editors manipulate artifacts (words, shapes, etc.) with only simple categorization and interrelation. Headers and footers may be identified, and style information may modify some of the artifacts. For this type of data, it is easy to create proprietary data formats, and very little reuse of and reference to the data is supported.
These simpler tools are working hard to add richer content, with hyperlinks, references, automatically updated content that resides elsewhere, etc. This is a difficult task, and more importantly, when tools try to reuse data, exchange between different tools becomes increasingly important. As the complexity of the data increases, and it is reused more, it becomes important to deal with multiple versions of the data over time. Even more so when many people cooperate in creating and manipulating the data. To make the problem even more acute, the structure of data evolves over time. Vendor-defined data formats change over time to accommodate changes and additions to the data and its interrelations. And, user-created structure evolves in similar ways.
Users evolve their categorizations over time. For example, sorting 100 emails into 2-3 categories is probably appropriate, but when the number increases to 10,000, searching for emails becomes very difficult and time-consuming unless more, perhaps nested, categories are used. When a user moves from one project to the next, he may want to use a different categorization to reflect organizational changes, but use the same categorization for old and new material.
Most classification schemes tend to favor a rigid classification. You classify your computer files in a folder hierarchy. Most complex tools have one or more such hierarchies. Even highly complex knowledge management tools, such as CASE tools tend to favor one hierarchy over any others it may present to the user. Changes to the structure of this one hierarchy tend to be very time-consuming to deal with, but they occur, for the same reasons any classification changes over time as conditions change or our understanding of the conditions changes.
The classical example of a rigid hierarchy is the so-called natural hierarchy of biology and botany. All living entities are classified in a tree structure, a hierarchy. It is great to have such a standard, unified categorization. It represents the accumulation of knowledge of the material world over millennia. But, it is not the only way to classify living entities. As our knowledge increases, we try to update the natural hierarchy to explain similarities and differences in as fundamental, clear and powerful way as possible. For example, when we achieve greater understanding of DNA, it is likely that we will use DNA-based methods of classification exclusively. Historically, however, we had to start out similarly to how any child starts classifying things around him. Perhaps first differentiating stationary things from moving things, then maybe talking and moving things from other moving things, and so on.
Even an adult, who is not a biologist, would probably not find great use in the biological grouping of humans with other primates, such as orangutans. To him, there are monkeys and humans, and among monkeys, there are big apes and smaller monkeys. Most children, and many adults, couldnt care less that dolphins are mammals and sharks are not. They swim, therefore theyre fish.
Multiple cross-categorizations are the norm, not the exception. Philosophically, categorization is an issue of knowledge, and in the case of human beings, it is a question of their, human context. Actually, there is no such thing as a natural hierarchy. Natural, in this context, presupposes answers to the questions to whom and for what. Even if there is a purpose all human beings can agree on, which is pretty much the case with the botanical and biological hierarchy of western science, it still hinges on them, i.e., their perspective, and on their chosen purpose.
XML-based standards basically fall into this trap, i.e., of working with the assumption of one unified categorization. This isnt wrong, as such, but it is extremely difficult to create such unified categorizations, and it is extremely difficult to change them, after people have started using them. What is worse is that it is near impossible to evolve such hierarchies from scratch. They typically involve committees of lots of industry players and incredible levels of negotiation.
Imagine for a minute that you could separate the issue of categorization from the representation (persistence or transmission) of the data. On the one hand we have data, lets call them quarks, and on the other hand categorizations, lets call them views. Lets take the dolphin example. Dolphins swim in water, and are mammals. This knowledge can be represented almost directly. In Quark, it would look something like this:
<dolphins-class> <swim-in> <water> <dolphin> <is-subtype-of> <mammal>
Now if a mammal-centric view or categorization is desired, the <is-subtype-of> and <mammal> quarks are searched for. If a different structure is desired, such as one based on being land-animal vs. a water-animal, that is searched for or asserted, as the case may be.
How is this accomplished, and what are the characteristics of QuarkSpace? See the rest of this site. Introduction to Quarks is a good place to start.