update: MetaPipe was renamed as GLAMpipe
I have previously made a tool called Flickr2GWToolset. It is a simple tool for editing metadata of Flickr images and exporting this data to XML file for GLAM-Wiki Toolset. The tool was aimed mainly for GLAM collections metadata. As you can see below, the user interface of Flickr2GWToolset is rather complicated.
The lesson learned from that project was that the problem with designing this kind of tool is how to make all the functionality available to the user without scaring the user. The user interface becomes complicated and adding new features makes it even more complicated.
For me, it seems obvious that extending user interfaces like seen above to include more and more functionality is a dead end (or, it would require a super-talented designer). Still, even if there was such designer, one fundamental problem remains.
The remaining problem is that, after the metadata is processed, there is no any clue about what was done with the data. The history of actions is not there. If someone asks “what did you do with the metadata?”, then one can just try to remember what were the actual steps. What if I could just show what I did? Or even better, re-run my process with different dataset?
At this point programmers raise their hands and say: “Just write scripts and then you can process any number of datasets”. That is true. Still, this approach has some problems.
The first one is obvious. How to write scripts if you are not a programmer? Second problem is re-usability. When someone writes scripts for example for processing metadata for a Wikimedia Commons upload, the results are often “hack until it works” type of scripts. This means awkward hacks, hardcoded values and no documentation (at least this is how my scripts look like). This makes re-using other programmers’ scripts very difficult, and people keep re-inventing the wheel over and over again.
Third problem is more profound. This is related to the origin of data and I’ll deal with it in next chapter.
Collection metadata vs. machine data
When speaking of tools for (meta)data manipulation, it important to define what kind of data we are dealing with. I make here a distinction between machine[c] data and collection data .
A server log file (a file that can tell what web pages are viewed and when, for example) is machine data. It is produced by computer, it has consistent structure *and* content. You can rely on that structure when you are manipulating the data. If there is a date field, then there is date in certain format with no exceptions. When processing is needed, a script is created, it is tested, and finally executed. The data is now processed. There is no need to edit this data by hand in any point. Actually, hand editing would endanger the reliability of the data.
On the contrary, collection data has “human nature”. It is produced by humans during some time period. This time period can include several changes in a way data was produced and structured. When this data is made publicly accessible, it has usually consistent structure but there might be various inconsistencies in the content structure and semantics.
For example, “author” can contain name or names, but it can also contain dates of birth or death, or it can even contain descriptions about authors. Or “description” field can contain just few words or it can include all the possible information about the target which could not be fitted in anywhere else in the data structure (and that should be placed somewhere else in upload).
This kind of data can be an algorithmic nightmare. There are special cases and special cases of special cases, and it would be almost impossible to make an algorithm that could deal every one of them. Often you can deal with 90 or 99 percent of cases. For the rest it might be easiest to just edit data manually.
When working with this kind of data, it is important that one can make manual edits during the process which is difficult when data is processed with scripts only.
MetaPipe GLAMpipe by WMFI
we are searching better name) relies the concepts of visual programming and node-based editing on its user interface. Both are based on visual blocks that can be added, removed and re-arranged by the user. The result is both a visual presentation and an executable program. Node-based user interface is not a new idea, but for some cases it is a good idea. There is a good analysis of node-based (actually flow-based which is a little different thing) here: http://bergie.iki.fi/blog/inspiration-for-fbp-ui/. Below you can see an example of visual programming with Scratch. Can you find out what happens when you click the cat?
GLAMpipe combines visual programming and node-based editing very loosely. The result of using nodes in GLAMpipe is a program that processes data in a certain way. You can think nodes as modifiers or scripts that are applied to a current dataset.
Above you can see a screenshot of GLAMpipe showing a simple project. There is a Flickr source node (blue), which brings data to collection (black). Then there are two transform nodes (brownish). The first one extracts the year from the “datetaken” field of the data and puts the result to a field called “datetaken_year”. The second one combines “title” and extracted year with comma and saves the result to a new field called “commons_title”.
Here is one record after the transforms nodes have been executed:
ORIGINAL title: “Famous architect feeding a cat”
ORIGINAL datetaken: 1965-02-01 00:00:00
NEW datetaken_year: 1965
NEW commons_title: “Famous architect feeding a cat, 1965”
Note that the original data remains intact. This means one can re-run transform nodes any number of times. Let’s say that you want to have the year in the commons title inside brackets like this: “Famous architect feeding a cat (1965)”. You can add brackets around “datetaken_year” in transform node’s settings. Then just re-run the node and new values are written to “commons_title” field.
The nodes have their own parameters and settings. This means that all information about editing process is packed in projects node-setup. Now, if someone asks “How did you create commons title names for your data?” I can share this setup. And even more, one can change the dataset and re-run nodes by replacing source node with a new source node with similar data structure. So if one want to process some other Flickr image album, this can be done with replacing source node which points to different album.
However, we are still fiddling with different kind of options how UI should work.
Nodes are the building blocks of data manipulation in GLAMpipe. Nodes can import data, transform data, download files, upload content or export content to a certain format.
Some examples of currently available nodes:
- Flickr source node. It needs an Flickr API key (which is free) and the album id. When executed, it imports data to the collection.
- File source node. Node reads data from file and imports it to the collection. Currently it accepts data in CSV or TSV formats.
- wikitext transform node. This maps your data fields to Photograph or Map template and writes wikitext to a new field to the collection.
- Flickr image url lookup node. This will fetch urls for different image sizes from Flickr and writes info to collection
- Search and replace transform node. This nodes searches string, replaces it and writes result back to collection by creating a new field.
Below is a screencast of using georeferencer node. Georeferencer is a view node. View node is basically a web page, that can fetch and alter data via GLAMpipe API.
Technically nodes are json files, that include several scripts. You can find in more depth information here: https://github.com/artturimatias/metapipe-nodes
How to test?
GLAMpipe is work in progress state. Do not expect things just to work.
Installing GLAMpipe currently requires some technical skills. GLAMpipe is a server software but only way to test it now is to install it to your own computer. There are directions for installation on Linux. For other operating systems installation should be also possible, but that is not tested.
My assignment (Ari) in Wikimaps 2.0 project is to improve a map upload process. This is done by adjusting separate tool development project by WMFI called MetaPipe (working title) so that would help map handling. The development of the tool is funded by Finnish Ministry of Education and Culture.