Thursday, November 13, 2008

OpenStreetMap - show me some data

Project Colbolt requires some understanding of space (actually another project does as well, but one thing at a time!) so I am going to use OpenStreetMap, as it gives me access to the underlying data. I'm not sure I can go the whole way using this - I will have to read the copyright and terms.
 
However, the first part of the plan is to take the OSM data and start to understand it. What I already like is that this is a real sized data set, the UK alone is ~1.2gb of XML, the world is 5.3gb as a zip file. So I'm looking forward to some proper coding. Since I worked in direct marketing a few years ago large geographical datasets are great fun for me (sad git).
 
The stats say that the dataset for the world contains :
  • nodes 278 millon;
  • ways 23 millon;
  • relations 41 thousand.
Node = GPS point (long/lat)
Way = a list of nodes describing a map feature (streets, fields etc)
Relation = a more complex grouping, normally of ways
 
My first step was to play with just the UK data, (always start with a small dataset that you know roughly what it should look like).
 
I wrote a little C# application to read the UK file, which gave me back some stats:
  • nodes 6,441,456;
  • ways 873,054;
  • relations 1,362
So this means we can apply a rough estimate that anything about the UK data can be sized up to the world by multiplying it by 42 (thanks for all the fish!).
Knowing there was no sensible way to try and load the whole XML file into memory (especially when we come back to the world data) I used an XmlReader to process the data. A quick scan which processed all the tags and added up these counts takes less than 30 secs for the UK - so the rule of 42 says it would take about 20 minutes to read the whole world.
 
A stripped down node (removing editing information) looks like:
 
<node id="int" lat="float" long="float">
 
In terms of trying to hold it in memory, stuffing the UK nodes into a List<float[2]> as a pair of floats required 150mb of memory, a sparse array fell over after 1gb and a Dictionary<int, float[2]> used 379 mb.
 
The List<float[2]> approach would lose the node id, which is not really a valid comparison, though I could just use a BitArray to hold the id (as big as the largest ID, a 1 indicating if its in the List<float[2]>, and the number of 1s before it indicating its offset) - this would require another 1mb of memory, but would be slower that the Dictionary approach (not validated this). In either case this is all moot as the rule of 42 shows we would need 7gb just to hold the nodes for the world - and at this stage I'm not doing that!. So we are going to need a storage system...
 
The code so far, is skeletal, but shows what we have to do to process the OSM format - note I could have skipped most of the data, but I wanted to get a fair idea of what reading the data takes:
 
 


static void Main(string[] args)
{
XmlReader reader = XmlReader.Create(@"k:\geo\uk-081015.osm");

Importer importer = new Importer(reader);
}

and the Importer class:


public class Importer
{
int nodes = 0;
int ways = 0;
int relations = 0;

public Importer(XmlReader reader)
{
// skip boring stuff
reader.MoveToContent();

reader.ReadStartElement("osm");

while (reader.IsStartElement())
{
string name = reader.Name;

switch (name)
{
case "bounds":
ReadBound(reader);
break;
case "node":
ReadNode(reader);
break;
case "way":
ReadWay(reader);
break;
case "relation":
ReadRelation(reader);
break;
default:
throw new Exception("Unknown element encounterd " + name);
}
}

reader.ReadEndElement(); // osm

Console.WriteLine("nodes {0} ways {1} relations {2}", nodes, ways, relations);
}

private void ReadBound(XmlReader reader)
{
reader.ReadStartElement();
reader.Skip();
}

private void ReadNode(XmlReader reader)
{
// get attributes
string id = reader.GetAttribute("id");
string latitude = reader.GetAttribute("lat");
string longtitude = reader.GetAttribute("lon");

reader.ReadStartElement(); // node
reader.Skip();
nodes++;

while (reader.Name == "tag")
{
reader.ReadStartElement();
reader.Skip();
}

if (reader.NodeType == XmlNodeType.EndElement)
reader.ReadEndElement(); // node
}

private void ReadWay(XmlReader reader)
{
// get attributes
string id = reader.GetAttribute("id");

reader.ReadStartElement(); // way
reader.Skip();
ways++;

while (reader.Name == "nd")
{
reader.ReadStartElement();
reader.Skip();
}
while (reader.Name == "tag")
{
reader.ReadStartElement();
reader.Skip();
}

if (reader.NodeType == XmlNodeType.EndElement)
reader.ReadEndElement(); // way
}

private void ReadRelation(XmlReader reader)
{
// get attributes
string id = reader.GetAttribute("id");

reader.ReadStartElement(); // relation
reader.Skip();
relations++;

while (reader.Name == "member")
{
reader.ReadStartElement();
reader.Skip();
}
while (reader.Name == "tag")
{
reader.ReadStartElement();
reader.Skip();
}

if (reader.NodeType == XmlNodeType.EndElement)
reader.ReadEndElement(); // relation
}

So as a first pass I'm going to stick it into SQL Server 2008 and see what the spatial features can do...

No comments: