When writing unit tests we mostly focus on business correctness. We do our best to exercise happy path and all edge cases. We sometimes microbenchmark and measure throughput. But one aspect that is often missed is how our code behaves when the input is excessively large? We test how we handle normal input files, malformed files, empty files, missing files… but what about insanely large input files?
Let’s start from a real-life use case. You were given a task to implement GPX (GPS Exchange Format, basically XML) to JSON transformation. I chose GPX for no particular reason, it’s just another XML format that you might have come across e.g. when recording your hike or bicycle ride with GPS receiver. Also I thought it will be nice to use some standard rather than yet another “people database” in XML. Inside GPX file there are hundreds of flat
<wpt/> entries, each one representing one point in space-time:
<gpx> <wpt lat="42.438878" lon="-71.119277"> <ele>44.586548</ele> <time>2001-11-28T21:05:28Z</time> <name>5066</name> <desc><![CDATA]></desc> <sym>Crossing</sym> <type><![CDATA[Crossing]]></type> </wpt> <wpt lat="42.439227" lon="-71.119689"> <ele>57.607200</ele> <time>2001-06-02T03:26:55Z</time> <name>5067</name> <desc><![CDATA]></desc> <sym>Dot</sym> <type><![CDATA[Intersection]]></type> </wpt> <!-- ...more... --> </gpx>
www.topografix.com/fells_loop.gpx. Our task is to extract each individual
<wpt/> element, discard those without
lon attributes and store back JSON in the following format: