2016-12-31

Solving an XML Entity Deserialization Issue

blogentry, programming, c, deserialization

banner

I've recently released a new version of MyAnimeListSharp and I'd like to talk about a challenge I faced while implementing it.

MAL (MyAnimeList.net) API returns search responses in an XML format instead of in JSON. To make library users' lives easier, I decided to deserialize the XML response into an object (either as AnimeSearchResponse or MangaSearchResponse) for easier processing. Then Alas, I run into a problem. For some reason, I am not able to deserialize XML into an object due to undeclared XML entities such as (—), < (<) or  >(>), etc...

Here is the edited sample response from MAL API for an anime search ("synopsys" section usually contains undeclared XML entities)

1<?xml version=""1.0"" encoding=""utf-8"" ?>2<anime>3  <entry>4    <id>71</id>5	...6    <synopsis>Sousuke Sagara, ... on the battlefield.&lt;br /&gt; &lt;br /&gt;(Source: ANN, edited)</synopsis>7    <image>https://cdn.myanimelist.net/images/anime/2/75259.jpg</image>8  </entry>9  <entry>10    <id>72</id>11	...12    <synopsis>It's ... Kaname's classmate.&lt;br /&gt;&lt;br /&gt;(Source: ANN)</synopsis>13    <image>https://cdn.myanimelist.net/images/anime/4/75260.jpg</image>14  </entry>15</anime>

Hacking begins...

1public class SearchResponseDeserializer<T> where T : class2{3	public T Deserialize(string responseString)4	{5		using (var stringReader = new StringReader(responseString))6		using (7			var xmlReader = XmlReader.Create(stringReader,8				new XmlReaderSettings {DtdProcessing = DtdProcessing.Ignore}))9		{10			DisableUndeclaredEntityCheck(xmlReader);11
12			var xmlSerializer = new XmlSerializer(typeof(T));13			var result = xmlSerializer.Deserialize(xmlReader) as T;14			return result;15		}16	}17
18	private static void DisableUndeclaredEntityCheck(XmlReader xmlReader)19	{20         ...21	}22}

Here is the run-down of SearchResponseDeserializer.Deserialize.

  1. Given the response string in XML format
  2. Disable undeclared entity check
  3. Deserialize.

The part I was having trouble figuring out was #2, disabling undeclared entity check. There is a limit to replacing all entities as an empty string and that solution is just not optimal since one never knows when XML response will change to return other unknown XML entities.

I looked for an alternative in .NET documentation. There were no properties to set or functions to call to disable the entity check. But I've found a way in one of StackOverflow answer (by Sam Harwell who is a Microsoft MVP in .NET), which discusses how to use reflection to set an internal variable to bypass entity check.

1private static void DisableUndeclaredEntityCheck(XmlReader xmlReader)2{3	PropertyInfo propertyInfo = xmlReader.GetType().GetProperty(4		"DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);5	propertyInfo.SetValue(xmlReader, true);6}

XmlReader does not expose a property DisableUndeclaredEntityCheck publicly so it needs to be turned on using reflection. The property name is aptly named since you can guess what it does from the name.

I've never hacked my code this bad by having to set an internal property in .NET library. What I've learned from this challenge was that this experience has broadened my horizon that learning the internal of a framework can be useful in certain scenarios even though messing around with internal details is not a good idea most of time.