Steve Michelotti: LINQ to XML with Hierarchical XML, Optional Elements, and Namespaces

Recently I had an interesting task to consume a particular XML document and populate a C# object from it. The structure of the XML document looked roughly like this:

<root xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <id>1</id>
    <title>abc</title>
    <content>
      <div xmlns="http://www.w3.org/1999/xhtml">
        <table>
          <tr>
            <td>Item1</td>
            <td>111</td>
          </tr>
          <tr>
            <td>Item2</td>
            <td>222</td>
          </tr>
          <tr>
            <td>Item3</td>
            <td>333</td>
          </tr>
        </table>
      </div>
    </content>
  </entry>
  <entry>
    <id>2</id>
    <title>xyz</title>
    <content>
      <div xmlns="http://www.w3.org/1999/xhtml">
        <table>
          <tr>
            <td>Item1</td>
            <td>444</td>
          </tr>
          <tr>
            <td>Item2</td>
            <td>555</td>
          </tr>
        </table>
      </div>
    </content>
  </entry>
</root>

The C# object to be populated looked like this:

public class Entry
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string Item1 { get; set; }
    public string Item2 { get; set; }
    public string Item3 { get; set; }
}

My initial inclination was that LINQ to XML was the correct approach to do this. I had a few interesting challenges:

The XML was hierarchical and needed to be flattened
The XML had multiple namespaces
Some of the elements were optional (e.g., Item3)
Needed to access sibling elements (e.g., in the first element, the name "Item1" has a value of "111" on line 9-10 above)

If you're already experienced with LINQ to XML, this might be a relatively straightforward query to write. However, if you're not a expert, it's often the approach that you take that most determines you success.

The first thing I did was to write a unit test so that when I started to code, I could also do a quick red/green check to see if I was getting closer. Although writing a unit test first is typically my default mindset, I realized pretty quickly that I needed even more efficiency (i.e., instantaneous feedback) since I was really just working on a single query. So my next step was to fire up LinqPad. This is truly an awesome tool and if you're not already using it, go download it right now (it's free). You can work with database connections (LINQ to SQL, EF), LINQ to Objects, WCF Data Services, and LINQ to XML. For LINQ to XML, put LinqPad in "C# Statements" mode rather than the default "C# Expressions."

My first step was to remove the XML namespaces from the XML. If my query didn't return any results, I wanted to make sure it was because something was wrong with the query and not a silly namespace issue. For my first iteration of the query, I didn't want to have to worry about optional elements, so I left Item3 out. Also, I knew that I could get the sibling element with LINQ to XML by using the ElementsAfterSelf() method:

var entries = from item in xmlList.Descendants("entry")
              select new
              {
                  Id = item.Element("id").Value,
                  Title = item.Element("title").Value,
                  Item1 = item.Element("content").Element("div").Element("table").Elements("tr").Elements("td").First(x => x.Value == "Item1").ElementsAfterSelf().First().Value,
                  Item2 = item.Element("content").Element("div").Element("table").Elements("tr").Elements("td").First(x => x.Value == "Item2").ElementsAfterSelf().First().Value
              };

While that does work, it certainly was not very pretty (or efficient with the deeply nested xhtml). A better approach would be to encapsulate each entry's table rows into a variable via the "let" keyword:

var entries = from item in xmlList.Descendants("entry")
              let rows = item.Element("content").Element("div").Element("table").Elements("tr").Elements("td")
              select new
              {
                  Id = item.Element("id").Value,
                  Title = item.Element("title").Value,
                  Item1 = rows.First(x => x.Value == "Item1").ElementsAfterSelf().First().Value,
                  Item2 = rows.First(x => x.Value == "Item2").ElementsAfterSelf().First().Value
              };

That's a definite improvement since I now have a "rows" variable that encapsulates all of the elements for just that entry. Now that we're confident the structure is in good order, we can put the namespaces back in:

XNamespace atomNs = "http://www.w3.org/2005/Atom";
XNamespace xhtmlNs = "http://www.w3.org/1999/xhtml";
 
var entries = from item in xmlList.Descendants(atomNs + "entry")
              let rows = item.Element(atomNs + "content").Element(xhtmlNs + "div").Element(xhtmlNs + "table").Elements(xhtmlNs + "tr").Elements(xhtmlNs + "td")
              select new
              {
                  Id = item.Element(atomNs + "id").Value,
                  Title = item.Element(atomNs + "title").Value,
                  Item1 = rows.First(x => x.Value == "Item1").ElementsAfterSelf().First().Value,
                  Item2 = rows.First(x => x.Value == "Item2").ElementsAfterSelf().First().Value
              };

At this point, the only thing left is dealing with the optional XML elements. Following this tip by John Papa, we can add the optional check for any item (Item2 and Item3 shown on line 11-12 below) like this:

XNamespace atomNs = "http://www.w3.org/2005/Atom";
XNamespace xhtmlNs = "http://www.w3.org/1999/xhtml";
 
var entries = from item in xmlList.Descendants(atomNs + "entry")
              let rows = item.Element(atomNs + "content").Element(xhtmlNs + "div").Element(xhtmlNs + "table").Elements(xhtmlNs + "tr").Elements(xhtmlNs + "td")
              select new
              {
                  Id = item.Element(atomNs + "id").Value,
                  Title = item.Element(atomNs + "title").Value,
                  Item1 = rows.First(x => x.Value == "Item1").ElementsAfterSelf().First().Value,
                  Item2 = (rows.FirstOrDefault(x => x.Value == "Item2") == null ? null : rows.First(x => x.Value == "Item2").ElementsAfterSelf().First().Value),
                  Item3 = (rows.FirstOrDefault(x => x.Value == "Item3") == null ? null : rows.First(x => x.Value == "Item3").ElementsAfterSelf().First().Value)
              };

At this point, things are working pretty well, so we can put this back into Visual Studio and execute the original unit test which is now green. But we're not quite done. While this code does what we want, it's not particularly succinct or DRY. It's also not incredibly efficient given that we're looking for the items multiple times (first to check for existence and then to get the value). When you hit situations like this, you can always write your own extension methods quite easily. So we can write an extension method to get a sibling value but only if the item exists (so it can be an optional element):

public static string FindSiblingValue(this IEnumerable<XElement> elements, string tagName)
{
    var label = elements.FirstOrDefault(x => x.Value == tagName);
    return (label == null ? null : label.ElementsAfterSelf().First().Value);
}

Once we have that in place, we can now refactor our LINQ to XML query:

var entries = from item in xmlList.Descendants(atomNs + "entry")
              let rows = item.Element(atomNs + "content").Element(xhtmlNs + "div").Element(xhtmlNs + "table").Elements(xhtmlNs + "tr").Elements(xhtmlNs + "td")
              select new
              {
                  Id = item.Element(atomNs + "id").Value,
                  Title = item.Element(atomNs + "title").Value,
                  Item1 = rows.FindSiblingValue("Item1"),
                  Item2 = rows.FindSiblingValue("Item2"),
                  Item3 = rows.FindSiblingValue("Item3")
              };

Now the code is succinct and efficient.

When looking at the final result, it really just boils down to 10 lines of C# code. However, with situations like these where you have to address multiple things at once, the best approach is typically to break it down into the smaller sub-components and attack each one individually. It's also key to have a good testing strategy up front with unit tests and "scratch pads" like LinqPad since the last thing you want to be doing is wasting your time hitting F5 all the time to see if your code runs correctly.