HTML Parsing with the DOMDocument

The DOMDocument is a class built in to PHP that helps developers navigate an HTML document tree and provides methods to help interact with the document.

Recently our development team needed to find a way to manipulate the body of an article and return JSON objects of all the body content. This was because of the constraints of the Apple News Publishing Format, which Outside recently joined. We needed to separate almost all HTML elements into their own individual component/object. As you can imagine, trying to write custom code to parse the body would’ve taken a long time and would’ve never captured all the permutations. After doing some research, we learned we were able to use PHP's DOMDocument to manipulate our body HTML content to solve the separation-of-HTML-elements problem.

What Is DOMDocument, and When Is It Used?

The DOMDocument is a class built in to PHP that helps developers navigate an HTML document tree and provides methods to help interact with the document. If you ever need to parse HTML content or manipulate HTML content using PHP, DOMDocument can help you quickly and easily access nodes.

Getting Started

At Outside, one thing we pride ourselves on is finding and sharing the best gear available. Today, we’re going to take a gear article and do a simple count of how many links are inside the body. DOMDocument is fairly easy to set up, and from there, you can manipulate it to your specific scenario.

View the article here: Upgrade Your Gear Closet with These 10 Great Deals

Here is a copy of the HTML content for your own testing purposes

Loading the Document

  1. Initialize the DOMDocument()

$dom = new DOMDocument();
  1. Load our HTML into the $dom object.

$dom->loadHTML($body); 

Retrieving Elements by Tag

1. With our HTML now loaded into the DOMDocument() object, we can use the method getElementsByTagName() which exists in the DOMDocument class, to get all elements with a link.

$links = $dom->getElementsByTagName('a');

2. For this specific example, all we need to do is get the number of links.  The method getElementsByTagName() returns a DOMNodeList, so we use the length method on DOMNodeList to get the number of links.

$body = HTML_CODE_HERE;

$dom = new DOMDocument();

$dom->loadHTML($body);

$links = $dom->getElementsByTagName('a');

$num_links = $links->length;

print($num_links); // 21

Excluding Certain Elements in a Tag

3. If you take a look at the article and the HTML, you will see that we have 2 types of links. We have regular links within text but we also have links with a class of btn. The btn links have a button style to them.

4. Next, we’re going to loop through all of the links so we can iterate on each one. Simple enough:

foreach ($links as $link) {

}

5. There then exists a method getAttribute() on DOMDocument to get the class attribute:

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

}

6. Our next step is to check if the class of btn exists on the link.

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

7. The above code looks correct, but if you look at the HTML, you’ll notice that some links don't contain a class on them. PHP will throw a WARNING because of this. Let's fix that.

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (!empty($link_class) && strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

8. The last thing we haven't done is initialize $num_btns:

$num_btns = 0;

foreach ($links as $link) {

  $link_class = $link->getAttribute('class');

  if (!empty($link_class) && strpos('btn', $link_class) !== FALSE) {

    $num_btns++;

  }

}

print($num_btns); // 10

9. Great work! As you can see, manipulating HTML can be fairly easy with DOMDocument.

Adding Elements

10. DOMDocument can be used for more than document traversal. You can also create new elements and append them to the current HTML.

11. Let's say we want to add a link to the bottom of this page that points to all of our gear articles. We can create a link element using the createElement method!

$gear = $dom->createElement('a', "Check out our Gear Channel");

$gear->setAttribute('href', "https://www.outsideonline.com/outdoor-gear");

12. After we've created our element, all we need to do now is add it to the $dom. The createElement function creates a new instance of the DOMElement, in this case a link, but it will not show up in the document unless it is properly inserted. In that case, we must use the  appendChild() function to get it to appear. See the documentation for reference.

$dom->appendChild($gear);

13. Here is the full code for adding a link to our HTML:

$gear = $dom->createElement('a', "Check out our Gear Channel");

$gear->setAttribute('href', 'https://www.outsideonline.com/outdoor-gear');

$dom->appendChild($gear);

print($dom->textContent); 

Recap

PHP's DOMDocument() class makes it very easy for developers to traverse and manipulate any HTML content. There exist many other methods in the class that can prove useful to you: getEelemntsByTagName, createAttribute, createTextNode, and createCDATASection just to name a few. No need to any extra libraries or modules, it's all built right in!

To learn more, visit the official PHP documentation for DOMDocument.


Body Copy:

 

<p>Moosejaw's Almost Everything sale starts Tuesday and goes until April 8. Most products are at least 25 percent off, or you can use the code YAY20 to get 20 percent off a full-price item. Here are a few sale highlights our editors have their eyes on.</p>

<h2>Patagonia Women's Nano Puff Hoody ($175; 30 percent off)</h2>

<p>Although it packs down to the size of an orange, the <a href="https://goo.gl/idjz9B" target="_blank">Nano Puff hoody</a> has kept our testers warm when temps drop to the 30s. Filled with high-loft synthetic insulation, the ripstop face fabric is treated with DWR to repel water.</p>

<p><a class="btn" href="https://goo.gl/idjz9B" target="_blank">Buy Now</a></p>

<hr />

<h2>Arcteryx Mens Covert Cardigan ($134; 25 percent off)</h2>

<p>Perfect for the office or the crag, the merino wool <a href="https://goo.gl/VCgtri" target="_blank">Covert cardigan</a> is style-oriented but with technical chops. Stash your credit card or chapstick in the zipper arm pocket. </p>

<p><a class="btn" href="https://goo.gl/VCgtri" target="_blank">Buy Now</a></p>

<hr />

<h2>Gregory Men's Baltoro Backpack ($191; 40 percent off)</h2>

<p>One of <a href="https://www.outsideonline.com/1974976/best-packs-2015">our favorite</a> backpacking packs year in and year out, the 75-liter <a href="http://goo.gl/cP7Z99" target="_blank">Baltoro</a> has the all the space you need to carry gear for a week in the backcountry. Plus, the removable internal hydration sleeve transforms into a daypack for summit bids.</p>

<p><a class="btn" href="https://goo.gl/cP7Z99" target="_blank">Buy Now</a> </p>

<hr />

<h2>CamelBak Franconia LR 24 Hydration Pack ($120; 25 percent off)</h2>

<p>With plenty of room for extra layers, a first aid kit, and lunch, <a href="http://goo.gl/gA9jBB" target="_blank">the Franconia</a> also features a lumbar style hydration pack which helps center the weight on the hip and prevents water sloshing.</p>

<p><a class="btn" href="https://goo.gl/gA9jBB" target="_blank">Buy Now</a></p>

<hr />

<h2>Hydro Flask 32 Ounce Wide Mouth Bottle ($34; 15 percent off)</h2>

<p>Don't settle for warm water or cold coffee, invest in an insulated bottle and never look back. The extra-wide mouth of <a href="https://goo.gl/9CJ1qN" target="_blank">this Hydro Flask</a> allows for easy filling and cleaning. </p>

<p><a class="btn" href="https://goo.gl/9CJ1qN" target="_blank">Buy Now</a></p>

<hr />

<h2>MSR Hubba Hubba NX 2-Person Tent ($300; 25 percent off)</h2>

<p>One of the most iconic tents ever made, the Hubba Hubba was redesigned in 2014 to make the <a href="https://goo.gl/KJBgFF" target="_blank">lightest offering of the series</a> yet. The designers also included color-coded stakeouts for easy setup. </p>

<p><a class="btn" href="https://goo.gl/KJBgFF" target="_blank">Buy Now</a></p>

<hr />

<h2>Therm-a-Rest Neoair Dream Sleeping Pad ($152; 44 percent off)</h2>

<p>This may just be the ultimate sleeping pad. <a href="https://goo.gl/5mgpB8" target="_blank">The Dream</a>'s unique design combines an air mattress and a foam topper. It's hands down the most comfortable pad we've ever slept on.</p>

<p><a class="btn" href="https://goo.gl/5mgpB8" target="_blank">Buy Now</a></p>

<hr />

<h2>Helinox Chair One Camp Chair ($75; 25 percent off)</h2>

<p>Weighing just 1.6 pounds, <a href="https://goo.gl/oasg8V" target="_blank">this chair</a> can hold up to 320 pounds. The secret is a pairing of strong but light aluminum poles and tough 600 denier polyester fabric which creates a package that packs to the size of a Nalgene.</p>

<p><a class="btn" href="https://goo.gl/oasg8V" target="_blank">Buy Now</a></p>

<hr />

<h2>Osprey Women's Ariel AG 65 Backpack ($248; 20 percent off)</h2>

<p>Set yourself up for a summer full of adventures with the <a href="http://goo.gl/yydRvh" target="_blank">Ariel 65 backpack</a>. It features women's specific touches, like extra padded S-shaped shoulder straps and a wide hip belt.</p>

<p><a class="btn" href="https://goo.gl/yydRvh" target="_blank">Buy Now</a></p>

<hr />

<h2>Yeti Roadie 20 Cooler ($160; 20 percent off)</h2>

<p>Designed for life on the move, the 20-liter <a href="http://goo.gl/bEQ6ZE" target="_blank">Roadie</a> has a sturdy aluminum handle for easy transport. It has room for 16 cans inside, plus ice. </p>

<p><a class="btn" href="https://goo.gl/bEQ6ZE" target="_blank">Buy Now</a></p>

	 

Filed To: Technology
More Magazine
Pinterest Icon