JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.

A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.

switching off pretty printing
using the textNodes

Switching off Pretty Printing

When you parse a document in JSoup you can switch off the prettyPrint

Document doc = Jsoup.parse(filename, "UTF-8", "http://example.com/");
doc.outputSettings().prettyPrint(false);

Then when you access the html or other text in an element you can find all the \n characters in the text.

String textA = element.html();

Use the `textNodes`

This approach works regardless of whether you have prettyPrint on or off:

String text = "";
for(TextNode node : element.textNodes()){
    text = text + node + "\n\n";
}

If you accidentally use both methods then you might get confused.

I think I prefer the second approach because it works regardless.

You can find code that illustrates this on github in the TwineSugarCubeReader.java file

See also the accompanying YouTube Video:

Watch on YouTube

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

Switching off Pretty Printing

Use the `textNodes`

Join The Evil Tester Patreon Community

Contact

Online Training Courses

Books

Follow

Podcasts and Videos

Need some motivation?

Recent Blog Posts

Also...

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

Switching off Pretty Printing

Use the textNodes

Join The Evil Tester Patreon Community

Contact

Join Email List

Online Training Courses

Books

Follow

Podcasts and Videos

Need some motivation?

Recent Blog Posts

Also...

Use the `textNodes`