TL;DR with JSoup either switch off document pretty printing or use textNodes
to pull the raw text from an element.
A quick tip for JSoup.
I wanted to pull out the raw text from an HTML element and retain the \n
newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.
I found two ways to access them.
- switching off pretty printing
- using the
textNodes
Switching off Pretty Printing
When you parse a document in JSoup you can switch off the prettyPrint
Document doc = Jsoup.parse(filename, "UTF-8", "http://example.com/");
doc.outputSettings().prettyPrint(false);
Then when you access the html
or other text in an element you can find all the \n
characters in the text.
String textA = element.html();
Use the textNodes
This approach works regardless of whether you have prettyPrint
on or off:
String text = "";
for(TextNode node : element.textNodes()){
text = text + node + "\n\n";
}
If you accidentally use both methods then you might get confused.
I think I prefer the second approach because it works regardless.
You can find code that illustrates this on github in the TwineSugarCubeReader.java file
See also the accompanying YouTube Video: