score:3

Accepted answer

promoting a comment to an answer...

it can't be done at the plain text level, as the semantic information around pages has been thrown away by that point

what you need to do is to extract the powerpoint file as xhtml, examples on how to do that from java here on the tika website. then, once you've got that, you'll see that the html has a structure like:

<body>
<div class="slideshow">
    <div class="slide">
       <div class="slide-master-content">
       </div>
       <div class="slide-content">
       </div>
    </div>
    <div class="slide">
       <div class="slide-content">
       </div>
    </div>
</div>
<div class="slide-notes">
</div>
</body>

so, you'll find a div for each slide, and within that you'll be able to see what's the slide itself, and what came from the slide master (if any). split that by slide divs, then grab the text out, and you're there!


Related Query

More Query from same tag