1 votes

How to get the highlighted text in PDF with JavaScript server-side?

I am trying to get the so-called highlights (or underlines) of the PDF. I am using PDFJS to obtain the data of the PDF on the server side, up to now I can get the notes (or post-it) of the underlined text, the more I don't get the text highlighted by the user previously. This code I wrote to test and "it works"

var data = new Uint8Array(fs.readFileSync('./pdftests/test1.pdf'));
PDFJS.getDocument(data).then(function (pdfDoc_) {
    pdfDoc = pdfDoc_;
    pdfDoc.getPage(1).then(function(stuff) {
      stuff.getTextContent().then( function(textContent){
        console.log("textContent: " + JSON.stringify(textContent));
      });
        stuff.getAnnotations().then(function(annotations){
          console.log("getAnnotations: "+ JSON.stringify(annotations));
        });

    }).catch(function(err) {
       console.log('Error data');
       console.log(err);
    });
}).catch(function(err) {
    console.log(err);
});

What is printed by the console.log in getAnnotations is the following:

[ { id: '18R',
    subtype: 'Highlight',
    annotationFlags: 4,
    rect: [ 126, 617.856, 237.3855, 631.26 ],
    color: Uint8Array [ 247, 220, 0 ],
    borderStyle: 
     { width: 0,
       style: 1,
       dashArray: [Object],
       horizontalCornerRadius: 0,
       verticalCornerRadius: 0 },
    hasAppearance: true,
    annotationType: 9,
    hasPopup: true,
    title: '',
    contents: 'PRUEBA\n' },
  { id: '19R',
    subtype: 'Popup',
    annotationFlags: 0,
    rect: [ 241.3855, 617.856, 369.3855, 689.856 ],
    color: Uint8Array [ 0, 0, 0 ],
    borderStyle: 
     { width: 0,
       style: 1,
       dashArray: [Object],
       horizontalCornerRadius: 0,
       verticalCornerRadius: 0 },
    hasAppearance: false,
    annotationType: 16 } ]

I do not find the way to know which is the text that is highlighted.

Looking for I found a JSFiddle that does what "in itself" I want to: http://jsfiddle.net/seikichi/RuDvz/2/

However, when you import this code to my controller, line 38 gives me error, detailing that there is not a method fromData:

var annotation = PDFJS.Annotation.fromData(data); 

The documentation of PDFJS not has helped me a lot.

1voto

yms Points 475

I don't know pdfjs enough to give you a complete answer, but I can give you some good leads:

  • An annotation of the highlighted text in PDF (Highlight Annotation), the only thing that does is draw a rectangle on the bottom of the page before you draw the text, and he draws the content of the page as such.

  • From an object "Highlight Annotation" there's no way to know directly which is the text you are highlighting, use the coordinates of the rectangle, and with that to go find the text in the object corresponding to the content of the page.

This response of SO-English has a sample code on how to extract the text from a full page, you could use it as a base and to limit the output to the bits of text that fall within your rectangle:

Extract text from a PDF from Javascript

Important details to keep in mind:

  • Not all PDF files allow you to extract human-readable text, although the text look good on the screen. Sometimes, the only information that there is inside the PDF is "paint a strip from here to there, and now a curve to no where", without that known for any side that the end result is a text.

  • The text within a PDF because it does not have to be stored in the same way that one sees on the screen, can be stored letter by letter, word by word, line by line, provided by coordinates and without a specific order.

Luck

HolaDevs.com

HolaDevs is an online community of programmers and software lovers.
You can check other people responses or create a new question if you don't find a solution

Powered by:

X