PDF tags (pdfinfo.exe)

den4b · 2016-04-28 14:27

To help you with your homework...

Pascal Script documentation:
http://www.den4b.com/wiki/ReNamer:Pascal_Script

Function reference:
http://www.den4b.com/wiki/ReNamer:Pasca … :Functions

Example scripts:
http://www.den4b.com/wiki/ReNamer:Scripts

kunkel321 · 2016-05-01 19:17

Thanks for the resources Dennis. Are there any examples of how to insert the content (the PDF tag, in this case) as a 'prefix' rather than a suffix? I thought maybe the "move_to" command from here http://www.den4b.com/wiki/ReNamer:Scrip … me_portion, but I don't know... This also makes me think of a feature idea, but I'll put that on a separate thread...

Andrew · 2016-05-02 19:17

FileName := WideExtractBaseName(FileName) + ' ' + Matches[0] + WideExtractFileExt(FileName);
    ↓                 ↓                      ↓          ↓                    ↓
new filename    base part of old filename   space    metatag      extension (last part) of old filename

This is just simple logic and common sense. All you need to do is move the metatag and space before the filename if you want to prepend rather than append it.

Of course if you're writing WideExtractBaseName(FileName) + WideExtractFileExt(FileName), then you might as well shorten it to simply FileName.

kunkel321 · 2016-05-03 13:55

Hmm.. Yes, I guess that was a no-brainer! Thanks Andrew

EDIT:
Just an extra note here, in case anyone else finds this thread via search, as I did.
My whole purpose was to rename a large number of ebooks that are in pdf format and give them a consistent naming structure. It turns out that most of the pdfs are mission the metadata needed.
Also, the metadata tag is likely to have "invalid characters" that will get added to the file name.
I looked up which common characters are invalid for file/folder names in Windows. Here they are as a list that you can paste (and save) as a custom Transliteration in list that changes them all to hyphens:

#=-
%=-
&=-
{=-
}=-
\=-
<=-
>=-
*=-
?=-
/=-
$=-
!=-
'=-
"=-
:=-
+=-
`=-
==-

It's interesting to note that the @ symbol and the space character " " were on the list. I use both of those in file and folder names though, so I left them off of this list.

EDIT AGAIN: Actually I'm going to post this list as a separate thread... It just occurred to me that maybe tilde "~" should be there...

Last edited by kunkel321 (2016-05-04 20:49)

jeffli · 2019-09-06 02:59

Hi,
pdfinfo.exe default encoding is Latin1, which doesn't support Chinese characters. If Meta-data contains Chinese words, it will not display.
It should display:

ReNamer_Pro_7.1>pdfinfo.exe 3.pdf
Title:          标题abc1
Subject:        主题abc
Keywords:       关键字abc2

But it actually display:

ReNamer_Pro_7.1>pdfinfo.exe 3.pdf
Title:          abc1
Subject:        abc
Keywords:       abc2

Try running this, it works well:

pdfinfo.exe -enc UTF-8 1.pdf

If you want to support Chinese characters, Pascal script should be modified to:

const
  EXE = 'pdfinfo.exe -enc UTF-8';
  TAG = 'Title\s*\:\s*(.*?)[\r\n]';

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    if Length(Matches) > 0 then
      FileName := Matches[0] + WideExtractFileExt(FileName);
  end;
end.

Last edited by jeffli (2019-09-07 02:30)

den4b · 2019-09-07 20:47

jeffli wrote:

Try running this, it works well:
pdfinfo.exe -enc UTF-8 1.pdf

Thanks for pointing this out. The script on the wiki has been updated accordingly.

http://www.den4b.com/wiki/ReNamer:Scripts:Xpdf

jeffli · 2019-09-08 00:18

den4b wrote:

jeffli wrote:
Try running this, it works well:
pdfinfo.exe -enc UTF-8 1.pdf
Thanks for pointing this out. The script on the wiki has been updated accordingly.
http://www.den4b.com/wiki/ReNamer:Scripts:Xpdf

Hi,
You help us a lot, I'd really appreciate it and give positive feedback.

Cleoss · 2019-09-24 01:33

den4b wrote:

and there are no open source libraries for Delphi to parse PDFs

Oh, is ReNamer written in Delphi? In RAD Studio? Doesn't it encounter any utf issues?

Stefan · 2019-10-04 11:46

If have tested this now on my own, here are my explanations...

Extract meta data from PDF file.

You can use the [Insert Meta Tag]-button to insert metadata from files, like ":File_DateCreated:".
But there is no Meta-Data extraction on PDFs on default.

Read den4bs' post above: http://www.den4b.com/forum/viewtopic.php?id=349

den4b wrote:

- The problem is that extracting tags from PDF is no easy task,
you'll nearly need to write an entire PDF parser to get that information.
- Anyway, there is possibility of using a 3-rd party executable tool to extract the tags from PDF,
- and then with a help of PascalScript, use them within ReNamer.

- - -

The package that we need to extract PDF meta data is called "Xpdf".

The Xpdf open source project includes a PDF viewer along with a collection
of command line tools which perform various functions on PDF files.
Xpdf was first released in 1995. It was written, and is still developed, by Derek Noonburg.

Download and extract it.
>> Browse to http://www.xpdfreader.com/ (((was before www.foolabs.com/xpdf)))
>> CLICK "Download the open source Xpdf tools"
>> CLICK "Download the Xpdf command line tools:" > Windows 32/64-bit: download

You will get a file called like: "xpdf-tools-win-4.02.zip"
Extract that ZIP file.

- - -

It has a command line tool called "pdfinfo.exe"
which we will use to print information from an PDF file.

Copy the 32-bit version of "pdfinfo.exe" and place it into ReNamer's folder.
--- ...\xpdf-tools-win-4.02\bin32\pdfinfo.exe

You may also want to read the documentation:
--- ...\xpdf-tools-win-4.02\doc\pdfinfo.txt
(((or read it online: http://www.xpdfreader.com/pdfinfo-man.html)))

- - -

Put the "pdfinfo.exe" in the same folder with the "ReNamer.exe".
Copy a sample PDF file into this folder too:
pdfinfo.exe
pdfinfo.txt
PDFTEST.pdf
ReNamer.exe
ReNamer.ini

Test it out:
Open a command prompt window. (((cmd.exe, Win+R, type cmd, press Enter)))
Navigate to the ReNamer folder.

In the command prompt window, enter the following command:
pdfinfo.exe PDFTEST.pdf

View the output in the command prompt.
You can also save that output to a text file:
pdfinfo.exe PDFTEST.pdf >PDFTESToutput.txt

Example Outputs (for reference):
Please note: not every tag may have an value, some are just empty.
The date format may depend on your system setting in windowsTM.

English date format:

Title:          PDFTEST.pdf
Author:         name removed
Creator:        PScript5.dll Version 6.0.1
Producer:       Acrobat Distiller 9.1.6 (Windows)
CreationDate:   04/06/17 19:46:57
ModDate:        04/06/17 19:46:57
Tagged:         yes
Form:           none
Pages:          3
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

German date format (note the missing leading zero on date less than 10)

Title:          VBScript FileSystemObject
Author:         name removed
Creator:        PDFCreator Version 0.8.1
Producer:       AFPL Ghostscript 8.51
CreationDate:   Mon Oct 16 11:58:34 2006
ModDate:        Mon Oct  9 10:23:45 2007
Tagged:         no
Form:           none
Pages:          57
Encrypted:      no
Page size:      612 x 792 pts (letter) (rotated 0 degrees)
File size:      313778 bytes
Optimized:      no
PDF version:    1.3

TIP:
Google for "Experts Exchange 6. Run the PDFinfo utility on the sample PDF file"
for to see an example output of PDF metadata.

- - -

Next we use a PascalScript to execute the "pdfinfo.exe" and read the output,
just like we had done above manually in command window.

Pseudo code:
-- myEXE = 'pdfinfo.exe -enc UTF-8';
(pdfinfo.exe default encoding is Latin1, which doesn't support Chinese characters.)
-- myCommand := myEXE+' "'+FilePath+'"';
-- ExecConsoleApp(myCommand, strOutput)
-- ShowMessage(strOutput); //should show something like the "Example Outputs:" above.

-- TAG = 'Title\s*\:\s*(.*?)[\r\n]';
-- Utilize Regular Expressions with 'TAG' to get the wanted line,
and next use PascalScript functions to process the found line to a nice format.
(see http://www.den4b.com/wiki/ReNamer:Pasca … :Functions )

Modify the TAG constant to specify which tag (line) you want to extract
and utilize PascalScript functions to process that finding to the wanted format (if not already).

### ### ### ### ### ### ### ###

Working code to get the "TITLE"-line from PDF meta data:

From den4bs' post above: http://www.den4b.com/forum/viewtopic.php?id=349

//Author: Denis Kozlov. Date: 2013-04-01.
const
  EXE = 'pdfinfo.exe -enc UTF-8';
  //Find a line in the output, starting with "Title" and ending at the EOL sequence.
  TAG = 'Title\s*\:\s*(.*?)[\r\n]';

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    if Length(Matches) > 0 then
      FileName := Matches[0] + WideExtractFileExt(FileName);
  end;
end.

The example script just replaces the current name, leaving only the original extension untouched.
To append the meta tag to the end of the file name, find the following line:
FileName := Matches[0] + WideExtractFileExt(FileName);
And replace it with:
FileName := WideExtractBaseName(FileName) + ' ' + Matches[0] + WideExtractFileExt(FileName);

The same script of den4b as just before, but with an "ELSE" if nothing is found:

//Author: Denis Kozlov. Date: 2013-04-01.
const
  EXE = 'pdfinfo.exe -enc UTF-8';
  TAG = 'Title\s*\:\s*(.*?)[\r\n]';

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    if Length(Matches) > 0 then
      FileName := Matches[0] + WideExtractFileExt(FileName)
    else
      FileName := '__No_Matches__' + FileName;
  end;
end.

### ### ### ### ### ### ### ###

Working code to get the "CreationDate"-line from PDF meta data:

English date format, the actual format may depend on your system setting in windowsTM

EXAMPLE LINE English: CreationDate: 04/06/17 19:46:57

-------------------------------------------------------
TEST it by an "Regular Expressions"-rule:

// Matches are count from the left: (1) (2) (3) (4) (5) (6) (7)
// Use that parts to compose the wanted NewName: $1 $2 $3...

Expression "CreationDate\s*\:\s*(\d\d).(\d\d).(\d\d).\s*(\d\d):(\d\d):(\d\d)"
Replace "20$3-$2-$1 $4$5$6" (skip extension)

Try that "Regular Expressions-rule" with the Analyze tool (Shift+A)
http://www.den4b.com/wiki/ReNamer:Analyze

Original:
CreationDate: 04/06/17 19:46:57

Replaced:
2017-06-04 194657
-------------------------------------------------------

Working code to get the CreationDate in English format:

//Working code to get the CreationDate in English format:
//Author: Denis Kozlov. Date: 2013-04-01. Stefan 2019-10-04
const
  EXE = 'pdfinfo.exe -enc UTF-8';
  //Find a line in the output, starting with "CreationDate".
  //EXAMPLE LINE English: CreationDate:   04/06/17 19:46:57
//TAG = 'CreationDate\s*\:\s*(\d\d)/(\d\d)/(\d\d)/\s*(\d\d):(\d\d):(\d\d)[\r\n]';
  TAG = 'CreationDate\s*\:\s*(\d\d).(\d\d).(\d\d).\s*(\d\d):(\d\d):(\d\d)[\r\n]';
  //	Matches are count from the left:  (0)	(1)	 (2)   (3)	(4) 	(5)
  //	Use that parts to compose the wanted NewName: Matches[0] Matches[1] Matches[2]...

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    if Length(Matches) = 6 then
      FileName := '20' + Matches[2] + Matches[1] 
			+ Matches[0] + WideExtractFileExt(FileName);
    end
    else
      FileName := '__NOTHING_FOUND___'+FileName;
end.

- - -

German date format, the actual format may depend on your system setting in windowsTM

EXAMPLE LINE German: CreationDate: Mon Oct 16 11:58:34 2006
EXAMPLE LINE German: CreationDate: Sun Oct 4 18:11:21 2009 //missing leading zero '0'! on date less than 10

-------------------------------------------------------
TEST it by an "Regular Expressions"-rule:

// Matches are count from the left: (1) (2) (3) (4) (5) (6) (7)
// Use that parts to compose the wanted NewName: $1 $2 $3...

Expression "CreationDate\s*\:\s*(\w\w\w)\s*(\w\w\w)\s*(\d+)\s*(\d\d):(\d\d):(\d\d)\s*(\d\d\d\d)"
Replace "$7-$2-$3 $4$5$6" (skip extension)

Original:
CreationDate: Wed Dec 17 15:00:42 2008
CreationDate: Sun Oct 4 18:11:21 2009

Replaced:
2008-Dec-17 150042
2009-Oct-4 181121
-------------------------------------------------------

Working code to get the CreationDate in German format:

//Working code to get the CreationDate in German format:
//Author: Denis Kozlov. Date: 2013-04-01. Stefan 2019-10-04
const
  EXE = 'pdfinfo.exe -enc UTF-8';
  //Find a line in the output, starting with "CreationDate".
  //EXAMPLE LINE German: CreationDate:   Sun Oct  4 18:11:21 2009
  //EXAMPLE LINE German: CreationDate:   WeekDay Month Day 18:11:21 Year
  TAG = 'CreationDate:\s+(\w\w\w)\s+(\w\w\w)\s+(\d+)\s+(\d\d):(\d\d):(\d\d)\s+(\d\d\d\d)';
  //	Matches are count from the left:  (0)	(1)	 (2)   (3)	(4) 	(5)  (6) (7)
  //	Use that parts to compose the wanted NewName: Matches[0] Matches[1] Matches[2]...

var
  Command, Output: String;
  Matches: TWideStringArray;

begin
  Command := EXE+' "'+FilePath+'"';
  if ExecConsoleApp(Command, Output) = 0 then
  begin
    Matches := SubMatchesRegEx(Output, TAG, False);
    //showmessage(IntToStr(Length(Matches)));
    if Length(Matches) = 7 then
      begin
        //showmessage(Output);
        //replace MONTH word by month number:
        Matches[1] := WideReplaceStr(Matches[1], 'Sep', '09');
        Matches[1] := WideReplaceStr(Matches[1], 'Oct', '10');
        //pad DAY less than 10 by an zero:
        If( Length(Matches[2]) <2) Then Matches[2] := '0'+Matches[2]; 
        FileName := Matches[6]+'-'+Matches[1]+'-'+Matches[2]
           +'_'+Matches[3]+Matches[4]+Matches[5]+'_'+FileName;
      end
  end
  else
      FileName := '__NOTHING_FOUND___'+FileName;
end.

HTH?

den4b · 2019-10-05 10:51

The pdfinfo tool has a command line option "-rawdates" which might simplify date parsing.

By default, I get the following date format:

CreationDate:   Thu Jan 31 23:33:00 2019

But with the "-rawdates" option I get this:

CreationDate:   D:20190131233300+01'00'

den4b Forum

#11 2016-04-28 14:27

Re: PDF tags (pdfinfo.exe)

#12 2016-05-01 19:17

Re: PDF tags (pdfinfo.exe)

#13 2016-05-02 19:17

Re: PDF tags (pdfinfo.exe)

#14 2016-05-03 13:55

Re: PDF tags (pdfinfo.exe)

#15 2019-09-06 02:59

Re: PDF tags (pdfinfo.exe)

#16 2019-09-07 20:47

Re: PDF tags (pdfinfo.exe)

#17 2019-09-08 00:18

Re: PDF tags (pdfinfo.exe)

#18 2019-09-24 01:33

Re: PDF tags (pdfinfo.exe)

#19 2019-10-04 11:46

Re: PDF tags (pdfinfo.exe)

#20 2019-10-05 10:51

Re: PDF tags (pdfinfo.exe)

Board footer