Header format for polyglot books.

Discussions about Winboard/Xboard. News about engines or programs to use with these GUIs (e.g. tournament managers or adapters) belong in this sub forum.

Moderator: Andres Valverde

Header format for polyglot books.

Postby Michel » 20 Sep 2012, 19:32

I have now released version 1.0 of the polyglot header specification as well as version 1.0
of a utility to handle such headers. Both can be found here

http://hardy.uhasselt.be/Toga/pgheader-release/

(there is a windows binary).

I have also upgraded the last version of Polyglot I have been maintaining (1.4.67b) to one that is aware of the new header (1.4.70b).

polyglot make-book

will write a default header and

polyglot merge-book

will merge the variant lists in the books it is merging.

Version 1.4.70b can be found here:

http://hardy.uhasselt.be/Toga/polyglot-release/
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby matematiko » 21 Sep 2012, 01:45

Thank you very much Michel.

Regards,
One that does not live to serve, does not deserve to live.
matematiko
 
Posts: 219
Joined: 07 Dec 2008, 17:11
Location: Texas

Re: Header format for polyglot books.

Postby crystalclear » 24 Sep 2012, 21:32

I am starting to write some header generation and recognition software for my chess engine.

I have one query. I think I read the acronym ASCII in the header description. That means you intentionally or unintentionally exclude various characters from the comments, e.g. you forbid a comment saying "If you like and use this opening book, please donate a £1, $1 or €1 to a charity or your choice." since the comment contains a pound symbol (£) and a euro symbol (€). You exclude comments in Arabic, Japanese, Russian, Chinese, etc.

Why not have the comment in Unicode, coded in UTF-8 format?
Then various chess symbols could be included in the comment too.

I believe that UTF-8 allows ASCII strings to be encoded with no change to the strings - just to the software that handles them. If we specify unicode right at the start, it might make it easier than to change later or have problems with "foreign" people using their own clashing 8-bit code pages.
crystalclear
 
Posts: 91
Joined: 22 Sep 2011, 14:19

Re: Header format for polyglot books.

Postby Michel » 25 Sep 2012, 12:42

I think I read the acronym ASCII in the header description. That means you intentionally or unintentionally exclude various characters from the comments, e.g. you forbid a comment saying "If you like and use this opening book, please donate a £1, $1 or €1 to a charity or your choice." since the comment contains a pound symbol (£) and a euro symbol (€). You exclude comments in Arabic, Japanese, Russian, Chinese, etc.


I think it makes sense to require the predefined fields to be printable ascii. In programming printable ascii is still the norm.

The comment fields do not have to be printable ascii I think. If I require this somewhere please tell me and I will clarify.

Why not have the comment in Unicode, coded in UTF-8 format?
Then various chess symbols could be included in the comment too.


This seems like a good idea.
So the predefined field would be printable ascii and the comment field would start at the first \n character after the
predefined fields. Everything after that would be 1 UTF-8 encoded field.
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby Michel » 25 Sep 2012, 13:08

UTF-8 really seems to be a very good option since it does not collide with the use of \n as a field separator (which I was a afraid of).

So I could just require upfront that the logical header is a UTF-8 encoded character string. That would not break anything.
Thanks.

EDIT:

Actually I just verified that my utility pgheader ( http://hardy.uhasselt.be/Toga/pgheader-release/ ) already seems to handle
UTF-8 comments just fine.... (actually only the linux binary, not the windows binary. I think Linux handles unicode transparently, but I am not sure about windows.).

EDIT2

I put the UTF-8 requirement in the specs. See http://hardy.uhasselt.be/Toga/pgheader- ... eader.html under
"The logical header".
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby H.G.Muller » 25 Sep 2012, 19:48

Well, that sort of excludes that I will ever implement it in WinBoard, as that does not support unicode...
User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: Header format for polyglot books.

Postby Michel » 25 Sep 2012, 20:11

H.G.Muller wrote:Well, that sort of excludes that I will ever implement it in WinBoard, as that does not support unicode...


Why? 7 bit asci is a perfect subset of UTF-8. So if the comment section is in 7bit ascii (as it would be in the western world)
then it would display just fine.

Furthermore the predefined fields (which contain the important metadata necessary for the correct
use of the book) are required to be printable ascii characters...

The only issue would be that there _could_ be multibyte characters in the comment section (say for a book file originating in China).
If winboard does not understand those the comment would be displayed as garbage...

But I would not be so sure that winboard would not display it correctly. I was quite suprised that my utility handled
UTF-8 comments just fine, without me having done anything special about it.

EDIT: BTW If winboard really would be unable to display UTF-8 encoded unicode then it can look for bytes in the comment section with their highest bit set.
This signals a multi-byte sequence and winboard can just decline to display the comment with a polite message (instead of displaying garbage). The free format comment
is just informative and not essential for the functioning of the book.
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby Michel » 26 Sep 2012, 09:53

It seems Linux and the internet use UTF-8 as default encoding (which is sane since UTF-8 is a backward compatible extension of 7bit ascii).

Windows however uses UTF-16 as default.

So to display a UTF-8 encoded string, which is not 7bit ascii, on Windows I assume you have to first convert it to UTF-16 using MultiByteToWideChar

http://msdn.microsoft.com/en-us/library ... 85%29.aspx
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby H.G.Muller » 26 Sep 2012, 13:19

The Windows I have all use the Latin-1 code page as default, and so does the WinBoard UI. WB is not a unicode app.
User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: Header format for polyglot books.

Postby Michel » 26 Sep 2012, 14:43

Well that would mean that if a comment contains non 7bit ascii chars winboard would not be able to display it....

But to be honest I find it actually very hard to believe that something trivial like displaying a unicode string would be hard
to do on windows....

EDIT: Deleted link, since it did not contain what I claimed it did.

Are you suggesting requiring that the comment field should be printable ascii? I think this is unreasonable in this
day and age....

Note that this is not the fault of the format I designed. Whenever your header contains free text (which
is desirable since the information about opening books will usually be unstructured) you will have the
problem of specifying the encoding. UTF-8 is the best choice since it is a widely accepted standard
and moreover it is a super set of 7bit ascii.
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby crystalclear » 28 Sep 2012, 02:59

EDIT: In the text below, I should perhaps have typed fischerandom instead of chess960. I was playing with the header utility, and not following the specification!




I tested the EXE header program on windows with a dozen random unicode characters and it worked just fine.

What didn't work was the Microsoft windows console and command shell. I tried initially with a console opened by clicking on a BAT file.

When I called the header program from a TCL shell using the SOURCE command it worked fine.
I read the characters into a comment in the polyglot file with one call to header, hexdumped the file, and printed them back out with another call to the header program and saved the characters to a file.

With and editor I could see that I got back what I put in. I checked the hex for the € symbol in the hexdump and that was correct.

I tried to post the whole lot on here, but the mix of Japanese, arabic, maths etc was too complicated for the website. It seems the header program handles UTF-8 since for the header program is it just numbers in bytes. The complication with unicode characters is when you get to character counting and displaying fonts I suppose.


My editor gives the byte codes for the € symbol as
E2 82 AC

Code: Select all
00000000: 00 00 00 00 00 00 00 00 - 40 50 47 40 0A 31 2E 30 |        @PG@ 1.0|
00000010: 00 00 00 00 00 00 00 00 - 0A 33 0A 32 0A 73 75 69 |         3 2 sui|
00000020: 00 00 00 00 00 00 00 00 - 63 69 64 65 0A 63 68 65 |        cide che|
00000030: 00 00 00 00 00 00 00 00 - 73 73 39 36 30 0A 54 68 |        ss960 Th|
00000040: 00 00 00 00 00 00 00 00 - 69 73 20 69 73 20 61 20 |        is is a |
00000050: 00 00 00 00 00 00 00 00 - 6D 79 20 E2 82 AC 20 E8 |        my      |
00000060: 00 00 00 00 00 00 00 00 - BF B7 EB 99 97 F0 9D 94 |                |
00000070: 00 00 00 00 00 00 00 00 - 89 EF B7 B5 EF B8 97 EF |                |
00000080: 00 00 00 00 00 00 00 00 - AD BA ED 9C 98 EC A3 BC |                |
00000090: 00 00 00 00 00 00 00 00 - EA 99 AC EA 97 A4 EA 97 |                |
000000a0: 00 00 00 00 00 00 00 00 - B6 EA 97 BA EA 97 BC EA |                |
000000b0: 00 00 00 00 00 00 00 00 - 94 9A EA 94 99 EA 91 84 |                |
000000c0: 00 00 00 00 00 00 00 00 - EA 92 93 EA 92 96 EA 91 |                |
000000d0: 00 00 00 00 00 00 00 00 - 98 EA 8D 94 20 63 6F 6D |             com|
000000e0: 00 00 00 00 00 00 00 00 - 6D 65 6E 74 21 0A 00 00 |        ment!   |
000000f0: 00 00 96 8B 7F CB 18 68 - 0E 39 00 05 00 00 00 00 |       h 9      |
00000100: 00 00 DA 48 99 75 03 D0 - 0F 74 00 75 00 00 00 00 |   H u   t u    |
00000110: 00 00 DA 48 99 75 03 D0 - 0E AC 00 0E 00 00 00 00 |   H u          |
00000120: 00 01 A6 F6 D7 E6 3F 5F - 02 93 00 06 00 00 00 00 |      ?_        |
00000130: 00 01 BA 75 2B FB 0B 80 - 02 10 00 07 00 00 00 00 |   u+           |
00000140: 00 01 BF 34 2C BC 43 E4 - 0F AD 00 0D 00 00 00 00 |   4, C         |
00000150;
Last edited by crystalclear on 29 Sep 2012, 16:30, edited 2 times in total.
crystalclear
 
Posts: 91
Joined: 22 Sep 2011, 14:19

Re: Header format for polyglot books.

Postby crystalclear » 28 Sep 2012, 03:02

EDIT: In the text below, I should perhaps have typed fischerandom instead of chess960. I was playing with the header utility, and not following the specification! Here's some of what I'd written in the posting that the site wouldn't take due to the unicode characters. ...........


I am on Windows 7.
I ran this ....

Code: Select all
exec C:\\Users\\crystalclear\\chess\\polyglotHeader\\pgheader-1.0.exe -d test.bin > temp.txt
exec C:\\Users\\crystalclear\\chess\\polyglotHeader\\pgheader-1.0.exe -v suicide,chess960 -c "This is a my € ................. comment!\n" test.bin
exec C:\\Users\\crystalclear\\chess\\polyglotHeader\\pgheader-1.0.exe -s test.bin >> temp.txt


and then printed the file temp.txt to look at its contents.

Code: Select all
Variants supported:
suicide
chess960
Comment:
This is a my €.................... comment!




So it seems that the windows version of the header utility can put UTF-8 characters in a polyglot header and recover them just fine on my windows computer.
I tried the same thing earlier using a Windows BAT file and a Windows console. That didn't work. However I expected that the problems lie (as usual) with the Microsoft software, so I used an alternative shell to launch the polyglot header program and recover its output.

I know the comment looks like garbage, but I wanted to test a fairly random selection of unicode characters, to see if I could put them in the opening book and get them back out. I think the Polyglot opening books will be fairly transparent to the whole thing and it's only a question of how the bytes are interpreted and displayed. With the tree structure of the header there is no reason that a later version of the specification couldn't have an ASCII only comment and a unicode comment for program that can handle it.

We could compromise at the moment by saying that books should initially use ASCII only; programs may refuse to display comments with non ASCII characters; and the preferred interpretation of characters above 127 is UTF-8 with whatever byte order Michel happens to have.

For UTF-8 I think byte order needs to be specified, so the fact that Michel and I can display our unicode characters correctly doesn't necessarily mean that he would display things correctly if I emailed him my opening book. Integers in Polyglot opening books are bigendian I think, and most computers are little endian. Confusion is avoided by things being well specified and read a byte at a time in the polyglot software thus making it machine architecture independent. IF we specify UTF-8 I guess we need to specify a byte order.

My editor gives the byte codes for the € symbol as
E2 82 AC

Code: Select all
00000000: 00 00 00 00 00 00 00 00 - 40 50 47 40 0A 31 2E 30 |        @PG@ 1.0|
00000010: 00 00 00 00 00 00 00 00 - 0A 33 0A 32 0A 73 75 69 |         3 2 sui|
00000020: 00 00 00 00 00 00 00 00 - 63 69 64 65 0A 63 68 65 |        cide che|
00000030: 00 00 00 00 00 00 00 00 - 73 73 39 36 30 0A 54 68 |        ss960 Th|
00000040: 00 00 00 00 00 00 00 00 - 69 73 20 69 73 20 61 20 |        is is a |
00000050: 00 00 00 00 00 00 00 00 - 6D 79 20 E2 82 AC 20 E8 |        my      |
00000060: 00 00 00 00 00 00 00 00 - BF B7 EB 99 97 F0 9D 94 |                |
00000070: 00 00 00 00 00 00 00 00 - 89 EF B7 B5 EF B8 97 EF |                |
00000080: 00 00 00 00 00 00 00 00 - AD BA ED 9C 98 EC A3 BC |                |
00000090: 00 00 00 00 00 00 00 00 - EA 99 AC EA 97 A4 EA 97 |                |
000000a0: 00 00 00 00 00 00 00 00 - B6 EA 97 BA EA 97 BC EA |                |
000000b0: 00 00 00 00 00 00 00 00 - 94 9A EA 94 99 EA 91 84 |                |
000000c0: 00 00 00 00 00 00 00 00 - EA 92 93 EA 92 96 EA 91 |                |
000000d0: 00 00 00 00 00 00 00 00 - 98 EA 8D 94 20 63 6F 6D |             com|
000000e0: 00 00 00 00 00 00 00 00 - 6D 65 6E 74 21 0A 00 00 |        ment!   |
000000f0: 00 00 96 8B 7F CB 18 68 - 0E 39 00 05 00 00 00 00 |       h 9      |
00000100: 00 00 DA 48 99 75 03 D0 - 0F 74 00 75 00 00 00 00 |   H u   t u    |
00000110: 00 00 DA 48 99 75 03 D0 - 0E AC 00 0E 00 00 00 00 |   H u          |
00000120: 00 01 A6 F6 D7 E6 3F 5F - 02 93 00 06 00 00 00 00 |      ?_        |
00000130: 00 01 BA 75 2B FB 0B 80 - 02 10 00 07 00 00 00 00 |   u+           |
00000140: 00 01 BF 34 2C BC 43 E4 - 0F AD 00 0D 00 00 00 00 |   4, C         |
00000150;


and they are visible in a hexdump of the opening book in that order too. Previous versions of the editor have allowed UTF-8 text files to be saved as little or big-endian, with or without a byte order marker. Now the option is "unicode transformation format" (whatever that means) and you cannot choose the byte order, although the byte order marker is still optional.
Last edited by crystalclear on 28 Sep 2012, 19:39, edited 1 time in total.
crystalclear
 
Posts: 91
Joined: 22 Sep 2011, 14:19

Re: Header format for polyglot books.

Postby Michel » 28 Sep 2012, 08:12

Thanks for the testing on windows. It is good to know that things work there as well!

About endianness: my information comes from

http://unicode.org/faq/utf_bom.html

It seems that endianness is not an issue for UTF-8 since it is a byte stream
(unlike UTF-16). So the byte order mark (BOM) is purely a comment (and optional)
So for the header one should require that the BOM is _not_
present since it breaks compatibility with 7bit ascii.

But the issue is moot since the BOM would be at the beginning
of the header data, at the place where the magic is. So a header
containing a BOM would be flagged as non-compliant anyway.
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15

Re: Header format for polyglot books.

Postby crystalclear » 28 Sep 2012, 12:42

Ah yes, my mistake!

If I try to save a file in UTF16 then the "endianness" radio buttons in my editor become active. That seems as good a reason as any to avoid UTF16.

The problems I had using the Polyglot header utility on Windows seemed to come from the Microsoft Windows shell command line interpreter and/or console.
I don't think other people will have an alternative shell and console available on windows. I played with the QT GUI a short while ago and I think I can knock up a GUI version of your header program if I may use your source as the work horse. I don't have anywhere to host it, but I believe you do (Michel), so I could send back the GUI program source code and executable.

Your thoughts are welcome.
Should I write a GUI?
Can you host it?
crystalclear
 
Posts: 91
Joined: 22 Sep 2011, 14:19

Re: Header format for polyglot books.

Postby Michel » 29 Sep 2012, 08:34

A GUI would indeed be nice. I was thinking of writing one myself using only win32 calls but I probably won't get around to it as creating win32 dialogs is so cumbersome (unless I guess you use some gui designer).

Qt could indeed be a good option. If you have something please let me know.

I updated the specification to include the recommendation that the header of books destined for distribution should for now only contain printable ascii.

http://hardy.uhasselt.be/Toga/pgheader- ... note_utf-8
Michel
 
Posts: 513
Joined: 01 Oct 2008, 12:15


Return to Winboard and related Topics

Who is online

Users browsing this forum: No registered users and 16 guests