XML::Edifact - an approach towards XML/EDI as a prototype in perl release 0.45 - UNOC Michael Koehne, ( kraehe@copyleft.de ) Tue Jun 12 15:40:06 CEST 2001 XML::Edifact is a set of perl scripts, for translating EDIFACT into XML. Version 0.45 improved UNOC handling. Perl 5.6.1 droped the 'tr' function to convert between ISO-8859-1 and UTF8, and introduced a new way. Thanks to Jarkko Hietaniemi for his regexp to produce a version compatible from Perl 5.6.0 up. ______________________________________________________________________ Table of Contents 1. Introduction 2. Release Notes: 2.1 Edi2SGML-0.1: About the beauty of plain text 2.2 XML-Edifact-0.2: It's hard work to cook up a second version. 2.3 XML-Edifact-0.3x: About normalisation, namespaces and xml2edi 2.4 XML-Edifact-0.4x: the portability track. 3. Installation 4. Known Bugs 4.1 Double namespace declarations 4.2 Stating level in Syntax identifier. 4.3 Explicit Indication of Nesting 4.4 XML::Edifact is slow! 5. Roadmap 6. Legal stuff 7. Download ______________________________________________________________________ 1. Introduction EDIFACT is often called "the nightmare of the paperless office" when you show a programmer the standard draft. Those 2700 pages of horror- filled advisory-board English have given many programmers headaches. EDIFACT is trying the impossible: a single form for the real world. Orders, invoices, freight papers, etc., always look different, if they come from different companies. EDIFACT tries to fulfill all needs of commercial messages, regardless of type and origin. Of course the real world is neither simple nor complete. Nevertheless, it's important for the top companies and their suppliers - you know, those who have been in business for years and can pay for a mainframe and a pack of gurus. XML/EDI is meant to provide a simpler (KISS) format that can be translated to and from EDI, to allow smaller companies to avoid slashing down forests and retyping into a computer keyboard stupid lines printed by other computers. This is NOT XML/EDI, it's certainly not KISS. The edifact03.dtd reflects the original words of the EDIFACT standard as closely as possible, on a segment, composite and element level. This DTD simplifies EDI inasmuch as it doesn't distinguish between e.g. INVOICE or PRICAT, but only defines a generic message type called edifact:message. The benefit is of course that it's possible to convert any EDI message into edifact. The drawback is that the dtd is really relaxed. Validation of EDIFACT message design can therefore not be done by a validating XML parser. Message designers will still need knowledge about EDIFACT message design and EDIFACT tools. But once the message is designed, it's simpler to read it with XML. 2. Release Notes: 2.1. Edi2SGML-0.1: About the beauty of plain text Standards should be based on standards. EDIFACT is based on ASCII and documentation is available from WWW.Premenos.Com as plain text. Well, the original contains some PCDOS characters. I took the liberty of replacing them with ASCII in this distribution to improve readability. I'm not talking about human readability here. A friend at SAP joked that plain paper is the only platform-independent format in that case. But I dislike retyping them. And plain text is more flexible, as I'm a programmer. Unlike the 0.1 distribution, following distributions will only contain those documents I need to parse by the scripts. Download the 0.1 for a complete set, or surf at Premenos. Note: Premenos was the old url - better start surfing now at www.unece.org 2.2. XML-Edifact-0.2: It's hard work to cook up a second version. As usual, second versions claim to be better documented and tested, but the truth is that they contain more features. So let's talk about features: First of all: It looks like a module. "use strict" and the package concept are useful things. But it'll take a lot of RTFM for me to understand the perl way of doing it. The XML/Edifact.pm doesnt export anything, and it's not even neccessary to "perl Makefile.PL; make install". The 0.2 version is not intended to be installed; it's a test case. So let's talk about the test case: Run ./bin/make_test.sh from here, and everything should be fine. Still, it will take some RTFM for me to understand the perl way of regession testing. But the ./bin/make_test.sh is the one this version offers ,-) I'm now using a tied hash for speeding startup. I've decided to use SDBM, as this DBM comes with any perl and a small DBM is better in this case. I've provided a document type definition. And it's now possible to use a validating parser like SP from James Clark. You may also notice the renaming of Edi2SGML to XML::Edifact. This name change reflects that my script is now producing XML and not SGML, and the name should point to the place in the CPAN hierarchy where this package belongs. 2.3. XML-Edifact-0.3x: About normalisation, namespaces and xml2edi You may notice the major change in the DBM design. While the old DBM files were modeled closely on the batch directory, this version has been partly normalised to improve coding. It's also denormalised for some perlish reasons. Unloading this DBM into a relational database would be possible with varchars, but the semantics of the 2nd element in segments and composite could only be expressed with some weird object relational databases like PostgreSQL. Also the DTD was changed for namespace reasons. The 0.2 needed to add the word literal, where element names clashed with segment names of the standard. And it dropped the composite information. Now trsd:party.name means the segment, while tred:party.name points to the element. This allows parsing the XML message to produce an EDI message without a backtracking parser. The event-based parser used for xml2edi is quite new, and certainly contains some bugs. Please dig around in your real-life messages, translate them with edi2xml, then back with xml2edi, and compare the original with the double translation. I've tried for a robust solution, which doesn't croak with codes from an unknown namespace, I hope. Version 0.30 and 0.31 used edicooked:message as namespace; versions 0.32 and up will use edifact:message for the main namespace. The technical reason is quite simple. The namespace prefix of a message does not mean anything. It's only a shorthand for the provided URI in the xmlns attribute. So any distinct XML message can claim to be in the edifact: namespace, if the URI is distinct. So if other projects start to be implemented, they can claim to be in the edifact: namespace by the same right. Version 0.33 first of all solves a bug which showed up with xml2edi and a TeleOrdering message translated by edi2xml. I just forgot to encode less than and ampersand, if they occured as translation in a code list. So NAD+OB+0091987:160:16' will now be translated using Dun & Bradstreet, which is right. There are two other major improvements. Version 005.60 contains a profiler, and finding the hot spots and optimising the SDBM by further denormalisation improved performance of edi2xml by factor 12. I hope nobody has used the SDBM internals so far. The last major improvement is that I'm getting familar with ExtUtils::MakeMaker, File::Spec and friends. Version 0.33 is the first that installed - at least on my Linux box :-) Version 0.34 introduced coding of UN/EDIFACT code list extensions by XML-Edifact namespace migration. Version 0.34 fixed a bug concerning the release indicator. As a minor improvement, the edi2xml and xml2edi scripts now have pod documentation. Version 0.35 was a bug fix, thanks to Detlef Lammermann from Dr. Materna GmbH, who found that ??' was misinterpreted. 2.4. XML-Edifact-0.4x: the portability track. The intention is to have a version running under as many operating systems as possible. Bug fixes may still merge into this version, but new features will be implemented in the 0.50 track. Version 0.40 started with a minor bugfix ( thanks to Werner F.C. Bruns ) and questions for a W32 port at a DIN meeting in Frankfurt. John Cope made the first PPM/PPD that was known to run on W32. But as I don't have any W32 system, I was unable to test it. Version 0.41 was the first version known to build and to pass its regression test under Windows NT, thanks to Arend R. Braun. The only change was in Makefile.PL. Version 0.42 requires Perl 5.6, and implements interpretation of the Stating Level. Now UNOC (Latin1) is translated to UTF8. Version 0.43 improved in grammar and spelling - thanks to Julian Olson. Version 0.44 improved in memory consumption - thanks to Carlos De Matos, who confrontet me with DELJIT messages of megabyte size. Version 0.45 improved UNOC handling. Perl 5.6.1 droped the 'tr' function to convert between ISO-8859-1 and UTF8, and introduced a new way. Thanks to Jarkko Hietaniemi for his regexp to produce a version compatible from Perl 5.6.0 up. 3. Installation I've included my modified documents, so others will be able to rebuild the DBM files. You may need a Unix-like system because of newline conventions. $ perl Makefile.PL I know I should check for those 99 possible places, but I prefer to ask :-) URL for public documents [http://www.xml-edifact.org] Directory on this system [/tmp/xml-edifact] Writing Makefile for XML::Edifact $ make perl perl Makefile.PL will first ask two questions. The reason is that XML::Edifact wants to install its document type definition on a web server to allow validation XML parser to grep the DTD. Do not change this setting the first time, as changes cause XML::Edifact to fail its regression test. You may change those decisions later by reperling the Makefile.PL, or by editing the XML::Edifact::Config module in your SITE_PERL. Make will take a while and then you may hope to have a working database. This database covers the 96b version of the UN/EDIFACT batch directory and will be installed as XML::Edifact::d96b later. $ make test The regression test will translate any .edi file found in the examples directory to xml and translate the xml back to EDIFACT. The result should not change. $ make install This will install the XML::Edifact module, the D96B batch directory, various files for the URL and two scripts: edi2xml and xml2edi You can now try your own UN/EDIFACT files. I really want to know what your EDI messages look like, do they break anything, what about your code list extension, ... ? Testing different real examples should show some bugs I havn't thought of. Think about the O'Reilly invoice or the Dubbel:Test and you should get the idea. I've tried to implement the UNA correctly, but this may need some additional debugging. Take a look at the difference between the edi.tst files from Frankfurt and the Springer message. The last one uses newline as a 9th character in UNA, so it's nearly human- readable. One last word - I hope this complex installation will work on most Unix look-alikes, but I'm quite sure that it'll break on Windows and Mac. If you have such a system, and have problems during installation, drop me a mail. You are granted my help, as I need your help to make the installation portable across different platforms. 4. Known Bugs 4.1. Double namespace declarations Namespace declaration was redefined in January 1999. XML::Edifact 0.30 produced both the old and the new declarations. XML::Edifact 0.31 dropped the deprecated declarations! If you have an old browser, you may have to download XML::Edifact 0.30 and edit the current XML::Edifact. Search for HERE_ and adapt the headers to your browsers preferences. 4.2. Stating level in Syntax identifier. The stating level in EDIFACT speak is called charset encoding in XML speak, and it's of course important if you thing about non US/UK products. Currently only UNOA, UNOB and UNOC are translated correctly. Other character encodings than Latin1, are not yet supported. 4.3. Explicit Indication of Nesting This has not been coded yet, as no example messsages are available to me. 4.4. XML::Edifact is slow! The 0.50 will be times faster ;-) 5. Roadmap I'm using even and odd numbering to distinguish between stable and experimental versions. Well, 0.2 was not as stable as an even number suggests. And I hope this 0.3x is stable enough, as it's often said that a third version will be the first useful one. Both 0.4x track and 0.5x track are active currently. The 0.35 was quite stable, and there is a need for portability, while the version under development is far from being usable. I had to realise that the roadmap is far to large, so I had to drop the steps 0.7x to 0.9x. The functionality will become unbundled into other CPAN modules if necessary. 0.4x This version focuses on portability, of the EdiCooked style. While Perl ensures portability across the unix'es, MacOS and Win32 will cause some problems. The 0.4 version will also be the first one intended to be installed. As installation also means configuration of non Perlish paths, e.g. for webserver, mime.types, mailcap, dtds and databases, XML::Config.pm will be discussed in the perlxml list. 0.5x This is the unstable version track. XML::Edifact now provides PerlSAX objects as drivers and handlers to UN/EDIFACT, making usage more flexible. 0.6x Stabilisation by discussion and consensus about features introduced with 0.5. 1.0 I hope that a consensus has been found in this direction, so the DTDs won't change in further releases. Those versions may focus on using XML::Edifact in real life applications. I can imagine an SQL interface, a Cobol interface, a message designer, a DOM/CORBA wrapper, and much more. Once I think XML::Edifact is complete, I have to think about speed. Perl is a perfect language for prototyping, but profiling and using a low level language like C for hot spots will be necessary to handle large batches. 6. Legal stuff Programs provided with this copy called XML-Edifact-0.32.tgz may be used, distributed and modified under terms of the GNU General Public License. Files in the ./examples directory are from various sources and free of claims as far as I know. Files in the ./un_edifact_d96b directory are based on EDI batch directories and are therefore copyrighted by the United Nations. See un_edifact_d96b/LICENAGR.TXT. Files that are produced during the bootstrap process and placed in XML::Edifact::d96b are based on the original UN/EDIFACT standard and therefore not covered by GPL, but likely copyrighted by the United Nations. The same applies to the text tables produced during Bootstrap.PL. Besides the GPLed Edition, a Custom Edition exists, if you dislike GPL. Drop me an eMail and ask for price and conditions. 7. Download I just got a message from PAUSE that I can upload it to : $CPAN/authors/id/K/KR/KRAEHE XML::Edifact requires XML::Parser, so to download and install, type: $ perl -MCPAN -e shell cpan> install XML::Parser cpan> install XML::Edifact or ftp directly at: ftp://ftp.cpan.org/pub/perl/CPAN/modules/by-module/XML/XML-Parser-*.tar.gz ftp://ftp.cpan.org/pub/perl/CPAN/modules/by-module/XML/XML-Edifact-*.tar.gz The canon source of the XML::Edifact project is now: http://www.xml-edifact.org/ This site contain various example files, research papers, a complete set of UN/EDIFACT batch directories and, most important, current versions from the unstable track.