Ticket #8584 (closed defect: fixed)

Opened 8 years ago

Last modified 7 years ago

PDF converting some character and white space differently

Reported by: Ken Busbee <ken.busbee@…> Owned by: jccooper
Priority: high Milestone: Hot Fixes
Component: Collection Printing Version: Live
Severity: major Keywords: ReportABug
Cc: sk8, reedstrm, kef System Area: Content Printing
Primary Skill: TeX/LaTeX
Site URL: http://cnx.org/content/col10621/latest/
Suppress email to reporter: yes

Description (last modified by reedstrm) (diff)

My PDF download of col10621 dated Mar 8th at 7:53am is fine and is showing a double dash or other white space as appropriate. My PDF download of col10621 dated today shows the follwoing as reported by one of my students:

Here is a cut and paste from page 113, pdf page 119.

12.5.6.3 Problem 12c â[U+0080][U+0093] Instructions Write C++ source code for the following pseudocode: If age equal to 24 A A Display a message â[U+0080][U+009C]Youâ[U+0080][U+0099]re the same age as Melinda.â[U+0080][U+009D] Else A A If age equal to 27 A A A A Display a message â[U+0080][U+009C]Youâ[U+0080][U+0099]re the same age as Ruth.â[U+0080][U+009D] A A Else A A A A If age equal to 34 A A A A A A Display a message â[U+0080][U+009C]Youâ[U+0080][U+0099]re the same age as Ben.â[U+0080][U+009D] A A A A Else A A A A A A Display a message â[U+0080][U+009C]Youâ[U+0080][U+0099]re age is un-important.â[U+0080][U+009D] A A A A Endif A A Endif Endif

Attachments

test.xsl (1.1 KB) - added by cbearden 8 years ago.
XSLT test ruling out simple the case of the encoding problem
test_index.xsl (1.1 KB) - added by cbearden 8 years ago.
Stylesheet that, when run with itself as input, retrieves m19049/latest/index.cnxml
test_mxt.xsl (1.2 KB) - added by cbearden 8 years ago.
Stylesheet that, when run with itself as input, retrieves m19049/latest/module_export_template

Change History

Changed 8 years ago by je2

  • status changed from new to assigned
  • severity changed from severe to major
  • cc sk8 added
  • suppressreporteremail unset
  • component changed from Unknown to Collection Printing
  • skills changed from Unspecified to TeX/LaTeX
  • milestone changed from future to Print Future
  • owner changed from je2 to cbearden

See #8587. Seems to be similar problem with smart quotes.

Changed 8 years ago by je2

Changed font to Palatino on wakizashi to see if that made a difference, but saw same garbage characters. Dates indicate possible regression. May be duplicate ticket, but will leave open for now pending further review by SMEs.

Changed 8 years ago by sk8

This is definitely related to #8587, but the source of the problem is unclear. For some reason, the module export template file seems to be correct (i.e. the Unicode stays the same), but after the collection is assembled (col10261, for example), the Unicode is different in the .tmp1 file.

Changed 8 years ago by je2

OK . . . I can't figure out what is happening here.

Viewing page 119 of the PDF:

http://cnx.org/content/col10621/latest/pdf This version does not show any problems with section 12.5 (though this appears to be a cached version, and may change after this posting). Other modules, including 12.4, do have character issues.

http://cnx.org/content/col10621/1.9/pdf This version, which I just refreshed via triggerPrint, does have problems with 12.5 characters.

http://cnx.org/content/col10621/1.8/pdf This version does not contain the module in questions.

What I conclude (and correct me if I'm wrong) based on the above is that it's not a problem with the collection having been modified, as it hadn't been modified between the last working version (the cached version at 'latest') and the actual current version ('1.9'), and the 12.5 wasn't even present prior to the last collection publish. That would point to a problem with the module being published . . . but it's unclear why.

Also, I checked the actual characters being used in the module; these are standard Unicode space characters (U+0020), so whatever is being introduced into the PDF isn't coming from the module, it would seem. I suspect that's what Scott addressed in his previous comment, but I wanted to be sure it wasn't some bizarre alternate whitespace element so we could rule out a content issue.

Spoke briefly with Ed; this seems to coincide more or less with the CNXML 0.6 hot fix rollout from March 6, which apparently included several print system changes. Not sure if that's a good lead or just a red herring. Was going to try to do a few tests on code prior to that rollout, but Ross doesn't think we have a test instance with code that old to play with.

Changed 8 years ago by sk8

It appears that what's causing the difference between the module and collection versions is that the module uses the wget command, whereas the collection calls the XSLT document() function on the URI of each included module, and somehow the document function is messing with the Unicode. It may be interpreting the Unicode as Latin-1 instead of UTF8, which would explain the leading characters in the collection version. For now, this ticket will be the primary one to work on, but #8587 could be a separate problem.

Changed 8 years ago by cbearden

There is some kind of encoding problem here caused by a difference between the module and collection PDF generation systems. It isn't merely caused by the XSLT document() function used to get the modules in the collection PDF system--I have ruled that out by retrieving a module with the problem strings using an identity transform. The problem is present at the first XML stage of the PDF transformation, so LaTeX is not implicated. Since the simple case does not manifest the problem, I suspect some combination of a libxslt bug, the set of included and imported stylesheets used to assemble the collection content, and the content itself.

Consider:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) 
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.96
for nice experiences hit <tab> multiple times
>>> s1 = '“'
>>> print s1
“
>>> # paste in bad string from collection XML (copied from gvim)
... s1 = '“'
>>> s1
'\xc3\xa2\xc2\x80\xc2\x9c'
>>> # paste in good string from module XML (copied from gvim)
... s2 = '“'
>>> s2
'\xe2\x80\x9c'
>>> # convert s1 to Latin-1 from UTF-8
... s1_fixed = s1.decode('utf-8').encode('latin1')
>>> s1_fixed
'\xe2\x80\x9c'
>>> s1_fixed == s2
True
>>> 

If we take the bad byte sequence in the collection (what should be the opening smart quotes) and convert it from UTF-8 to Latin-1 (ISO 8859-1), we have a byte sequence that is equivalent to the wanted byte sequenc in UTF-8. So it appears to me that libxslt is interpreting the input file as Latin-1, and converting these bytes to their UTF-8 equivalents.

I think the way to proceed is to run the first collection PDF in variations, omitting differnt included/imported stylesheets, to see under what circumstances the problem is manifested.

I'm attaching a simple stylesheet to show that the simple case of retrieving and copying one of the problem modules to the result tree does not manifest the problem. You can run it like this:

xsltproc test.xsl test.xsl

It will function as its own input XML file, and it retrieves the preface module (m19049) and copies it to the result tree.

One other variation would be to try xsl:copy-of instead of xsl:copy.

Changed 8 years ago by cbearden

XSLT test ruling out simple the case of the encoding problem

Changed 8 years ago by cbearden

One further note: when I run the same stylesheet with saxon (SAXON 6.5.5 from Michael Kay, Java version 1.6.0_03) instead of libxslt, the smart quotes are handled correctly. So it is evident that libxslt is part of the problem interaction.

Changed 8 years ago by cbearden

I must emend what I wrote above: my test stylesheet was retrieving 'index.cnxml' and not 'module_export_template', so it did not correctly replicate collection PDF generation, which retrieves 'module_export_template' for each module. And there's the rub: when I modify the test stylesheet to retrieve m19049/latest/module_export_template, the encoding is mangled. So the problem does not depend on our stylesheets after all--it is manifested even in the simplest case.

I recorded the HTTP requests and responses with wireshark in order to get a look at the HTTP headers. Here are the headers for the index.cnxml request:

GET /content/m19049/latest/index.cnxml HTTP/1.0
Host: cnx.org
Accept-Encoding: gzip


HTTP/1.0 200 OK
Server: Zope/(Zope 2.9.8-final, python 2.4.4, linux2) ZServer/1.1 Plone/2.5.4-2
Date: Mon, 23 Mar 2009 16:10:25 GMT
Content-Length: 14389
Content-Type: text/html; charset=iso-8859-15
Accept-Ranges: bytes
X-Cache: MISS from tachi.cnx.rice.edu
X-Cache-Lookup: MISS from tachi.cnx.rice.edu:80
Via: 1.0 tachi.cnx.rice.edu:80 (squid/2.6.STABLE20)
Connection: close

Here are the headers for the module_export_template request:

GET /content/m19049/latest/module_export_template HTTP/1.0
Host: cnx.org
Accept-Encoding: gzip


HTTP/1.0 200 OK
Server: Zope/(Zope 2.9.8-final, python 2.4.4, linux2) ZServer/1.1 Plone/2.5.4-2
Date: Mon, 23 Mar 2009 16:10:26 GMT
Content-Length: 17315
Content-Type: text/xml; charset=iso-8859-15
Set-Cookie: viewed_mods="m19049+1.10+"; Path=/
X-Cache: MISS from tachi.cnx.rice.edu
X-Cache-Lookup: MISS from tachi.cnx.rice.edu:80
Via: 1.0 tachi.cnx.rice.edu:80 (squid/2.6.STABLE20)
Connection: close

In both cases, Zope is incorrectly giving ISO-8859-15 (Latin-9) as the encoding, but it is giving two different MIME types: "text/html" in the case of the request for index.cnxml, and "text/xml" in the case of module_export_template. Possibly libxslt pays attention to the charset in the HTTP header when the type is 'text/xml' but ignores it when the type is 'text/html'. When I park the untransformed module_export_template for m19049 in my Owlnet web space and run the same transformation against it, I get the correct results, and the http headers have no charset property in the Content-Type header. So, I suspect the incorrect charset spec in the Zope Content-Type header. If I'm right, Cameron is probably the right person to evaluate a possible modification to the Zope config to work around this bug.

To summarize: it looks to me as if libxslt is retrieving module_export_template encoded in UTF-8, but it thinks it is in Latin-9, so it applies the Latin-9-to-UTF-8 conversion to the doc, which gives the results we see. If libxslt is paying attention to the charset declaration in the HTTP header and ignoring the fact that the XML lacks a charset declaration and is hence to be taken as being in the default UTF-8 encoding, then this is ultimately a bug in libxslt.

Changed 8 years ago by cbearden

Stylesheet that, when run with itself as input, retrieves m19049/latest/index.cnxml

Changed 8 years ago by cbearden

Stylesheet that, when run with itself as input, retrieves m19049/latest/module_export_template

Changed 8 years ago by reedstrm

  • owner changed from cbearden to reedstrm
  • status changed from assigned to working
  • description modified (diff)

Changed 8 years ago by cbearden

See Bug #576485 for the libxslt problem.

Changed 8 years ago by cbearden

Update on libxslt: I was incorrect in my beliefs about how libxslt should respond to charset declarations sent in the Content-Type header--see Björn Höhrmann's response on the xml@gnome list, and Daniel Veillard's emphatic comment on my bug report. So my bug report was bogus.

For the relevant standards, see F.2 Priorities in the Presence of External Encoding Information and RFC 3023 XML Media Types, sections 3.1 and 3.2.

We were working on fixing the charset declaration in our Zope instances anyway, so we will continue with that effort. Evidently, Saxon ignores the XML rec and the RFC in this regard.

Changed 8 years ago by cbearden

index.cnxml files seem to be served up as 'text/html' in all cases; when 'source' or 'module_export_template' of a module is retrieved, it seems to get the MIME type it has in the postgresql database (e.g. m1048 gets 'application/xml' and m19049 gets 'text/xml').

'text/xml' appears to be deprecated, and in my view we should avoid it like the plague: the RFC states that, if no charset parameter is present, receiving XML processors are required to use 'us-ascii' as the encoding: they must ignore the encoding declaration if any in the XML doc. We should serve up all well-formed non-XHTML XML as 'application/xml'. That includes at least 'index.cnxml', 'source', 'module_export_template', and the RDF/XML expression of collection structures.

My preference would be to omit the charset parameter from the Content-Type header, but others might not agree, and in any case it may not be easy to do in Zope.

Changed 8 years ago by ew2

  • owner changed from reedstrm to jccooper
  • status changed from working to assigned
  • milestone changed from Print Future to Hot Fixes

Changed 8 years ago by ew2

  • priority changed from unprioritized to high

Changed 8 years ago by jccooper

  • cc cbearden, reedstrm added
  • status changed from assigned to working

My essay about mimetypes and math in CNXPloneSite/skins/cnx_overrides/cnxml_transform.py::

# About mimetypes and browsers:
# Pretty much all browsers say they accept */*.
#   Safari only says */*
#   Mozilla explicitly includes application/xhtml+xml whereas IE doesn't
# Mozilla's perfectly happy with application/xhtml+xml
# IE will not treat an application/xhtml+xml mimetype as XHTML, though if
#   MathPlayer recognizes it, it will
# MathPlayer recognizes only the following strings... exactly!
#   http://www.dessci.com/en/products/mathplayer/author/creatingsites.htm
#   'application/xhtml+xml' 'text/xml' 'text/xml; charset=utf-8' 'text/xml; charset=iso-8859-1'
#   ...but everything but the first is deprectated
# IE doesn't pay attention to our several in-page hints to encoding, so we need
# to say it in the content-type header, but Mathplayer doesn't accept
# 'application/xhtml+xml; charset=utf-8'. It does take 'text/xml' with charset,
# so we use that, even though it's deprecated.
# If we don't put a charset in the Content-Type, IE will make it "Wester European",
# and utf-8 characters will look like some sort of accented/gibberish letter.
# So, Mathplayer has to get text/xml so we can append a charset.

Changed 8 years ago by jccooper

I'll note that cnxml_transform isn't really up-to-date with CNXML 0.6; it still looks for math in the doc's namespaces, which 0.6 won't have. So I'm kind of surprised it works. Maybe MathPlayer? is better now?

Changed 8 years ago by jccooper

We have two header sets; one in the module_export_template, and one in ModuleView?.getFile(). That one happens later, and so wins. It comes from ModuleFile?.content_type(), which is a result from the database. This is becoming Ross's area again.

Changed 8 years ago by jccooper

Yes, we are storing type as 'text/xml' in this case, and over-riding the MET setting, since the getFile for index.cnxml happens inside the MET.

ZPublisher then wants to apply its default encoding, which is Latin-9 for some reason.

This is to do with the recent MIME types storage fix.

We should probably have getFile not set a header if it already exists, and may also want to store encoding with the type.

Changed 8 years ago by jccooper

(In [27277]) respect already-set response headers, since getFile is not always called by itself. this was over-riding the setting in module_export_template, which was causing print problems. see #8584

Changed 8 years ago by jccooper

Okay, change on trunk that makes the content type what we want it to be. Testing printing to make sure the ultimate problem is solved.

Changed 8 years ago by jccooper

  • status changed from working to testing

col10621 print tested on my instance, and it is good.

Changed 8 years ago by jenn

  • cc kef added; cbearden removed

Testing strategy:

  • flicker-test all standard test modules online and in print
  • confirm the fix in the PDFs of five or ten of the modules identified by Chuck as being strongly affected by this problem; also, flicker-test them in online view
  • thoroughly check the PDFs of two or three collections containing the high-risk modules above; don't use col10621 or any of the others you've already used in characterizing the bug

How does that sound? Chuck, if I do the first bullet, can you coordinate handling the other two with your Printing Squad? Making sure Kathi and Ed are copied so they can throw extra people at the problem if they want to.

Changed 8 years ago by kef

This testing strategy sounds good to me.

Changed 8 years ago by je2

Fix has been rolled out, and tests of col10676 (from #8587) and col10621 both suggest that the problem has been resolved.

Changed 8 years ago by je2

  • suppressreporteremail set

Hi Ken -

I know you've been receiving the Trac notices over the past two weeks, but I wanted to follow-up with you directly to let you know that we appear to have isolated and corrected the problem with your collections. The underlying issue had to do with the way the index.cnxml document types were being stored and reported internally by the server, rather than a problem with the PDF printing system itself as we originally thought. We have rolled out the fix for this issue, and tests of known problem areas in both collection PDFs you reported suggest that the issue has been resolved.

Please let me know if you notice any additional problems with these or any other collections, and we will be more than happy to reopen the ticket as necessary, though we seem to have this issue resolved.

My apologies for the inconvenience and frustration this has caused. As always, please let me know if you have any questions or concerns and I will be happy to assist in any way I can.

Regards,

Jonathan

Changed 8 years ago by je2

(Email above sent to reporter directly.)

Changed 8 years ago by jenn

  • status changed from testing to closed
  • resolution set to fixed

Was waiting on PDF regeneration to close the ticket, but the collection revved at 9:45 this morning, so the new PDF should be good.

Note: See TracTickets for help on using tickets.