Wednesday, September 7, 2011

Safe ANSI Encoding (Better than Base64)

To take advantage of Data URI feature for image inline embedding, almost everyone use Base64 encoding.

Base64 encode array of bytes (array of 8 bits) into it's Radix-64 representation (hence the name is Base-"64"). It transform each 3 bytes (24 bits) of data into 4 bytes encoded string with 6 bits of original data in each encoded char. These 6 bits (000000b-111111b = 0-63) is an index to 64 characters table, in which those 64 characters are considered safe to textual transfer (i.e., embedding binary data within an HTML script).

The disadvanatage of this scheme is the inflation size of it's host script as Base64 encoded string to 4/3 times (33.3%) longer than the original data stream. Anyway, if the data is sent gunzipped, the inflation is greatly reduced (in many case, even smaller than the original data). So for conclusion, Base64 embedding is a very good solution if you can serve the page gunzipped (i.e., using mod_gzip in Apache or using PHP's compression libraries).

If for any reason you can't gunzip your page, you can consider another alternative to embed binary data in your page. One of the solution packed in LF JavaScript Library is SAE (Safe Ansi Encoding), in which 32 characters of control charset (ASCII 0 to 31) is encoded using leading-escape character (the copyright character / ASCII 169) with value modified to (x+64) and the rest is embedded as is after XOR-ed with 128 (as binary data contains a lot of low ASCII bytes, especially the null byte (ASCII 0), this XOR helps reduce size inflation up to 5 percents in average). Every SAE data string is prefixed with Æ$ (2 bytes chars) as encoded indicator.

The result of SAE encoding is ANSI string safe for textual transfer (with charset iso-8859-1 to allow byte-range data minus first 32 bytes control characters) with size inflation about 9-11% in average for common images (jpeg/gif/png). It's one third of size-overhead with it's Base64 counterpart. Average size inflation if page is gunzipped is only about 1-3% of it's original size.

Below is PHP function to encode SAE data string

function saeEncode($s){
  $r='$'.chr(198);
  for($i=0;$i<strlen($s);$i++)
    $r.=($c=chr($a=ord($s[$i])^128)).($c==$e?$e:$a==60?'{':$a<32?chr($a|64):'');
  return $r;
}

Implementation example:
$f=file_get_contents('myImg.jpg');
$encoded=saeEncode($f);
header('Content-type: text/plain; charset=iso-8859-1');
echo $f;

As you can see from code above, this encoder takes special caution for '<' character since it's interpreted by browser DOM parser as beginning of tag. This is necessary as SAE encoded string is intended to be embedded within element text (so javascript code can easily retrieved the data using element.innerText property before performing SAE decoding).

This is the portion from lfw.js code that serve SAE decoding...
String.prototype.sae=function(){
  var s=this,r='',l,x=i=-1,c,a;
  if(!s)return '';
  if(s.$(0,2)=='$Æ'){ //decode
    s=s.$(2);a=[];l=s.length-1;
    for(i=0;i<l;i++){
      c=s.o(i);
      if((c==169)&&(i<=l)){
        i++;c=s.o(i);
        if(c==123)c=60;
        else if(c!==169)c=c&31;
      }
      r+=chr(c^128);
    }
  }
  //else encode (see the script for complete reference)
  return r;
}
function SAE(s){
  s=_(s).innerText.sae();
  var r='',i=0,e,h=s.$(0,32);
  e=h.c('JFIF')?'jpeg':h.c('GIF')?'gif':h.c('PNG')?'png':'unknown';
  for(;i<s.L();i++)r+='%'+hx(s.o(i));
  return 'data:image/'+e+';charset=oem,'+r
}

Using SAE embedded data is easy, simply call SAE('element_id') to compose the URI string (example: to create image (320x240px) with data embedded in an element (can be any element that support innerText property) with id myImg, just type:
__img(SAE('myImg'),320,240);

No comments:

Post a Comment