Saturday, July 24, 2010

url encoding using sed (take 2)

The output matches that of PHP's urlencode() function. I think.


#!/usr/bin/sed -f

# Brad Forschinger, 2010-07-25

s/%/%25/g
:loop
/[^A-Za-z0-9%_.-]/ {
s/$/%%BNJF\x0000\x0101\x0202\x0303\x0404\x0505\x0606\x0707\x0808\x0909/
s/$/\x0A0A\x0B0B\x0C0C\x0D0D\x0E0E\x0F0F\x1010\x1111\x1212\x1313\x1414/
s/$/\x1515\x1616\x1717\x1818\x1919\x1A1A\x1B1B\x1C1C\x1D1D\x1E1E\x1F1F/
s/$/ 20!21"22#23$24\&26'27(28)29*2A+2B,2C\/2F:3A;3B<3C=3D>3E?3F@40[5B/
s/$/\\5C]5D^5E`60{7B|7C}7D~7E\x7F7F/
t jump;:jump
s/\(.\)\(.*\)%%BNJF.*\1\([0-9A-F][0-9A-F]\).*/%\3\2/g
t loop
s/%%BNJF.*//
}
/%20/s/%20/+/g


The "t jump" to nowhere is to reset the status of successful s/// operations. The important one is the look-up, if that fails then t will fall through.

I did this in gsed. It will not work in Solaris (and others not using GNU sed) because of the \x encoding. Here's a more portable version, but less functional; it doesn't encode characters < 0x20.


#!/usr/bin/sed -f

# drop line if it has chars outside of 0x20-0x7E
/[^ -~]/d
s/%/%25/g
:loop
/[^A-Za-z0-9%_.-]/ {
s/$/%% 20!21"22#23$24\&26'27(28)29*2A+2B,2C\/2F:3A/
s/$/;3B<3C=3D>3E?3F@40[5B\\5C]5D^5E`60{7B|7C}7D~7E/
s/\(.\)\(.*\)%%.*\1\([0-9A-F][0-9A-F]\).*/%\3\2/g
b loop
}
/%20/s/%20/+/g


Enjoy.

No comments: