Performance
This post addresses Bash performance by demonstrating code to accomplish three things:
- stripping HTML
- decoding POST data
- simple post markup
I think you will find it enlightening (well, I did).
Stripping HTML Tags
Stripping tags is straight forward and uses a single state flag:
# strip tags
st() {
n=${#data}
t=0
r=
for (( i=0; i<$n; i++ )); do
if [[ ${data:$i:1} == ">" ]]; then t=0; continue; fi
if [[ $t == 1 ]]; then continue; fi
if [[ ${data:$i:1} == "<" ]]; then t=1; continue; fi
r+=${data:$i:1}
done
data=$r
}
That, though, only demonstrates that expanding a string one character at a time is slow. Sometimes case is used for parsing data, so let's give it a try:
# strip tags version 2
st() {
n=${#data}
t=0; i=0
r=
while [ $i -lt $n ]; do
case ${data:$i:1} in
\>) t=0;;
\<) t=1;;
*) if [[ $t -eq 0 ]]; then r+=${data:$i:1}; fi;;
esac
(( i++ ))
done
data=$r
}
Which is faster, but it still is not fast enough (and I dislike mis-applying a construct like that). There is something that is fast enough:
# strip tags version 3
st() {
t=0
r=
while read -r -n 1 c; do
if [[ $c == ">" ]]; then t=0; continue; fi
if [[ $t == 1 ]]; then continue; fi
if [[ $c == "<" ]]; then t=1; continue; fi
if [[ -z $c ]]; then c=$'\n'; fi
r+=$c
done <<< "$data"
data=$r
}
Note that these and the other examples here use a global for the data; very easy to manage with code as small and as structured as WordBash.
Converting POST Data
POST data is encoded in a simple manner, and decoding using Bash could be simple string replacements:
POST_STRING=${POST_STRING//+/ }
POST_STRING=${POST_STRING//%0D/}
POST_STRING=${POST_STRING//%/\\x}
Culminating with something like this (which would actually be done on the data split by &):
POST_STRING=$(echo -e "$POST_STRING")
But even with a fairly small amount of data those string replacements can take a long time with a very steep curve up as the data increases. A much faster Bash way with a nice linear line up is:
p=
while read -r -n 1 t; do
if [[ $t == '%' ]]; then
read -r -n 2 t
if [[ $t == '0D' ]]; then
continue
fi
t="\\x$t"
elif [[ $t == '+' ]]; then
t=" "
fi
p+=$t
done <<< "$POST_STRING"
POST_STRING=$p
But that still is unacceptably slow with large data (because no newlines). So, Bash needs external help.
This is more than acceptably fast even for very large data:
POST_STRING=`echo "$POST_STRING" | sed -e '
s/+/ /g
s/%0D//g
s/%/\\\\x/g
'`
The external program (which could be PERL or any of many others available) need not be called directly—it could be a shell script. The data need not be in-line—it could be in the shell script or in a file.
Indeed, other shell scripts (as stand alone programs of their own) could do much of what WordBash libraries do. Making many small programs work together is what scripting languages are generally used for. (I just chose to do something completely different.)
Post Translation Data
Another string replacement is what I call post translations, allowing for some Admin laziness. There is an array of data:
# post translate data; FROM = TO
dp=(
'–-' '—'
'BASH' '<strong>Bash</strong>'
)
What is odd about it is that it is a simple array but needs to be "even", which means it is applied by:
# post translate
pt() {
for (( i=0; i<${#dp[@]}; i+=2 )); do
body=${body//${dp[i]}/${dp[i+1]}}
done
}
This works quite well except that it slows down fast as the data grows, slowing down so much that it would be unacceptably slow. It could be:
# post translate SED version
dp='
s/–-/\—/g
s/BASH/<strong>Bash<\/strong>/g
'
body=`echo "$body" | sed -e "$dp"`
Which would be noticeably faster.
But Not So Fast
SED is not used here. The Bash code is used for a demonstration.
This is the largest post here (about 17,700 characters), and on my 1.5GHz Pentium M, Windows XP/SP2, Cygwin test computer it displays in about two seconds (user time; about 300 milliseconds of system time). Not particularly fast, but it could be that computer capacity is growing...
But, there is one important thing about Bash string replacement: Bash string replacement is not really slow.
Wait Just a Second
The times output at the bottom of the display is there for this reason: Refresh this page a few times, noting the time it takes. Then edit library D (libs/D.sh) to comment out line 137 and uncomment line 138. Refresh this page.
Interestingly, if you look at the function (br in B.sh) you will see that it is... Bash string replacement:
# string replace
br() {
local l r
r=
while read -r l; do
r+=${l//$1/$2}$'\n'
done <<< "$3"
echo "$r"
}
Bash string replacement can be said to be flawed as it's performance plummets unacceptably past a certain size and becomes slow—yet line by line string replacement for the same data is fast.
The Final Word
WordBash demonstrates that Bash only code works well and that a web application like WordPress can be implemented in only a few thousand lines of shell script.
Notes:
1. A drawback to this type of substitution is that it will occur everywhere, even when not wanted—like in these examples; so steps must be taken (I simply used character substitution in the posts, but the data and/or code can be made "smarter" by checking word boundary, etc.)