Golden Rule for Super Fast Shell Scripts!

Shell scripts are very powerful in what you can make them do and they can be written in a very short amount of time, especially when it comes to processing text files. Most often these scripts are written as a stop-gap solution till a more permanent, efficient C program can replace them. However, once the script is installed in production, people realize its working just fine and there is no need to spend more effort to write a C-program from scratch. Over time the script starts getting used more and more (a phenomena one of my good friend labels “if you build it, they will come”. A reference from the movie “Field of dreams”… more on this in some other post)  and it soon becomes a performance bottle-neck in the system.

This is partly due to the fact that there are so many ways of accomplishing a single task in shell scripts that its difficult for most developers to figure out which one is the most efficient way. In this post I’ll cover a single “Golden rule” that I have discovered, which helps me write really efficient Korn Shell scripts. Here it is:

Never launch a child process in a processing loop!

A processing loop, as referred here, is a loop which iterates over every record in the input file.

The Golden Rule can be expressed in mathematical notation as:

performance = A/(# of child processes launched per input record)

Considering ‘A’ would be a system-dependent constant.

Note:Even though I have used Korn shell for all the examples here, the basic principle should hold true for any shell.

What is a Child Process?

The crux of the golden rule is to make sure that we do not launch any child process multiple times within a script execution. This is because there is a lot of overhead involved in creating a child-process. Here are some tips on what causes a child process to be created:

  • Any utility that’s not a shell built-in, like cut, sed, grep etc.
  • Every time you use a pipeline, it causes child processes to be created.

It will be easier to demo the Golden rule than talk about it. So, lets dive into some samples.

Examples

Lets take a very simple example, where you have a CSV input file with first-name, last-name and an email address. The job of the process is to parse the input file, verify each email for an ‘@’ and a ‘.’ in the email id, and split out any invalid records to an error file.

I’ll show the same script written in 3 different styles in reducing number of child-processes per record and increasing degree of efficiency.

Sample 1

#!/bin/ksh

> valid
> invalid

cat $1 | while read line
do
  ## Three pipelines used here to parse each record
  fname=$(echo $line | cut -d, -f1)
  lname=$(echo $line | cut -d, -f2)
  email=$(echo $line | cut -d, -f3)
  ## Another pipeline to check validity of the email
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 2

#!/bin/ksh

> valid
> invalid

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated the need for three pipelines by using IFS with read
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 3

#!/bin/ksh

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated need for using egrep by using KSH's built-in regular expression capbility
  if [[ $email = (.)+@(.)+\.(.)+ ]]; then
    echo "$fname,$lname,$email"
  else
    echo "$fname,$lname,$email" >&2
  fi
done > valid 2> invalid

Performance Matrix

Input RecordsSample 1Sample 2Sample 3
1,000Real0m9.380s0m2.533s0m0.085s
User0m2.744s0m0.781s0m0.045s
Sys0m7.960s0m2.037s0m0.039s
10,000Real1m25.238s0m22.515s0m0.970s
User0m24.663s0m7.001s0m0.379s
Sys1m11.544s0m17.786s0m0.299s
100,000Real14m42.842s 4m6.492s 0m6.527s
User 4m8.237s 1m15.282s 0m3.667s
Sys 12m12.653s 3m13.174s 0m2.862s
1,000,000Real 145m58.457s 41m17.773s 1m11.483s
User 41m15.294s 12m28.498s 0m39.565s
Sys 121m8.872s 32m21.106s 0m30.701s

Conclusion

As you can see from the performance matrix, as the number of child processes in a shell script reduce, the performance begins to improve drastically.

There are situations, however, when it seems impossible to avoid running child-processes for every record (especially when third party utilities are involved, like having to update the database for every record). Rest assured, there is a way around that (read KSH co-processes)! I’ll cover that in another post sometime.

Hope this post helps you write faster running scripts!