Golden Rule for Super Fast Shell Scripts!

Shell scripts are very powerful in what you can make them do and they can be written in a very short amount of time, especially when it comes to processing text files. Most often these scripts are written as a stop-gap solution till a more permanent, efficient C program can replace them. However, once the script is installed in production, people realize its working just fine and there is no need to spend more effort to write a C-program from scratch. Over time the script starts getting used more and more (a phenomena one of my good friend labels “if you build it, they will come”. A reference from the movie “Field of dreams”… more on this in some other post)Â and it soon becomes a performance bottle-neck in the system.

This is partly due to the fact that there are so many ways of accomplishing a single task in shell scripts that its difficult for most developers to figure out which one is the most efficient way. In this post I’ll cover a single “Golden rule” that I have discovered, which helps me write really efficient Korn Shell scripts. Here it is:

Never launch a child process in a processing loop!

A processing loop, as referred here, is a loop which iterates over every record in the input file.

The Golden Rule can be expressed in mathematical notation as:

performance = A/(# of child processes launched per input record)

Considering ‘A’ would be a system-dependent constant.

Note:Even though I have used Korn shell for all the examples here, the basic principle should hold true for any shell.

What is a Child Process?

The crux of the golden rule is to make sure that we do not launch any child process multiple times within a script execution. This is because there is a lot of overhead involved in creating a child-process. Here are some tips on what causes a child process to be created:

Any utility that’s not a shell built-in, like cut, sed, grep etc.
Every time you use a pipeline, it causes child processes to be created.

It will be easier to demo the Golden rule than talk about it. So, lets dive into some samples.

Examples

Lets take a very simple example, where you have a CSV input file with first-name, last-name and an email address. The job of the process is to parse the input file, verify each email for an ‘@’ and a ‘.’ in the email id, and split out any invalid records to an error file.

I’ll show the same script written in 3 different styles in reducing number of child-processes per record and increasing degree of efficiency.

Sample 1

#!/bin/ksh

> valid
> invalid

cat $1 | while read line
do
  ## Three pipelines used here to parse each record
  fname=$(echo $line | cut -d, -f1)
  lname=$(echo $line | cut -d, -f2)
  email=$(echo $line | cut -d, -f3)
  ## Another pipeline to check validity of the email
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 2

#!/bin/ksh

> valid
> invalid

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated the need for three pipelines by using IFS with read
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 3

#!/bin/ksh

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated need for using egrep by using KSH's built-in regular expression capbility
  if [[ $email = (.)+@(.)+\.(.)+ ]]; then
    echo "$fname,$lname,$email"
  else
    echo "$fname,$lname,$email" >&2
  fi
done > valid 2> invalid

Performance Matrix

Input Records		Sample 1	Sample 2	Sample 3
1,000	Real	0m9.380s	0m2.533s	0m0.085s
	User	0m2.744s	0m0.781s	0m0.045s
	Sys	0m7.960s	0m2.037s	0m0.039s
10,000	Real	1m25.238s	0m22.515s	0m0.970s
	User	0m24.663s	0m7.001s	0m0.379s
	Sys	1m11.544s	0m17.786s	0m0.299s
100,000	Real	14m42.842s	4m6.492s	0m6.527s
	User	4m8.237s	1m15.282s	0m3.667s
	Sys	12m12.653s	3m13.174s	0m2.862s
1,000,000	Real	145m58.457s	41m17.773s	1m11.483s
	User	41m15.294s	12m28.498s	0m39.565s
	Sys	121m8.872s	32m21.106s	0m30.701s

Conclusion

As you can see from the performance matrix, as the number of child processes in a shell script reduce, the performance begins to improve drastically.

There are situations, however, when it seems impossible to avoid running child-processes for every record (especially when third party utilities are involved, like having to update the database for every record). Rest assured, there is a way around that (read KSH co-processes)! I’ll cover that in another post sometime.

Hope this post helps you write faster running scripts!