Golden Rule for Super Fast Shell Scripts!
Shell scripts are very powerful in what you can make them do and they can be written in a very short amount of time, especially when it comes to processing text files. Most often these scripts are written as a stop-gap solution till a more permanent, efficient C program can replace them. However, once the script is installed in production, people realize its working just fine and there is no need to spend more effort to write a C-program from scratch. Over time the script starts getting used more and more (a phenomena one of my good friend labels “if you build it, they will come”. A reference from the movie “Field of dreams”… more on this in some other post)Â and it soon becomes a performance bottle-neck in the system.
This is partly due to the fact that there are so many ways of accomplishing a single task in shell scripts that its difficult for most developers to figure out which one is the most efficient way. In this post I’ll cover a single “Golden rule” that I have discovered, which helps me write really efficient Korn Shell scripts. Here it is:
Never launch a child process in a processing loop!
A processing loop, as referred here, is a loop which iterates over every record in the input file.
The Golden Rule can be expressed in mathematical notation as:
performance = A/(# of child processes launched per input record)
Considering ‘A’ would be a system-dependent constant.
Note:Even though I have used Korn shell for all the examples here, the basic principle should hold true for any shell.
What is a Child Process?
The crux of the golden rule is to make sure that we do not launch any child process multiple times within a script execution. This is because there is a lot of overhead involved in creating a child-process. Here are some tips on what causes a child process to be created:
- Any utility that’s not a shell built-in, like cut, sed, grep etc.
- Every time you use a pipeline, it causes child processes to be created.
It will be easier to demo the Golden rule than talk about it. So, lets dive into some samples.
Examples
Lets take a very simple example, where you have a CSV input file with first-name, last-name and an email address. The job of the process is to parse the input file, verify each email for an ‘@’ and a ‘.’ in the email id, and split out any invalid records to an error file.
I’ll show the same script written in 3 different styles in reducing number of child-processes per record and increasing degree of efficiency.
Sample 1
#!/bin/ksh
> valid
> invalid
cat $1 | while read line
do
## Three pipelines used here to parse each record
fname=$(echo $line | cut -d, -f1)
lname=$(echo $line | cut -d, -f2)
email=$(echo $line | cut -d, -f3)
## Another pipeline to check validity of the email
if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
echo "$fname,$lname,$email" >> valid
else
echo "$fname,$lname,$email" >> invalid
fi
done
Sample 2
#!/bin/ksh
> valid
> invalid
cat $1 | while IFS=, read fname lname email
do
## Eliminated the need for three pipelines by using IFS with read
if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
echo "$fname,$lname,$email" >> valid
else
echo "$fname,$lname,$email" >> invalid
fi
done
Sample 3
#!/bin/ksh
cat $1 | while IFS=, read fname lname email
do
## Eliminated need for using egrep by using KSH's built-in regular expression capbility
if [[ $email = (.)+@(.)+\.(.)+ ]]; then
echo "$fname,$lname,$email"
else
echo "$fname,$lname,$email" >&2
fi
done > valid 2> invalid
Performance Matrix
Input Records | Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|---|
1,000 | Real | 0m9.380s | 0m2.533s | 0m0.085s |
User | 0m2.744s | 0m0.781s | 0m0.045s | |
Sys | 0m7.960s | 0m2.037s | 0m0.039s | |
10,000 | Real | 1m25.238s | 0m22.515s | 0m0.970s |
User | 0m24.663s | 0m7.001s | 0m0.379s | |
Sys | 1m11.544s | 0m17.786s | 0m0.299s | |
100,000 | Real | 14m42.842s | 4m6.492s | 0m6.527s |
User | 4m8.237s | 1m15.282s | 0m3.667s | |
Sys | 12m12.653s | 3m13.174s | 0m2.862s | |
1,000,000 | Real | 145m58.457s | 41m17.773s | 1m11.483s |
User | 41m15.294s | 12m28.498s | 0m39.565s | |
Sys | 121m8.872s | 32m21.106s | 0m30.701s |
Conclusion
As you can see from the performance matrix, as the number of child processes in a shell script reduce, the performance begins to improve drastically.
There are situations, however, when it seems impossible to avoid running child-processes for every record (especially when third party utilities are involved, like having to update the database for every record). Rest assured, there is a way around that (read KSH co-processes)! I’ll cover that in another post sometime.
Hope this post helps you write faster running scripts!