Shell scripts are very powerful in what you can make them do and they can be written in a very short amount of time, especially when it comes to processing text files. Most often these scripts are written as a stop-gap solution till a more permanent, efficient C program can replace them. However, once the script is installed in production, people realize its working just fine and there is no need to spend more effort to write a C-program from scratch. Over time the script starts getting used more and more (a phenomena one of my good friend labels “if you build it, they will come”. A reference from the movie “Field of dreams”… more on this in some other post)Â and it soon becomes a performance bottle-neck in the system.
This is partly due to the fact that there are so many ways of accomplishing a single task in shell scripts that its difficult for most developers to figure out which one is the most efficient way. In this post I’ll cover a single “Golden rule” that I have discovered, which helps me write really efficient Korn Shell scripts. Here it is:
Never launch a child process in a processing loop!
A processing loop, as referred here, is a loop which iterates over every record in the input file.
The Golden Rule can be expressed in mathematical notation as:
performance = A/(# of child processes launched per input record)
Considering ‘A’ would be a system-dependent constant.
Note:Even though I have used Korn shell for all the examples here, the basic principle should hold true for any shell.
What is a Child Process?
The crux of the golden rule is to make sure that we do not launch any child process multiple times within a script execution. This is because there is a lot of overhead involved in creating a child-process. Here are some tips on what causes a child process to be created:
- Any utility that’s not a shell built-in, like cut, sed, grep etc.
- Every time you use a pipeline, it causes child processes to be created.
It will be easier to demo the Golden rule than talk about it. So, lets dive into some samples.
Lets take a very simple example, where you have a CSV input file with first-name, last-name and an email address. The job of the process is to parse the input file, verify each email for an ‘@’ and a ‘.’ in the email id, and split out any invalid records to an error file.
I’ll show the same script written in 3 different styles in reducing number of child-processes per record and increasing degree of efficiency.
#!/bin/ksh > valid > invalid cat $1 | while read line do ## Three pipelines used here to parse each record fname=$(echo $line | cut -d, -f1) lname=$(echo $line | cut -d, -f2) email=$(echo $line | cut -d, -f3) ## Another pipeline to check validity of the email if echo "$email" | egrep ".+@.+\..+" >/dev/null; then echo "$fname,$lname,$email" >> valid else echo "$fname,$lname,$email" >> invalid fi done
#!/bin/ksh > valid > invalid cat $1 | while IFS=, read fname lname email do ## Eliminated the need for three pipelines by using IFS with read if echo "$email" | egrep ".+@.+\..+" >/dev/null; then echo "$fname,$lname,$email" >> valid else echo "$fname,$lname,$email" >> invalid fi done
#!/bin/ksh cat $1 | while IFS=, read fname lname email do ## Eliminated need for using egrep by using KSH's built-in regular expression capbility if [[ $email = (.)+@(.)+\.(.)+ ]]; then echo "$fname,$lname,$email" else echo "$fname,$lname,$email" >&2 fi done > valid 2> invalid
|Input Records||Sample 1||Sample 2||Sample 3|
As you can see from the performance matrix, as the number of child processes in a shell script reduce, the performance begins to improve drastically.
There are situations, however, when it seems impossible to avoid running child-processes for every record (especially when third party utilities are involved, like having to update the database for every record). Rest assured, there is a way around that (read KSH co-processes)! I’ll cover that in another post sometime.
Hope this post helps you write faster running scripts!