In this series of posts, I'll make a case that Shell scripts are not used enough in mainstream data processing pipelines, what are the reasons behind that, and how we should develop a new shell based on Golang to address those issues.

Data Processing Pipelines

I have always been a big fan of Unix shell pipelines. I think they are a fantastic way to write rapid data processing pipelines.

Of course, there are many design patterns to present processing pipelines in mainstream languages. Look at this example of a simple pipeline from Akka Streams:

  // #pipelining
  Flow<ScoopOfBatter, HalfCookedPancake, NotUsed> fryingPan1 =
      Flow.of(ScoopOfBatter.class).map(batter -> new HalfCookedPancake());

  Flow<HalfCookedPancake, Pancake, NotUsed> fryingPan2 =
      Flow.of(HalfCookedPancake.class).map(halfCooked -> new Pancake());
  // #pipelining

  public void demonstratePipelining() {
    // #pipelining

    // With the two frying pans we can fully cook pancakes
    Flow<ScoopOfBatter, Pancake, NotUsed> pancakeChef = fryingPan1.async().via(fryingPan2.async());
    // #pipelining
Akka Stream Pipeline Example

Or, an example from Apache Beam data processing:

var wordRE = regexp.MustCompile(`[a-zA-Z]+('[a-z])?`)

func main() {
	// beam.Init() is an initialization hook that must be called on startup.

	// Create the Pipeline object and root scope.
	p := beam.NewPipeline()
	s := p.Root()

	// Apply the pipeline's transforms.

	// This example reads a public data set consisting of the complete works
	// of Shakespeare.
	lines := textio.Read(s, "gs://apache-beam-samples/shakespeare/*")

	words := beam.ParDo(s, func(line string, emit func(string)) {
		for _, word := range wordRE.FindAllString(line, -1) {
	}, lines)

	counted := stats.Count(s, words)

	formatted := beam.ParDo(s, func(w string, c int) string {
		return fmt.Sprintf("%s: %v", w, c)
	}, counted)

	textio.Write(s, "wordcounts.txt", formatted)

	// Run the pipeline on the direct runner.
	direct.Execute(context.Background(), p)
Apache Beam Minimal Wordcount Example

Something similar in a shell script would look like this:

scoop < batter | cook_pancake_side | flip_to_other_pan | cook_pancake_side > plate
Cook a Pancake in a Shell Script

Or, this for the word count:

# Split text into words
function words
    while IFS=\ \n\t read -A line
      for word in ${line[*]}
         print "$word"

gsutil cat "gs://apache-beam-samples/shakespeare/*" | words | sort | uniq -c
Shell Word Count

As you can see, the shell pipeline is more precise and expressive on what the inputs & outputs are, and what is happening in the processing.

Limitations of Shell Pipelines

Even though shell pipelines are very expressive and easy to write, there are several limitations when it comes to using them for complex data pipelines:

  1. Lack of structured data. All data in a pipeline is a blob. If your data source is a SELECT query, you have to combine all fields of a row into a blob that is specific to your program before passing it to the next step in the pipeline. To work effectively with structured data, each program is required to be aware of which program it pipes to/from, which violates the universal principle of independent utilities combining to produce more sophisticated programs in Unix.
  2. No type-safety. All data in a shell pipeline is treated as a byte stream, typically a string. This again causes the programs to continually cast the input and output values and increases the risk of a run-time type error.
  3. No compatibility check. echo Hello | ls is a perfectly valid stream syntactically even though semantically it makes no sense.
  4. No fan-in, fan-out. You can linearly represent much of data processing, but there are always crucial use-cases where you need to either fan-out an input or fan-in an output. The shell pipelines lack this capability.
  5. Lack of multi-threading. Each step of the pipeline is run in an independent OS process, making invoking a pipeline an expensive operation. Although this overhead is usually not a problem for long-running pipelines, it can make the use of short-lived pipelines prohibitive. Also, in a fan-in/fan-out scenario, it is challenging to scale a process-model compared to a thread/co-routine/actor-based model.

In the next post, we will focus on how we can enhance the syntax of a shell pipeline to handle some of these issues.