In this series of posts, I'll make a case that Shell scripts are not used enough in mainstream data processing pipelines, what are the reasons behind that, and how we should develop a new shell based on Golang to address those issues.
Data Processing Pipelines
I have always been a big fan of Unix shell pipelines. I think they are a fantastic way to write rapid data processing pipelines.
Of course, there are many design patterns to present processing pipelines in mainstream languages. Look at this example of a simple pipeline from Akka Streams:
Or, an example from Apache Beam data processing:
Something similar in a shell script would look like this:
Or, this for the word count:
As you can see, the shell pipeline is more precise and expressive on what the inputs & outputs are, and what is happening in the processing.
Limitations of Shell Pipelines
Even though shell pipelines are very expressive and easy to write, there are several limitations when it comes to using them for complex data pipelines:
- Lack of structured data. All data in a pipeline is a blob. If your data source is a
SELECTquery, you have to combine all fields of a row into a blob that is specific to your program before passing it to the next step in the pipeline. To work effectively with structured data, each program is required to be aware of which program it pipes to/from, which violates the universal principle of independent utilities combining to produce more sophisticated programs in Unix.
- No type-safety. All data in a shell pipeline is treated as a byte stream, typically a string. This again causes the programs to continually cast the input and output values and increases the risk of a run-time type error.
- No compatibility check.
echo Hello | lsis a perfectly valid stream syntactically even though semantically it makes no sense.
- No fan-in, fan-out. You can linearly represent much of data processing, but there are always crucial use-cases where you need to either fan-out an input or fan-in an output. The shell pipelines lack this capability.
- Lack of multi-threading. Each step of the pipeline is run in an independent OS process, making invoking a pipeline an expensive operation. Although this overhead is usually not a problem for long-running pipelines, it can make the use of short-lived pipelines prohibitive. Also, in a fan-in/fan-out scenario, it is challenging to scale a process-model compared to a thread/co-routine/actor-based model.
In the next post, we will focus on how we can enhance the syntax of a shell pipeline to handle some of these issues.