Integrated DB Access from the Shell - Part 2

As discussed in part 1 of this series, we are trying to build an idiomatic way of accessing the database from the shell. Here is a quick snippet of how we want the code to look like:

## Setting the DSN opens a DB connection
DSN="host=localhost port=5432 user=postgres password=123456"
SELECT id, email from guest | while read id email
do
    valid=$((is_email_valid $email))
    echo "$id" "$valid"
done | UPDATE guest set is_valid = $2 where id = $1
## Unset the DSN to close the connection
unset DSN

Ideally, we would write custom shell built-ins in C, so we get full control over each db api invocation, but to keep things simpler, let's look at doing this purely within the shell using co-processes. All the samples here are written for ksh93, but should be easily adaptable to any other shell.

To achieve this, we will be tackling the following:

  1. Implement discipline functions for the DSN variable, so the DB connection is opened and closed when this variable is set and unset.
  2. Launch the psql utility as a co-process from these discipline functions, so we can communicate with it using print -p and read -p.
  3. Define SELECT and INSERT functions to use the co-process to execute the database operations. You can model DELETE and UPDATE based on these examples.
function DSN.set
{
   typeset db="${.sh.value}"

   ## launch psql as a co-process
   psql -d "$db" -t |&
   typeset pid=$!

   .sh.value=( db="$db" pid="$pid" )
}

function DSN.unset
{
   ## close the co-process pipe for writing
   exec p>&- p<&-
}
DSN Discipline Functions

|& is the way to create a co-process. A co-process is just like any other background job, except it gets a set of i/o pipes that can be accessed from the current process using read -p and print -p.

To close the background co-process, we simply close the pipe p for i/o using the >&- and <&- commands.

function SELECT
{
   typeset q="SELECT $@;"

   if [[ -t 1 ]]; then
      ## If outputing to the terminal, use the expanded output and page by default
      _dbq '\x on
      \pset format aligned
      \pset tuples_only off' >/dev/null
      _dbq "$q" | ${PAGER:-less}
   else
      ## For output to a file/pipe, use the unload syntax
      _dbq '\x off
      \pset format unaligned
      \pset tuples_only on' >/dev/null
      _dbq "$q" | while read -r x; do print -r "$x|"; done
   fi
}

## Execute the given DB query on the co-process and print the results
function _dbq
{
   typeset __DSNSEP="--DONE--"
   print -p "${@}"
   print -p "\echo ${__DSNSEP}"

   while read -p row
   do
      [[ "$row" == "$__DSNSEP" ]] && break
      print -r -- "$row"
   done
}
SELECT

The key with the SELECT function is to customize the behavior depending on where the output is going. For terminal output, we want to make it easier for the user to read. However, for a file or a pipe output, we want the data to be more machine-readable.

The key to note in the _dbq function is that you do not want to get stuck in a read from the co-process. For each print -p, we need to make sure that we read -p all the output, but not keep waiting for anything extra to avoid a deadlock.

Since each query can return an unknown number of records, we need a pre-defined marker to determine when the output of the query is complete. That is the purpose of the DNSSEP command. We ask the psql co-process to echo it after each query. Then, in the read loop we break when we find the $__DNSSEP. This guarantees that we always read the necessary amount of data from the co-process.

function INSERT
{
   typeset q="INSERT $@"

   if [[ -t 0 ]]; then
      ## If input is the terminal, execute the query as-is
      _dbq "$q;"
   else
      ## For input from a file/pipe, use the load syntax
      _load "$q"
   fi
}

function _load
{
   typeset q="$@ VALUES"
   typeset -i commit_limit=${COMMIT_LIMIT:-100}
   typeset sql="$q"
   typeset -i cnt=0
   typeset -a vals
   while read -A vals
   do
      sql+="("
      for val in ${vals[@]}
      do
         if [[ -z $val ]]; then
            sql+="NULL,"
         else
            sql+="'$val',"
         fi
      done
      sql="${sql%,}"    ## trim the trailing ','
      sql+="),"
      cnt=$((cnt+1))
      if ((cnt%commit_limit == 0)); then
         sql="${sql%,}"
         _dbq "$sql;"
         ## Reset the query
         sql="$q"
      fi
   done

   ## Catch anything left over
   if [[ $sql != $q ]]; then
      sql="${sql%,}"
      _dbq "$sql;"
   fi
}
INSERT

Similar to the SELECT function, we check if the input is from the terminal or a file/pipe. For terminal input, run the query as-is. For a file/pipe input, we want to read the values in from the input and process the records in batches using the COMMIT_LIMIT.

Here is how this all works from the command-line.

$ DSN="host=localhost port=5432 user=postgres password=123456"
$ echo $DSN
( db=host\='localhost port=5432 user=postgres password=123456' pid=47325 )

$ SELECT \* from guest
-[ RECORD 1 ]------------
id       | 10
email    | xxx
is_valid | t
-[ RECORD 2 ]------------
id       | 1
email    | test@test.com
is_valid |
-[ RECORD 3 ]------------
id       | 2
email    | test@test.com
is_valid | f
(END)

$ SELECT \* from guest > guest.dat
$ cat guest.dat
10|xxx|t|
1|test@test.com||
2|test@test.com|f|

$ DELETE from guest

$ IFS=\| INSERT INTO guest < guest.dat
INSERT 0 3

$ SELECT count\(\*\) from guest
-[ RECORD 1 ]
count | 3

(END)

In the next post, we will see how we can implement these functions as C builtins and enable a lot more sophisticated functionality.