A Programmer's Journal

Integrated DB Access from the Shell - Part 2

Manuj Bhatia — Mon, 13 Apr 2020 08:31:01 GMT

As discussed in part 1 of this series, we are trying to build an idiomatic way of accessing the database from the shell. Here is a quick snippet of how we want the code to look like:

## Setting the DSN opens a DB connection
DSN="host=localhost port=5432 user=postgres password=123456"
SELECT id, email from guest | while read id email
do
    valid=$((is_email_valid $email))
    echo "$id" "$valid"
done | UPDATE guest set is_valid = $2 where id = $1
## Unset the DSN to close the connection
unset DSN

Ideally, we would write custom shell built-ins in C, so we get full control over each db api invocation, but to keep things simpler, let's look at doing this purely within the shell using co-processes. All the samples here are written for ksh93, but should be easily adaptable to any other shell.

To achieve this, we will be tackling the following:

Implement discipline functions for the DSN variable, so the DB connection is opened and closed when this variable is set and unset.
Launch the psql utility as a co-process from these discipline functions, so we can communicate with it using print -p and read -p.
Define SELECT and INSERT functions to use the co-process to execute the database operations. You can model DELETE and UPDATE based on these examples.

function DSN.set
{
   typeset db="${.sh.value}"

   ## launch psql as a co-process
   psql -d "$db" -t |&
   typeset pid=$!

   .sh.value=( db="$db" pid="$pid" )
}

function DSN.unset
{
   ## close the co-process pipe for writing
   exec p>&- p<&-
}

DSN Discipline Functions

|& is the way to create a co-process. A co-process is just like any other background job, except it gets a set of i/o pipes that can be accessed from the current process using read -p and print -p.

To close the background co-process, we simply close the pipe p for i/o using the >&- and <&- commands.

function SELECT
{
   typeset q="SELECT $@;"

   if [[ -t 1 ]]; then
      ## If outputing to the terminal, use the expanded output and page by default
      _dbq '\x on
      \pset format aligned
      \pset tuples_only off' >/dev/null
      _dbq "$q" | ${PAGER:-less}
   else
      ## For output to a file/pipe, use the unload syntax
      _dbq '\x off
      \pset format unaligned
      \pset tuples_only on' >/dev/null
      _dbq "$q" | while read -r x; do print -r "$x|"; done
   fi
}

## Execute the given DB query on the co-process and print the results
function _dbq
{
   typeset __DSNSEP="--DONE--"
   print -p "${@}"
   print -p "\echo ${__DSNSEP}"

   while read -p row
   do
      [[ "$row" == "$__DSNSEP" ]] && break
      print -r -- "$row"
   done
}

SELECT

The key with the SELECT function is to customize the behavior depending on where the output is going. For terminal output, we want to make it easier for the user to read. However, for a file or a pipe output, we want the data to be more machine-readable.

The key to note in the _dbq function is that you do not want to get stuck in a read from the co-process. For each print -p, we need to make sure that we read -p all the output, but not keep waiting for anything extra to avoid a deadlock.

Since each query can return an unknown number of records, we need a pre-defined marker to determine when the output of the query is complete. That is the purpose of the DNSSEP command. We ask the psql co-process to echo it after each query. Then, in the read loop we break when we find the $__DNSSEP. This guarantees that we always read the necessary amount of data from the co-process.

function INSERT
{
   typeset q="INSERT $@"

   if [[ -t 0 ]]; then
      ## If input is the terminal, execute the query as-is
      _dbq "$q;"
   else
      ## For input from a file/pipe, use the load syntax
      _load "$q"
   fi
}

function _load
{
   typeset q="$@ VALUES"
   typeset -i commit_limit=${COMMIT_LIMIT:-100}
   typeset sql="$q"
   typeset -i cnt=0
   typeset -a vals
   while read -A vals
   do
      sql+="("
      for val in ${vals[@]}
      do
         if [[ -z $val ]]; then
            sql+="NULL,"
         else
            sql+="'$val',"
         fi
      done
      sql="${sql%,}"    ## trim the trailing ','
      sql+="),"
      cnt=$((cnt+1))
      if ((cnt%commit_limit == 0)); then
         sql="${sql%,}"
         _dbq "$sql;"
         ## Reset the query
         sql="$q"
      fi
   done

   ## Catch anything left over
   if [[ $sql != $q ]]; then
      sql="${sql%,}"
      _dbq "$sql;"
   fi
}

INSERT

Similar to the SELECT function, we check if the input is from the terminal or a file/pipe. For terminal input, run the query as-is. For a file/pipe input, we want to read the values in from the input and process the records in batches using the COMMIT_LIMIT.

Here is how this all works from the command-line.

$ DSN="host=localhost port=5432 user=postgres password=123456"
$ echo $DSN
( db=host\='localhost port=5432 user=postgres password=123456' pid=47325 )

$ SELECT \* from guest
-[ RECORD 1 ]------------
id       | 10
email    | xxx
is_valid | t
-[ RECORD 2 ]------------
id       | 1
email    | test@test.com
is_valid |
-[ RECORD 3 ]------------
id       | 2
email    | test@test.com
is_valid | f
(END)

$ SELECT \* from guest > guest.dat
$ cat guest.dat
10|xxx|t|
1|test@test.com||
2|test@test.com|f|

$ DELETE from guest

$ IFS=\| INSERT INTO guest < guest.dat
INSERT 0 3

$ SELECT count\(\*\) from guest
-[ RECORD 1 ]
count | 3

(END)

In the next post, we will see how we can implement these functions as C builtins and enable a lot more sophisticated functionality.

Reference Counting in C

Manuj Bhatia — Sat, 11 Apr 2020 06:10:53 GMT

Ever since computer programming has been around, the problem of dynamic memory management has plagued many a programs.

Now a days most programming languages rely on either Garbage Collection or Reference Counting, but what is a C programmer to do?

Turns out, there is no magic to reference counting, it literally is what it sounds like. Each piece of memory has a counter to how many references it has. Each function that wants to keep the value outside of the given scope, should retain the value, and once it's done with it release it.

You can easily create thin wrappers around your allocation modules to build this capability.

Of course, without the help from the compiler, you cannot do Automatic Reference Counting (ARC), and need to make sure your code properly calls retain and release and also avoid retention cycles.

// A header for each memory block
typedef struct MEMHDR {
	int retain_cnt;
} MEMHDR;

#define MEMHDR_SIZ	     ALIGN(sizeof(MEMHDR))
#define MEMHDR_PTR(ptr)  ((MEMHDR *)((char *)(ptr) - MEMHDR_SIZ))

// align the size to word-boundry
#define ALIGN(siz)       ((siz) + sizeof(int) - ((siz) % sizeof(int)))

// Allocate extra header at the head of the memory pointer,
// but return the offset pointer.
// This gives you an internal header to maintain any attributes
// for this block of memory
#define allocate(siz) ((char *)malloc(MEMHDR_SIZ + ALIGN(siz))) + MEMHDR_SIZ))

// Simply increment ths retention counter and return the original pointer
// Allows the use like this:
//   str1 = retain(str)
#define retain(ptr)   (MEMHDR_PTR(ptr)->retain_cnt++, ptr)

// Release is a little more involved.
// decrement the retention counter and free the block,
// if the retain_cnt is zero
#define release(ptr)  ( \
	--(MEMHDR_PTR(ptr)->retain_cnt), \
    MEMHDR_PTR(ptr)->retain_cnt \
    	?ptr \
        :(free(ptr),NULL) \
)

Essentially, all we are doing is prepending a header to each piece of allocated memory and tracking a retain_cnt variable in it. When the retain_cnt hits zero, we will free the block.

Let's see how you would use this with an example. We will build an API to read a file.

typedef struct FILER {
    FILE *fd;
    char *buf;
} FILER;

FILER * filer_open(const char *path) {
    FILE *fd = fopen(path, "r");
    FILER *filer = allocate(sizeof(FILER));
    filer.fd = fd;
    return filer;
}

void filer_close(FILER *filer) {
    fclose(filer->fd);
    if (filer->buf)
        release(filer->buf);
}

char * filer_read(FILER *filer) {
    if (!filer->buf)
        filer->buf = allocate(BUFSIZ);
    fgets(filer->buf, BUFSIZ, filer->fd);
    return filer->buf;
}

filer.c

int main(void) {
    FILER *f = filer_open("/home/user/.profile");

    char *line = filer_read(f);
    char *first_line = retain(line);

    filer_close(f);

    // first_line is still valid here

    ...

    // clean up
    release(first_line);
}

main.c

As you can see in this example, a major side-benefit of using reference counting is the locality of memory operations. Both filer.c and main.c have equal numbers of allocate/retain and release, making reviewing the code for memory leaks very easy.

That's all there is to it!

Integrated DB Access From the Shell

Manuj Bhatia — Wed, 08 Apr 2020 19:08:35 GMT

One of the main things I do in my scripts regularly is work with databases using their cli.

psql -d "$DSN" -t -c "select id, email from guest" | while IFS=\| read id email
do
    valid=$((is_email_valid $email))
    psql -d "$DSN" -t -c "update guest set is_valid='$valid' where id=$id"
done

Even though this is clunky, for a one-off script iterating on a few records, this approach works. It does not scale, however, due to violating the golden rule.

Some of the problems with this approach are:

Ripe for SQL-injection attack. We are essentially building SQL queries by string concatenation like the early 2000s and are prone to SQL-injection attacks as a result.
Poor DB Performance. No prepared queries and persistent DB connections means, each update will have to establish a new DB connection and prepare/execute the query.
No transaction control. Since each update is forced to establish a new DB connection, you effectively cannot batch updates.
Poor Scaling. Since each update has to launch a child process, the overhead for a large input set piles up quickly and will almost become unworkable for a production issue.
Database Driver is Hard-Coded. Try changing the database driver from psql to isql and you end up doing a lot of manual refactoring of the code base.
Not easily usable interactively. Using the native db utilities in this way is not very comfortable for a terminal user. For example, I might want the headers and labels if I am outputting to my terminal, but would want a CSV when redirecting to a file. As a user, I would need to remember the flags for each of those operations.

This approach also does not yield well to idiomatic shell scripting. Here is my attempt to do this in an idiomatic way.

DSN="postgres"
SELECT id, email from guest | while read id email
do
    valid=$((is_email_valid $email))
    echo "$id" "$valid"
done | UPDATE guest set is_valid = $2 where id = $1

This relies on the capability of modern shells to define custom built-ins. Combine this with KEYBD traps and you can do some fun things interactively too.

# Download the results to a csv file
$ OFS=, SELECT \* from guest > guest.csv
2 rows selected

$ cat guest.csv
1,test@test.com,Y
2,test1@test.com,N

# Delete the data
DELETE from guest
2 rows deleted

# Bulk load data from a csv file
$ IFS=, INSERT into guest < guest.csv
2 rows inserted

# Dump data on the terminal from a query
# (will automatically use PAGER to give a sane user-experience)
$ SELECT \* from guest

id          1
email       test@test.com
is_valid    Y

id          2
email       test1@test.com
is_valid    N

2 rows deleted

Here is the keyboard trap I use with ksh to make typing these commands on the keyboard convenient and auto-escaping some of the special characters.

## The trap to use for keyboard bindings (KEYBD) to properly escape SQL queries
function db_keybd
{
   ##############################################
   ## In the code below type Ctrl-V+Esc for ^[ ##
   ##############################################   
   if [[ ${.sh.edchar} == ' ' ]]; then
      ## Make the keyword consistent case,
      ## so you are not forced to type all upper-case for the keywords
      if [[ ${.sh.edtext} == [sS][eE][lL][eE][cC][tT] ]]; then
         .sh.edchar="^[0cwSELECT"
      elif [[ ${.sh.edtext} == [uU][pP][dD][aA][tT][eE] ]]; then
         .sh.edchar="^[0cwUPDATE"
      elif [[ ${.sh.edtext} == [dD][eE][lL][eE][tT][eE] ]]; then
         .sh.edchar="^[0cwDELETE"
      elif [[ ${.sh.edtext} == [iI][nN][sS][eE][rR][tT] ]]; then
         .sh.edchar="^[0cwINSERT"
      fi
   fi

   ## Escape the special characters, unless it is already escaped
   [[ ${.sh.edchar} == '*' || ${.sh.edchar} == $'\'' || ${.sh.edchar} == '"' || ${.sh.edchar} == '(' || ${.sh.edchar} == ')' ]] \
      && [[ ${.sh.edtext} == @(SELECT|UPDATE|DELETE|INSERT)\ * ]] \
      && [[ ${.sh.edtext:$((.sh.edcol-1)):1} != \\ ]] \
         && .sh.edchar=\\${.sh.edchar}
}

trap db_keybd KEYBD

In part 2 of this post we will look at how to develop the plugins to implement the actual functions.

Let's Write a New Shell - Part 1

Manuj Bhatia — Tue, 07 Apr 2020 21:13:03 GMT

In this series of posts, I'll make a case that Shell scripts are not used enough in mainstream data processing pipelines, what are the reasons behind that, and how we should develop a new shell based on Golang to address those issues.

Data Processing Pipelines

I have always been a big fan of Unix shell pipelines. I think they are a fantastic way to write rapid data processing pipelines.

Of course, there are many design patterns to present processing pipelines in mainstream languages. Look at this example of a simple pipeline from Akka Streams:

  // #pipelining
  Flow fryingPan1 =
      Flow.of(ScoopOfBatter.class).map(batter -> new HalfCookedPancake());

  Flow fryingPan2 =
      Flow.of(HalfCookedPancake.class).map(halfCooked -> new Pancake());
  // #pipelining

  @Test
  public void demonstratePipelining() {
    // #pipelining

    // With the two frying pans we can fully cook pancakes
    Flow pancakeChef = fryingPan1.async().via(fryingPan2.async());
    // #pipelining
  }

Akka Stream Pipeline Example

Or, an example from Apache Beam data processing:

var wordRE = regexp.MustCompile(`[a-zA-Z]+('[a-z])?`)

func main() {
	// beam.Init() is an initialization hook that must be called on startup.
	beam.Init()

	// Create the Pipeline object and root scope.
	p := beam.NewPipeline()
	s := p.Root()

	// Apply the pipeline's transforms.

	// This example reads a public data set consisting of the complete works
	// of Shakespeare.
	lines := textio.Read(s, "gs://apache-beam-samples/shakespeare/*")

	words := beam.ParDo(s, func(line string, emit func(string)) {
		for _, word := range wordRE.FindAllString(line, -1) {
			emit(word)
		}
	}, lines)

	counted := stats.Count(s, words)

	formatted := beam.ParDo(s, func(w string, c int) string {
		return fmt.Sprintf("%s: %v", w, c)
	}, counted)

	textio.Write(s, "wordcounts.txt", formatted)

	// Run the pipeline on the direct runner.
	direct.Execute(context.Background(), p)
}

Apache Beam Minimal Wordcount Example

Something similar in a shell script would look like this:

scoop < batter | cook_pancake_side | flip_to_other_pan | cook_pancake_side > plate

Cook a Pancake in a Shell Script

Or, this for the word count:

# Split text into words
function words
{
    while IFS=\ \n\t read -A line
    do
      for word in ${line[*]}
      do
         print "$word"
      done
    done
}

gsutil cat "gs://apache-beam-samples/shakespeare/*" | words | sort | uniq -c

Shell Word Count

As you can see, the shell pipeline is more precise and expressive on what the inputs & outputs are, and what is happening in the processing.

Limitations of Shell Pipelines

Even though shell pipelines are very expressive and easy to write, there are several limitations when it comes to using them for complex data pipelines:

Lack of structured data. All data in a pipeline is a blob. If your data source is a SELECT query, you have to combine all fields of a row into a blob that is specific to your program before passing it to the next step in the pipeline. To work effectively with structured data, each program is required to be aware of which program it pipes to/from, which violates the universal principle of independent utilities combining to produce more sophisticated programs in Unix.
No type-safety. All data in a shell pipeline is treated as a byte stream, typically a string. This again causes the programs to continually cast the input and output values and increases the risk of a run-time type error.
No compatibility check. echo Hello | ls is a perfectly valid stream syntactically even though semantically it makes no sense.
No fan-in, fan-out. You can linearly represent much of data processing, but there are always crucial use-cases where you need to either fan-out an input or fan-in an output. The shell pipelines lack this capability.
Lack of multi-threading. Each step of the pipeline is run in an independent OS process, making invoking a pipeline an expensive operation. Although this overhead is usually not a problem for long-running pipelines, it can make the use of short-lived pipelines prohibitive. Also, in a fan-in/fan-out scenario, it is challenging to scale a process-model compared to a thread/co-routine/actor-based model.

In the next post, we will focus on how we can enhance the syntax of a shell pipeline to handle some of these issues.

When Preparation Meets Adversity

Manuj Bhatia — Tue, 07 Apr 2020 18:38:34 GMT

Luck is when preparation meets opportunity.

All of us have heard this phrase at some point in our life. But what happens when preparation meets adversity??

I had a taste of this a couple of years ago when I was on my annual working vacation stint at my parents' home in India. I had just been rejected from a job I was confident I would get. I was bored and struggling to complete even the most basic tasks at my current job. My stress level was off the charts, and even though I was physically with my family, I was never there mentally. It was one of the lowest points of my adult life.

Usually, my biggest weapons against stress used to be Diet Coke, ice cream, and Food.

Now the year before this incident, I had realized how bad my dependence on Diet Coke was and had stopped using it. And, at this time, I was also Icecream-free for a few months for similar reasons. Both those things were not easy, but I was very determined to stick to my guns on staying off of them.

Not being at my own house, I did not have any of my other coping mechanisms available to me either. I could not go for a drive, go to the gym, read my favorite books, or watch my American TV shows.
And even though I don't drink to deal with stress, I could not have a single drink to unwind, owing to some health issues at the time.

That left food as the only alternative. However, being at my parents' house meant that I was at a more structured regimen of 3-square meals a day and no access to my usual comfort foods.

All this forced me to think hard about my stress and tackle the real sources of that stress. For example, I squared with my project manager about my productivity and took some actual time off to enjoy with my family.

Cut to a couple of months after I came back home, and I realized I had been losing weight without any change to my non-existent exercise schedule! Upon some closer analysis, I discovered that my appetite had drastically reduced. I was not indulging with food, and would usually stop eating if I was not enjoying it (which was a polar opposite of my earlier self, who never cared how the food tasted).

That experience altered my relationship with food for the better. As a result, I lost 30 pounds in the following 6-months in a very healthy and sustainable manner (getting to a healthy weight has been a life long goal of mine).

In that moment of suffering, I could not see any silver lining. Reflecting on that time, however, I can see that had I not worked hard to stop my Diet Coke and ice cream habits before that incident, this change in my food habits would not have happened.

As I type this blog post, the whole world is dealing with probably the most significant adversity we ever faced caused by the Coronavirus. We are afraid, stuck away from family, losing jobs, and maybe the worst, losing hope.

I promise you that you are prepared to handle this adversity in some way that you don't yet realize. At some point in your life, you will look back and find out that this incident caused you to change your life for the better.

Keep working hard on yourself and your goals, and stay safe!

Behavioral Impact of Tracking Metrics: Loading vs. Unloading a Dishwasher

Manuj Bhatia — Tue, 04 Dec 2018 07:46:18 GMT

TL;DR

Don't look at a metric just at it's face value; dig deeper and question the behavior promoted by tracking that metric.

Metrics. The be-all and end-all of any IT ops team and something that most IT executive leaders track on a weekly, if not daily, basis.

I like that lately a lot more traditionally non-tech companies are focusing on metrics to increase efficiencies throughout the software life cycle. However, one thing I notice missing is an informed discussion on not only what metrics are important, but what metrics promote the desired behavior in the teams.

To Load or Unload, that is the Question!

To illustrate the importance of this point, lets look at a very simple regular life example of dishwashing.

Let's say you live with a roommate, and you need to divvy up the responsibility of dishwashing. A very reasonable thing to do is to alternate the responsibility between the two of you. Now the question is how do you track who did the dishes last. Most people will probably go by "Who loaded and ran the dishwasher last?". I argue that it is the wrong metric. Not because it does not track the effort properly, but it promotes the wrong behavior. The correct metric to track should be "Who unloaded the dishwasher last?".

Who loaded the dishwasher last?

First, lets look at the behavior it promotes:

Once you have loaded and ran the dishwasher, you have no incentive to unload it.
Since, you are not unloading it, usually you end up having clean utensils in the dishwasher and just pulling out what's necessary.
This leads to the dirty dishes sitting in the kitchen sink till the dishwasher is organically emptied.
Also, its in your self-interest to load and run the dishwasher before its completely full, so you effectively have to rinse and load less dishes.

So, the net effect is two-fold:

Your kitchen sink is usually full of dirty dishes.
You pay higher energy costs, since dishwasher is not always run completely full.

Who unloaded the dishwasher last?

Again, looking at the behavior this promotes:

The dishwasher usually always have dirty dishes in it.
Since the dishwasher is mostly on dirty, the dishes get rinsed and put in right away, instead of sitting in the kitchen sink.
There is no incentive to run the dishwasher early, so it gets run when its needed.

As you can see out of the two metrics, if you only wanted to track one, tracking unloads promotes better behavior!

Tracking Software

Now lets take this example into the IT field. Couple of the most commonly tracked and cited metrics are:

Number of Priority 1 (P1) tickets.
Number of Failed Change Requests (CR).

P1s

Tracking number of P1s inherently is not bad, but when it becomes the primary factor in determining the stability and performance of a team, then its crosses a line into promoting bad behavior.

Here is what happens when the executive leadership of an IT department start monitoring P1s closely:

The ops teams start down-grading P1s to P2s or even P3s.
Issues reported by business users directly sometimes don't even make it to the official ticket tracking system and just resolved off-the-books.

Failed CRs

Tracking failed CRs basically sounds like a good metric, but in reality is probably worse than tracking P1s. Lets analyze what behavior we promote by doing this:

The overall amount of paperwork done for CRs goes through the roof. Because if any CR fails, the "remediation" meeting will usually dissect the paperwork and not the true root cause of the failure, so everyone tries to cover their rear-ends.
There is an unreasonable amount of testing that starts happening; for the exact same reason as above. IF this CR fails, you don't want to be the guy who didn't do enough testing of your change. So, even a simple change to reduce the amount of data retained by a purge process has to go through multiple rounds of testing by a separate QA team.
The worst affect of all this is the delivery cycles just get longer, as all your dev and ops teams are worried about bureaucracy and distracted from the real things that matter, like taking calculated risks to deliver business value as fast as possible.

The Correct Metric: System Uptime

I would argue that tracking System Uptime, both planned as well as unplanned, is a far more useful metric than the number of P1s and failed CRs.

First, unplanned downtime is a good indicator of P1s that matter. Since each P1 is not created equal (a crashed load balancer that took your whole website offline, is a lot worse than a rogue service instance, causing 5% of your call-center agents not to be able to book a room on the first try). Tracking system downtime, gives you a more normalized view of the problem than the number of P1s.

Second, it takes the pressure off the devs to have the paperwork to the tee; it focuses them on making sure their CR will not cause any downtime. No matter whether the actual CR needs to be rolled back, as long as the dev took proper precautions to ensure no downtime, everything is good. It also encourages your team to build more resilient architectures, hot swappable service instance, rolling upgrades, automated deployments/rollbacks etc.

Conclusion

As I have put forth the case here, there could be devastating affects of tracking wrong metrics on your organization's speed and effectiveness.

Tracking the wrong metrics not just reduces the effectiveness of your data-driven decision making, it actively undermines the productivity and efficiency of your teams.

Think long and hard the next time you want to tie the performance bonus of your team to the reduction in number of P1s, unless you want your organization to become like the Japanese Police department^[1][2].

1.http://articles.latimes.com/2007/nov/09/world/fg-autopsy9
2.https://www.vox.com/world/2015/12/13/9989250/japan-crime-conviction-rate

Rockets and the Human Condition

Manuj Bhatia — Sat, 24 Feb 2018 23:56:30 GMT

I have been watching a lot of SpaceX launches^[1] lately and one thing that amazes me is how easy it looks from the outside for the Rocket to break through the Earth's gravity and get to the orbit. In reality, it goes through immense pressure and stress to reach the orbit. Once in the orbit though, it's almost effortless for the satellite to stay up there.

I could not help but see parlance of this behavior in the human struggle to better themselves.
Think of the satellite as a human being who is trying to change her^[2] circumstances (professional or personal). Earth is her current reality and the orbit is her dream.

To make her dream a reality, she has to work through similar stresses and pressures to break through the gravity of her current situation. Unsupportive spouse or parents, social pressures, risking a secure life at times to count just a few.

However, if she makes it through that high-stress period without imploding or exploding, after a while all the stresses and pressures disappear. What remains is the freedom to navigate life with minimal effort.

You cannot jump from one stable state of your life to a higher stable state, without going through a lot of pressure and stress.

If you want that new job that you do not qualify for, but can't learn that skill as you have a full-time job already, you have to give up on sleep and other leisurly activities for some time, while you transition to that new job.

Want to lose weight but are always tired and barely scraping by getting all your daily duties done. You will have to dig deep and find that will-power to power through an exercise schedule for a few weeks, before you start feeling more energetic due to all that exercising and become more efficient in your daily tasks.

The other key factor in a successful launch is the booster. It helps the satellite power through the early stages of the flight when the external factors are the strongest.
Similarly, we humans need a good support system of family and friends if we ever hope to break through our current realities and achieve new heights.

In short, success does not come easy. Keep your friends and family close and keep fighting the negative forces; they will evetually disappear and life will become easier again.

If you have not seen any of the Falcon launches from SpaceX, I highly recommend you check out the Falcon Heavy test flight and watch through to the landing. ↩︎
I am not a feminist in any practical sense, and I just got tired of typing "his/her" everywhere. Considering that "him" got to represent mankind for so long in our literature, I think its time for us to switch gears and let "her" be a representative of all mankind for a while. ↩︎

What is Something Worth?

Manuj Bhatia — Wed, 21 Feb 2018 02:54:43 GMT

Stay tuned to find out which of these Bobbleheads is worth five thousand dollars.

This was the teaser of a news report about Bobbleheads distributed at a football game showing up on e-bay that I saw this morning.^[1]

It got me thinking. What is something worth? Is it what someone wants for it or what someone is willing to pay for it?

I happen to be in the latter camp. The iPhone is pricier than almost any other phone because people are willing to pay for that luxury.

The worth of something is what someone is willing to pay for it.

Now, there is a corollary to this rule when it comes to humans though.

You are always worth more than what someone is willing to pay for you.

Why is that you ask? It really comes down to basic economics.

Whenever you transact, whether to buy an object or time from another human, you always pay a little less than it is really worth, so you can derive some value from that transaction for yourself.

You spend that money on an iPhone because you think it will enrich your life more than what you pay for it in dollars.

In case of employing someone, you will always pay someone a little less than what they are worth to you, so you can pocket the rest as a profit.

So next time you go out for a job interview or ask for a raise, remember:

You are worth more!

eBay Listing ↩︎

Bitcoin vs. Gold: Store of Value

Manuj Bhatia — Sun, 18 Feb 2018 00:29:05 GMT

TL;DR

This is not a fact-based essay; it is my philosophical argument against Bitcoin being a true store of value. I argue for your Bitcoin to hold its value, how someone has to keep doing work after its mined.

Disclaimer: As I write this, I have no significant investments in either Bitcoin/Crypto or Gold/Precious Metals.

First Things First

Let me get this thing out the way first, I wish I had bought Bitcoin in 2013, or 2014, or the start of 2017! It has probably made more millionaires than anything else in history and I wish I had invested at the right time.
I am not approaching this subject as a sore-loser though and am more than willing to listen to any criticism of arguments made here.

Also, this argument is specifically against Bitcoin and other Crypto-Currencies being a true store of value; not against their other uses and the revolutionary nature of the underlying technologies of Blockchain and other Crypto Distributed Ledgers.

Value

When I hear the term Store of Value, I see long-term value; not a few months or years; the term should be measured in decades or even generations.

To understand how something can store value, you first need to understand how value is created. In case of both Bitcoin and Gold, the value is created by someone doing work to produce the asset. In case of Gold, its the mining company doing some physical work; for Bitcoin, its a mining computer solving a mathematical problem to produce a fixed amount of Bitcoin every 10 minutes.

Econ 101

To generate value, both Gold and Bitcoin miners have to invest capital to buy the equipment (computers, mining machines etc.) and incur production costs and operating expenses to maintain and run that equipment (cooling, electricity, manual labor etc.). For simplicity, I am going to include operating expenses into production costs in the remainder of this article.

For a miner to keep producing the asset, the market price for the asset should be higher than at least the production cost (assuming all capital is already paid off).

For example, if the expenses to produce an ounce of Gold are $1000, but the price of Gold is $900/ounce, the miners will probably shut down the mine.
Similarly, the Bitcoin miners will only keep mining if the transaction fee + block reward are more valuable than the production costs of the mining rig.

Scenario 1: Gold

Consider that you bought some Gold at $1000/ounce and the current cost of mining Gold is $500/ounce. A couple years down the road the current mines are drying up and the cost of mining shoots up to $1500/ounce, while the market price holds at $1000/ounce; all of the miners shut down their operations^[1]. What is the value of your Gold now? Still $1000/ounce!
In essence, the value of your Gold is independent of the continuing effort of others or even you. There is no action required on anybody's part for your Gold to retain its value.

Scenario 2: Bitcoin

Now, let's look at a similar example in Bitcoin. You bought your Bitcoin at $1000/coin. Cost of mining shot up to more than $1000/coin and all the miners shut down their operations.
What value does your Bitcoin have? A big fat zero! Why?
In contrast to Gold, where the value is stored in the physical asset, Bitcoin's value is based on a distributed ledger hosted on the network. And, the network requires constant upkeep for you to maintain the value of your Bitcoin.

Conclusion

Unlike Gold, Bitcoin requires a constant amount of work by the network to maintain its value. Hence, it is not a true store of value.

In some respects, Bitcoin is more like a security^[2], where you are betting that value will be maintained on someone's future work and less like a commodity, where value has already been generated and is maintained by a demand for the commodity.

http://www.mining.com/sibanye-stillwater-shuts-cooke-gold-mine-7000-laid-off/ ↩︎
Definition of a Security as interpreted by the Supreme Court... "[a]n investment contract for purposes of the Securities Act means a contract, transaction or scheme whereby a person invests his money in a common enterprise and is led to expect profits solely from the efforts of the promoter or a third party. 66 S.Ct. at 1103." ↩︎

Automation 201: Processes and Frameworks

Manuj Bhatia — Thu, 08 Feb 2018 07:02:38 GMT

We need more [better] processes to make this operation repeatable!

Working in a corporate IT development environment, I hear that statement a lot. Every time something goes wrong with a deployment, a project is running late or over budget, a team is not meeting its objectives, the root cause meeting always ends with this conclusion. Well, at least it seems to be!

Do I disagree with this conclusion? Not necessarily!
I think there are better ways of handling these issues through tools, automation and organizational structure, but processes do have a critical role to play in a stable production environment.

Now there is nothing wrong with processes. However, processes are not free! While they make operations repeatable, they are not reusable.

Repeatability adds consistency to an operation, reusability makes it faster.

I came to a revelation recently that contrary to popular belief, processes do not add reusability. They only add repeatability and predictability.
Think about it, a checklist to deploy an application is repeatable and provides consistency of the output (as long as the checklist is complete). However, it's not reusable! Every new person executing the checklist for the first time suffers from the same learning curve and pitfalls. Even on the millionth time, the time taken to follow the process is still the same.

Processes are repeatable, frameworks are reusable.

And the main reason behind that is that a process aims to replace the human intellect by "documenting" all tasks that need to be accomplished for repeatability, whereas frameworks insist on automation based on conventions that need to be followed.

Take for an example, memory management "best practices" in C programming language.
One way to address memory saftey is by mandating memory-management be a line item on a code review checklist. This is a linear complexity operation. More memory operations in the code, more time it will take to verify during the code review.
Conversley, you can come up with a memory management convention, like Objective-C's memory management guidelines.
Now during the code review you only need to make sure the guidelines around memory allocation/deallocation have been followed or not; no matter the size of the application. Infact, Apple was able to take these conventions and codify them as ARC, completely automating the memory safety reviews.

Process is a set of tasks to complete, framework is a convention to follow.

And, this is the key difference. Next time you have a problem to solve, see if you can build a convention-based framework, instead of a task-based process.

Pretty Emails from Shell Scripts

Manuj Bhatia — Sun, 14 Aug 2016 06:18:54 GMT

One of the key usage of shell scripts is to monitor systems and jobs and send out notification alerts to email addresses. With the advent of smartphones and always-connected devices, these emails are not limited to 140-characters of plain text and can provide a rich layout and format to make it easier to understand the issue at hand and provide useful recovery information for support personnel.

You can combine uuencode and sendmail to send formatted html notification emails with attachments. Use some shell scripting glue and you can put everything in a simple script as shown below.

Synopsis

notify_html   [" [name1]" ... " [nameN]"] < body.txt

Example,

gen_error_html | notify_html notify@example.org "Issue with job one" \
       "one.3212345.log one.log"

Script


typeset -r UTILS_UUENCODE=uuencode
typeset -r UTILS_SENDMAIL=sendmail

## Takes a list of file names and encodes them properly to be attached to an email
function process_attachments
{

    ## loop over all the remaining parameters and create uuencode commands to encode attachments
    for arg in "$@"
    do
        ## The remote name for the attachment is optional, so append the same name if it is missing
        if [[ $arg != *\ * ]]; then
            $UTILS_UUENCODE $arg $arg
        else
            $UTILS_UUENCODE $arg
        fi
    done
}

function notify_html 
{
    typeset -r SYSTEM=$(uname -s)
 
    typeset email="$1"
    typeset subject="$2"
    shift 2
    
    (
        ## pass-through email body content
        cat
        ## Add a footer
        echo "\n\rMessage generated from $system at $(date)"
        ## Add any attachments
        process_attachments "$@"
    ) | sendmail_html "$email" "$system: $subject"

    return $?
}

function sendmail_html
{
    typeset email="$1"
    typeset subject="$2"

    (
        echo "Subject: $subject"
        echo "MIME-Version: 1.0"
        echo "Content-Type: text/html"
        echo "Content-Disposition: inline"
        ## pass-through email body content, including attachments
        cat
    ) | $UTILS_SENDMAIL $email
}

notify_html "$@"

K-shawk!

Manuj Bhatia — Sun, 14 Aug 2016 03:48:00 GMT

Ok. I started writing this when I was standing in line to pick up an iPhone 4 (that’s right, this post has been that long in the making!), so this post was planned to be short with few examples!

I just wanted to introduce ksh93 to everyone (or as my very good friend calls it K-shawk!).

When it comes to shells most people know about sh, csh, ksh, bash etc. They know about strengths and weaknesses of each and what not. However, very few people know that there are 2 versions of ksh available out there.

ksh – the most popular version packaged with pretty much all UNIX OSes. This is really ksh88. In other words, it’s the version of ksh that was implemented in 1988.
ksh93 – this is not as popular as ksh88, but in my experience, it’s packaged in every commercial OS.

I highly recommend upgrading to ksh93 for all your shell scripts and interactive shell. Why? Simply because it has a lot of features that were available to you only if you mixed awk within your shell scripts (and hence the nickname K-shawk!). Here are some of the main ones:

Associative arrays. This is a big one. It helps you create associations (‘hashes’ in many languages) between strings. The traditional arrays can only use integers as indices, however, with associative arrays your can use strings as indices.
```
  $ typeset -A x=( [A]=1 [B]=2 )
  $ echo ${x[A]},${x[B]}
  1,2
```
Compound Variables. These are C-style structures that can be used to group multiple variables together and passed around to functions as parameters.
```
  $ typeset -C y=( A=1 B=2 )
  $ echo ${y.A},${y.B}
  1,2
```
Advanced Variable Substitution. You can perform substrings and string substitution within the variable substitution construct now.
```
  $ x=Hello
  $ echo ${x//Hell/Problem}
  Problemo
  $ echo ${x:0:4}
  Hell
```
New Date Capabilities. printf can now handle date time formatting and allows you to perform a lot of date calculations that were not possible without mixing some perl in your shell scripts.
```
  $ printf "%T\n" now
  Sat Aug 13 20:41:54 CDT 2016
  $ printf "%(%Y-%m-%d)T\n" now
  2016-08-13
```

There is a lot of depth to ksh93 and the more recent versions of the shell expand the compound variable syntax to provide object-oriented programming constructs and a lot more.
You can check what version comes with your OS distribution by running ksh93 --version or ksh --version (Most of the linux distributions will only have ksh93 and they link it as ksh).

Visit David Korn's website KornShell.com for more details.
Also, Musings of an OS Plumber has great tutorials on some advanced ksh93 features.

Fun with Pipes: Copy Directories Across Servers With SSH

Manuj Bhatia — Tue, 09 Aug 2016 05:43:15 GMT

Many UNIX experts would swear by the use of pipes in the day-to-day operations, but very few know to combine it with ssh to perform operations across different servers.

The following command would copy the files contained in $SOURCE_FILES to $TARGET_DIR (any directory structure in $SOURCE_FILES will be created under $TARGET_DIR on the remote machine:

export SOURCE_FILES=file1
export TARGET_DIR=/copy/file/here
export REMOTE_MACHINE=remote.server.net

tar -cvf - $SOURCE_FILES | gzip -9 -c \
    | ssh $REMOTE_MACHINE "(cd $TARGET_DIR; gunzip -c | tar -xvf -)"

You might wonder why you should not just use sftp or scp to perform the same function... few reasons:

Not every server allows scp, since it requires interactive login for accounts and some service accounts are not allowed to perform interactive logins for security reasons. sftp is a subsystem build into sshd, so its a lot more restrictive and secure.
sftp does not preserve the file permissions and behaves like ftp and marks all files as non-executable.
If you are behind multiple servers, which again is a network segmentation/security technique, you would have an edge server between your laptop and the destination server. This means you need to have enough disk space available on the edge server to stage the files before pushing them to the final destination. Following is an example of how you can extend this pattern to hop across multiple servers without requiring any disk space on intermediate servers.

export SOURCE_FILES=file1
export TARGET_DIR=/copy/file/here
export HOP_MACHINE=edge.server.net
export REMOTE_MACHINE=remote.server.net

tar -cvf - $SOURCE_FILES | gzip -9 -c \
    | ssh $HOP_MACHINE \
        "ssh $REMOTE_MACHINE '(cd $TARGET_DIR; gunzip -c | tar -xvf -)'"

This same pattern can be used to execute arbitrary code which reads from stdin and processes it. For example, I use this approach to dump data from an application server to a hadoop cluster via an edge server, without requiring to stage the files on the edge server.

You Are Not Fully Utilizing Your VCS!

Manuj Bhatia — Wed, 16 Mar 2016 08:03:21 GMT

The primary purpose of a VCS is not to put stuff in it, but to take stuff out as needed.

Ponder over that statement before you read on.

There is a fundamental difference between how big Corporate IT treats version control vs the Silicon Valley and Startups.

Most Corporate IT teams use version control as Cavemen used fire... just stick raw food in it (just checkin your code into a VCS)!

Silicon Valley understands that fire drives cars, air planes, rockets and, yes, Jetpacks :-) (VCS is the very foundation of DevOps)!

Now this comparison is a little exaggerated, but the underlying point remains. You can use your VCS for basic code checkin and collaboration or you can make it the powerhouse that drives all of your development and operations.

Here are some things you are (probably) not using your VCS for!

Code Reviews

If you are not using a pre-commit code review system, you are not really doing code reviews!

Developers are by definition lazy people (we write programs, so we don't have to do something manually!). An honor/convention-based code review that is conducted at the end of a coding cycle via emails, code-walkthrough, is mostly skipped or given lip-service to. The only effective way of doing proper code review is to stop the developer from finishing the commit, unless the code passes all quality checks and code reviews.

Now, don't confuse code reviews just with manual peer reviews. This is where all the magic happens, from automated static code analysis, syntax checks, build verification, unit test verification etc.

Using a modern version control system like Git with the help of tools like Gerrit, your version control becomes a code review hub with very little overhead for the developers.
Developers use the same workflow that they are used to. For example, here is how a regular commit and a code review request looks like for a developer using Git + Gerrit.

Regular Commit

$ vim file.c ## Make your changes
$ git commit -a -m 'Changed something' file.c
$ git push origin master:master  ## Push directly to the main branch

Code Review

$ vim file.c ## Make your changes
$ git commit -a -m 'Changed something' file.c
$ git push origin master:refs/for/master   ## Push to a different destination to trigger a code review

As you can see there is nothing additional that the developer had to do to create a code review request.

Debugging

Using clean, atomic commits helps you debug when a bug was introduced in the code base very easily.
You can use tools like git bisect to quickly identify when a particular regression was introduced in your codebase, if your VCS has maintained clean history.

Track Developer Productivity

Using Gerrit's changesets allows you to have your developers check-in work-in-progress (WIP) at the end of every day. Then you can report on the activity of each developer across days and months. This is especially useful when you have contractors working on your codebase. Not only can you oversee their productivity, you get all the work at the end-of-day in a central repository.
Here is a default dashboard for a member's contribution from Github.

Github Developer Contribution

Identify Products at Risk

Having a pulse of commits and committers on a repository is a very good indicator of how actively a product in your company is maintained. You can write reports based on the activity on a product's source code repository on whether it is actively supported or not. If a critical product in the company falls below a certain number of active contributors/commits-per-day, that could be a trigger to look into augmenting staff for that product.
Again, here are in-built dashboards from github.com, indicating current product health.

Daily Commit Activity

Contribution History and Top Committers

Commits per week

True Collaboration

Maintaining a VCS that allows a maintainer vs contributor model has tremendous benefit for the IT organizination.
Allowing everyone in the organization access code to the code and submit a patch for review as a contributor, that the maintainers can accept or reject, creates a level playing field across your organization.
Projects benefit from a wide range of people reviewing and contributing to the product and developers benefit by having the ability to expand their horizons into newer products and technologies.

Releases

Instead of developers/build managers doing builds for stage/production, the VCS tags drive your build servers.
If a particular version of your product is ready, you simply tag the version in your VCS as release/3.0.1-beta1 and it triggers your build server to build that version and push it to your artifact repository.
This eliminates developer mistakes of building a wrong version or picking up local changes in the build.
Using immutable tags also makes tracking builds to source control versions very easy, providing very good audit-ability of the releases.
Using digitally-signed tags, verified by the build server, can ensure that only authorized developers can actually release software.
Also, maintaining clean commits allows the build server to automatically generate release notes for each build.

Deployments

Further using digitally-signed tags and a carefully crafted build/artifact server, developers can actually perform no-button deployments by appropriately tagging a particular version in the VCS for release.

Why Switch to Git?

Manuj Bhatia — Sun, 13 Mar 2016 01:02:12 GMT

Overview^[1]

Git is a widely used source code management system for software development. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows. Git was initially designed and developed in 2005 by Linux kernel developers (including Linus Torvalds) for Linux kernel development.

As companies are adopting DevOps practices, they are realizing that their version control system is not only a place to track revisions of their code, but is the backbone of their whole DevOps practice! I highly recommend going through the recent blog post I wrote about how you are not fully utilizing your VCS for some background.

Disclaimer

While I would recommend any and all projects and teams to adopt Git as their primary VCS, please do not try to migrate your teams to Git without proper initial training.

Git (in general, any Distributed VCS) is fundamentally a very different system, compared to traditional VCSes like SVN. This can really confuse developers used to a central VCS. Take your time and develop a migration/training strategy before switching over.

Also, SVN does have very narrow advantages over Git in some edge use-cases. A good comparison of Git and SVN can be found at the Linux Kernel Wiki and at WikiVS. You should consider whether one of these use-cases applies to your team.

Why Git?

Industry Leader

Git's popularity has been rapidly increasing year-over-year. A testament to Git's popularity is that the founders of Subversion, CollabNet, have moved Subversion to a Apache Software Foundation project and are providing Enterprise Git solutions now^[1:1].

In the last published Eclipse Community Survey of developers (2014) Git overtook SVN as the most widely used version control system. ^[1:2]

Although not very scientific, the Google search trends also show the interest in Git rising since 2009 and svn declining^[1:3]:

As a developer, you will also increasingly see Git as a required job skill^[1:4]:

Speed

Due to its distributed nature, Git is an order of a magnitude faster than SVN. There is no network communication involved. Most operations in Git take only a few milliseconds. Operations like switching a branch takes sub-second on even big repositories like the linux kernel.

Here are some benchmarks^[1:5]

Integrations

Due to its industry leader status, almost every development tool out there supports Git. In fact, most new tools first support Git and later support SVN or not support SVN at all (Microsoft Visual Studio 2015 has in-built support for Git and not SVN^[1:6]). Plugins for integration with CA Rally, Atlassian Jira, and HP ALI are available.

De facto Cloud Standard

Almost every cloud-based service supports or requires Git out of the box; most don't even support SVN at all.

IBM's BlueMix's DevOps solution integrates with Github as the source code repository.
Google's Cloud Platform provides Git repository hosting and AppEngine integrates with Git.
Microsoft Azure integrates with Git to publish websites.
Amazon AWS provides a Git compatible repository hosting service with CodeCommit and has built-in integration with Github for their Continuous Delivery offering, CodePipeline.

Reliability

Git is very reliable when it comes to managing your code. All the commands that you would use on a day-to-day basis would never leave your working copy in an inconsistent state. Even if you get stuck somewhere

SVN on the other hand has a tendency to leave your working copy in an intermediate state on failed merges, branch switching etc., making you spend hours manually cleaning out your working copy. This is one of the big reasons that SVN users rarely switch branches in the same working copy; they always use and recommend a different working copy for each branch.

Developer Productivity

Git has great branching and merging features which allow developers to perform multiple merges per day encouraging smaller units of work to be merged. This leads to a lot more manageable merging structure instead of monolithic merges.
Here is an example of merges done on the Git repository itself by one of the maintainers^[1:7]:

Also, due to the fast speed and some git-specific features like stashes, it becomes super easy for a developer to switch context.

Source Code Integrity & Security

The underlying data model of a Git repository makes it tamper proof. A malicious user who has admin access to your source code repo cannot alter your repo's history without raising a red flag.^[1:8]

SVN offers no such guarantees and a malicious user with admin access to the SVN repo can easily inject arbitrary code into your repositories history without detection.

References

http://www.itjobswatch.co.uk/default.aspx?page=1&sortby=0&orderby=0&q=git+svn&id=0&lid=2618

http://git-scm.com/about/info-assurance ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

A Programmer's Journal

Integrated DB Access from the Shell - Part 2

Reference Counting in C

Integrated DB Access From the Shell

Let's Write a New Shell - Part 1

Data Processing Pipelines

Limitations of Shell Pipelines

When Preparation Meets Adversity

Behavioral Impact of Tracking Metrics: Loading vs. Unloading a Dishwasher

To Load or Unload, that is the Question!

Who loaded the dishwasher last?

Who unloaded the dishwasher last?

Tracking Software

P1s

Failed CRs

The Correct Metric: System Uptime

Conclusion

Rockets and the Human Condition

What is Something Worth?

Bitcoin vs. Gold: Store of Value

TL;DR

First Things First

Value

Econ 101

Scenario 1: Gold

Scenario 2: Bitcoin

Conclusion

Automation 201: Processes and Frameworks

Pretty Emails from Shell Scripts

Synopsis

Script

K-shawk!

Fun with Pipes: Copy Directories Across Servers With SSH

You Are Not Fully Utilizing Your VCS!

Code Reviews

Regular Commit

Code Review

Debugging

Track Developer Productivity

Github Developer Contribution

Identify Products at Risk

Daily Commit Activity

Contribution History and Top Committers

Commits per week

True Collaboration

Releases

Deployments

Why Switch to Git?

Overview[1]

Disclaimer

Why Git?

Industry Leader

Speed

Integrations

De facto Cloud Standard

Reliability

Developer Productivity

Source Code Integrity & Security

References

Overview^[1]