Add an exponentially rolling garbage collection script

git-exproll is a git garbage collection script aimed
specifically at reducing exccessive garbage collection and
particularly large packfile churn for Gerrit installations.

Excessive garbage collection on "dormant" repos is wasteful
of both CPU and disk IO.  Large packfile churn can lead to
heavy RAM and FS usage on Gerrit servers when the Gerrit
process continues to hold open the old delete packfiles.
This situation is most detrimental when jgit is configured
with large caching parameters.  Aside from these downsides,
running git gc often can be very beneficial to performance
on servers.  This script attempts to implement a git gc
policy which avoids the downsides mentioned above so that
git gc can be comfortably run very regularly.

git-exproll.sh uses keep files to manage which files will
get repacked.  It also uses timestamps on the repos to
detect dormant repos to avoid repacking them at all.  The
primary packfile objective is to keep around a series of
packfiles with sizes spaced out exponentially from each
other, and to roll smaller packfiles into larger ones once
the smaller ones have grown.  This strategy attempts to
balance disk space usage with avoiding rewriting large
packfiles most of the time.

The exponential packing objective above does not save a
large amount of time or CPU, but it does prevent the
packfile churn.  Depending on repo usage, however the
dormant repo detection and avoidance can result in a very
large time savings.

Change-Id: I5e8c3cc02f6ca382d34f62af2cf12a9cb8ffd21e
diff --git a/contrib/git-exproll.sh b/contrib/git-exproll.sh
new file mode 100644
index 0000000..9526d9f
--- /dev/null
+++ b/contrib/git-exproll.sh
@@ -0,0 +1,566 @@
+#!/bin/bash
+# Copyright (c) 2012, Code Aurora Forum. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#    # Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#    # Redistributions in binary form must reproduce the above
+#       copyright notice, this list of conditions and the following
+#       disclaimer in the documentation and/or other materials provided
+#       with the distribution.
+#    # Neither the name of Code Aurora Forum, Inc. nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT
+# ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+# BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+# OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
+# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+usage() { # error_message
+
+    cat <<-EOF
+		usage: $(basename $0) [-unvt] [--noref] [--nolosse] [-r|--ratio number]
+		                      [git gc option...] git.repo
+
+		-u|-h                usage/help
+		-v verbose
+		-n dry-run           don't actually repack anything
+		-t touch             treat repo as if it had been touched
+		--noref              avoid extra ref packing timestamp checking
+		--noloose            do not run just because there are loose object dirs
+		                     (repacking may still run if they are referenced)
+		-r ratio <number>    packfile ratio to aim for (default 10)
+
+		git gc option        will be passed as args to git gc
+
+		git.repo             to run gc against
+
+		Garbage collect using a pseudo logarithmic packfile maintenance
+		approach.  This approach attempts to minimize packfile churn
+		by keeping several generations of varying sized packfiles around
+		and only consolidating packfiles (or loose objects) which are
+		either new packfiles, or packfiles close to the same size as
+		another packfile.
+
+		An estimate is used to predict when rollups (one consolidation
+		would cause another consolidation) would occur so that this
+		rollup can be done all at once via a single repack.  This reduces
+		both the runtime and the pack file churn in rollup cases.
+
+		Approach: plan each consolidation by creating a table like this:
+
+		Id Keep Size           Sha1(or consolidation list)      Actions(repack down up note)
+		1     - 11356          9052edfb7392646cd4e5f362b953675985f01f96 y - - New
+		2     - 429088         010904d5c11cd26a79fda91b01ab454d1001b402 y - - New
+		c1    - 440444         [1,2]                                    - - -
+
+		Id:    numbers preceded by a c are estimated "c pack" files
+		Keep:  - none, k private keep, o our keep
+		Size:  in disk blocks (default du output)
+		Sha1:  of packfile, or consolidation list of packfile ids
+		Actions
+		repack: - n no, y yes
+		down:   - noop, ^ consolidate with a file above
+		up:     - noop, v consolidate with a file below
+		note:   Human description of script decisions:
+		         New (file is a new packfile)
+		         Consolidate with:<list of packfile ids>
+		         (too far from:<list of packfile ids>)
+
+		On the first pass, always consolidate any new packfiles along
+		with loose objects and along with any packfiles which are within
+		the ratio size of their predecessors (note, the list is ordered
+		by increasing size).  After each consolidation, insert a fake
+		consolidation, or "c pack", to naively represent the size and
+		ordered positioning of the anticipated new consolidated pack.
+		Every time a new pack is planned, rescan the list in case the
+		new "c pack" would cause more consolidation...
+
+		Once the packfiles which need consolidation are determined, the
+		packfiles which will not be consolidated are marked with a .keep
+		file, and those which will be consolidated will have their .keep
+		removed if they have one.  Thus, the packfiles with a .keep will
+		not get repacked.
+
+		Packfile consolidation is determined by the --ratio parameter
+		(default is 10).  This ratio is somewhat of a tradeoff.  The
+		smaller the number, the more packfiles will be kept on average;
+		this increases disk utilization somewhat.  However, a larger
+		ratio causes greater churn and may increase disk utilization due
+		to deleted packfiles not being reclaimed since they may still be
+		kept open by long running applications such as Gerrit.  Sane
+		ratio values are probably between 2 and 10.  Since most
+		consolidations actually end up smaller than the estimated
+		consolidated packfile size (due to compression), the true ratio
+		achieved will likely be 1 to 2 greater than the target ratio.
+		The smaller the target ratio, the greater this discrepancy.
+
+		Finally, attempt to skip garbage collection entirely on untouched
+		repos.  In order to determine if a repo has been touched, use the
+		timestamp on the script's keep files, if any relevant file/dir
+		is newer than a keep marker file, assume that the repo has been
+		touched and gc needs to run.  Also assume gc needs to run whenever
+		there are loose object dirs since they may contain untouched
+		unreferenced loose objects which need to be pruned (once they
+		expire).
+
+		In order to allow the keep files to be an effective timestamp
+		marker to detect relevant changes in a repo since the last run,
+		all relevant files and directories which may be modified during a
+		gc run (even during a noop gc run), must have their timestamps
+		reset to the same time as the keep files or gc will always run
+		even on untouched repos.  The relevant files/dirs are all those
+		files and directories which garbage collection, object packing,
+		ref packing and pruning might change during noop actions.
+EOF
+
+    [ -n "$1" ] && info "ERROR $1"
+
+    exit
+}
+
+debug() { [ -n "$SW_V" ] && info "$1" ; }
+info() { echo "$1" >&2 ; }
+
+array_copy() { #v2 # array_src array_dst
+    local src=$1 dst=$2
+    local s i=0
+    eval s=\${#$src[@]}
+    while [ $i -lt $s ] ; do
+        eval $dst[$i]=\"\${$src[$i]}\"
+        i=$(($i + 1))
+    done
+}
+
+array_equals() { #v2 # array_name [vals...]
+    local a=$1 ; shift
+    local s=0 t=() val
+    array_copy "$a" t
+    for s in "${!t[@]}" ; do s=$((s+1)) ; done
+    [ "$s" -ne "$#" ] && return 1
+    for val in "${t[@]}" ; do
+        [ "$val" = "$1" ] || return 2
+        shift
+    done
+    return 0
+}
+
+packs_sizes() { # git.repo > "size pack"...
+    du -s "$1"/objects/pack/pack-$SHA1.pack | sort -n 2> /dev/null
+}
+
+is_ourkeep() { grep -q "$KEEP" "$1" 2> /dev/null ; } # keep
+has_ourkeep() { is_ourkeep "$(keep_for "$1")" ; } # pack
+has_keep() { [ -f "$(keep_for "$1")" ] ; } # pack
+is_repo() { [ -d "$1/objects" ] && [ -d "$1/refs/heads" ] ; } # git.repo
+
+keep() { # pack   # returns true if we added our keep
+    keep=$(keep_for "$1")
+    [ -f "$keep" ] && return 1
+    echo "$KEEP" > "$keep"
+    return 0
+}
+
+keep_for() { # packfile > keepfile
+    local keep=$(echo "$1" | sed -es'/\.pack$/.keep/')
+    [ "${keep/.keep}" = "$keep" ] && return 1
+    echo "$keep"
+}
+
+idx_for() { # packfile > idxfile
+    local idx=$(echo "$1" | sed -es'/\.pack$/.idx/')
+    [ "${idx/.idx}" = "$idx" ] && return 1
+    echo "$idx"
+}
+
+# pack_or_keep_file > sha
+sha_for() { echo "$1" | sed -es'|\(.*/\)*pack-\([^.]*\)\..*$|\2|' ; }
+
+private_keeps() { # git.repo -> sets pkeeps
+    local repo=$1 ary=$2
+    local keep keeps=("$repo"/objects/pack/pack-$SHA1.keep)
+    pkeeps=()
+    for keep in "${keeps[@]}" ; do
+        is_ourkeep "$keep" || pkeeps=("${pkeeps[@]}" "$keep")
+    done
+}
+
+is_tooclose() { [ "$(($1 * $RATIO))" -gt "$2" ] ; } # smaller larger
+
+unique() { # [args...] > unique_words
+    local lines=$(while [ $# -gt 0 ] ; do echo "$1" ; shift ; done)
+    lines=$(echo "$lines" | sort -u)
+    echo $lines  # as words
+}
+
+outfs() { # fs [args...] > argfs...
+    local fs=$1 ; shift
+    [ $# -gt 0 ] && echo -n "$1" ; shift
+    while [ $# -gt 0 ] ; do echo -n "$fs$1" ; shift ; done
+}
+
+sort_list() { # < list > formatted_list
+    # n has_keep size sha repack down up note
+    awk '{ note=$8; for(i=8;i<NF;i++) note=note " "$(i+1)
+           printf("%-5s %s %-14s %-40s %s %s %s %s\n", \
+                     $1,$2,   $3,  $4, $5,$6,$7,note)}' |\
+        sort -k 3,3n -k 1,1n
+}
+
+is_touched() { # git.repo
+    local repo=$1
+    local loose keep ours newer
+    [ -n "$SW_T" ] && { debug "$SW_T -> treat as touched" ; return 0 ; }
+
+    if [ -z "$SW_LOOSE" ] ; then
+        # If there are loose objects, they may need to be pruned,
+        # run even if nothing has really been touched.
+        loose=$(find "$repo/objects" -type d \
+                      -wholename "$repo/objects/[0-9][0-9]"
+                      -print -quit 2>/dev/null)
+        [ -n "$loose" ] && { info "There are loose object directories" ; return 0 ; }
+    fi
+
+    # If we don't have a keep, the current packfiles may not have been
+    # compressed with the current gc policy (gc may never have been run),
+    # so run at least once to repack everything.  Also, we need a marker
+    # file for timestamp tracking (a dir needs to detect changes within
+    # it, so it cannot be a marker) and our keeps are something we control,
+    # use them.
+    for keep in "$repo"/objects/pack/pack-$SHA1.keep ; do
+        is_ourkeep "$keep" && { ours=$keep ; break ; }
+    done
+    [ -z "$ours" ] && { info 'We have no keep (we have never run?): run' ; return 0 ; }
+
+    debug "Our timestamp keep: $ours"
+    # The wholename stuff seems to get touched by a noop git gc
+    newer=$(find "$repo/objects" "$repo/refs" "$repo/packed-refs" \
+                  '!' -wholename "$repo/objects/info" \
+                  '!' -wholename "$repo/objects/info/*" \
+                  -newer "$ours" \
+                  -print -quit 2>/dev/null)
+    [ -z "$newer" ] && return 1
+
+    info "Touched since last run: $newer"
+    return 0
+}
+
+touch_refs() { # git.repo start_date refs
+    local repo=$1 start_date=$2 refs=$3
+    (
+        debug "Setting start date($start_date) on unpacked refs:"
+        debug "$refs"
+        cd "$repo/refs" || return
+        # safe to assume no newlines in a ref name
+        echo "$refs" | xargs -d '\n' -n 1 touch -c -d "$start_date"
+    )
+}
+
+set_start_date() { # git.repo start_date refs refdirs packedrefs [packs]
+    local repo=$1 start_date=$2 refs=$3 refdirs=$4 packedrefs=$5 ; shift 5
+    local pack keep idx repacked
+
+    # This stuff is touched during object packs
+    while [ $# -gt 0 ] ; do
+        pack=$1 ; shift
+        keep="$(keep_for "$pack")"
+        idx="$(idx_for "$pack")"
+        touch -c -d "$start_date" "$pack" "$keep" "$idx"
+        debug "Setting start date on: $pack $keep $idx"
+    done
+    # This will prevent us from detecting any deletes in the pack dir
+    # since gc ran, except for private keeps which we are checking
+    # manually.  But there really shouldn't be any other relevant deletes
+    # in this dir which should cause us to rerun next time, deleting a
+    # pack or index file by anything but gc would be bad!
+    debug "Setting start date on pack dir: $start_date"
+    touch -c -d "$start_date" "$repo/objects/pack"
+
+
+    if [ -z "$SW_REFS" ] ; then
+        repacked=$(find "$repo/packed-refs" -newer "$repo/objects/pack"
+                      -print -quit 2>/dev/null)
+        if [ -n "$repacked" ] ; then
+            # The ref dirs and packed-ref files seem to get touched even on
+            # a noop refpacking
+            debug "Setting start date on packed-refs"
+            touch -c -d "$start_date" "$repo/packed-refs"
+            touch_refs "$repo" "$start_date" "$refdirs"
+
+            # A ref repack does not imply a ref change, but since it is
+            # hard to tell, simply assume so
+            if [ "$refs" != "$(cd "$repo/refs" ; find -depth)" ] || \
+               [ "$packedrefs" != "$(<"$repo/packed-refs")" ] ; then
+                # We retouch if needed (instead of simply checking then
+                # touching) to avoid a race between the check and the set.
+                debug "  but refs actually got packed, so retouch packed-refs"
+                touch -c "$repo/packed-refs"
+            fi
+        fi
+    fi
+}
+
+note_consolidate() { # note entry > note (no duplicated consolidated entries)
+    local note=$1 entry=$2
+    local entries=() ifs=$IFS
+    if  echo "$note" | grep -q 'Consolidate with:[0-9,c]' ; then
+        IFS=,
+        entries=( $(echo "$note" | sed -es'/^.*Consolidate with:\([0-9,c]*\).*$/\1/') )
+        note=( $(echo "$note" | sed -es'/Consolidate with:[0-9,c]*//') )
+        IFS=$ifs
+    fi
+    entries=( $(unique "${entries[@]}" "$entry") )
+    echo "$note Consolidate with:$(outfs , "${entries[@]}")"
+}
+
+note_toofar() { # note entry > note (no duplicated "too far" entries)
+    local note=$1 entry=$2
+    local entries=() ifs=$IFS
+    if  echo "$note" | grep -q '(too far from:[0-9,c]*)' ; then
+        IFS=,
+        entries=( $(echo "$note" | sed -es'/^.*(too far from:\([0-9,c]*\)).*$/\1/') )
+        note=( $(echo "$note" | sed -es'/(too far from:[0-9,c]*)//') )
+        IFS=$ifs
+    fi
+    entries=( $(unique "${entries[@]}" "$entry") )
+    echo "$note (too far from:$(outfs , "${entries[@]}"))"
+}
+
+last_entry() { # isRepack pline repackline > last_rows_entry
+    local size_hit=$1 pline=$2 repackline=$3
+    if [ -n "$pline" ] ; then
+        if [ -n "$size_hit" ] ; then
+            echo "$repack_line"
+        else
+            echo "$pline"
+        fi
+    fi
+}
+
+init_list() { # git.repo > shortlist
+    local repo=$1
+    local file
+    local n has_keep size sha repack
+
+    packs_sizes "$1" | {
+        while read size file ; do
+            n=$((n+1))
+            repack=n
+            has_keep=-
+            if has_keep "$file" ; then
+                has_keep=k
+                has_ourkeep "$file" && has_keep=o
+            fi
+            sha=$(sha_for "$file")
+            echo "$n $has_keep $size $sha $repack"
+        done
+    } | sort_list
+}
+
+consolidate_list() { # run < list > list
+    local run=$1
+    local sum=0 psize=0 sum_size=0 size_hit pn clist pline repackline
+    local n has_keep size sha repack down up note
+
+    {
+        while read n has_keep size sha repack down up note; do
+            [ -z "$up" ] && up='-'
+            [ -z "$down" ] && down="-"
+
+            if [ "$has_keep" = "k" ] ; then
+                echo "$n $has_keep $size $sha $repack - - Private"
+                continue
+            fi
+
+            if [ "$repack" = "n" ] ; then
+                if is_tooclose $psize $size ; then
+                    size_hit=y
+                    repack=y
+                    sum=$(($sum + $sum_size + $size))
+                    sum_size=0 # Prevents double summing this entry
+                    clist=($(unique "${clist[@]}" $pn $n))
+                    down="^"
+                    [ "$has_keep" = "-" ] && note="$note New +"
+                    note=$(note_consolidate "$note" "$pn")
+                elif [ "$has_keep" = "-" ] ; then
+                    repack=y
+                    sum=$(($sum + $size))
+                    sum_size=0 # Prevents double summing this entry
+                    clist=($(unique "${clist[@]}" $n))
+                    note="$note New"
+                elif [ $psize -ne 0 ] ; then
+                    sum_size=$size
+                    down="!"
+                    note=$(note_toofar "$note" "$pn")
+                else
+                    sum_size=$size
+                fi
+            else
+                sum_size=$size
+            fi
+
+            # By preventing "c files" (consolidated) from being marked
+            # "repack" they won't get keeps
+            repack2=y
+            [ "${n/c}" != "$n" ] && { repack=- ; repack2=- ; }
+
+            last_entry "$size_hit" "$pline" "$repack_line"
+            # Delay the printout until we know whether we are
+            # being consolidated with the entry following us
+            # (we won't know until the next iteration).
+            # size_hit is used to determine which of the lines
+            # below will actually get printed above on the next
+            # iteration.
+            pline="$n $has_keep $size $sha $repack $down $up $note"
+            repack_line="$n $has_keep $size $sha $repack2 $down v $note"
+
+            pn=$n ; psize=$size # previous entry data
+            size_hit='' # will not be consolidated up
+
+        done
+        last_entry "$size_hit" "$pline" "$repack_line"
+
+        [ $sum -gt 0 ] && echo "c$run - $sum [$(outfs , "${clist[@]}")] - - -"
+
+    } | sort_list
+}
+
+process_list() { # git.repo > list
+    local list=$(init_list "$1")  plist run=0
+
+    while true ; do
+        plist=$list
+        run=$((run +1))
+        list=$(echo "$list" | consolidate_list "$run")
+        if [ "$plist" != "$list" ] ; then
+            debug "------------------------------------------------------------------------------------"
+            debug "$HEADER"
+            debug "$list"
+        else
+            break
+        fi
+    done
+    debug "------------------------------------------------------------------------------------"
+    echo "$list"
+}
+
+repack_list() { # git.repo < list
+    local repo=$1
+    local start_date newpacks=0 pkeeps keeps=1 refs refdirs rtn
+    local packedrefs=$(<"$repo/packed-refs")
+
+    # so they don't appear touched after a noop refpacking
+    if [ -z "$SW_REFS" ] ; then
+        refs=$(cd "$repo/refs" ; find -depth)
+        refdirs=$(cd "$repo/refs" ; find -type d -depth)
+        debug "Before refs:"
+        debug "$refs"
+    fi
+
+    # Find a private keep snapshot which has not changed from
+    # before our start_date so private keep deletions during gc
+    # can be detected
+    while ! array_equals pkeeps "${keeps[@]}" ; do
+       debug "Getting a private keep snapshot"
+       private_keeps "$repo"
+       keeps=("${pkeeps[@]}")
+       debug "before keeps: ${keeps[*]}"
+       start_date=$(date)
+       private_keeps "$repo"
+       debug "after keeps: ${pkeeps[*]}"
+    done
+
+    while read n has_keep size sha repack down up note; do
+        if [ "$repack" = "y" ] ; then
+            keep="$repo/objects/pack/pack-$sha.keep"
+            info "Repacking $repo/objects/pack/pack-$sha.pack"
+            [ -f "$keep" ] && rm -f "$keep"
+        fi
+    done
+
+    ( cd "$repo" && git gc "${GC_OPTS[@]}" ) ; rtn=$?
+
+    # Mark any files withoug a .keep with our .keep
+    packs=("$repo"/objects/pack/pack-$SHA1.pack)
+    for pack in "${packs[@]}" ; do
+        if keep "$pack" ; then
+            info "New pack: $pack"
+            newpacks=$((newpacks+1))
+        fi
+    done
+
+    # Record start_time.  If there is more than 1 new packfile, we
+    # don't want to risk touching it with an older date since that
+    # would prevent consolidation on the next run.  If the private
+    # keeps have changed, then we should run next time no matter what.
+    if [ $newpacks -le 1 ] || ! array_equals pkeeps "${keeps[@]}" ; then
+        set_start_date "$repo" "$start_date" "$refs" "$refdirs" "$packedrefs" "${packs[@]}"
+    fi
+
+    return $rtn # we really only care about the gc error code
+}
+
+git_gc() { # git.repo
+    local list=$(process_list "$1")
+    if [ -z "$SW_V" ] ; then
+        info "Running $PROG on $1.  git gc options: ${GC_OPTS[@]}"
+        echo "$HEADER" >&2
+        echo "$list" >&2 ;
+    fi
+    echo "$list" | repack_list "$1"
+}
+
+
+PROG=$(basename "$0")
+HEADER="Id Keep Size           Sha1(or consolidation list)      Actions(repack down up note)"
+KEEP=git-exproll
+HEX='[0-9a-f]'
+HEX10=$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX
+SHA1=$HEX10$HEX10$HEX10$HEX10
+
+RATIO=10
+SW_N='' ; SW_V='' ; SW_T='' ; SW_REFS='' ; SW_LOOSE='' ; GC_OPTS=()
+while [ $# -gt 0 ] ; do
+    case "$1" in
+        -u|-h)  usage ;;
+        -n)  SW_N="$1" ;;
+        -v)  SW_V="$1" ;;
+
+        -t)  SW_T="$1" ;;
+        --norefs)  SW_REFS="$1" ;;
+        --noloose) SW_LOOSE="$1" ;;
+
+        -r|--ratio)  shift ; RATIO="$1" ;;
+
+        *)  [ $# -le 1 ] && break
+            GC_OPTS=( "${GC_OPTS[@]}" "$1" )
+            ;;
+    esac
+    shift
+done
+
+
+REPO="$1"
+if ! is_repo "$REPO" ; then
+    REPO=$REPO/.git
+    is_repo "$REPO" || usage "($1) is not likely a git repo"
+fi
+
+
+if [ -z "$SW_N" ] ; then
+    is_touched "$REPO" || { info "Repo untouched since last run" ; exit ; }
+    git_gc "$REPO"
+else
+    is_touched "$REPO" || info "Repo untouched since last run, analyze anyway."
+    process_list "$REPO" >&2
+fi