This file produces a two-level randomization following the general ideas from Ichino and Schuendeln (2012) which in turn draws from Sinclair, McConnell, Green (2012).1

.libPaths("libraries")
library(RItools)
source("code/two-level-sampler.R")

Imagine that we have 10 large upper-level units (like counties or states) and each of these upper-level units have 50 lower-level units within it. Here, for an example, half of the upper-level units will be assigned to contain treated lower-level units. And have of the individuals within the treatment eligible lower-level units will be assigned treatment.

First, set up the values in the experiment. The units:

set.seed(20151204)

numStates<-10 ## like states
countiesPerState<-4
stateSizes <- rep( countiesPerState , numStates)

countySizesMat<-sapply(stateSizes,function(ncounties){
              trunc(runif(ncounties,min=10,max=50))
})
colnames(countySizesMat)<-paste("state",1:numStates,sep="")
rownames(countySizesMat)<-paste("county",1:countiesPerState,sep="")

countySizes<-as.vector(countySizesMat)

N<-sum(countySizes)

farmerid<-1:N
countyid<-rep(rep(1:countiesPerState,numStates),countySizes)
stateid<-rep(1:numStates,colSums(countySizesMat))
statecountyid<-paste(stateid,countyid,sep=".")
## Make a fake dataset:
dat<-data.frame(farmerid=farmerid,
        countyid=countyid,
        stateid=stateid)
dat$statecountyid<-factor(statecountyid,
              levels=unique(statecountyid))

## Make sure that we can re-create the list of county sizes (number of farmers within counties)
stopifnot(all.equal(as.vector(table(dat$statecountyid)),countySizes))
## How many people within each county (first, second, third ... )
table(dat$countyid)

  1   2   3   4 
223 240 285 289 
## How many people within each state
table(dat$stateid)

  1   2   3   4   5   6   7   8   9  10 
117 123  85 108 119 121  88  56 117 103 

Add some fake covariates to demonstrate randomization testing below. So far we are not blocking, but I imagine that we would just run the assignment separately within each block.

## A variable with constant variance but different means by country
countymeans<-rnorm(length(countySizes))
dat$x1<-unsplit(lapply(split(dat,dat$statecountyid),function(d){
             rnorm(nrow(d),mean=countymeans) }),dat$statecountyid)

## A binary variable with no relationship to county per se
dat$x2<-rbinom(nrow(dat),size=1,prob=.33)

## The size of the county as a covariate
dat$countySize<-unsplit(table(dat$statecountyid),statecountyid)

Define the number of counties eligible for any treatment assignment within each state. Here, with 4 counties in the experiment per state, we assign only one of them to receive treatments.

numTrtedState <- rep(1,length(stateSizes)) 

Define the number of farmers within each county eligible for treatment assignment. Here, for now, assign half of the farmers in each treatment-eligible county to treatment.

numTrtedCounty <- floor(countySizes/2)

Second, define the “sampler” (i.e. the randomizer, the function which does the random assignment).

twoLevelSampler <- upperLowerSampler(upperBlockSize = stateSizes,
                     treatedUpper   = numTrtedState,
                     lowerBlockSize = countySizes,
                     treatedLower   = numTrtedCounty)

Third, try it out. The command twoLevelSampler(1) generates one random assignment following that design. If we wanted lots of random assignments because we were using re-randomization for covariate balance or to do statistical testing or assessment of our statistical procedures, we would use numbers other than 1 in the parenthesis.

set.seed(20151204)
Z1<-twoLevelSampler(1)
str(Z1)
List of 2
 $ weight : num 1
 $ samples: num [1:1037, 1] 0 0 0 0 0 0 0 0 0 0 ...
dat$Z1<-Z1$samples

Fourth, test the code. What should be true if this randomization followed the design?

One county and only one county per state should have any treated units:

test1tabs<-lapply(split(dat,dat$stateid),function(d){
     table(d$Z1,d$countyid) })

## Each state should have only 1 county with any treated units
test1<-sapply(test1tabs,function(tab){ sum(tab[2,]!=0) })
stopifnot(all(test1==1))

Within treated counties, roughly half of the farmers should be assigned to treatment. The following table shows only one county per state assigned to any treatment, and roughly half of the farmers in that county assigned to treatment.

tmp<-with(dat,table(Z1,statecountyid))
tmp
   statecountyid
Z1  1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 5.1
  0  30  15  16  41  18  37  10  41  10  20  15  25  33  31  33   6  22
  1   0   0  15   0  17   0   0   0   0   0  15   0   0   0   0   5   0
   statecountyid
Z1  5.2 5.3 5.4 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4 8.1 8.2 8.3 8.4 9.1 9.2
  0  34  19  26  24  20  18  41   8  40  13  19  11  15  14   8   5  17
  1   0  18   0   0   0  18   0   8   0   0   0   0   0   0   8   5   0
   statecountyid
Z1  9.3 9.4 10.1 10.2 10.3 10.4
  0  42  48   32    6   39   21
  1   0   0    0    5    0    0
test2tab<-tmp[,tmp["1",]!=0]
test2<-test2tab[2,]/colSums(test2tab)
test2
      1.3       2.1       3.3       4.4       5.3       6.3       7.1 
0.4838710 0.4857143 0.5000000 0.4545455 0.4864865 0.5000000 0.5000000 
      8.4       9.1      10.2 
0.5000000 0.5000000 0.4545455 
## We cannot guarantee exactly half, so just make sure it is around .5
stopifnot(all(test2>.4 & test2<.6))

Randomization Balance Assessment

Assess whether or not the randomization is balanced (it really should be, and should pass the other tests, too, but it is reasonable to check.) This is a bit complicated because of the two-level randomization. However, we can get some rough ideas in a couple of ways. First, we can look at certain covariates alone using the randomization distribution arising directly from re-assigning treatment following the two level design above 1000 times.

ritx1<-RItest(y=dat$x1, z=dat$Z1, test.stat=mean.difference, sampler=twoLevelSampler, samples=1000)
ritx2<-RItest(y=dat$x2, z=dat$Z1, test.stat=mean.difference, sampler=twoLevelSampler, samples=1000)

## P-values from the randomization distribution 
c(ritx1, ritx2)
[1] 0.798 0.119

Second, we can use a Normal theory/Large-sample approximation if we pretended that we didn’t have two level randomization but one level blocked by county within state:

xb1<-xBalance(Z1~x1+x2,strata=list(statecounty=~statecountyid),
          data=dat,report="all")

First, look at the omnibus test: across all covariates, how much information do we have against the hypothesis that this design was randomized. (The design, is that farmers were randomly assigned treatment within counties and states with equal probability.)

xb1$overall
            chisquare df   p.value
statecounty 0.2031501  2 0.9034134

Next look for some more descriptive information about whether the covariates were balanced. Not looking at the one-by-one p-values except as descriptive information (since, often, we have many covariates and don’t want to look at many many tests).

xb1$results
, , strata = statecounty

    stat
vars        Z1=0       Z1=1     adj.diff adj.diff.null.sd    std.diff
  x1 -0.04301666 -0.1215795 -0.078562854       0.17776857 -0.05974106
  x2  0.35405900  0.3592074  0.005148437       0.06358109  0.01111412
    stat
vars           z         p
  x1 -0.44193894 0.6585334
  x2  0.08097435 0.9354624

The treated-vs-control mean differences between x1 and x2, adjusting for the fact that treatment was assigned within counties and states, is fairly small in both cases. The approximate method is pretty close to the simulated-but-exact method.


  1. To make a html document from this file, do library(rmarkdown);render("twolevelrand.Rmd") or click on the Knit HTML in RStudio — see the help file for render to make a pdf document.