A randomized saturation design is like the simple two-level design that we describe in twolevelrand.Rmd but involves random assignment of proportion treated to counties, or random assignment of “saturation” of the treatment.¹ For example, we might randomly assign counties to receive between 0% treated (i.e. the “pure control” counties in the two-level design) and 100% treated, with some saturations in between (like 33%, 50%, 66%, etc.).

This design raises a couple of new questions for us. If our total number of treatments is fixed at \(n_t\), then, to maximize power and enhance interpretibility: How many saturation categories should we use? How many counties should we choose?

Blocks

State

We will block first on state because the agricultural outcomes and farmer behavior will vary by state. We expect, that the relationship between microloans applications, spillover, and other outcomes, might also vary by state (in part as a proxy for ecosystem/type of agriculture and in part as a proxy for other influences that states have on agricultural production). Also, since counties tend to have the same land area size within a state, but vary greatly in land area across states, we expect different patterns in spillover by state — assigning treatment to 50% of the farmers in one large county might incur different spillover than 50% of the farmers in a smaller county.

County Size (number of farmers)

If we are randomly assigning counties to saturation, we should also probably should assign saturation to sets of counties that are similar in number of farmers within them. Across many counties, the distribution of county size should be balanced across saturation treatments, in expectation, but in our finite sample, we think that we will enhance power and make comparisons easier to understand if, within states, we randomly assign saturation within blocks of similar counties (with number of farmers within the county being a key covariate, but perhaps there are others that matter as well).

Other county characteristics that moderate treatment and/or predict outcomes?

Saturation

Within blocks, we know that we want at least one county at saturation 0% (i.e the “pure control”) condition.

Power

We have two main effects: the ITT (whereby we compare treated to pure control individuals), the Spillover on the Non-Treated (SNT) (whereby we compare non-treated individuals in a county with some saturation \(\pi>0\) to pure control individuals (people in counties with \(\pi=0\)).

The power of the design will depend on overall experimental pool \(N\) (roughly 3 million); number treated \(n_t\); the variation in the outcome, \(Y\); the size of the direct effect of treatment; the size of the spillover effect; the intra-county dependence (measured by the intra-cluster correlation coefficient (ICC)); and the correlation between treatment status within counties. The Baird et al 2015 piece provides

A Note on Inferential Frameworks

The Blair et al 2015 piece uses a random effects/model-based mode for statistical inference. They claim that this very much simplifies the work required for power analysis. Although, in general, I tend to advocate randomization inference for randomized experiments, I suspect that we might gain from following their lead here, especially since our total pool of units is so large — that the concerns about consistency and the CLT that one sidesteps with randomization infeerence in its permutation form might not be that important. That said, we might still want to check our results using a more direct randomization inference approach if we have few saturation categories.

What do our data look like?

The following reads the data from Google Drive, assuming that there is a Google Sheet called “county-counts.csv” in your sheets directory.

library(googlesheets)
countyCountsGS<-gs_title("countyCountsFIPS.csv") ## get info about the data
countyCounts<-as.data.frame(gs_read_csv(countyCountsGS,as.is=TRUE),stringsAsFactors=FALSE)
## countyCounts<-countyCounts[countyCounts$state!="PR",]
str(countyCounts)

'data.frame':   3389 obs. of  6 variables:
 $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ state      : chr  "AK" "AK" "AK" "AK" ...
 $ state.fips : int  2 2 2 2 2 2 2 2 2 2 ...
 $ county     : chr  "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
 $ county.fips: int  13 16 20 50 60 68 70 90 100 105 ...
 $ n          : int  8 7 930 17 6 25 21 422 23 12 ...

countyCounts$STATEFIPS<-with(countyCounts,ifelse(state.fips<10,paste("0",state.fips,sep=""),as.character(state.fips)))
countyCounts$COUNTYNAME<-countyCounts$county
countyCounts$COUNTYFIPS<-with(countyCounts,ifelse(county.fips<10,
                          paste("00",county.fips,sep=""),
                          ifelse(county.fips<100,paste("0",county.fips,sep=""),
                             as.character(county.fips))))

countyCounts$combifips<-with(countyCounts,paste(STATEFIPS,COUNTYFIPS,sep=""))
str(countyCounts)

'data.frame':   3389 obs. of  10 variables:
 $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ state      : chr  "AK" "AK" "AK" "AK" ...
 $ state.fips : int  2 2 2 2 2 2 2 2 2 2 ...
 $ county     : chr  "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
 $ county.fips: int  13 16 20 50 60 68 70 90 100 105 ...
 $ n          : int  8 7 930 17 6 25 21 422 23 12 ...
 $ STATEFIPS  : chr  "02" "02" "02" "02" ...
 $ COUNTYNAME : chr  "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
 $ COUNTYFIPS : chr  "013" "016" "020" "050" ...
 $ combifips  : chr  "02013" "02016" "02020" "02050" ...

sapply(countyCounts,function(x){ sum(is.na(x) | x=="" ) })

         id       state  state.fips      county county.fips           n 
          0           0           0           0           0           0 
  STATEFIPS  COUNTYNAME  COUNTYFIPS   combifips 
          0           0           0           0

with(countyCounts,tapply(n,state,sum)) ## N by state

    AK     AL     AR     AZ     CA     CO     CT     DE     FL     GA 
  3330  69979  56139  24649  71932  49230   4643   6254  60993  75985 
    HI     IA     ID     IL     IN     KS     KY     LA     MA     MD 
  3479 137252  27353 145738 108542 104116 134013  55057   8262  25891 
    ME     MI     MN     MO     MS     MT     NC     ND     NE     NH 
  9333  85599 117255 109653  69699  42221 156017  51032  86328   3565 
    NJ     NM     NV     NY     OH     OK     OR     PA     RI     SD 
 20480  19483   6065  43250 126956  68605  36829  60822   1271  50685 
    TN     TX     UT     VA     VT     WA     WI     WV     WY 
121912 293867  13889  86890   7611  49918 156940  18243  11811

with(countyCounts,tapply(county,state,function(x){ length(unique(x) ) })) ## number of counties by state

 AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA 
 36  73  89  26  70  72   9   3  79 166   5 103  51 113  99 110 122  70 
 MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK 
 18  28  22  86  91 122  92  62 115  58  96  11  25  37  20  68  92  81 
 OR  PA  RI  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 37  69   5  70 106 263  31 138  16  47  76  58  27

summary(with(countyCounts,tapply(n,county,sum))) ## number of farmers by county

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     440     922    1697    1777   37860

sapply(split(countyCounts$n,countyCounts$state),function(x){ summary(x) }) ## dist of county sizes by state

            AK     AL     AR     AZ   CA     CO    CT   DE     FL     GA
Min.      1.00    1.0    1.0    1.0    1    1.0   1.0  990    1.0    1.0
1st Qu.   3.75  462.8  292.2    1.0  185  124.5 282.0 1092  266.0  189.0
Median   13.50  827.0  568.5  143.5  716  296.5 578.0 1195  562.0  362.0
Mean     92.50  945.7  623.8  948.0 1028  683.8 515.9 2085  772.1  449.6
3rd Qu.  46.75 1242.0  801.5  794.8 1474  897.8 675.0 2632 1102.0  572.0
Max.    930.00 2848.0 3187.0 9382.0 6876 3606.0 887.0 4069 3436.0 2811.0
            HI   IA     ID   IL   IN     KS   KY     LA     MA     MD
Min.       1.0    1    1.0    1    1    1.0    1    1.0    1.0    1.0
1st Qu.  400.0  864  161.8  647  674  396.5  570  260.2   19.5  554.5
Median   700.0 1195  391.5 1060 1096  641.0  918  445.5  440.0  750.0
Mean     695.8 1333  526.0 1256 1075  938.0 1072  786.5  459.0  924.7
3rd Qu.  980.0 1586  759.2 1684 1392 1130.0 1340 1181.0  661.0 1122.0
Max.    1398.0 5986 2132.0 8248 3133 8574.0 4443 3767.0 1270.0 2425.0
            ME     MI   MN     MO     MS     MT   NC     ND     NE    NH
Min.       1.0    1.0    1    1.0    1.0    1.0    1    1.0    1.0   1.0
1st Qu.    1.0  232.5  701  473.5  441.0  200.5  437  333.2  372.5 200.5
Median   456.0  591.0 1120  777.0  757.0  385.0 1214  550.5  709.0 315.0
Mean     405.8  983.9 1261  891.5  757.6  681.0 1357  879.9  899.2 324.1
3rd Qu.  649.5 1742.0 1651 1142.0 1061.0  797.5 1844 1057.0 1189.0 483.5
Max.    1153.0 3213.0 7576 5109.0 2743.0 3648.0 8486 4519.0 7268.0 566.0
            NJ     NM      NV     NY   OH   OK     OR     PA    RI     SD
Min.       1.0    1.0    1.00    1.0    1    1    1.0    1.0  57.0    1.0
1st Qu.  341.0  248.0   55.75  254.0  978  498  332.0  422.0 140.0  321.5
Median   783.0  365.0  125.00  572.0 1325  698  655.0  777.0 250.0  572.0
Mean     819.2  526.6  303.20  626.8 1365  847  995.4  881.5 254.2  724.1
3rd Qu. 1234.0  566.0  291.50  912.0 1770 1025 1480.0 1246.0 349.0  938.8
Max.    2842.0 2144.0 2247.00 1897.0 3952 5339 3252.0 2938.0 475.0 4580.0
          TN      TX   UT     VA     VT     WA   WI    WV     WY
Min.       1     1.0    1    1.0    1.0    1.0    1   1.0    1.0
1st Qu.  428   307.0  170  234.2  247.8  263.5  982 130.0  205.5
Median   924   655.5  304  435.5  430.0  699.0 1880 296.5  371.0
Mean    1118  1113.0  448  629.6  475.7 1062.0 2065 314.5  437.4
3rd Qu. 1576  1390.0  729  782.5  707.8 1429.0 2935 438.5  618.5
Max.    6104 16420.0 1713 2901.0 1105.0 5830.0 7122 863.0 1523.0

Remove counties with fewer than 2 farmers (these seem to be duplicates in the file anyway):

countyCounts<-countyCounts[countyCounts$n>2,]
names(table(countyCounts$combifips))[table(countyCounts$combifips)>1]

[1] "02020"

Anchorage is still duplicated, so removing by hand. Eventually the input file should only have one row per county.

countyCounts<-countyCounts[!(countyCounts$n==4 & countyCounts$combifips=="02020"),]
stopifnot(all(table(countyCounts$combifips)==1))

County to County Distances

We can have three measures of distance between counties: euclidean distance between geographic centers, euclidean distance between centers of population, and an indicator (1 or 0) for whether two counties are adjacent to each other.²

County Adjacency

This file required a bit of reformatting work. The file that does the cleanup is gis/setupAdjData.sh. It should be run once at the command line. The countyAdjClean.txt file will be in the repository. These lines just document the contents of setupAdjData.sh.

##These are bash commands.
## They are currently in gis/setupAdjData.sh which can be run at the command line.
curl -O http://www2.census.gov/geo/docs/reference/county_adjacency.txt
cp county_adjacency.txt tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/"\t/";\t/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t\t"/\t\t;;"/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t"/\t;"/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t//g' tmp1.txt
## Add "Watonwan County, MN" to line 9625
LC_CTYPE=C LANG=C sed -i.bak '9629s/.*/"Watonwan County, MN";27165;"Blue Earth County, MN";27013/' tmp1.txt
## Change the old FIPS code of 02195 for Petersburgh Alaska to the new one of 02280
LC_CTYPE=C LANG=C sed -i.bak 's/02195/02280/g' tmp1.txt
LC_CTYPE=C LANG=C tr  '\n' ':'  < tmp1.txt > tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/:;;/;/g'  tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/:/\\\n/g'  tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ County//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ Census Area//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ and Borough//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ Borough//g' tmp2.txt
cp tmp2.txt countyAdjClean.txt

## https://stackoverflow.com/questions/5411979/state-name-to-abbreviation-in-r

Sys.setlocale('LC_ALL','C')

[1] "C/C/C/C/C/en_US.UTF-8"

cntyAdjList1<-scan("gis/countyAdjClean.txt",what="",sep="\n",quote='\"')
cntyAdjList2<-strsplit(cntyAdjList1,";")
centerCounties1<-sapply(cntyAdjList2,function(obj){obj[1]})
centerCountiesFIPS1<-sapply(cntyAdjList2,function(obj){obj[2]})
centerCounties1[centerCounties1=="Prince of Wales-Hyder Census Area"]<-"Prince Wales Hyder"
names(cntyAdjList2)<-centerCounties1

## Remove Puerto Rico
centerCounties2<-centerCounties1[grep("PR$|VI$|GU$|AS$|MP$",centerCounties1,invert=TRUE)]
centerCountiesFIPS2<-centerCountiesFIPS1[grep("PR$|VI$|GU$|AS$|MP$",centerCounties1,invert=TRUE)]
cntyAdjList3<-cntyAdjList2[centerCounties2]

adjacentCounties<-lapply(cntyAdjList3,function(x){
               if(length(x)<5){
                 return("None")
               } else {
                 grep(",",x[3:length(x)],value=TRUE)
}})

isolatedCounties<-sapply(adjacentCounties,function(x){ all(x=="None") })
cntyAdjList3[isolatedCounties]

$`Hawaii, HI`
[1] "Hawaii, HI" "15001"      "Hawaii, HI" "15001"     

$`Honolulu, HI`
[1] "Honolulu, HI" "15003"        "Honolulu, HI" "15003"       

$`Kauai, HI`
[1] "Kauai, HI" "15007"     "Kauai, HI" "15007"

adjacentCountiesFIPS<-lapply(cntyAdjList3,function(x){ grep('[0-9]$',x[3:length(x)],value=TRUE) })
names(adjacentCountiesFIPS)<-centerCountiesFIPS2

Centroids based on geography and based on population density.

states<-read.csv("gis/StateNamesAndAbbreviations.csv",as.is=TRUE)

usaCountiesPopCentroids<-read.csv(url("http://www2.census.gov/geo/docs/reference/cenpop2010/county/CenPop2010_Mean_CO.txt"),as.is=TRUE)
usaCountiesPopCentroids$state<-states$Postal.Code[match(usaCountiesPopCentroids$STNAME,states$State)]
usaCountiesPopCentroids$STATEFIPS<-with(usaCountiesPopCentroids,ifelse(STATEFP<10,paste("0",STATEFP,sep=""),as.character(STATEFP)))
usaCountiesPopCentroids$COUNTYNAME<-usaCountiesPopCentroids$COUNAME

str(usaCountiesPopCentroids)

'data.frame':   3221 obs. of  10 variables:
 $ STATEFP   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTYFP  : int  1 3 5 7 9 11 13 15 17 19 ...
 $ COUNAME   : chr  "Autauga" "Baldwin" "Barbour" "Bibb" ...
 $ STNAME    : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ POPULATION: int  54571 182265 27457 22915 57322 10914 20947 118572 34215 25989 ...
 $ LATITUDE  : num  32.5 30.5 31.8 33 34 ...
 $ LONGITUDE : num  -86.5 -87.8 -85.3 -87.1 -86.6 ...
 $ state     : chr  "AL" "AL" "AL" "AL" ...
 $ STATEFIPS : chr  "01" "01" "01" "01" ...
 $ COUNTYNAME: chr  "Autauga" "Baldwin" "Barbour" "Bibb" ...

usaCountiesPopCentroids$COUNTYFIPS<-with(usaCountiesPopCentroids,ifelse(COUNTYFP<10,
                                    paste("00",COUNTYFP,sep=""),
                                    ifelse(COUNTYFP<100,paste("0",COUNTYFP,sep=""),
                                           as.character(COUNTYFP))))

library(foreign)
usaCountiesGeoCentroids<-read.dbf("gis/AddCombiFips/USACountyCentroids.dbf",as.is=TRUE)

tmp<-t(sapply(split(usaCountiesGeoCentroids,usaCountiesGeoCentroids$COMBIFIPS),function(dat){
        sapply(dat,function(x){
             if(is.character(x)){
               return(unique(x))
             } else {
               return(mean(x))
                                    }})}))

usaCountiesGeoCentroidsAvg<-as.data.frame(tmp,stringsAsFactors=FALSE)
usaCountiesGeoCentroidsAvg$CENTRDLON<-as.numeric(usaCountiesGeoCentroidsAvg$CENTRDLON)
usaCountiesGeoCentroidsAvg$CENTRDLAT<-as.numeric(usaCountiesGeoCentroidsAvg$CENTRDLAT)
str(usaCountiesGeoCentroidsAvg)

'data.frame':   3219 obs. of  6 variables:
 $ COUNTYNAME: chr  "Autauga" "Baldwin" "Barbour" "Bibb" ...
 $ STATEFIPS : chr  "01" "01" "01" "01" ...
 $ COUNTYFIPS: chr  "001" "003" "005" "007" ...
 $ COMBIFIPS : chr  "01001" "01003" "01005" "01007" ...
 $ CENTRDLON : num  -86.6 -87.7 -85.4 -87.1 -86.6 ...
 $ CENTRDLAT : num  32.5 30.7 31.9 33 34 ...

## Dumping repeated counties --- basically, multi-island counties with multiple geographic centroids
censusCounties<-merge(usaCountiesPopCentroids,usaCountiesGeoCentroidsAvg,all.x=TRUE,all.y=FALSE,by=c("COUNTYFIPS","STATEFIPS"))

## Clifton Forge, VA is a town not a county as of 2010, but only in usaCountiesGeo so not in this file
## censusCounties<-censusCounties[censusCounties$COUNTYNAME.x!="Clifton Forge",]

summary(censusCounties)

  COUNTYFIPS         STATEFIPS            STATEFP        COUNTYFP    
 Length:3221        Length:3221        Min.   : 1.0   Min.   :  1.0  
 Class :character   Class :character   1st Qu.:19.0   1st Qu.: 35.0  
 Mode  :character   Mode  :character   Median :30.0   Median : 79.0  
                                       Mean   :31.3   Mean   :103.1  
                                       3rd Qu.:46.0   3rd Qu.:133.0  
                                       Max.   :72.0   Max.   :840.0  
                                                                     
   COUNAME             STNAME            POPULATION         LATITUDE    
 Length:3221        Length:3221        Min.   :     82   Min.   :17.98  
 Class :character   Class :character   1st Qu.:  11310   1st Qu.:34.34  
 Mode  :character   Mode  :character   Median :  26076   Median :38.23  
                                       Mean   :  97011   Mean   :37.96  
                                       3rd Qu.:  65880   3rd Qu.:41.69  
                                       Max.   :9818605   Max.   :70.52  
                                                                        
   LONGITUDE          state           COUNTYNAME.x      
 Min.   :-168.29   Length:3221        Length:3221       
 1st Qu.: -98.06   Class :character   Class :character  
 Median : -89.96   Mode  :character   Mode  :character  
 Mean   : -91.66                                        
 3rd Qu.: -82.99                                        
 Max.   : -65.30                                        
                                                        
 COUNTYNAME.y        COMBIFIPS           CENTRDLON         CENTRDLAT    
 Length:3221        Length:3221        Min.   :-169.50   Min.   :17.98  
 Class :character   Class :character   1st Qu.: -98.05   1st Qu.:34.34  
 Mode  :character   Mode  :character   Median : -89.93   Median :38.20  
                                       Mean   : -91.56   Mean   :37.93  
                                       3rd Qu.: -82.95   3rd Qu.:41.69  
                                       Max.   : -43.46   Max.   :69.31  
                                       NA's   :6         NA's   :6

## There are a few places with population centroids but no geographic centroids. This is ok. I think that we'd prefer population centroids anyway.
blah<-censusCounties[is.na(censusCounties$CENTRDLAT),"COUNTYNAME.x"]
countyCounts$county[countyCounts$county %in% blah]

[1] "Hoonah-Angoon" "Petersburg"    "Skagway"       "Wrangell"     
[5] "Broomfield"

grep("Prince",countyCounts$county,value=TRUE)

[1] "Prince Wales Hyder" "Prince Georges"     "Prince Edward"     
[4] "Prince George"      "Prince William"

censusCounties$COUNTYNAME.x[censusCounties$COUNTYNAME.x=="Prince of Wales-Hyder"]<-"Prince Wales Hyder"

## Remove Puerto Rico
censusCounties<-censusCounties[!(censusCounties$STNAME=="Puerto Rico" | censusCounties$state=="PR" | censusCounties$STATEFIPS=="72"),]
stopifnot(all(is.na(unlist(censusCounties[is.na(censusCounties$state),]))))
censusCounties<-censusCounties[!is.na(censusCounties$state),]

## Merge onto existing data.

censusCounties$county<-censusCounties$COUNTYNAME.x
usaCountiesPopCentroids$COUNTYNAME[usaCountiesPopCentroids$COUNTYNAME=="Prince of Wales-Hyder"]<-"Prince Wales Hyder"
usaCountiesPopCentroids$county<-usaCountiesPopCentroids$COUNTYNAME

tmpdat<-merge(countyCounts,censusCounties,by=c("county","state"),all.x=TRUE,all.y=FALSE)
##tmpdat<-merge(countyCounts,usaCountiesPopCentroids,by=c("county","state"),all.x=TRUE,all.y=FALSE)
tmpdat$ids<-paste(tmpdat$county,tmpdat$state,sep=", ")


## We have duplicated records because, I think, some citys and counties are both cities and counties.
## https://en.wikipedia.org/wiki/List_of_counties_in_Virginia
with(tmpdat,table(ids)[table(ids)!=1])

ids
Baltimore, MD   Bedford, VA   Fairfax, VA  Franklin, VA  Richmond, VA 
            2             2             2             2             2 
  Roanoke, VA 
            2

blah<-names(with(tmpdat,table(ids)[table(ids)!=1]))
blahdat<-tmpdat[tmpdat$ids %in% blah,]
blahdat<-blahdat[order(blahdat$county,blahdat$POPULATION,decreasing=TRUE),]
## The county FIPS codes are (from the TR-65 FIPS Code Chart)

bad<-with(tmpdat,{(county=="Richmond"  & state=="VA" & COUNTYFP!="159")  |
      (county=="Bedford"   & state=="VA" & COUNTYFP!="019") |
      (county=="Fairfax"   & state=="VA" & COUNTYFP!="059") |
      (county=="Roanoke"   & state=="VA" & COUNTYFP!="161")|
      (county=="Franklin"  & state=="VA" & COUNTYFP!="067") |
      (county=="Baltimore" & state=="MD" & COUNTYFP!="005")})

tmpdat<-tmpdat[!bad,]
stopifnot(all(table(tmpdat$ids)==1))
row.names(tmpdat)<-tmpdat$ids

Find pairs

Remove counties with only 1 farmer

tmpdat<-tmpdat[tmpdat$n>1,]

Remove counties with no geographic information:

tmpdat<-tmpdat[!is.na(tmpdat$LATITUDE),]

Also trim the top and bottom 1 percent of counties (in terms of size) within each county.

stateCountyFarmDist<-tapply(tmpdat$n,tmpdat$state,function(x){ quantile(x,c(0,.01,.1,.25,.5,.75,.9,.99,1)) })
table(tmpdat$state)


 AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA 
 25  66  74  15  58  64   8   3  65 159   4  98  44  98  89 105 120  55 
 MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK 
 14  19  16  81  85 109  82  56 100  52  93  10  21  32  16  61  88  77 
 OR  PA  RI  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 36  67   5  66  95 253  29  92  14  39  71  55  23

datList<-lapply(split(tmpdat,tmpdat$state),function(dat){
          qs<-quantile(dat$n,c(.01,.99))
          return(dat[dat$n >= qs[[1]] & dat$n <= qs[[2]],])
      })

sapply(datList,nrow)

 AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA 
 23  64  72  13  56  62   6   1  63 155   2  96  42  96  87 101 116  53 
 MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK 
 12  17  14  79  83 105  80  54  98  50  91   8  19  30  14  59  86  75 
 OR  PA  RI  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 34  65   3  64  94 247  27  90  12  37  69  53  21

sum(sapply(datList,nrow))

[1] 2898

wrkdat<-do.call("rbind",datList)
## Exclude delaware, rhode island, and hawaii as having too few counties and people.
wrkdat<-wrkdat[!(wrkdat$state %in% c("DE","RI","HI")),]

row.names(wrkdat)<-wrkdat$combifips

Find Pairs

The input to the pairing function is a matrix of distances. So first we make distance matrices.

library(nbpMatching)

Make an adjacency matrix of the counties:

## Make a matrix with 1 indicating that two counties are touching and 0 otherwise.
#### Not all counties in the adjacency matrix are represented in the working data from the USDA
#### blah<-sapply(centerCountiesFIPS2,function(x){
####           all(adjacentCountiesFIPS[[x]] %in% wrkdat$combifips ) })
#### table(blah)

adjList<-lapply(wrkdat$combifips,function(x){
          wrkdat$combifips %in% adjacentCountiesFIPS[[x]]
      })
names(adjList)<-wrkdat$combifips

adjMat<-do.call("rbind",adjList)
stopifnot(all(diag(adjMat))) ## all diagonals should be true
adjMat<-adjMat*1 # convert to numbers
dimnames(adjMat)<-list(wrkdat$combifips,wrkdat$combifips)
diag(adjMat)<-0 ## set self-distances to zero for ease in penalty creation below
wrkdat[which(rowMeans(adjMat)==0 ),] ## which counties have no adjacent counties (Honolulu and Maui in Hawaii)

 [1] county       state        id           state.fips   county.fips 
 [6] n            STATEFIPS.x  COUNTYNAME   COUNTYFIPS.x combifips   
[11] COUNTYFIPS.y STATEFIPS.y  STATEFP      COUNTYFP     COUNAME     
[16] STNAME       POPULATION   LATITUDE     LONGITUDE    COUNTYNAME.x
[21] COUNTYNAME.y COMBIFIPS    CENTRDLON    CENTRDLAT    ids         
<0 rows> (or 0-length row.names)

## Some counties in Minnesota are causing a problem with the adjacency matrix which should be symmetric
## That is, for example, county 27005 is counted as being adjacent to county 27153 but 27153 is not recorded as being adjacent to county 27005. Since adjacency is a symmetric property (by definition here) I need to fix this.
problemCountiesBig<-colSums(adjMat)!=rowSums(adjMat)
asymCounties<-problemCountiesBig[problemCountiesBig]

asymAdjMat<-adjMat[names(asymCounties),names(asymCounties)]
cbind(colSums(asymAdjMat),rowSums(asymAdjMat))

      [,1] [,2]
27005    2    1
27021    1    2
27027    3    2
27051    2    1
27097    2    3
27145    1    2
27153    3    4
27167    3    2

adjacentCountiesFIPS[names(asymCounties)]

$`27005`
[1] "27005" "27027" "27029" "27057" "27087" "27107" "27111" "27159"

$`27021`
[1] "27001" "27007" "27021" "27035" "27057" "27061" "27097" "27153" "27159"

$`27027`
[1] "27005" "27027" "27107" "27111" "27167" "38017" "38077"

$`27051`
[1] "27041" "27051" "27111" "27121" "27149" "27155" "27167"

$`27097`
[1] "27009" "27021" "27035" "27095" "27097" "27145" "27153"

$`27145`
 [1] "27009" "27041" "27067" "27093" "27097" "27121" "27141" "27145"
 [9] "27153" "27171"

$`27153`
[1] "27005" "27027" "27041" "27051" "27111" "27153" "27159" "27167"

$`27167`
[1] "27027" "27051" "27111" "27155" "27167" "38077"

## wrkdat[names(asymCounties),]

## Just to check: These counties are not in some strange geographic situation.
###  with(wrkdat[wrkdat$state=='MN',],plot(CENTRDLAT,CENTRDLON,pch=1))
###  with(wrkdat[names(asymCounties),],points(CENTRDLAT,CENTRDLON,pch=19,col="red"))

## Fix the matrix itself: if any county is called adjacent to another county, make
## sure that this is reflected in both counties.

## Since the matrix is 0 when two counties are not adjacent and 1 when when
## they are adjacent, I can add the transpose of the matrix and turn missing 0's
## into 1's and existing 1's into 2's. So I then convert that submatrix into
## TRUE/FALSE and then back to 1/0. At least, this is my one line fix

adjMat[names(asymCounties),names(asymCounties)]<-as.numeric(
                                (adjMat[names(asymCounties),names(asymCounties)]+
                                 t(adjMat[names(asymCounties),names(asymCounties)]))>0
                               )
stopifnot(isSymmetric(adjMat))

Make a matrix recording absolute differences in number of farmers between all pairs of counties.

scalarDist<-function(var,scalefactor=1){
  ## Utility function to make n x n abs dist matrices
  ## Scalefactor helps us turn fractions into integers since nbpMatching needs
  ## integers
  outer(var,var,FUN=function(x,y){
      as.integer(abs(x-y)*scalefactor)
                               })
}


## Create distance matrix for differences in number of farmers
numfarmers<-wrkdat$n
names(numfarmers)<-row.names(wrkdat)
sizeDist<-scalarDist(numfarmers)
## This will be an ingredient in the penalty that, we hope, forces pairs that are not adjacent but yet which are similar in terms of size and within the same state into pairs.
maxSizeDist<-max(as.vector(sizeDist))

getStatePairs<-function(fips){
  adjD<-adjMat[fips,fips]
  sizeD<-sizeDist[fips,fips]
  penMat<- sizeD + adjD*maxSizeDist   ## add the maximum distance to any entry where two counties are adjacent
  if( (nrow(penMat) %% 2)!=0){
    penMat <- make.phantoms(penMat,1) ## delete one county where we have an odd number --- it will be the county least like the others, greatest distance on size and adjacency.
  }
  penMatD<-distancematrix(penMat)
  nbpm<-nonbimatch(penMatD)
  thepairs<-get.sets(nbpm$matches,remove.unpaired=TRUE)
  thepairsA<-as.character(thepairs) ## return as a character rather than a factor to make the state-by-state matching easier
  names(thepairsA)<-names(thepairs)
  return(thepairsA)
}

## Get the fips codes by state so that we can do state-by-state matching
fipsByState<-split(wrkdat$combifips,wrkdat$state)

## This is inelegant but intelligible and not slow because the matching problems are small in each state
pairsByState<-lapply(fipsByState,function(fips){ getStatePairs(fips)})

## An example of the output. Each county has a categorical variable (a character var) with a label for the pair it is in.
str(pairsByState[1:2])

List of 2
 $ AK: Named chr [1:22] "02013-02275" "02016-02185" "02050-02280" "02060-02188" ...
  ..- attr(*, "names")= chr [1:22] "02013" "02016" "02050" "02060" ...
 $ AL: Named chr [1:64] "01001-01013" "01003-01077" "01005-01021" "01007-01131" ...
  ..- attr(*, "names")= chr [1:64] "01001" "01003" "01005" "01007" ...

## Because some states have an odd number of counties, one is excluded.
sum(sapply(pairsByState,length))

[1] 2870

sapply(pairsByState,length)

 AK  AL  AR  AZ  CA  CO  CT  FL  GA  IA  ID  IL  IN  KS  KY  LA  MA  MD 
 22  64  72  12  56  62   6  62 154  96  42  96  86 100 116  52  12  16 
 ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA 
 14  78  82 104  80  54  98  50  90   8  18  30  14  58  86  74  34  64 
 SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 64  94 246  26  90  12  36  68  52  20

## Compare to:
table(wrkdat$state)


 AK  AL  AR  AZ  CA  CO  CT  FL  GA  IA  ID  IL  IN  KS  KY  LA  MA  MD 
 23  64  72  13  56  62   6  63 155  96  42  96  87 101 116  53  12  17 
 ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA 
 14  79  83 105  80  54  98  50  91   8  19  30  14  59  86  75  34  65 
 SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 64  94 247  27  90  12  37  69  53  21

sum(table(wrkdat$state))

[1] 2892

nrow(wrkdat)

[1] 2892

## Add the pair variable to the data
for(i in 1:length(pairsByState)){
  wrkdat[names(pairsByState[[i]]),"pm1"]<-pairsByState[[i]]
}

## For all non-missing values of the paired variable, we should have exactly 2 unique values.
stopifnot(all(table(wrkdat$pm1)==2))

How well did the matching do? Here are descriptives of the pairs in terms of size and adjacency. We have one pair that is adjacent, so we will avoid it. Also, since we did not ask the algorithmn to exclude bad matches, we have some pairs that are very different in number of farmers. We will also avoid them when choosing the one (or a few more) pairs for the design.

pairedDat<-wrkdat[!is.na(wrkdat$pm1),]
pairDiffsN<-with(pairedDat,tapply(n,pm1,function(x){ abs(diff(x)) }) )
summary(pairDiffsN)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00   16.00   43.52   40.00 1443.00

quantile(pairDiffsN,seq(0,1,.1))

    0%    10%    20%    30%    40%    50%    60%    70%    80%    90% 
   0.0    2.0    4.0    7.0   11.0   16.0   23.0   33.8   51.0   99.0 
  100% 
1443.0

How many pairs involve adjacent counties?

pairDiffsAdj<-with(pairedDat,tapply(combifips,pm1,function(x){ all(as.vector(adjMat[x,x]==0)) }))
table(pairDiffsAdj)

pairDiffsAdj
TRUE 
1435

Choose best pairs within each state

We want to assign about 1/4 of the counties in each state to treatment. If this seems like too much, we might decide to assign, say, 6 counties to treatment or 1/4 of the total counties, whichever is greater.

pairsByState<-sapply(split(pairedDat$pm1,pairedDat$state),unique) ## otherwise we'll have two entries for each pair

pairAvgN<-with(pairedDat,tapply(n,pm1,mean))


penalty1 <- function(x1,x2){
  ## here, x1 and x2 are the vectors of sizes (in numbers of farmers) in each county within a pair
  ifelse( (x1-x2) != 0,
    log(x1) + log(x2) - log(abs(x1 - x2)),
    log(x1) + log(x2) )
}


chooseBestPairs<-function(pm,numpairs){
  ## pm is a vector of pair-match names
  ## numpairs is the number of pairs to choose
  ## Given a vector of pair match labels, choose the best pair (closest match in size that is not adjacent with largest sample)

  ### First, exclude pairs where the two counties are adjacent (should be very rare given our pairing algorithm)
  okpairs<-names(pairDiffsN[pm][pairDiffsAdj[pm]]) ## exclude pairs where the two counties are adjacent
  sizeDiffs<-pairDiffsN[okpairs]
  avgN<-pairAvgN[okpairs]

  ### Second, sort the pairs by size difference as a proportion of average size of a county within in the pair.

  ### The idea here is that a 1000-900 pair is better than a 10-9 pair even
  ### though the size difference is 100 in the first place and only 1 in the second
  ### place.

  ## pairsInOrder<-cbind(sizeDiffs,avgN)[order(sizeDiffs,-1*avgN,decreasing=FALSE),] ## sort by diff in size and then by avgN
  ## pairsInOrder<-cbind(sizeDiffs,avgN,sizeDiffs/avgN)[order(sizeDiffs/avgN,-1*avgN,decreasing=FALSE),] ## sort by diff in size/avgN and then by avgN
  ## Rank the pairs by a combination of closeness in size and absolute size, breaking ties by size
  ordering<-order(penalty1(avgN-(sizeDiffs/2),avgN+(sizeDiffs/2)),avgN,decreasing=TRUE)
  pairsInOrder<-cbind(sizeDiffs,avgN,avgN-(sizeDiffs/2),avgN+(sizeDiffs/2),penalty1(avgN-(sizeDiffs/2),avgN+(sizeDiffs/2)))[ordering,] ## sort by diff in size/avgN and then by avgN

  ### Third, start choosing pairs in order from best to worst, requiring that already chosen counties not be adjacent to any other counties subsequantly chosen

  ## Decided not to focus on control-only adjacency since then the pairs are no
  ## longer fixed, pre-treatment features, but depend on the preceding ##
  ## randomizations. So, it would then be difficult to look for heterogeneous
  ## effects by type of pair (big counties, small counties, etc..)

  nAvail<-numpairs
  ##nFarmers<-0
  newpairsInOrder<-pairsInOrder
  res<-list() ## very inefficient but clear and clean versus res<-vector(mode="list",length=nrow(pairsInOrder))
  k<-0
  while(nAvail > 0 & nAvail <= numpairs){
    k<-k+1
    ## message(k)

    #### Choose the pair that is on top of the list, the best one according to our criteria above
    pair1<-strsplit(row.names(newpairsInOrder)[1],"-")[[1]]
    names(pair1)<-rep(row.names(newpairsInOrder)[1],2)
    ## pair1Z<-sample(c(0,1)) ## assign one county to treatment
    names(pair1)<-rep(row.names(newpairsInOrder)[1],2)
    ## res[[k]]<-data.frame(county=pair1,Z=pair1Z,pair=names(pair1))
    res[[k]]<-data.frame(county=pair1,pair=names(pair1),stringsAsFactors=FALSE)
    ##nFarmers1<-pairedDat[pair1[pair1Z==1],"n"]/2
    ##nFarmers<-nFarmers1+nFarmers
    ## message(nFarmers)

    #### Trim the pairsInOrder data to exclude the pairs already chosen and any pairs adjacent to the preceding pair.
    remainingPairs<-row.names(newpairsInOrder)[!(row.names(newpairsInOrder) %in% names(pair1)[1])]
    if(length(remainingPairs)==0){
      return(res)
    } else {
      ## Find pairs where at least one member is adjacent to one of the pair members and exclude that pair from consideration.
      ## adjM<-adjMat[pair1,##[pair1Z==0], unlist(strsplit(remainingPairs,"-"))]
      adjM<-adjMat[pair1,unlist(strsplit(remainingPairs,"-"))]
      adjCounties<-colnames(adjM)[colSums(adjM)>=1]
      adjCountiesLocs<-unique(sapply(c(pair1[1],adjCounties),function(cnty){ grep(cnty,row.names(newpairsInOrder)) }))
      includeCounties<-row.names(newpairsInOrder)[-c(adjCountiesLocs)]
      if(length(includeCounties)==0){ 
    return(res)
      } else {
    newpairsInOrder<-newpairsInOrder[includeCounties,,drop=FALSE]
    nAvail<-min(numpairs,nrow(newpairsInOrder))
    ## message("Available pairs:",nAvail)
      }
    }
  }
}

targetPairsByState<-floor(sapply(pairsByState,length)/2) ## no more than 1/4 of the counties in a state in treatment

selectedPairs<-lapply(names(pairsByState),function(nm){
            pairList<-chooseBestPairs(pairsByState[[nm]],numpairs=targetPairsByState[[nm]])
            if(length(pairList)==0){
              return(NA)
            } else {
              results<-data.frame(do.call("rbind",pairList),stringsAsFactors=FALSE)
              results$state<-nm
              return(results)
            }
                               })
names(selectedPairs)<-names(pairsByState)


## How many counties did the algorithm return per state?
sapply(selectedPairs,function(x){ if(is.data.frame(x)){ return(nrow(x)) } else { return(NA) }})

AK AL AR AZ CA CO CT FL GA IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC 
 8 12 14  2 12 12  2 16 30 18 10 18 16 18 24 14  4  6  4 16 14 20 16 10 20 
ND NE NH NJ NM NV NY OH OK OR PA SD TN TX UT VA VT WA WI WV WY 
 8 18  2  4  8  2 12 16 14  6 12 14 18 48  6 18  4 10 12 12  4

prelimExperimentDat<-data.frame(do.call("rbind",selectedPairs),stringsAsFactors = FALSE)
row.names(prelimExperimentDat)<-prelimExperimentDat$county
prelimExperimentDat$n<-pairedDat[row.names(prelimExperimentDat),"n"]
str(prelimExperimentDat)

'data.frame':   584 obs. of  4 variables:
 $ county: chr  "02090" "02122" "02100" "02180" ...
 $ pair  : chr  "02090-02122" "02090-02122" "02100-02180" "02100-02180" ...
 $ state : chr  "AK" "AK" "AK" "AK" ...
 $ n     : int  422 678 23 23 55 44 7 5 1105 1100 ...

head(prelimExperimentDat)

      county        pair state   n
02090  02090 02090-02122    AK 422
02122  02122 02090-02122    AK 678
02100  02100 02100-02180    AK  23
02180  02180 02100-02180    AK  23
02130  02130 02130-02220    AK  55
02220  02220 02130-02220    AK  44

summary(prelimExperimentDat$n) ## sizes of counties in the experiment

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      5     491     793    1000    1344    4531

Assign one county to possible treatment within each pair.

set.seed(20150101)
prelimExperimentDat$Z<-unsplit(lapply(split(prelimExperimentDat$county,prelimExperimentDat$pair),function(x){ sample(c(0,1)) }),prelimExperimentDat$pair)
head(prelimExperimentDat)

      county        pair state   n Z
02090  02090 02090-02122    AK 422 0
02122  02122 02090-02122    AK 678 1
02100  02100 02100-02180    AK  23 1
02180  02180 02100-02180    AK  23 0
02130  02130 02130-02220    AK  55 1
02220  02220 02130-02220    AK  44 0

stopifnot(all(tapply(prelimExperimentDat$Z,prelimExperimentDat$pair,sum)==1)) ## exactly 1 treated county per pair

Preliminary Map

A map of our counties. Black are those counties assigned treatment. I think that this is very useful since we may want to go back and refine the algorithm a bit.

png(file="map.png",width=1920,height=1080)
par(oma=rep(0,4))
library(maps)
data(countyMapEnv)
data(usaMapEnv)
data(county.fips)
map('state',regions=states$State[states$Postal.Code %in% prelimExperimentDat$state],exact=FALSE,mar=c(1,1,1,1))
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(prelimExperimentDat$county)],add=TRUE)
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(prelimExperimentDat$county)[prelimExperimentDat$Z==1]],col="gray",fill=TRUE,add=TRUE)
dev.off()

quartz_off_screen 
                2

A Map of the counties.

Avoid cross-state treatment-to-control adjacency

The map makes us realize that we could contaminate the controls because of adjacency across state lines. This is rare, so we just dump those pairs where this happens — keeping the pairs with more farmers.

newAdjMat<-adjMat[prelimExperimentDat$county[prelimExperimentDat$Z==0],prelimExperimentDat$county[prelimExperimentDat$Z==1]]

## This many controls are adjacent to a treatment county
sum(rowSums(newAdjMat))

[1] 32

## Identify pairs with this problem. Some multiple adjacency

table(rowSums(newAdjMat))


  0   1   2 
264  24   4

table(colSums(newAdjMat))


  0   1   2 
262  28   2

ctrls<-rownames(newAdjMat)[rowSums(newAdjMat)>0]

crossStateAdjList<-apply(newAdjMat[ctrls,],1,function(x){ return(names(x[x>0,drop=FALSE]))})

dumpPairs<-lapply(names(crossStateAdjList),function(nm){
           crossStateAdj<-prelimExperimentDat[c(nm,crossStateAdjList[[nm]]),]
           dropPairs<-crossStateAdj$pair[order(crossStateAdj$n,decreasing=TRUE)][-1]
                               })
names(dumpPairs)<-names(crossStateAdjList)

experimentDat<-prelimExperimentDat[!(prelimExperimentDat$pair %in% unlist(dumpPairs)),]

nextAdjMat<-adjMat[experimentDat$county[experimentDat$Z==0],experimentDat$county[experimentDat$Z==1]]
stopifnot(sum(nextAdjMat)==0)

Final Map

png(file="map2.png",width=1920,height=1080)
par(oma=rep(0,4))
library(maps)
data(countyMapEnv)
data(usaMapEnv)
data(county.fips)
map('state',regions=states$State[states$Postal.Code %in% experimentDat$state],exact=FALSE,mar=c(1,1,1,1))
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(experimentDat$county)],add=TRUE)
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(experimentDat$county)[experimentDat$Z==1]],col="gray",fill=TRUE,add=TRUE)
dev.off()

quartz_off_screen 
                2

A Map of the design with no cross-state contamination.

Calculate the number of treatment assignments.

If we have 50% of farmers assigned to treatment within each potential-to-be-treated county, then we have the following total number of letters.

proportionTreated <- 1/2

sum(experimentDat$n[experimentDat$Z==1] * proportionTreated)

[1] 137363

More generally, here are some of the total numbers from different choices of proportion treated.

totalTreated<-sapply(seq(.5,.75,.05),function(x){ sum(experimentDat$n[experimentDat$Z==1] * x) })
rbind(totalTreated,seq(.5,.75,.05))

                 [,1]      [,2]     [,3]      [,4]     [,5]      [,6]
totalTreated 137363.0 151099.30 164835.6 178571.90 192308.2 206044.50
                  0.5      0.55      0.6      0.65      0.7      0.75

Or we can just solve for the proportion to send out about 150,000 letters

fn<-function(x){
  150000 - sum(experimentDat$n[experimentDat$Z==1] * x)
}

sol<-uniroot(fn,lower=.1,upper=.6)

sol$root

[1] 0.5459986

fn(sol$root)

[1] 2.910383e-11

sum(experimentDat$n[experimentDat$Z==1] * sol$root)

[1] 150000

Save the file for use later

write.csv(experimentDat,file="experimentDat.csv")

References

Baird, Sarah, J Aislinn Bohren, Craig McIntosh, and Berk Ozler. 2015. “Designing Experiments to Measure Spillover Effects.”

Sinclair, Betsy, Margaret McConnell, and Donald P Green. 2012. “Detecting Spillover Effects: Design and Analysis of Multilevel Experiments.” American Journal of Political Science 56 (4): 1055–69.

We follow Baird et al. (2015), who in turn build on Sinclair, McConnell, and Green (2012), among others.↩
Some of the sources: Centers of population http://www2.census.gov/geo/docs/reference/cenpop2010/county/CenPop2010_Mean_CO.txt https://www.census.gov/geo/reference/centersofpop.html County Adjacency: https://www.census.gov/geo/reference/county-adjacency.html ↩

Notes on a Randomized Saturation Design

Jake Bowers

January 06, 2016