A randomized saturation design is like the simple two-level design that we describe in twolevelrand.Rmd but involves random assignment of proportion treated to counties, or random assignment of “saturation” of the treatment.1 For example, we might randomly assign counties to receive between 0% treated (i.e. the “pure control” counties in the two-level design) and 100% treated, with some saturations in between (like 33%, 50%, 66%, etc.).
This design raises a couple of new questions for us. If our total number of treatments is fixed at \(n_t\), then, to maximize power and enhance interpretibility: How many saturation categories should we use? How many counties should we choose?
We will block first on state because the agricultural outcomes and farmer behavior will vary by state. We expect, that the relationship between microloans applications, spillover, and other outcomes, might also vary by state (in part as a proxy for ecosystem/type of agriculture and in part as a proxy for other influences that states have on agricultural production). Also, since counties tend to have the same land area size within a state, but vary greatly in land area across states, we expect different patterns in spillover by state — assigning treatment to 50% of the farmers in one large county might incur different spillover than 50% of the farmers in a smaller county.
If we are randomly assigning counties to saturation, we should also probably should assign saturation to sets of counties that are similar in number of farmers within them. Across many counties, the distribution of county size should be balanced across saturation treatments, in expectation, but in our finite sample, we think that we will enhance power and make comparisons easier to understand if, within states, we randomly assign saturation within blocks of similar counties (with number of farmers within the county being a key covariate, but perhaps there are others that matter as well).
Within blocks, we know that we want at least one county at saturation 0% (i.e the “pure control”) condition.
We have two main effects: the ITT (whereby we compare treated to pure control individuals), the Spillover on the Non-Treated (SNT) (whereby we compare non-treated individuals in a county with some saturation \(\pi>0\) to pure control individuals (people in counties with \(\pi=0\)).
The power of the design will depend on overall experimental pool \(N\) (roughly 3 million); number treated \(n_t\); the variation in the outcome, \(Y\); the size of the direct effect of treatment; the size of the spillover effect; the intra-county dependence (measured by the intra-cluster correlation coefficient (ICC)); and the correlation between treatment status within counties. The Baird et al 2015 piece provides
The Blair et al 2015 piece uses a random effects/model-based mode for statistical inference. They claim that this very much simplifies the work required for power analysis. Although, in general, I tend to advocate randomization inference for randomized experiments, I suspect that we might gain from following their lead here, especially since our total pool of units is so large — that the concerns about consistency and the CLT that one sidesteps with randomization infeerence in its permutation form might not be that important. That said, we might still want to check our results using a more direct randomization inference approach if we have few saturation categories.
The following reads the data from Google Drive, assuming that there is a Google Sheet called “county-counts.csv” in your sheets directory.
library(googlesheets)
countyCountsGS<-gs_title("countyCountsFIPS.csv") ## get info about the data
countyCounts<-as.data.frame(gs_read_csv(countyCountsGS,as.is=TRUE),stringsAsFactors=FALSE)
## countyCounts<-countyCounts[countyCounts$state!="PR",]
str(countyCounts)
'data.frame': 3389 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ state : chr "AK" "AK" "AK" "AK" ...
$ state.fips : int 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
$ county.fips: int 13 16 20 50 60 68 70 90 100 105 ...
$ n : int 8 7 930 17 6 25 21 422 23 12 ...
countyCounts$STATEFIPS<-with(countyCounts,ifelse(state.fips<10,paste("0",state.fips,sep=""),as.character(state.fips)))
countyCounts$COUNTYNAME<-countyCounts$county
countyCounts$COUNTYFIPS<-with(countyCounts,ifelse(county.fips<10,
paste("00",county.fips,sep=""),
ifelse(county.fips<100,paste("0",county.fips,sep=""),
as.character(county.fips))))
countyCounts$combifips<-with(countyCounts,paste(STATEFIPS,COUNTYFIPS,sep=""))
str(countyCounts)
'data.frame': 3389 obs. of 10 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ state : chr "AK" "AK" "AK" "AK" ...
$ state.fips : int 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
$ county.fips: int 13 16 20 50 60 68 70 90 100 105 ...
$ n : int 8 7 930 17 6 25 21 422 23 12 ...
$ STATEFIPS : chr "02" "02" "02" "02" ...
$ COUNTYNAME : chr "Aleutians East" "Aleutians West" "Anchorage" "Bethel" ...
$ COUNTYFIPS : chr "013" "016" "020" "050" ...
$ combifips : chr "02013" "02016" "02020" "02050" ...
sapply(countyCounts,function(x){ sum(is.na(x) | x=="" ) })
id state state.fips county county.fips n
0 0 0 0 0 0
STATEFIPS COUNTYNAME COUNTYFIPS combifips
0 0 0 0
with(countyCounts,tapply(n,state,sum)) ## N by state
AK AL AR AZ CA CO CT DE FL GA
3330 69979 56139 24649 71932 49230 4643 6254 60993 75985
HI IA ID IL IN KS KY LA MA MD
3479 137252 27353 145738 108542 104116 134013 55057 8262 25891
ME MI MN MO MS MT NC ND NE NH
9333 85599 117255 109653 69699 42221 156017 51032 86328 3565
NJ NM NV NY OH OK OR PA RI SD
20480 19483 6065 43250 126956 68605 36829 60822 1271 50685
TN TX UT VA VT WA WI WV WY
121912 293867 13889 86890 7611 49918 156940 18243 11811
with(countyCounts,tapply(county,state,function(x){ length(unique(x) ) })) ## number of counties by state
AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA
36 73 89 26 70 72 9 3 79 166 5 103 51 113 99 110 122 70
MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
18 28 22 86 91 122 92 62 115 58 96 11 25 37 20 68 92 81
OR PA RI SD TN TX UT VA VT WA WI WV WY
37 69 5 70 106 263 31 138 16 47 76 58 27
summary(with(countyCounts,tapply(n,county,sum))) ## number of farmers by county
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 440 922 1697 1777 37860
sapply(split(countyCounts$n,countyCounts$state),function(x){ summary(x) }) ## dist of county sizes by state
AK AL AR AZ CA CO CT DE FL GA
Min. 1.00 1.0 1.0 1.0 1 1.0 1.0 990 1.0 1.0
1st Qu. 3.75 462.8 292.2 1.0 185 124.5 282.0 1092 266.0 189.0
Median 13.50 827.0 568.5 143.5 716 296.5 578.0 1195 562.0 362.0
Mean 92.50 945.7 623.8 948.0 1028 683.8 515.9 2085 772.1 449.6
3rd Qu. 46.75 1242.0 801.5 794.8 1474 897.8 675.0 2632 1102.0 572.0
Max. 930.00 2848.0 3187.0 9382.0 6876 3606.0 887.0 4069 3436.0 2811.0
HI IA ID IL IN KS KY LA MA MD
Min. 1.0 1 1.0 1 1 1.0 1 1.0 1.0 1.0
1st Qu. 400.0 864 161.8 647 674 396.5 570 260.2 19.5 554.5
Median 700.0 1195 391.5 1060 1096 641.0 918 445.5 440.0 750.0
Mean 695.8 1333 526.0 1256 1075 938.0 1072 786.5 459.0 924.7
3rd Qu. 980.0 1586 759.2 1684 1392 1130.0 1340 1181.0 661.0 1122.0
Max. 1398.0 5986 2132.0 8248 3133 8574.0 4443 3767.0 1270.0 2425.0
ME MI MN MO MS MT NC ND NE NH
Min. 1.0 1.0 1 1.0 1.0 1.0 1 1.0 1.0 1.0
1st Qu. 1.0 232.5 701 473.5 441.0 200.5 437 333.2 372.5 200.5
Median 456.0 591.0 1120 777.0 757.0 385.0 1214 550.5 709.0 315.0
Mean 405.8 983.9 1261 891.5 757.6 681.0 1357 879.9 899.2 324.1
3rd Qu. 649.5 1742.0 1651 1142.0 1061.0 797.5 1844 1057.0 1189.0 483.5
Max. 1153.0 3213.0 7576 5109.0 2743.0 3648.0 8486 4519.0 7268.0 566.0
NJ NM NV NY OH OK OR PA RI SD
Min. 1.0 1.0 1.00 1.0 1 1 1.0 1.0 57.0 1.0
1st Qu. 341.0 248.0 55.75 254.0 978 498 332.0 422.0 140.0 321.5
Median 783.0 365.0 125.00 572.0 1325 698 655.0 777.0 250.0 572.0
Mean 819.2 526.6 303.20 626.8 1365 847 995.4 881.5 254.2 724.1
3rd Qu. 1234.0 566.0 291.50 912.0 1770 1025 1480.0 1246.0 349.0 938.8
Max. 2842.0 2144.0 2247.00 1897.0 3952 5339 3252.0 2938.0 475.0 4580.0
TN TX UT VA VT WA WI WV WY
Min. 1 1.0 1 1.0 1.0 1.0 1 1.0 1.0
1st Qu. 428 307.0 170 234.2 247.8 263.5 982 130.0 205.5
Median 924 655.5 304 435.5 430.0 699.0 1880 296.5 371.0
Mean 1118 1113.0 448 629.6 475.7 1062.0 2065 314.5 437.4
3rd Qu. 1576 1390.0 729 782.5 707.8 1429.0 2935 438.5 618.5
Max. 6104 16420.0 1713 2901.0 1105.0 5830.0 7122 863.0 1523.0
Remove counties with fewer than 2 farmers (these seem to be duplicates in the file anyway):
countyCounts<-countyCounts[countyCounts$n>2,]
names(table(countyCounts$combifips))[table(countyCounts$combifips)>1]
[1] "02020"
Anchorage is still duplicated, so removing by hand. Eventually the input file should only have one row per county.
countyCounts<-countyCounts[!(countyCounts$n==4 & countyCounts$combifips=="02020"),]
stopifnot(all(table(countyCounts$combifips)==1))
We can have three measures of distance between counties: euclidean distance between geographic centers, euclidean distance between centers of population, and an indicator (1 or 0) for whether two counties are adjacent to each other.2
This file required a bit of reformatting work. The file that does the cleanup is gis/setupAdjData.sh
. It should be run once at the command line. The countyAdjClean.txt file will be in the repository. These lines just document the contents of setupAdjData.sh
.
##These are bash commands.
## They are currently in gis/setupAdjData.sh which can be run at the command line.
curl -O http://www2.census.gov/geo/docs/reference/county_adjacency.txt
cp county_adjacency.txt tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/"\t/";\t/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t\t"/\t\t;;"/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t"/\t;"/g' tmp1.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/\t//g' tmp1.txt
## Add "Watonwan County, MN" to line 9625
LC_CTYPE=C LANG=C sed -i.bak '9629s/.*/"Watonwan County, MN";27165;"Blue Earth County, MN";27013/' tmp1.txt
## Change the old FIPS code of 02195 for Petersburgh Alaska to the new one of 02280
LC_CTYPE=C LANG=C sed -i.bak 's/02195/02280/g' tmp1.txt
LC_CTYPE=C LANG=C tr '\n' ':' < tmp1.txt > tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/:;;/;/g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/:/\\\n/g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ County//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ Census Area//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ and Borough//g' tmp2.txt
LC_CTYPE=C LANG=C sed -i.bak $'s/ Borough//g' tmp2.txt
cp tmp2.txt countyAdjClean.txt
## https://stackoverflow.com/questions/5411979/state-name-to-abbreviation-in-r
Sys.setlocale('LC_ALL','C')
[1] "C/C/C/C/C/en_US.UTF-8"
cntyAdjList1<-scan("gis/countyAdjClean.txt",what="",sep="\n",quote='\"')
cntyAdjList2<-strsplit(cntyAdjList1,";")
centerCounties1<-sapply(cntyAdjList2,function(obj){obj[1]})
centerCountiesFIPS1<-sapply(cntyAdjList2,function(obj){obj[2]})
centerCounties1[centerCounties1=="Prince of Wales-Hyder Census Area"]<-"Prince Wales Hyder"
names(cntyAdjList2)<-centerCounties1
## Remove Puerto Rico
centerCounties2<-centerCounties1[grep("PR$|VI$|GU$|AS$|MP$",centerCounties1,invert=TRUE)]
centerCountiesFIPS2<-centerCountiesFIPS1[grep("PR$|VI$|GU$|AS$|MP$",centerCounties1,invert=TRUE)]
cntyAdjList3<-cntyAdjList2[centerCounties2]
adjacentCounties<-lapply(cntyAdjList3,function(x){
if(length(x)<5){
return("None")
} else {
grep(",",x[3:length(x)],value=TRUE)
}})
isolatedCounties<-sapply(adjacentCounties,function(x){ all(x=="None") })
cntyAdjList3[isolatedCounties]
$`Hawaii, HI`
[1] "Hawaii, HI" "15001" "Hawaii, HI" "15001"
$`Honolulu, HI`
[1] "Honolulu, HI" "15003" "Honolulu, HI" "15003"
$`Kauai, HI`
[1] "Kauai, HI" "15007" "Kauai, HI" "15007"
adjacentCountiesFIPS<-lapply(cntyAdjList3,function(x){ grep('[0-9]$',x[3:length(x)],value=TRUE) })
names(adjacentCountiesFIPS)<-centerCountiesFIPS2
states<-read.csv("gis/StateNamesAndAbbreviations.csv",as.is=TRUE)
usaCountiesPopCentroids<-read.csv(url("http://www2.census.gov/geo/docs/reference/cenpop2010/county/CenPop2010_Mean_CO.txt"),as.is=TRUE)
usaCountiesPopCentroids$state<-states$Postal.Code[match(usaCountiesPopCentroids$STNAME,states$State)]
usaCountiesPopCentroids$STATEFIPS<-with(usaCountiesPopCentroids,ifelse(STATEFP<10,paste("0",STATEFP,sep=""),as.character(STATEFP)))
usaCountiesPopCentroids$COUNTYNAME<-usaCountiesPopCentroids$COUNAME
str(usaCountiesPopCentroids)
'data.frame': 3221 obs. of 10 variables:
$ STATEFP : int 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTYFP : int 1 3 5 7 9 11 13 15 17 19 ...
$ COUNAME : chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
$ STNAME : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ POPULATION: int 54571 182265 27457 22915 57322 10914 20947 118572 34215 25989 ...
$ LATITUDE : num 32.5 30.5 31.8 33 34 ...
$ LONGITUDE : num -86.5 -87.8 -85.3 -87.1 -86.6 ...
$ state : chr "AL" "AL" "AL" "AL" ...
$ STATEFIPS : chr "01" "01" "01" "01" ...
$ COUNTYNAME: chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
usaCountiesPopCentroids$COUNTYFIPS<-with(usaCountiesPopCentroids,ifelse(COUNTYFP<10,
paste("00",COUNTYFP,sep=""),
ifelse(COUNTYFP<100,paste("0",COUNTYFP,sep=""),
as.character(COUNTYFP))))
library(foreign)
usaCountiesGeoCentroids<-read.dbf("gis/AddCombiFips/USACountyCentroids.dbf",as.is=TRUE)
tmp<-t(sapply(split(usaCountiesGeoCentroids,usaCountiesGeoCentroids$COMBIFIPS),function(dat){
sapply(dat,function(x){
if(is.character(x)){
return(unique(x))
} else {
return(mean(x))
}})}))
usaCountiesGeoCentroidsAvg<-as.data.frame(tmp,stringsAsFactors=FALSE)
usaCountiesGeoCentroidsAvg$CENTRDLON<-as.numeric(usaCountiesGeoCentroidsAvg$CENTRDLON)
usaCountiesGeoCentroidsAvg$CENTRDLAT<-as.numeric(usaCountiesGeoCentroidsAvg$CENTRDLAT)
str(usaCountiesGeoCentroidsAvg)
'data.frame': 3219 obs. of 6 variables:
$ COUNTYNAME: chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
$ STATEFIPS : chr "01" "01" "01" "01" ...
$ COUNTYFIPS: chr "001" "003" "005" "007" ...
$ COMBIFIPS : chr "01001" "01003" "01005" "01007" ...
$ CENTRDLON : num -86.6 -87.7 -85.4 -87.1 -86.6 ...
$ CENTRDLAT : num 32.5 30.7 31.9 33 34 ...
## Dumping repeated counties --- basically, multi-island counties with multiple geographic centroids
censusCounties<-merge(usaCountiesPopCentroids,usaCountiesGeoCentroidsAvg,all.x=TRUE,all.y=FALSE,by=c("COUNTYFIPS","STATEFIPS"))
## Clifton Forge, VA is a town not a county as of 2010, but only in usaCountiesGeo so not in this file
## censusCounties<-censusCounties[censusCounties$COUNTYNAME.x!="Clifton Forge",]
summary(censusCounties)
COUNTYFIPS STATEFIPS STATEFP COUNTYFP
Length:3221 Length:3221 Min. : 1.0 Min. : 1.0
Class :character Class :character 1st Qu.:19.0 1st Qu.: 35.0
Mode :character Mode :character Median :30.0 Median : 79.0
Mean :31.3 Mean :103.1
3rd Qu.:46.0 3rd Qu.:133.0
Max. :72.0 Max. :840.0
COUNAME STNAME POPULATION LATITUDE
Length:3221 Length:3221 Min. : 82 Min. :17.98
Class :character Class :character 1st Qu.: 11310 1st Qu.:34.34
Mode :character Mode :character Median : 26076 Median :38.23
Mean : 97011 Mean :37.96
3rd Qu.: 65880 3rd Qu.:41.69
Max. :9818605 Max. :70.52
LONGITUDE state COUNTYNAME.x
Min. :-168.29 Length:3221 Length:3221
1st Qu.: -98.06 Class :character Class :character
Median : -89.96 Mode :character Mode :character
Mean : -91.66
3rd Qu.: -82.99
Max. : -65.30
COUNTYNAME.y COMBIFIPS CENTRDLON CENTRDLAT
Length:3221 Length:3221 Min. :-169.50 Min. :17.98
Class :character Class :character 1st Qu.: -98.05 1st Qu.:34.34
Mode :character Mode :character Median : -89.93 Median :38.20
Mean : -91.56 Mean :37.93
3rd Qu.: -82.95 3rd Qu.:41.69
Max. : -43.46 Max. :69.31
NA's :6 NA's :6
## There are a few places with population centroids but no geographic centroids. This is ok. I think that we'd prefer population centroids anyway.
blah<-censusCounties[is.na(censusCounties$CENTRDLAT),"COUNTYNAME.x"]
countyCounts$county[countyCounts$county %in% blah]
[1] "Hoonah-Angoon" "Petersburg" "Skagway" "Wrangell"
[5] "Broomfield"
grep("Prince",countyCounts$county,value=TRUE)
[1] "Prince Wales Hyder" "Prince Georges" "Prince Edward"
[4] "Prince George" "Prince William"
censusCounties$COUNTYNAME.x[censusCounties$COUNTYNAME.x=="Prince of Wales-Hyder"]<-"Prince Wales Hyder"
## Remove Puerto Rico
censusCounties<-censusCounties[!(censusCounties$STNAME=="Puerto Rico" | censusCounties$state=="PR" | censusCounties$STATEFIPS=="72"),]
stopifnot(all(is.na(unlist(censusCounties[is.na(censusCounties$state),]))))
censusCounties<-censusCounties[!is.na(censusCounties$state),]
## Merge onto existing data.
censusCounties$county<-censusCounties$COUNTYNAME.x
usaCountiesPopCentroids$COUNTYNAME[usaCountiesPopCentroids$COUNTYNAME=="Prince of Wales-Hyder"]<-"Prince Wales Hyder"
usaCountiesPopCentroids$county<-usaCountiesPopCentroids$COUNTYNAME
tmpdat<-merge(countyCounts,censusCounties,by=c("county","state"),all.x=TRUE,all.y=FALSE)
##tmpdat<-merge(countyCounts,usaCountiesPopCentroids,by=c("county","state"),all.x=TRUE,all.y=FALSE)
tmpdat$ids<-paste(tmpdat$county,tmpdat$state,sep=", ")
## We have duplicated records because, I think, some citys and counties are both cities and counties.
## https://en.wikipedia.org/wiki/List_of_counties_in_Virginia
with(tmpdat,table(ids)[table(ids)!=1])
ids
Baltimore, MD Bedford, VA Fairfax, VA Franklin, VA Richmond, VA
2 2 2 2 2
Roanoke, VA
2
blah<-names(with(tmpdat,table(ids)[table(ids)!=1]))
blahdat<-tmpdat[tmpdat$ids %in% blah,]
blahdat<-blahdat[order(blahdat$county,blahdat$POPULATION,decreasing=TRUE),]
## The county FIPS codes are (from the TR-65 FIPS Code Chart)
bad<-with(tmpdat,{(county=="Richmond" & state=="VA" & COUNTYFP!="159") |
(county=="Bedford" & state=="VA" & COUNTYFP!="019") |
(county=="Fairfax" & state=="VA" & COUNTYFP!="059") |
(county=="Roanoke" & state=="VA" & COUNTYFP!="161")|
(county=="Franklin" & state=="VA" & COUNTYFP!="067") |
(county=="Baltimore" & state=="MD" & COUNTYFP!="005")})
tmpdat<-tmpdat[!bad,]
stopifnot(all(table(tmpdat$ids)==1))
row.names(tmpdat)<-tmpdat$ids
Remove counties with only 1 farmer
tmpdat<-tmpdat[tmpdat$n>1,]
Remove counties with no geographic information:
tmpdat<-tmpdat[!is.na(tmpdat$LATITUDE),]
Also trim the top and bottom 1 percent of counties (in terms of size) within each county.
stateCountyFarmDist<-tapply(tmpdat$n,tmpdat$state,function(x){ quantile(x,c(0,.01,.1,.25,.5,.75,.9,.99,1)) })
table(tmpdat$state)
AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA
25 66 74 15 58 64 8 3 65 159 4 98 44 98 89 105 120 55
MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
14 19 16 81 85 109 82 56 100 52 93 10 21 32 16 61 88 77
OR PA RI SD TN TX UT VA VT WA WI WV WY
36 67 5 66 95 253 29 92 14 39 71 55 23
datList<-lapply(split(tmpdat,tmpdat$state),function(dat){
qs<-quantile(dat$n,c(.01,.99))
return(dat[dat$n >= qs[[1]] & dat$n <= qs[[2]],])
})
sapply(datList,nrow)
AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA
23 64 72 13 56 62 6 1 63 155 2 96 42 96 87 101 116 53
MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
12 17 14 79 83 105 80 54 98 50 91 8 19 30 14 59 86 75
OR PA RI SD TN TX UT VA VT WA WI WV WY
34 65 3 64 94 247 27 90 12 37 69 53 21
sum(sapply(datList,nrow))
[1] 2898
wrkdat<-do.call("rbind",datList)
## Exclude delaware, rhode island, and hawaii as having too few counties and people.
wrkdat<-wrkdat[!(wrkdat$state %in% c("DE","RI","HI")),]
row.names(wrkdat)<-wrkdat$combifips
The input to the pairing function is a matrix of distances. So first we make distance matrices.
library(nbpMatching)
Make an adjacency matrix of the counties:
## Make a matrix with 1 indicating that two counties are touching and 0 otherwise.
#### Not all counties in the adjacency matrix are represented in the working data from the USDA
#### blah<-sapply(centerCountiesFIPS2,function(x){
#### all(adjacentCountiesFIPS[[x]] %in% wrkdat$combifips ) })
#### table(blah)
adjList<-lapply(wrkdat$combifips,function(x){
wrkdat$combifips %in% adjacentCountiesFIPS[[x]]
})
names(adjList)<-wrkdat$combifips
adjMat<-do.call("rbind",adjList)
stopifnot(all(diag(adjMat))) ## all diagonals should be true
adjMat<-adjMat*1 # convert to numbers
dimnames(adjMat)<-list(wrkdat$combifips,wrkdat$combifips)
diag(adjMat)<-0 ## set self-distances to zero for ease in penalty creation below
wrkdat[which(rowMeans(adjMat)==0 ),] ## which counties have no adjacent counties (Honolulu and Maui in Hawaii)
[1] county state id state.fips county.fips
[6] n STATEFIPS.x COUNTYNAME COUNTYFIPS.x combifips
[11] COUNTYFIPS.y STATEFIPS.y STATEFP COUNTYFP COUNAME
[16] STNAME POPULATION LATITUDE LONGITUDE COUNTYNAME.x
[21] COUNTYNAME.y COMBIFIPS CENTRDLON CENTRDLAT ids
<0 rows> (or 0-length row.names)
## Some counties in Minnesota are causing a problem with the adjacency matrix which should be symmetric
## That is, for example, county 27005 is counted as being adjacent to county 27153 but 27153 is not recorded as being adjacent to county 27005. Since adjacency is a symmetric property (by definition here) I need to fix this.
problemCountiesBig<-colSums(adjMat)!=rowSums(adjMat)
asymCounties<-problemCountiesBig[problemCountiesBig]
asymAdjMat<-adjMat[names(asymCounties),names(asymCounties)]
cbind(colSums(asymAdjMat),rowSums(asymAdjMat))
[,1] [,2]
27005 2 1
27021 1 2
27027 3 2
27051 2 1
27097 2 3
27145 1 2
27153 3 4
27167 3 2
adjacentCountiesFIPS[names(asymCounties)]
$`27005`
[1] "27005" "27027" "27029" "27057" "27087" "27107" "27111" "27159"
$`27021`
[1] "27001" "27007" "27021" "27035" "27057" "27061" "27097" "27153" "27159"
$`27027`
[1] "27005" "27027" "27107" "27111" "27167" "38017" "38077"
$`27051`
[1] "27041" "27051" "27111" "27121" "27149" "27155" "27167"
$`27097`
[1] "27009" "27021" "27035" "27095" "27097" "27145" "27153"
$`27145`
[1] "27009" "27041" "27067" "27093" "27097" "27121" "27141" "27145"
[9] "27153" "27171"
$`27153`
[1] "27005" "27027" "27041" "27051" "27111" "27153" "27159" "27167"
$`27167`
[1] "27027" "27051" "27111" "27155" "27167" "38077"
## wrkdat[names(asymCounties),]
## Just to check: These counties are not in some strange geographic situation.
### with(wrkdat[wrkdat$state=='MN',],plot(CENTRDLAT,CENTRDLON,pch=1))
### with(wrkdat[names(asymCounties),],points(CENTRDLAT,CENTRDLON,pch=19,col="red"))
## Fix the matrix itself: if any county is called adjacent to another county, make
## sure that this is reflected in both counties.
## Since the matrix is 0 when two counties are not adjacent and 1 when when
## they are adjacent, I can add the transpose of the matrix and turn missing 0's
## into 1's and existing 1's into 2's. So I then convert that submatrix into
## TRUE/FALSE and then back to 1/0. At least, this is my one line fix
adjMat[names(asymCounties),names(asymCounties)]<-as.numeric(
(adjMat[names(asymCounties),names(asymCounties)]+
t(adjMat[names(asymCounties),names(asymCounties)]))>0
)
stopifnot(isSymmetric(adjMat))
Make a matrix recording absolute differences in number of farmers between all pairs of counties.
scalarDist<-function(var,scalefactor=1){
## Utility function to make n x n abs dist matrices
## Scalefactor helps us turn fractions into integers since nbpMatching needs
## integers
outer(var,var,FUN=function(x,y){
as.integer(abs(x-y)*scalefactor)
})
}
## Create distance matrix for differences in number of farmers
numfarmers<-wrkdat$n
names(numfarmers)<-row.names(wrkdat)
sizeDist<-scalarDist(numfarmers)
## This will be an ingredient in the penalty that, we hope, forces pairs that are not adjacent but yet which are similar in terms of size and within the same state into pairs.
maxSizeDist<-max(as.vector(sizeDist))
getStatePairs<-function(fips){
adjD<-adjMat[fips,fips]
sizeD<-sizeDist[fips,fips]
penMat<- sizeD + adjD*maxSizeDist ## add the maximum distance to any entry where two counties are adjacent
if( (nrow(penMat) %% 2)!=0){
penMat <- make.phantoms(penMat,1) ## delete one county where we have an odd number --- it will be the county least like the others, greatest distance on size and adjacency.
}
penMatD<-distancematrix(penMat)
nbpm<-nonbimatch(penMatD)
thepairs<-get.sets(nbpm$matches,remove.unpaired=TRUE)
thepairsA<-as.character(thepairs) ## return as a character rather than a factor to make the state-by-state matching easier
names(thepairsA)<-names(thepairs)
return(thepairsA)
}
## Get the fips codes by state so that we can do state-by-state matching
fipsByState<-split(wrkdat$combifips,wrkdat$state)
## This is inelegant but intelligible and not slow because the matching problems are small in each state
pairsByState<-lapply(fipsByState,function(fips){ getStatePairs(fips)})
## An example of the output. Each county has a categorical variable (a character var) with a label for the pair it is in.
str(pairsByState[1:2])
List of 2
$ AK: Named chr [1:22] "02013-02275" "02016-02185" "02050-02280" "02060-02188" ...
..- attr(*, "names")= chr [1:22] "02013" "02016" "02050" "02060" ...
$ AL: Named chr [1:64] "01001-01013" "01003-01077" "01005-01021" "01007-01131" ...
..- attr(*, "names")= chr [1:64] "01001" "01003" "01005" "01007" ...
## Because some states have an odd number of counties, one is excluded.
sum(sapply(pairsByState,length))
[1] 2870
sapply(pairsByState,length)
AK AL AR AZ CA CO CT FL GA IA ID IL IN KS KY LA MA MD
22 64 72 12 56 62 6 62 154 96 42 96 86 100 116 52 12 16
ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA
14 78 82 104 80 54 98 50 90 8 18 30 14 58 86 74 34 64
SD TN TX UT VA VT WA WI WV WY
64 94 246 26 90 12 36 68 52 20
## Compare to:
table(wrkdat$state)
AK AL AR AZ CA CO CT FL GA IA ID IL IN KS KY LA MA MD
23 64 72 13 56 62 6 63 155 96 42 96 87 101 116 53 12 17
ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA
14 79 83 105 80 54 98 50 91 8 19 30 14 59 86 75 34 65
SD TN TX UT VA VT WA WI WV WY
64 94 247 27 90 12 37 69 53 21
sum(table(wrkdat$state))
[1] 2892
nrow(wrkdat)
[1] 2892
## Add the pair variable to the data
for(i in 1:length(pairsByState)){
wrkdat[names(pairsByState[[i]]),"pm1"]<-pairsByState[[i]]
}
## For all non-missing values of the paired variable, we should have exactly 2 unique values.
stopifnot(all(table(wrkdat$pm1)==2))
How well did the matching do? Here are descriptives of the pairs in terms of size and adjacency. We have one pair that is adjacent, so we will avoid it. Also, since we did not ask the algorithmn to exclude bad matches, we have some pairs that are very different in number of farmers. We will also avoid them when choosing the one (or a few more) pairs for the design.
pairedDat<-wrkdat[!is.na(wrkdat$pm1),]
pairDiffsN<-with(pairedDat,tapply(n,pm1,function(x){ abs(diff(x)) }) )
summary(pairDiffsN)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 5.00 16.00 43.52 40.00 1443.00
quantile(pairDiffsN,seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
0.0 2.0 4.0 7.0 11.0 16.0 23.0 33.8 51.0 99.0
100%
1443.0
How many pairs involve adjacent counties?
pairDiffsAdj<-with(pairedDat,tapply(combifips,pm1,function(x){ all(as.vector(adjMat[x,x]==0)) }))
table(pairDiffsAdj)
pairDiffsAdj
TRUE
1435
We want to assign about 1/4 of the counties in each state to treatment. If this seems like too much, we might decide to assign, say, 6 counties to treatment or 1/4 of the total counties, whichever is greater.
pairsByState<-sapply(split(pairedDat$pm1,pairedDat$state),unique) ## otherwise we'll have two entries for each pair
pairAvgN<-with(pairedDat,tapply(n,pm1,mean))
penalty1 <- function(x1,x2){
## here, x1 and x2 are the vectors of sizes (in numbers of farmers) in each county within a pair
ifelse( (x1-x2) != 0,
log(x1) + log(x2) - log(abs(x1 - x2)),
log(x1) + log(x2) )
}
chooseBestPairs<-function(pm,numpairs){
## pm is a vector of pair-match names
## numpairs is the number of pairs to choose
## Given a vector of pair match labels, choose the best pair (closest match in size that is not adjacent with largest sample)
### First, exclude pairs where the two counties are adjacent (should be very rare given our pairing algorithm)
okpairs<-names(pairDiffsN[pm][pairDiffsAdj[pm]]) ## exclude pairs where the two counties are adjacent
sizeDiffs<-pairDiffsN[okpairs]
avgN<-pairAvgN[okpairs]
### Second, sort the pairs by size difference as a proportion of average size of a county within in the pair.
### The idea here is that a 1000-900 pair is better than a 10-9 pair even
### though the size difference is 100 in the first place and only 1 in the second
### place.
## pairsInOrder<-cbind(sizeDiffs,avgN)[order(sizeDiffs,-1*avgN,decreasing=FALSE),] ## sort by diff in size and then by avgN
## pairsInOrder<-cbind(sizeDiffs,avgN,sizeDiffs/avgN)[order(sizeDiffs/avgN,-1*avgN,decreasing=FALSE),] ## sort by diff in size/avgN and then by avgN
## Rank the pairs by a combination of closeness in size and absolute size, breaking ties by size
ordering<-order(penalty1(avgN-(sizeDiffs/2),avgN+(sizeDiffs/2)),avgN,decreasing=TRUE)
pairsInOrder<-cbind(sizeDiffs,avgN,avgN-(sizeDiffs/2),avgN+(sizeDiffs/2),penalty1(avgN-(sizeDiffs/2),avgN+(sizeDiffs/2)))[ordering,] ## sort by diff in size/avgN and then by avgN
### Third, start choosing pairs in order from best to worst, requiring that already chosen counties not be adjacent to any other counties subsequantly chosen
## Decided not to focus on control-only adjacency since then the pairs are no
## longer fixed, pre-treatment features, but depend on the preceding ##
## randomizations. So, it would then be difficult to look for heterogeneous
## effects by type of pair (big counties, small counties, etc..)
nAvail<-numpairs
##nFarmers<-0
newpairsInOrder<-pairsInOrder
res<-list() ## very inefficient but clear and clean versus res<-vector(mode="list",length=nrow(pairsInOrder))
k<-0
while(nAvail > 0 & nAvail <= numpairs){
k<-k+1
## message(k)
#### Choose the pair that is on top of the list, the best one according to our criteria above
pair1<-strsplit(row.names(newpairsInOrder)[1],"-")[[1]]
names(pair1)<-rep(row.names(newpairsInOrder)[1],2)
## pair1Z<-sample(c(0,1)) ## assign one county to treatment
names(pair1)<-rep(row.names(newpairsInOrder)[1],2)
## res[[k]]<-data.frame(county=pair1,Z=pair1Z,pair=names(pair1))
res[[k]]<-data.frame(county=pair1,pair=names(pair1),stringsAsFactors=FALSE)
##nFarmers1<-pairedDat[pair1[pair1Z==1],"n"]/2
##nFarmers<-nFarmers1+nFarmers
## message(nFarmers)
#### Trim the pairsInOrder data to exclude the pairs already chosen and any pairs adjacent to the preceding pair.
remainingPairs<-row.names(newpairsInOrder)[!(row.names(newpairsInOrder) %in% names(pair1)[1])]
if(length(remainingPairs)==0){
return(res)
} else {
## Find pairs where at least one member is adjacent to one of the pair members and exclude that pair from consideration.
## adjM<-adjMat[pair1,##[pair1Z==0], unlist(strsplit(remainingPairs,"-"))]
adjM<-adjMat[pair1,unlist(strsplit(remainingPairs,"-"))]
adjCounties<-colnames(adjM)[colSums(adjM)>=1]
adjCountiesLocs<-unique(sapply(c(pair1[1],adjCounties),function(cnty){ grep(cnty,row.names(newpairsInOrder)) }))
includeCounties<-row.names(newpairsInOrder)[-c(adjCountiesLocs)]
if(length(includeCounties)==0){
return(res)
} else {
newpairsInOrder<-newpairsInOrder[includeCounties,,drop=FALSE]
nAvail<-min(numpairs,nrow(newpairsInOrder))
## message("Available pairs:",nAvail)
}
}
}
}
targetPairsByState<-floor(sapply(pairsByState,length)/2) ## no more than 1/4 of the counties in a state in treatment
selectedPairs<-lapply(names(pairsByState),function(nm){
pairList<-chooseBestPairs(pairsByState[[nm]],numpairs=targetPairsByState[[nm]])
if(length(pairList)==0){
return(NA)
} else {
results<-data.frame(do.call("rbind",pairList),stringsAsFactors=FALSE)
results$state<-nm
return(results)
}
})
names(selectedPairs)<-names(pairsByState)
## How many counties did the algorithm return per state?
sapply(selectedPairs,function(x){ if(is.data.frame(x)){ return(nrow(x)) } else { return(NA) }})
AK AL AR AZ CA CO CT FL GA IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC
8 12 14 2 12 12 2 16 30 18 10 18 16 18 24 14 4 6 4 16 14 20 16 10 20
ND NE NH NJ NM NV NY OH OK OR PA SD TN TX UT VA VT WA WI WV WY
8 18 2 4 8 2 12 16 14 6 12 14 18 48 6 18 4 10 12 12 4
prelimExperimentDat<-data.frame(do.call("rbind",selectedPairs),stringsAsFactors = FALSE)
row.names(prelimExperimentDat)<-prelimExperimentDat$county
prelimExperimentDat$n<-pairedDat[row.names(prelimExperimentDat),"n"]
str(prelimExperimentDat)
'data.frame': 584 obs. of 4 variables:
$ county: chr "02090" "02122" "02100" "02180" ...
$ pair : chr "02090-02122" "02090-02122" "02100-02180" "02100-02180" ...
$ state : chr "AK" "AK" "AK" "AK" ...
$ n : int 422 678 23 23 55 44 7 5 1105 1100 ...
head(prelimExperimentDat)
county pair state n
02090 02090 02090-02122 AK 422
02122 02122 02090-02122 AK 678
02100 02100 02100-02180 AK 23
02180 02180 02100-02180 AK 23
02130 02130 02130-02220 AK 55
02220 02220 02130-02220 AK 44
summary(prelimExperimentDat$n) ## sizes of counties in the experiment
Min. 1st Qu. Median Mean 3rd Qu. Max.
5 491 793 1000 1344 4531
set.seed(20150101)
prelimExperimentDat$Z<-unsplit(lapply(split(prelimExperimentDat$county,prelimExperimentDat$pair),function(x){ sample(c(0,1)) }),prelimExperimentDat$pair)
head(prelimExperimentDat)
county pair state n Z
02090 02090 02090-02122 AK 422 0
02122 02122 02090-02122 AK 678 1
02100 02100 02100-02180 AK 23 1
02180 02180 02100-02180 AK 23 0
02130 02130 02130-02220 AK 55 1
02220 02220 02130-02220 AK 44 0
stopifnot(all(tapply(prelimExperimentDat$Z,prelimExperimentDat$pair,sum)==1)) ## exactly 1 treated county per pair
A map of our counties. Black are those counties assigned treatment. I think that this is very useful since we may want to go back and refine the algorithm a bit.
png(file="map.png",width=1920,height=1080)
par(oma=rep(0,4))
library(maps)
data(countyMapEnv)
data(usaMapEnv)
data(county.fips)
map('state',regions=states$State[states$Postal.Code %in% prelimExperimentDat$state],exact=FALSE,mar=c(1,1,1,1))
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(prelimExperimentDat$county)],add=TRUE)
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(prelimExperimentDat$county)[prelimExperimentDat$Z==1]],col="gray",fill=TRUE,add=TRUE)
dev.off()
quartz_off_screen
2
The map makes us realize that we could contaminate the controls because of adjacency across state lines. This is rare, so we just dump those pairs where this happens — keeping the pairs with more farmers.
newAdjMat<-adjMat[prelimExperimentDat$county[prelimExperimentDat$Z==0],prelimExperimentDat$county[prelimExperimentDat$Z==1]]
## This many controls are adjacent to a treatment county
sum(rowSums(newAdjMat))
[1] 32
## Identify pairs with this problem. Some multiple adjacency
table(rowSums(newAdjMat))
0 1 2
264 24 4
table(colSums(newAdjMat))
0 1 2
262 28 2
ctrls<-rownames(newAdjMat)[rowSums(newAdjMat)>0]
crossStateAdjList<-apply(newAdjMat[ctrls,],1,function(x){ return(names(x[x>0,drop=FALSE]))})
dumpPairs<-lapply(names(crossStateAdjList),function(nm){
crossStateAdj<-prelimExperimentDat[c(nm,crossStateAdjList[[nm]]),]
dropPairs<-crossStateAdj$pair[order(crossStateAdj$n,decreasing=TRUE)][-1]
})
names(dumpPairs)<-names(crossStateAdjList)
experimentDat<-prelimExperimentDat[!(prelimExperimentDat$pair %in% unlist(dumpPairs)),]
nextAdjMat<-adjMat[experimentDat$county[experimentDat$Z==0],experimentDat$county[experimentDat$Z==1]]
stopifnot(sum(nextAdjMat)==0)
png(file="map2.png",width=1920,height=1080)
par(oma=rep(0,4))
library(maps)
data(countyMapEnv)
data(usaMapEnv)
data(county.fips)
map('state',regions=states$State[states$Postal.Code %in% experimentDat$state],exact=FALSE,mar=c(1,1,1,1))
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(experimentDat$county)],add=TRUE)
map("county",regions=county.fips$polyname[county.fips$fips %in% as.numeric(experimentDat$county)[experimentDat$Z==1]],col="gray",fill=TRUE,add=TRUE)
dev.off()
quartz_off_screen
2
If we have 50% of farmers assigned to treatment within each potential-to-be-treated county, then we have the following total number of letters.
proportionTreated <- 1/2
sum(experimentDat$n[experimentDat$Z==1] * proportionTreated)
[1] 137363
More generally, here are some of the total numbers from different choices of proportion treated.
totalTreated<-sapply(seq(.5,.75,.05),function(x){ sum(experimentDat$n[experimentDat$Z==1] * x) })
rbind(totalTreated,seq(.5,.75,.05))
[,1] [,2] [,3] [,4] [,5] [,6]
totalTreated 137363.0 151099.30 164835.6 178571.90 192308.2 206044.50
0.5 0.55 0.6 0.65 0.7 0.75
Or we can just solve for the proportion to send out about 150,000 letters
fn<-function(x){
150000 - sum(experimentDat$n[experimentDat$Z==1] * x)
}
sol<-uniroot(fn,lower=.1,upper=.6)
sol$root
[1] 0.5459986
fn(sol$root)
[1] 2.910383e-11
sum(experimentDat$n[experimentDat$Z==1] * sol$root)
[1] 150000
write.csv(experimentDat,file="experimentDat.csv")
Baird, Sarah, J Aislinn Bohren, Craig McIntosh, and Berk Ozler. 2015. “Designing Experiments to Measure Spillover Effects.”
Sinclair, Betsy, Margaret McConnell, and Donald P Green. 2012. “Detecting Spillover Effects: Design and Analysis of Multilevel Experiments.” American Journal of Political Science 56 (4): 1055–69.
We follow Baird et al. (2015), who in turn build on Sinclair, McConnell, and Green (2012), among others.↩
Some of the sources: Centers of population http://www2.census.gov/geo/docs/reference/cenpop2010/county/CenPop2010_Mean_CO.txt https://www.census.gov/geo/reference/centersofpop.html County Adjacency: https://www.census.gov/geo/reference/county-adjacency.html↩