Skip to contents

Calculates a p-score for our null hypothesis h0, that we have missed our recall target `recall_target`.

Usage

calculate_h0(df, recall_target = 0.95, bias = 1, seen_docs = NULL)

Arguments

df

A data.frame that contains the columns `relevant` and `seen` The dataframe should have as many rows as there are documents, and be ordered in the order dictated by the ML prioritisation algorithm. relevant should contain 1s and 0s for relevant and irrelevant documents, and NAs for documents that have not yet been screened. Seen should contain 1s where documents have been screened by a human, and 0s where documents have not yet been screened

recall_target

The recall target (default=0.95). Must be between 0 and 1

bias

a number which represents our estimate of how much more likely we were to select a random relevant document than a random irrelevant document. The higher this is, the better we think the machine learning went.

seen_docs

an integer which overrides the seen column, telling us how many of the first documents have been screened

Value

p, a p-score for our null hypothesis. We can reject the null hypothesis (and stop screening) if p is below 1 - our confidence level.

Examples

N <- 60000 # number of documents
prevalence <- 0.01 # prevalence of relevant documents
r <- N*0.01 # number of relevant documents
bias <- 10
docs <- rep(0,N)
docs[1:r] <- 1
weights = rep(1,N)
weights[1:r] <- bias
set.seed(2023)
docs <- sample(
  docs, prob=weights, replace=F
)
df <- data.frame(relevant=docs)
df$seen <- 0
df$seen[1:1000] <- 1
calculate_h0(df)
#> [1] 0.9996611