Skip to contents

Calculates p scores across different recall targets

Usage

recall_frontier(df, bias = 1)

Arguments

df

A data.frame that contains the columns `relevant` and `seen` The dataframe should have as many rows as there are documents, and be ordered in the order dictated by the ML prioritisation algorithm. relevant should contain 1s and 0s for relevant and irrelevant documents, and NAs for documents that have not yet been screened. Seen should contain 1s where documents have been screened by a human, and 0s where documents have not yet been screened

bias

a number which represents our estimate of how much more likely we were to select a random relevant document than a random irrelevant document. The higher this is, the better we think the machine learning went.

plot

Boolean describing whether to plot a graph (default=True).

Value

A dataframe with a column `p` showing the p score for h0 calculated given each recall target `target`

Examples

N <- 60000 # number of documents
prevalence <- 0.01 # prevalence of relevant documents
r <- N*0.01 # number of relevant documents
bias <- 10
docs <- rep(0,N)
docs[1:r] <- 1
weights = rep(1,N)
weights[1:r] <- bias
set.seed(2023)
docs <- sample(
  docs, prob=weights, replace=F
)
df <- data.frame(relevant=docs)
df$seen <- 0
df$seen[1:20000] <- 1
recall_df <- recall_frontier(df)
#> Warning: NaNs produced