BACKGROUND

Effective assessment frameworks provide credible results and a coherent set of data that support programmatic goals.13 In the case of Entrustable Professional Activity (EPA) assessments, narrative comments provided by assessors must align with assigned supervision ratings in order for the information to be meaningful to learners and their advisors, mentors, and coaches, and to inform institutional decisions to grant autonomy to learners.46 With the expanding use of EPA-based entrustment-supervision scales to document a learner’s readiness to assume patient care responsibilities has come the call to evaluate implementation of this framework as a measure of clinical performance.711

A core component of clinical performance assessment is the incorporation of narrative comments that not only illuminate and support quantitative ratings but also provide data to guide a learner to improve.5,6,12 Previous studies have explored assessors’ and learners’ interpretation of narrative comments and the importance of providing information to learners to promote their ongoing development.1822 While much has been written about narrative assessment,1316 the literature examining alignment of the qualitative and quantitative aspects of EPA assessment is limited.17 Narrative comments provide context for supervision ratings assigned by assessors, justify the “number” selected, and illuminate strengths and areas for improvement related to the observed performance.17,23 Alignment of narrative comments and quantitative scores integrates evidence, thereby facilitating and informing learner self-assessment.24 The availability of clinical faculty who can coach learners to make meaning of the data and set goals for continued development lays the foundation to promote assessment as learning.2527

In an EPA-based program of assessment, summative entrustment decisions require synthesis of data from ad hoc EPA assessments. Those charged to make these decisions must have a shared understanding about the approach used to aggregate, interpret, and synthesize information from assessments.4,2732 To facilitate evidence-based, sound decisions, entrustment committees need data that are coherent, that is, narrative comments that are correlated with assessors’ numeric ratings.3337 In addition, the purpose, process, and consequence of decisions must be clear to the members of the committee and to assessors, learners, and their faculty advisors.3840 The ability of the committee members to analyze data within and across assessments explicitly contributes to the program’s “fitness” in achieving an educational and catalytic effect.2,3,41

Assessors in our EPA program attend professional development training sessions during which they learn how to use direct observation and apply performance expectations (Fig. 1) to assign a supervision rating and provide narrative comments that justify the selected rating.42,43 Residents, fellows, attendings (faculty), and master assessors (MA; expert assessors who conduct assessments across clinical contexts) use an adapted, prospective supervision scale during ad hoc assessments. Data from these assessments are immediately available to students and their longitudinal faculty coaches. In regularly scheduled meetings, learners and their coaches co-create individualized learning goals and action plans to achieve those goals.26 The Entrustment Committee (EC) is comprised of the MAs and is facilitated by two members of the leadership team for the EPA program.43,44 Members of the EC have first-hand knowledge about workplace assessment, the application of performance expectations for EPA tasks, and the use of observation to collect data predictive of a learner’s need for supervision the next time they perform the clinical task.21,43 EC members review the results of assessments done by residents and fellows, attending faculty, and their peer MAs throughout the academic year. At the end of the clerkship phase, a collective summative entrustment decision is made about each learner’s readiness to enroll in an advanced clinical course in which they are expected to assume patient care responsibilities as an acting intern.45

Figure 1
figure 1

Oral presentation (EPA 6) performance expectations to apply in assigning supervision ratings

In this study, we explore the concordance of narrative comments with supervision ratings provided for two EPA tasks by three different assessor types in ad hoc assessments in three distinct clinical disciplines. Specifically, we seek to examine if the mean supervision ratings assigned by an expert panel using narrative comments provided at the time of assessment correlate with the ratings assigned by the original assessor.

METHODS

EPA assessment data is collected in a web-enabled tool, iCAN, within our institutional learning management system, VMED. The iCAN tool includes general information about the patient encounter that is entered by the student, a drop down menu to select the supervision rating, and two open text boxes for the assessor to describe what the student did well during the performance of the EPA and what were areas for improvement. Data from assessments completed for students enrolled in the clerkship phase over 2 successive academic years, February 2018–February 2020, were used to extract a stratified random sample of 100 comments. The authors determined that one-hundred comments would provide a representative sample and would also be feasible for the expert panel to review.

EPA assessments include a supervision rating using a 4-point modified entrustment-supervision scale categorizing a student’s need for direct or indirect supervision. Assessors select one of the following to indicate their recommendation for the next time the student performs the task (i.e., the student is ready to perform the task): jointly with a supervisor (level 1); with a supervisor in the room, ready to step in as needed (level 2); with a supervisor available to double check all elements of the performed activity (level 3); or with a supervisor available to double check key elements of the performed activity (level 4). All assessments contain narrative comments about observed strengths and areas for development.

To extract the sample of data used in the analysis, data for the two EPA tasks with the largest number of completed assessments at the time of the study were used, i.e., EPA 1 (history taking and physical examination) and EPA 6 (performing an oral presentation based on a clinical encounter). In our program, EPA 1 is assessed through observations of students completing four aspects of this task: EPA 1.1—obtaining a complete history; EPA 1.2—gathering a focused history; EPA 1.3—performing a complete physical; and EPA 1.4—conducting a focused physical. A stratified sampling of assessments was selected to include assessments with the following: each level of supervision rating (level 1 through level 4); completed by each type of assessor in the program (residents/fellows, attendings, and MAs); and from distinct types of patient encounters experienced during the clerkship phase (inpatient and outpatient settings, a procedural-based specialty, and with adult and pediatric patients). More specifically, assessments from the Internal Medicine (inpatient setting), Pediatrics (inpatient and outpatient setting), and Surgery clerkships were used. Narrative comments were separated from the supervision rating provided by the original assessor by one of the members of the research team (JM). Narrative comments were further de-identified before review by the expert panel through the removal of information that would specify a student, assessor, or the clerkship during which the assessment was completed.

The expert panel was comprised of a MA, a faculty coach, and a member of the EPA leadership team. Members of the expert panel are all clinical faculty members who complete EPA assessments in their role as a clinical supervisor or MA. Despite their familiarity with the process of completing assessments, the panel met as a group to discuss how they would use the established performance expectations for each EPA task (Fig. 1) during their review of narrative comments to frame their decisions about the level of supervision suggested by the qualitative information provided. The de-identified narrative comments were provided electronically to the expert panel after this conversation. Each member of the expert panel independently reviewed the narrative information related to strengths and areas for improvement in each comment and assigned a corresponding supervision rating. Interrater reliability (IRR) was measured using Kendall’s concordance coefficient W. Correlation between the mean supervision rating assigned by the expert panel and the rating provided with the narrative comments by the original assessor was analyzed using a Kendall W test (Kendall’s concordance coefficient W). The Kendall W test is a measure of concordance and is commonly used to assess agreement among a group of raters. As a non-parametric procedure, it is particularly well-suited for outcome measures that are ordinal in nature, as is the case with this dataset.46 A correlation coefficient of 1 represents perfect agreement. A supervision rating assigned by the original assessor and by a member of the expert panel represents a discrete decision related to a specific clinical encounter.4 Means were calculated for ratings of all comments related to specific EPA task.

The UVA Institutional Review Board reviewed this project and determined that it met criteria for exempt review (ref no. 3791).

RESULTS

One-hundred narrative comments were extracted from assessments completed for 305 clerkship phase students (149 in 2018–2019; 156 from 2019 to 2020). These 100 narrative comments represented 100 unique students and 80 unique assessors. The sample included 32 narrative comments for assessments of history taking (EPA 1.1 + EPA 1.2), 32 comments related to physical examination skills (EPA 1.3 + EPA 1.4), and 36 comments for oral presentation (EPA 6); 37 comments were originally provided by residents/fellows, 27 comments by attendings, and 36 comments by an MA; 37 comments were from the internal medicine clerkship, 36 from the pediatrics clerkship, and 27 from the surgery clerkship.

IRR among supervision ratings assigned by members of the expert panel in their independent review ranged from .536 for comments associated with focused history taking to .833 for complete physical exam. Kendall W (KW) test correlation coefficients (CC) for panel members’ assignment of supervision ratings for history taking (complete + focused), physical examination (complete + focused), and oral presentation comments were .668, .697, and .735 respectively (Table 1).

Table 1 Interrater Reliability Among Mean Supervision Ratings Assigned by Each Member of the Expert Panel

CC between the mean supervision rating of the expert panel and the mean rating for the task provided at the time of the assessment ranged from .327 for history taking to .697 for physical examination and .735 for oral presentation. The mean supervision rating assigned for each task includes assessments completed by all assessor types (residents/fellows, attendings, and MAs) and in all of the clerkships included in the study (internal medicine, pediatrics, and surgery). The mean supervision rating of the original assessors and the mean supervision rating determined through the expert panel member’s review of the narrative comments are included in Table 2. Representative narrative comments with a high and low level of correlation between the supervision ratings provided by the expert panel and by the original assessor during assessments of oral presentations (EPA 6) are illustrated in Table 3.

Table 2 Correlation Between Mean Supervision Rating Provided by the Original Assessor and the Mean Rating Assigned by the Expert Panel
Table 3 Representative Narrative Comments from Assessments of Oral Presentations (EPA 6) Representing High and Low Levels of Correlation Between the Supervision Ratings of the Expert Panel and the Original Assessor

Correlation between the mean supervision ratings assigned by the expert panel and the supervision rating provided with the narrative comments at the time of the observation varied by assessor type (Table 4). For history taking, the CC was .525, .540, and .941 respectively for supervision ratings originally determined by residents/fellows, attendings, and MAs respectively; for physical examination, the CC ranged from .403 to .790 for ratings from residents/fellows to ratings from MAs and for oral presentation, the CC spanned from .309 for residents/fellow ratings to .854 for MA ratings.

Table 4 Correlation Between Mean Supervision Rating Provided by Each Original Assessor Type and the Mean Rating Assigned by the Expert Panel

Table 5 contains CC for data from assessments completed by all assessor types in a variety of clinical settings (clerkships). CC between the mean supervision ratings provided at the time of the observation and the mean expert panel rating for comments from assessments from the internal medicine clerkship ranged from .596 to .942 for history taking and oral presentation; for assessments on the surgery clerkship, the CC ranged from .301 to .873 for history taking and oral presentation respectively. CC for mean ratings of comments provided on the pediatrics clerkship were .738 for history taking, .862 for physical examination, and .663 for oral presentation.

Table 5 Correlation Between the Mean Supervision Ratings Provided by Original Assessors in Each Clinical Discipline/Setting (Clerkship) and the Mean Supervision Rating of the Expert Panel

DISCUSSION

In this study, we explored the alignment of narrative comments with the supervision ratings provided by assessors during EPA assessments. The expert panel’s assignment of supervision ratings served as a “gold standard” for comparison to supervision ratings assigned by original assessors. A higher degree of correlation between the “gold standard” supervision rating and the original rating suggests that the narrative comments provided by the original assessor more closely align, are concordant with, support, and justify supervision ratings. We found, however, that supervision ratings given by the expert panel had variable levels of concordance with the ratings given by original assessors at the time of assessment.

Narrative comments contain critical information about learners’ performance not fully captured by a quantitative entrustment-supervision scale score.5,6,47 For narrative comments provided in EPA assessment to be useful to learners48 and also to summative decision-making committees, all stakeholders who provide the comments must be clear about the importance of meaningful, high-quality, performance-based narrative that substantiates quantitative ratings.17 Our findings support the call not only to evaluate the fidelity of implementation but also to measure outcomes that provide meaningful information to learners, their coaches, and institutional decision-makers.7,9,11,47

Supervision ratings of the expert panel had the highest degree of correlation with ratings provided by MAs. As noted, MAs are experienced clinicians, selected and trained to perform assessments across various clinical settings. In our program, all assessors (residents/fellows/attendings) are required to attend an EPA training session. All sessions are interactive and structured to promote skill building and hands-on practice in applying performance expectations, translating observations into decisions about the level of supervision a student needs the next time they perform the task, and providing narrative comments to justify the level of supervision seleted.43 MAs are “frequent observers” with designated effort for this role. They are not simultaneously supervising students nor providing clinical care for the patient at the time of the assessment and participate in additional professional development to enable them to complete assessments outside of their clinical specialty.

While authors49 have noted enhanced generalizability of ad hoc entrustment decisions when provided by clinical supervisors who assess students frequently, decisions in the workplace require an assessor to weigh the risk of granting autonomy to a learner.4,50,51 The relative lack of concordance between narrative comments and supervision ratings provided by residents/fellows and attendings may be explained by the challenges inherent in serving concurrently as a teacher, assessor, and clinical supervisor.28,52,53 The quality and focus on assessment can vary when any one role is emphasized. Assessors may also struggle with assigning a supervision rating indicating what level of supervision a student will need in future clinical encounters.52 Prospective decisions about a learner’s adaptive competence based on a discrete observation of clinical performance require a different mindset than traditional end of clerkship/rotation evaluation.4,52,455

MAs constitute the Entrustment Committee (EC) and have participated in additional training to facilitate group decision-making. The Committee meets regularly to review and analyze data from ad hoc assessments further developing their expertise and fortifying their shared mental model about the criteria for assessment, specifically how the performance expectations outline behaviors that can be translated to the assignment of a supervision rating. To make summative entrustment decisions, the members of the committee integrate and synthesize quantitative (supervision ratings) and qualitative data (narrative comments) from assessments completed across clinical contexts to predict students’ readiness to meet expectations for future performance.4,2931,33

Our findings suggest that despite efforts to establish a shared understanding and application of established performance expectations, clinical supervisors may define what constitutes a focused history differently based on their clinical discipline and likely the context of the encounter. In contrast to focused history taking, approaches to hypothesis-driven evidence-based physical exam have been well described and with the availability of published resources, assessors are perhaps less likely to rely on personal opinion in judging a learner’s performance. Correlation between supervision ratings provided with the narrative comments at the time of observation and supervision ratings assigned by the expert panel differed by clerkship and may reflect the value placed on the skill, and perhaps the corresponding comfort level with assessment of the task in a given specialty.8 Supervisors on procedural-based specialties spend less time with learners in settings in which a history and physical examination would be performed, leading to a reliance on the use of simulation-based assessment for these skills.56 In contrast, supervision ratings and corresponding narrative comments for EPA 6 (oral presentation) provided to students in assessments on the internal medicine clerkship were highly correlated with supervision ratings assigned by the expert panel. This likely reflects comfort with this traditional approach used to assess learners on the internal medicine clerkship.57 These findings suggest the need to consider the existing teaching and assessment practices of various clinical disciplines when defining opportunities to incorporate EPA assessments in each setting.51,58,59

Stakes, whether they be low or high, influence all stakeholders.5027,51,53 And dual purposing of data from assessment for both formative feedback and summative decisions may raise concerns for ad hoc assessors.5,17 In our program, the data from ad hoc assessments does not contribute to the student’s evaluation/grade on the clinical clerkship and the results are visible only to the student, their faculty coach, and their student affairs dean. The student and faculty coach use the data to co-create individualized learning plans to promote continued clinical development.26 Supervisors may be particularly concerned about disadvantaging learners through assessment, highlighting the importance of training for assessors to ensure they understand the goal of the program and how the data from assessment are used.28,43,51,54 A dedicated group of “external” assessors, who do not contribute to a student’s formal end of clerkship evaluation, does not experience this tension. For narrative comments provided in EPA assessment to be useful to learners53 and also to summative decision-making committees, all stakeholders who provide the comments must be clear about the importance of meaningful, high-quality, performance-based narrative that substantiates quantitative ratings and expands the information provided through the supervision rating.17

This study has limitations. First, narrative comments sampled for analysis represent a random sampling of the total assessments completed during the study period. The assessments were done during observation of a subset of students by a subset of the total assessors in the program and, so, may not be representative. Second, this study did not examine the accuracy of data (supervision ratings or narrative comments). Although our web-enabled assessment tool allows assessors to use voice dictation to capture verbal feedback, it is not known if the narrative comments were consistent with verbal feedback given to learners at the time of the assessment. Assessment data must be submitted within a specified period after the time of the observation but if not done immediately may be subject to limited recollection and/or recall bias.

CONCLUSIONS

EPA assessments communicate information about a learner through both entrustment-supervision ratings and narrative comments about observed performance. Concordance between both components is critical in making the data meaningful to learners and to those who help them use this information for their continued development.2,3,53,60 Committees charged with analyzing and integrating data from EPA assessments to make high stakes decisions must be able to use the information to support summative entrustment.6,29,30,36,61 Our findings underscore the need for high-quality narrative comments aligned with performance criteria so that the educational and catalytic effect of an EPA-based program of assessment can be fully realized.2,3,43,62,63