Glossary of Assessment Terms



by Barry Sweeny, 1994

Return to the Performance Assessment Links Page

Return to Barry's Price List

Click a letter to jump to that section: A B C D E F G H I M N O P R S T U V W



Aggregated Scores -- The combined scores for a population of students, such as all third graders), often expressed as an average. Aggregating scores requires that all the scores be based on the same or equivalent assessments administered in uniform ways.

Alignment --The process of assuring that State Goals for Learning, local learning outcomes, local curriculum & instruction and the system of assessment all support and match each other.

Analytic Scoring --The use of specific criteria or features to evaluate and assign points to each essential part of a product or performance. Analytic scoring is diagnostic, allowing planning for specific remediation. (See holistic scoring for the alternative approach.)

Annotated Rubric -- The notes from an assessment development group, often after a field test and initial scoring, which explain the meaning of criteria or distinctions between the criteria on a rubric. Annotation is an important tool to increase scoring reliability and to train others to score consistently.

Anchor(s) --Actual samples of student work which illustrate the essential characteristics of work typical for each scoring level on a scoring rubric. Anchors can also be captured on video or audio tapes of performances or may be video or photographic images of a larger product. The top anchor is often called an "exemplar" as it is the example of exemplary work.

Assessment --The process of obtaining information that is used to make educational decisions about students, to give feedback about their progress/strengths/weaknesses, and to judge instructional effectiveness/curricular adequacy.

Authentic --A characteristic of assessments that have a high degree of similarity to tasks performed in the real world. The more authentic the assessment, the less inference required to predict student success after graduation.



Benchmark Group --A grade level or instructional group designated by a school district as one in which students' performance relative to one or more learning outcomes will be assessed.


Complex-Generated Response -- Assessment which asks a student to perform or produce to demonstrate knowledge and skills. Such assessments will not have one right answer, but instead will result in student work which is across a range of quality. The assessment requires that the student engage in a task of multiple parts or steps. Scoring of the assessment involves teacher judgment based on stated criteria for performance. (see performance-based)

Comprehensive --All dimensions of a State Goal for Learning with regard to scope, content, specificity, skills, and types of thinking required are addressed.

Consultative --Conducted in a manner that solicits input from staff, students, parents and community. (Constituents from ALL groups must be involved) A consultative process must be documented in written form, but does not require actual participation in decision making.

Criterion-Referenced Test -- A measurement of achievement of specific criteria stated as levels of mastery. The focus is performance of an individual as measured against a standard or criteria rather than against performance of others who take the same test. (See standardized, norm-referenced tests.)

Cut score --The number of points needed which represents the criteria for successful completion of an assessment task, such as 8 out of 10, or the percent that must be attained to be determined as successful in performing an assessment task (e.g., 80%). Cut score also refers to the critical point for dividing scores into two groups in reference to some criterion. It is possible to set multiple cut scores from differing criterion (e.g., meets, does not meet & exceeds).


Developmental Appropriateness --A characteristic of scoring criteria ensuring that the range of quality levels (score points) is appropriate for the grade level or age group being assessed.

Disaggregated Group --Any group of students within a school population from which a a group score is computed as a group separate from the total assessed population.

Diverse Assessment --Using more than one type of assessment in constructing a standard. The types of assessment selected as parts of a standard must not be exclusively forced choice/short answer (e.g., multiple choice, true/false, matching, fill in the blank) and must reflect the range and depth of the content and thinking skills of the learning outcome being assessed.

Documentation --Written descriptions, reports or summaries of the steps taken and the rationale for those actions as related to the Illinois School Improvement Plan.


Evaluation --A value judgment about the results of assessment data. For example, evaluation of student learning requires that educators compare student performance to a standard to determine how the student measures up . Depending on the result, decisions are made regarding whether and how to improve student performance.

Exemplar --Actual samples of student work which illustrate the essential characteristics of work typical of exemplary student work at the top scoring level on a scoring rubric. Several exemplars are desirable to promote creativity so that students see multiple products/performances are possible.

Expectation --An estimate of the percent of students in a school who will meet the defined standard for a learning outcome.


Fairness -- (See non-discrimination)

Feasibility/Reasonableness --A characteristic of scoring criteria ensuring that the judging of student work is appropriate for the conditions within which the task was completed.

Field Test -- A small scale administration of an assessment with one or two classes of students of the staff on the assessment development group. A field test is conducted when staff have insufficient experience with using a task with students to know how well it will work as an assessment.

Forced-Choice Assessment --Testing where responses to an item, questions or prompts are placed against a set answer key. Scoring does not require judgment on the part of the scorer because there is one right answer to the item. Multiple choice, true/false, cloze, and matching are examples of forced choice/short answer assessments.


Generalizable --The results of an assessment are generalizable when the score on one assessment can accurately predict a student score on a different assessment covering the same knowledge or skill. Generalizability across time is promoted by ensuring that assessments focus on general level concepts or strategies, not on facts, topics, or skills which are found only at one grade or in one classroom.

Grade Equivalent --The estimated grade level that corresponds to a given score. Caution! Often grade equivalency is very misunderstood and so, very misused.


Holistic Scoring --Scoring based upon an overall impression (as opposed to traditional test scoring which counts up specific errors and subtracts points on the basis of them). In holistic scoring the rater matches his or her overall impression to the point scale to see how the portfolio, product or performance should be scored. Raters usually are directed to pay attention to particular aspects of a performance in assigning the overall score.


Indicator --A statistic that reveals information about the performance of a school or a student. For a statistic to be an educational indicator, there must be a standard against which it can be judged. Educational indicators must meet certain substantive and technical standards that define the kind of information they should provide and the features they should measure. The primary educational indicator is student performance; other secondary indicators include attendance, graduation, mobility, truancy and dropout rates.

Item --An individual question or exercise in a test.


Map -- A chart which summarizes the major elements of a system and which shows the relationships between the parts of a system.

Measurement --The process of gathering information, in assessment of student learning, about student characteristics. Educators use a wide variety of methods such as paper and pencil tests, performance assessments, direct observation, and personal communications with students. (see evaluation)

Methods of Assessment --Tests and procedures used to measure student performance in meeting the standards for a learning outcome. These assessments must relate to a learning outcome, identify a particular kind of evidence to be evaluated, define exercises that elicit that evidence and describe systematic scoring procedures. Methods of assessment are classified as either forced choice/short answer or complex generated (performance-based) response.


Nondiscrimination --Evidence that differences of race or ethnicity, gender, or disability do not bias results of assessment instruments or procedures.

Normal Curve Equivalent (NCE)--Standard scores with a mean of 50 and a standard deviation of approximately 21. The use of a NCE is an attempt to make different assessment comparable.


Objective --Precise statements which specify the performance or behavior a student is to demonstrate relative to a knowledge or skill. Objectives typically relate to lessons or units, not "big ideas" such as described by an outcome.

Outcome --A statement of what students should know and be able to do in order to demonstrate achievement of a State Goal for Learning or a portion thereof. A learning outcome addresses the content of one or more State Goal(s) for Learning; is broader in focus than a learning objective; probes the range and depth of thinking skills appropriate to the State Goal(s) for Learning; is amenable to assessment; may integrate Fundamental Learning Areas; and may reflect problems and tasks found outside the classroom.

Overall Performance Level --A combination of the cut-scores or proficiency levels of the various assessments used to determine whether students do not meet, meet, or exceed the standard set for a whole learning outcome. Different assessments may be given greater weight when determining an overall performance level. (See weighting)


Performance-Based Assessments --Assessments requiring reasoning about recurring issues, problems and concepts that apply in both academic and practical situations. Students actively engage in generating complex responses requiring integration of knowledge and strategies, not just use of isolated facts and skills. (See Complex Generated Response.)

Pilot -- A large scale administration of an assessment, usually with several classes of students if not all students in a grade. The purpose of the pilot is to detect any flaws in the assessment before the assessment is considered "done" and is fully implemented. (See field test)

Portfolio -- A portfolio is collection of students' work which can be used to assess not only the outcome of learning, but the process of learning. Using portfolios as a school improvement assessment tool requires the ability to score both individual works and the whole portfolio against a standard for each.

Proficiency Level --The equivalent of a cut score (on a forced-choice assessment) but for a performance/complex assessment. The proficiency level for a performance assessment is set by determining the required performance criteria (such as the required level on a rubric) for a specific grade level. Such a proficiency level could be achievement of all the criteria required for a scoring level, or it could be a set number of points achieved by combining scores for each feature on the rubric.

Prompt --In a narrow sense a prompt is a statement to which a student responds in an assessment, often a reading passage, picture, chart or other form of information. In the fullest sense a prompt is the directions which ask the student to undertake a task. Prompts should include the context of the situation, the problem to be solved, the role the student takes, and the audience for the product or performance.


Rationale --Written statements providing the reasons for steps taken and choices made.

Raw Score --The number of items that are answered correctly out of the total possible.

Reliability --Consistency or stability of assessment results over time. Of particular importance for performance assessment is inter-rater reliability. It is the estimate of the consistency of the ratings assigned by two or more raters because they agree on the criteria used to evaluate the performance.

Representativeness- -A representative factor of performance tasks and of scoring criteria ensuring that the task and criteria focus on the significant elements, concepts and strategies in the outcome(s) assessed.


Score --The result obtained by a student on an assessment, expressed as a number. Assessments always have only one score. Each score is recorded as a positive number, with a larger numerical value implying a better result.

Scoring Rubric --A set of related scoring scales used for judging student work and awarding points to reflect the evaluation of the work.

Scoring Scale -- Assessment criteria formatted as levels of quality ranging from poorest to best, used to judge student work on a single feature such as "clarity of main idea". Scales may combine several traits within a feature. Scoring levels on the scale are assigned points, each level specifying the characteristics of the quality of content or skills needed to attain the points.

Self-Assessment --Students reflect about their own abilities and performance, related to specified content and skills and related to their effectiveness as learners, using specific performance criteria, assessment standards, and personal goal setting. The intent is to teach students to monitor their own learning continuously.

Standard for a Learning Outcome --The qualitative and quantitative assessment criteria by which it is decided if students have attained a specified level of performance related to an outcome. The parts of a standard include: a)the learning outcome, (b) the assessment tasks which will measure student learning relative to the learning outcome, (c) the cut-score or proficiency level required to "pass" the assessment and (d) the overall level of performance needed to combine assessments and indicate whether a student has mastered the whole outcome.

Standardized, Norm-Referenced Test --A form of assessment in which a student is compared to other students. Results have been normed against a specific population (usual nationally). Standardization (uniformity) is obtained by administering the test to a given population under controlled conditions and then calculating means, standard deviations, standardized scores, and percentiles. Equivalent scores are then produced for comparisons of an individual score to the norm group's performance.

Standard Score --A score that is expressed as a deviation from a population mean.

Stanine --One of the steps in a nine-point scale of standard scores.

Sufficiency --A judgment on whether an assessment task is comprehensive enough to produce a sample of student work broad enough in depth relative to a body of knowledge or skill to be considered an adequate measure of whether the student has attained the knowledge or achieved the skill. For forced choice assessments the number of items used to decide this is the crucial issue for sufficiency.


Task --A goal-directed assessment activity or project which prescribes that the student use their background knowledge and skill in a somewhat long term process to solve complex problems or answer a multi-faceted question.


Utility --A characteristic of scoring criteria that ensures that the criteria are diagnostic and can communicate information about performance quality with clear implications for improvement.


Validity --The extent to which an assessment method produces accurate, meaningful, and useful measures of the skills and knowledge it was designed to assess. The primary issue is content validity, which is whether an assessment and instructional program align (match).

Variety of Assessments --Two or more assessment instruments or procedures which are of different kinds or at least rich enough to capture the full range of knowledge, strategies and levels of thinking in an outcome. For each standard, students must be assessed by multiple tasks which are distinctly different in content and the student's processing of information or demonstration of skill. (see diverse)

Validation -- The process of developing, field testing, refining, piloting and refining assessment items, tasks, scoring tools, directions, etc. to increase validity, reliability, fairness and instructional usefulness.


Weighting --A method to combine the results of two or more assessments used in calculating the percent who meet the standard for a learning outcome. If some assessments are deemed more important due to the amount of time for completion or the number of items included in the assessment, etc. the cut-scores on those assessments may be given greater consideration or weight in determining the overall performance level.


You may copy and distribute this as long as it is for free and you retain the following credits:
1995, Barry Sweeny, Resources for Staff and Organization Development
26 W 413 Grand Ave., Wheaton, IL 60187, (630) 668-2605, e-mail [email protected]


Return to Materials List