Department of Labor Logo United States Department of Labor
Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Article
November 2016

Occupational Requirements Survey: results from a job observation pilot test

The Bureau of Labor Statistics (BLS) conducts the Occupational Requirements Survey (ORS), an establishment survey, for the Social Security Administration. The survey collects information on the vocational preparation for and the cognitive and physical requirements of occupations in the U.S. economy, as well as the environmental conditions in which those jobs are performed. BLS field economists, who conduct interviews with establishment representatives, collect these data. On the basis of stakeholders asking whether the data collected through this mode would result in similar measurements as data collected through direct job observation, BLS conducted a job observation pilot test during the summer of 2015. As part of this test, field economists recontacted establishments who had responded to the ORS preproduction test, and while observing workers performing their jobs, the field economists obtained data on the physical job requirements. The results showed relatively high rates of agreement between observed and collected data for most physical requirements.

In the summer of 2012, the Social Security Administration (SSA) and the Bureau of Labor Statistics (BLS) signed an interagency agreement to begin testing the collection of occupational requirements data. As a result, BLS established the Occupational Requirements Survey (ORS) as a test survey in late 2012. The goal of ORS is to collect and publish occupational information that meets the needs of SSA at the level of an eight-digit code based on the Standard Occupational Classification (SOC) system that is used by the Occupational Information Network (O*NET).1

The ORS data are collected under the umbrella of the National Compensation Survey, which field economists use to collect data. Field economists generally collect data elements either through a personal visit to the establishment or remotely by telephone, email, mail, or a combination of modes.

For ORS, field economists collect occupationally specific data elements to meet the needs of the SSA in the following categories:

  •  Physical demands
  • Specific vocational preparation
  •  Mental and cognitive demands
  •  Environmental conditions in which the work is performed

In fiscal year 2015, BLS completed data collection for the ORS preproduction test. The preproduction test might better be described as a “dress rehearsal” because the collection procedures, data-capture systems, and review were structured to be as close as possible to the structure of those that would be used in production.2

Background on job observation pilot test

The ORS job observation pilot test was intended to assess whether the data collected through ORS interview collection methods are systematically different from the data collected through direct observation. This test was conducted in response to both public comments in a Federal Register notice and an external subject matter expert’s recommendations for testing and validating ORS design.3

The observation test was conducted in the summer of 2015, running from June to September. The observation test involved recontacting a subset of establishments that were interviewed as part of the preproduction test. Pairs of field economists were sent to observe select jobs within each establishment and record data on the physical and environmental data elements during a 1-hour observation period.4 The 1-hour observation period sought to achieve a balance between gathering data that represent the job and limiting the respondent burden involved in conducting such a test.

Because the goal of ORS is to produce estimates at the eight-digit O*NET SOC level, the observation test was structured to allow us to compare preproduction data with observed data at the eight-digit SOC level as well. Thus, a subset of occupations was chosen for inclusion in the test. The subset was chosen on the basis of two criteria:

  1. The occupation must appear in the preproduction microdata at least 40 times.
  2. The jobs have substantial physical requirements and are relatively simple to observe (e.g., long-haul truck drivers were excluded).

These criteria resulted in the following occupations sampled for the observation test:

  • Nursing assistants
  • Cooks, institution and cafeteria
  • Cooks, restaurant
  • Waiters and waitresses
  • Dishwashers
  • Janitors and cleaners
  • Maids and housekeeping cleaners
  • Childcare workers
  • Cashiers
  • Retail salespersons
  • Receptionists and information clerks
  • Team assemblers
  • Industrial truck and tractor operators
  • Laborers and freight, stock, and material movers, hand

Procedures for job observation pilot test

The sample consisted of 540 jobs (456 from private industry and 84 from state and local governments) from existing ORS preproduction establishments. Establishments were sampled across geography, industry, and size to ensure a good distribution of available establishments within each occupation.

For each of the sampled establishments and occupations, a field economist secured the appointment and explained to the respondent the reason for the followup visit. A pair of field economists then collected data by way of a personal visit. The paired field economists, if possible, simultaneously observed the same employee in a preselected job and documented the situation for a 60-minute period. However, if pairing was not possible, one economist would observe a separate employee. The field economists were instructed not to look at data recorded from the preproduction test for their establishments or to discuss their data with one another. Each field economist independently recorded and coded his or her observations during the personal visit. Field economists were as inconspicuous as possible and did not ask questions of the observed employee.

The field economists were instructed to code the duration in minutes and to code a duration of zero if the element was not observed. Some elements had additional questions, such as, Was the job performed with one hand or both? For these elements, each field economist checked the appropriate box category and noted the duration in minutes. Field economists then checked their data for accurate recording before marking the schedule and quote as complete. The team overseeing the test held two collection debrief meetings with the field economists (one midtest and one at the end of collection) to assess how the process worked.

Response rates

Field economists contacted establishments for 405 of 540 jobs in the sample and observed 244 jobs, a 60-percent response rate.5 As shown in table 1, the refusal rate varied by occupation.

Table 1. Response rates for Occupational Requirements Survey job observation test, by occupation
OccupationObservedResponse rate
All occupations24460
Nursing assistants931
Cooks, institution and cafeteria1961
Cooks, restaurant1659
Waiters and waitresses1966
Dishwashers1352
Janitors and cleaners2574
Maids and housekeeping cleaners2071
Childcare workers637
Cashiers2267
Retail salespersons1757
Receptionists and information clerks2368
Team assemblers1771
Industrial truck and tractor operators1753
Laborers and freight, stock, and material movers, hand2164
Source: U.S. Bureau of Labor Statistics.

Childcare workers and nursing assistants had very high refusal rates. These refusals largely stemmed from establishments’ concerns about privacy under state and national laws. Some successful observations of these two occupations occurred during the observation test; however, because of the small sample size, we did not include any childcare workers or nursing assistants in our test analysis.

Measures of agreement between preproduction and observed duration

We evaluated the agreement between the observed values of the data elements and those collected (during the interviews) in the preproduction test. We will refer to these as “observed” and “interview” values hereon.6 Our analysis focused on the physical elements defined in table 2.

Table 2. Description of physical elements in Occupational Requirements Survey
Physical demandDescription
CrawlingMoving about on hands and knees or hands and feet.
CrouchingBending the body downward and forward by bending legs and spine.
KneelingBending legs at knees to come to rest on knee(s).
StoopingBending the body downward and forward by bending the spine at the waist.
Reaching overheadExtending hands and arms in a 150- to 180-degree vertical arc.
Reaching at or below shoulder levelExtending hands and arms from 0 up to 150 degrees in a vertical arc.
Communicating verballyExpressing or exchanging ideas by means of the spoken word to impart oral information.
KeyboardingEntering text or data into a computer or other machine by using a keyboard or other device. This element is measured separately for standard keyboards and for touchscreen, 10-key, and other keyboards.
Fine manipulationPicking, pinching, or otherwise working primarily with fingers rather than the whole hand or arm.
Gross manipulationSeizing, holding, grasping, turning, or otherwise working with hand(s).
Pushing or pullingExerting force upon an object so that it moves away from (pushing) or toward (pulling) the force. This element is measured separately for hands and arms, feet and legs, and feet.
Climbing Ramps or stairsAscending or descending ramps and/or stairs by using feet and legs.
Climbing ladders, ropes, or scaffoldingAscending or descending ladders, scaffolding, ropes, or poles and the like.
Source: U.S. Bureau of Labor Statistics.

The durations of most physical elements for preproduction were classified into five categories:

  1. Not present
  2. Seldom—up to 2 percent of the day
  3. Occasionally—2 percent up to one-third of the day
  4. Frequently—one-third up to two-thirds of the day
  5. Constantly—two-thirds or more of the day

Measuring agreement between observed and interview data was complicated by two factors:

  1. The duration of the observation test was short, which may lead to discrepancies between the presence or absence of certain physical requirements. In particular, we expected high degrees of agreement in the presence or absence of physical requirements with durations that fall into the “frequently” or “constantly” categories and low degrees of agreement for elements that occur “occasionally” or “almost never.”
  2. In preproduction collection, roughly 20 percent of the physical requirements that were classified as “present” in the job had no duration provided by the respondent. The unknown duration is especially high in particular elements—in the sample of jobs that were observed, the interview data had missing duration in nearly 30 percent of the cases for communicating verbally and 25 percent of the cases for fine manipulation.

To address the challenges posed by the short duration of the job observation test, we recategorized the durations into four categories, aggregating the classifications “not present” and “seldom” into one category:

  1. Not present or seldom—less than 2 percent of the day
  2. Occasionally—2 percent up to one-third of the day
  3. Frequently—one-third up to two-thirds of the day
  4. Constantly—two-thirds or more of the day

First, we calculated “raw” levels of agreement between the observed data and interview data. These levels are presented in column 2 of table 3. The levels of agreement are relatively high, ranging from 71.2 percent for reaching at or below shoulder level to 97.4 percent for pushing and pulling with feet.

Table 3. Percent agreement and Cohen’s kappa measure of agreement between observed and interview data for duration of physical elements
Occupational Requirements Survey elementAgreementExpected agreementCohen’s kappap-valuePABAK
Crawling97.197.1–0.010.580.96
Crouching79.377.8.07.18.63
Kneeling87.785.4.16<.01.78
Stooping74.071.7.08.02.38
Reaching 

Overhead

84.381.0.18<.01.62

At or below shoulder level

71.267.3.12<.01.31
Communicating verbally75.667.3.25<.01.41
Keyboarding92.181.5.58<.01.81

Touchscreen

93.988.4.47<.01.85

10-key

96.594.9.32<.01.94

Other

95.994.8.20<.01.90
Manipulation 

Fine

76.771.4.19<.01.44

Gross

76.370.5.21<.01.44
Pushing or pulling 

With hands and arms

73.666.6.21<.01.37

With feet and legs

79.876.0.16<.01.52

With feet

97.497.4–.01.57.94
Climbing 

Ramps or stairs

89.888.6.11.05.82

Ladders, ropes, or scaffolding

95.894.9.17<.01.92

Note: PABAK = prevalence-adjusted and bias-adjusted kappa.

Source: U.S. Bureau of Labor Statistics.

The levels of agreement reported in table 3 suggest that the durations of physical requirements were similar across modes of collection, but statistical tests of agreement were required to ensure that the levels of agreement did not occur by chance. Therefore, we used a weighted version of Cohen’s kappa to assess agreement across modes of collection.7 The kappa statistic measures the agreement against a benchmark of the expected agreement. Bear in mind, however, that if only a few possible categories exist, the observed and interview data could fall into the same duration category simply by chance. The weighting in our kappa measure penalized less for disagreements that were close (e.g., the observed duration was “frequently” and the duration from the interview was “constant”) and more severely for disagreements that were farther apart (e.g., the observed duration was “not present or seldom” and the duration from the interview was “constantly”).

Kappa generally ranges from ‒1 to +1. Negative values of kappa indicate that the level of agreement is less than the expected agreement. Kappa statistics close to 1 imply a higher level of agreement. Although some controversy exists in the literature regarding thresholds of kappa, J. Richard Landis and Gary Koch have proposed the following standards:8

  • Less than or equal to 0 is poor agreement.
  • 0.01–0.20 is slight agreement.
  • 0.21–0.40 is fair agreement.
  • 0.41–0.60 is moderate agreement.
  • 0.61–0.80 is substantial agreement.
  • 0.81–1.00 is almost perfect agreement.

As can be seen in column 3 of table 3, however, the expected levels of agreement are relatively high and the kappa statistics are relatively low.9 With the exception of crawling, crouching, and pushing and/or pulling with feet, the agreement levels are greater than the expected levels of agreement in a 5-percent one-tailed test (column 5 of table 3). The kappa statistics vary considerably; the average value of the kappa statistic is 0.20, which denotes relatively low levels of agreement relative to agreement by chance.

A well-known issue with kappa is the influence of prevalence and bias on the kappa measures. Generally, variables with underlying uniform distributions result in higher values of kappa. Measuring kappa using data that have skewed prevalence can cause the “kappa paradox,” in which high levels of rater agreement have relatively low kappa statistics.10 The distributions of the physical elements in ORS, however, tend to be highly skewed (in terms of both underlying discrete values of the duration as well as the categorical measures). As figure 1 illustrates, the distribution of observed duration for two physical elements—crouching and gross manipulation. The mode for crouching is “not present or seldom,” which was recorded in over 70 percent of the observed jobs. Frequent or constant crouching rarely occurs. On the other end of the spectrum, gross manipulation is constant for most of the observed jobs, which is not surprising when one considers the occupations sampled (cashier, maid, etc.).

To reduce the impact of skewness on our test statistics, we used a measure of prevalence-adjusted and bias-adjusted kappa (PABAK). The PABAK measure is presented in column 6 of table 3. The average is 0.68, which is considered “substantial”; however, the PABAK measure varies considerably by data element. In particular, stooping, reaching at or below the shoulder, communicating verbally, fine manipulation, gross manipulation, and pushing and/or pulling with hands and arms have low measures (less than 0.50), even after prevalence and bias are adjusted.

Of particular concern, given the potential uses of ORS data in the disability determination process, was whether the preproduction interview data appear to understate the duration of the physical elements, especially for those elements with relatively low PABAK measures. We evaluated potential under- or overstatement of duration using a sign test. The sign test is a test of the difference in medians. We were particularly interested in elements in which the sign test rejects the null hypothesis that the observed median is less than or equal to the interview median—rejecting this hypothesis implies that the observed durations were the durations collected through interviews (see column 2 of table 4).

Table 4. P-values associated with test of differences in median values between interview and observation
Occupational Requirements Survey elementObserved median of null hypothesis
Less than interviewGreater than interview
Crawling0.660.66
Crouching.50.59
Kneeling.97.05
Stooping<.011.00
Reaching 

Overhead

.81.25

At or below shoulder level

<.011.00
Communicating verbally.95.07
Keyboarding.68.44

Touchscreen

.35.79

10-key

.29.87

Other

<.011.00
Manipulation 

Fine

<.011.00

Gross

<.011.00
Pushing and/or pulling 

With hands and arms

<.011.00

With feet and legs

.22.84

With feet

.021.00
Climbing 

Ramps or stairs

.98.05

Ladders, ropes, or scaffolding

.96.21
Source: U.S. Bureau of Labor Statistics.

The elements with longer duration associated with observation are stooping, reaching at or below the shoulder, other keyboarding, fine manipulation, gross manipulation, pushing and/or pulling with hands and arms, and pushing and/or pulling with feet. When we measured the modes of these elements, only one differed in mode between the collected and observed values—reaching at or below the shoulder level. The value of the mode for this element was “occasionally” (2 percent up to one-third of the day) in the interview data and “constantly” in the job observation data (two-thirds or more of the day).

Missing duration was identified as an issue with ORS preproduction data for some of the physical elements. In the case of reaching at or below the shoulder level, 53 of the job observation duration measures could not be compared with interview duration data because of missing duration. Notably, among the 53 missing quotes in preproduction, 64 percent of the quotes of the job observation test recorded durations of frequently or constantly. This result was a common pattern among those elements in which the sign test rejected the null of observation duration equal to or below preproduction—the missing data in preproduction aligned with observed durations above the median and mode.

Conclusions

The job observation pilot test validated the ORS physical elements by comparing the data collected through preproduction interviews with those data collected through a different source—observation. Pairs of field economists were assigned to observe the same job for 60 minutes and record the duration of each physical element of the job.

For elements that workers performed infrequently, the 60-minute observation period may have led to more disagreement between observed data and interview data collected during preproduction. The PABAK measures of duration were relatively strong, suggesting that the data collected during interviews and observations had high levels of agreement across most elements.

Suggested citation:

Tiffany Y. Chang, Kristen Monaco, and Kristin Smyth, "Occupational Requirements Survey: results from a job observation pilot test," Monthly Labor Review, U.S. Bureau of Labor Statistics, November 2016, https://doi.org/10.21916/mlr.2016.50

Notes

1 Most typically used by BLS, the Standard Occupational Classification (SOC) system uses six-digit codes (https://www.bls.gov/soc/), generally referred to as “detailed occupations.” O*NET uses a more detailed occupational taxonomy (https://www.onetcenter.org/taxonomy.html), classifying occupations by eight-digit codes and referring to these as “O*NET-SOC 2010 occupations.” The SOC has 840 six-digit and 1,110 eight-digit codes.

2 The sample design was similar to what will be used in production but was altered to meet test goals. A report on the Occupational Requirements Survey preproduction test is available at https://www.bls.gov/ncs/ors/pre-prod-report.pdf.

3 A link to the subject matter expert’s report can be found at https://www.bls.gov/ncs/ors/pre-prod_estval.pdf.

4 The field economists used devices to collect data on a subset of the Occupational Requirements Survey environmental elements. Unfortunately, problems with the readings of some devices resulted in the analysis of those elements being dropped from this research.

5 The remaining 135 jobs were at establishments that were not contacted during the time of the test.

6 We also assessed the agreement between the paired field economists. The analysis of interrater agreement is available in the “Occupational Requirements Survey job observation report” at https://www.bls.gov/ncs/ors/preprod_job_ob.pdf.

7 John Cohen, “Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, 1968, pp. 213–220.

8 J. Richard Landis and Gary G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, March 1977, pp. 159–174, https://www.jstor.org/stable/2529310? seq=1#page_scan_tab_contents.

9 Measures of expected agreement, kappa statistics, and the standard errors needed to compute p-values were calculated with use of Stata (version 12). Stata’s calculations are based on J. Richard Landis and Gary G. Koch, “A one-way components of variance model for categorical data,” Biometrics, vol. 33, no. 4, December 1977, pp. 671–679, https://www.jstor.org/stable/2529465? seq=1#page_scan_tab_contents; and standard errors are based on Joseph L. Fleiss, John C. Nee, and J. Richard Landis, “Large sample variance of kappa in the case of different sets of raters,” Psychological Bulletin, vol. 86, no. 5, September 1979, pp. 974–977, http://psycnet.apa.org/psycinfo/1979-32706-001.

10 For more information, see Domenic V. Cicchetti and Alvan R. Feinstein, “High agreement but low kappa II: resolving the paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, 1990, pp. 551–558, https://www.ncbi.nlm.nih.gov/pubmed/2 189948; and Alvan R. Feinstein and Domenic V. Cicchetti, “High agreement but low kappa I: the problem of two paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, 1990, pp. 543–549, https://www.ncbi.nlm.nih.gov/pubmed/2 348207.

article image
About the Author

Tiffany Y. Chang
chang.tiffany@bls.gov

Tiffany Y. Chang is a statistician in the Office of Employment and Unemployment Statistics, U.S. Bureau of Labor Statistics.

Kristen Monaco
Monaco.Kristen@bls.gov

Kristen Monaco is the Associate Commissioner of the Office of Compensation and Working Conditions, U.S. Bureau of Labor Statistics.

Kristin Smyth
smyth.kristin@bls.gov

Kristin Smyth is a labor economist in the Office of Compensation and Working Conditions, U.S. Bureau of Labor Statistics.

close or Esc Key