T. Griffith Jones
EDF 6938 - Survey Design and Analysis
Spring 1998 - Dr. Wolfe - University of Florida
Declining trends in both high school and college-level science enrollment present science educators with the need to make informed decisions regarding the effectiveness and appropriateness of current instructional approaches and curricular resources. In physics, these decisions often involve a choice between a more traditional mathematics-based approach and an alternative conceptual approach (Sousa, 1996). Because of the long established philosophy of teaching physics with mathematics-based approach, physics instructors who support a conceptual approach are faced with the difficult task of changing the manner in which curriculum is presented to students.
As is the case with implementation of any new type of curriculum reform, initial assessment of student attitudes is necessary.
This study therefore examined students' attitudes on the subject and teaching of physics. Particular attention was focused on students' views on the influence of mathematics in physics instruction. The data are analyzed using a Rasch (1980) rating scale model. The implications of the study to curriculum developers and physics instructors are discussed.
It is no secret that high school students perceive physics as the most abstract, irrelevant, and confusing science course in the high school curriculum (Franz, 1983). Just mentioning the word physics seems to make students cringe and elicits responses like: "Ooh, that is hard stuff." or "I can't understand physics because I'm not good in math." Even most adults react with sour expressions and responses such as: "I never liked that class when I was in school, too much math." or "Physics was so complicated, I was always lost in that class." Is is also no secret that most high school and college students avoid physics because of its reputation as an applied mathematics course (Toews, 1988). This has resulted in decreased enrollments in physics at a time when our society desperately needs scientifically literate citizens.
Some high school and community college physics instructors have reacted to this concern by implementing a conceptually-based curriculum that stresses the mastery of key physics concepts before computations are attempted (Hewitt, 1994). But many physics instructors believe a decision to focus on concepts equates to sacrificing the mathematical component of physics (Sousa, 1996). Proponents of the conceptual physics approach believe it does not eliminate mathematics rather it introduces all of the standard mathematical equations but they are used as guides to aide student thinking rather than "recipes" for computation (Hewitt, 1990).
Some university physics instructors believe that a conceptual approach to an introductory-level physics courses does a disservice by not providing students with the mathematical foundation necessary for success in higher-level, traditional physics courses (Franz, 1983). In fact, a recent study of high school students that had completed a conceptually-based high school physics course did quite well in their subsequent college-level physics class (Linder & Hillhouse, 1996).
For sometime now, many researchers have argued that students' achievement in the sciences is determined by by their mastery of mathematics (Fakuade 1977, Lewis 1972, Ogunsulire 1977, Rennie & Parker 1996). This argument has been especially strong for physics with its heavy application of mathematics. Students' perception that physics is difficult and that it is beyond their capabilities due to their deficient mathematical skills discourages some students from enrolling in introductory physics classes (Chandavarkar, 1991).
From the foregoing, it seems that most high school students are finding it difficult to succeed in physics with its heavy mathematical focus. At the college level, physics has been described by students with aspirations of becoming our future scientists, doctors, and engineers as the "weed-out course," the "flunk-out course," and even the "killer course"(Hewitt, 1990). Physics is seen as a filter that allows the passage of only a few mathematically elite students while defeating the others. Since our society is facing a critical need for scientists and engineering, particularly minority representation, we cannot afford to "weed-out" students during their initial efforts. We must not only attract more students to science but also facilitate their success. Therefore, the present study examines college students' attitudes to the subject of physics, the teaching and learning of physics, and the students' views/attitudes on the influence of mathematics on learning physics.
The 20 question survey was divided into three parts: the subject of physics (6 items), the teaching of physics (8 items), and mathematics and physics (6 items). In Part I, the subject of physics, the focus was on identifying the students' general attitudes towards physics and its value in our society. In Part II, the teaching of physics, explored what students thought were effective instructional methods and if they felt adequately prepared for their first university physics course by their high school physics course. Part III, focuses on students' view on the influence of mathematics in physics.
A Likert-type response scale, with options ranging from 4 (Strongly Agree) to 1 (Strongly Disagree) was used. Participants could also circle NA for not applicable. The instrument included statements that were both positive and negative to ensure that the students read all the statements carefully before responding. We also invited respondents to include additional written comments if desired. Surveys were distributed by handed to participants during their regularly scheduled class time. To ensure maximum participation, surveys were distributed midway through the class and collected immediately upon completion. This avoided missing potential participants who were either late to class or who might decide to leave early. The data were coded on a scale from 3 (Strongly Agree) to 0 (Strongly Disagree). Not applicable responses were coded with a period (.). Negatively written items were coded in reverse to adhere to the linear continuum (see below).
Linear Continuum of Survey
Low Score High Score
very negative very positive
attitude towards attitude towards
Norms & Sampling
Since this was an exploratory study a nonprobability, convenient sample was selected. The participants (N=286) in this study were university students currently enrolled in one of two introductory-level physics courses (PHY 2020 or PHY 2048). Three classes were surveyed, 1 of PHY 2020 and 2 of PHY 2048. All were one semester, three credit hour courses that met one hour, three times a week. The PHY 2020 class, with 33 participants, met from 2:30 p.m.to 3:30 p.m. and was coded as Group 1. The first class of PHY 2048, with 79 participants, met from 7:30 a.m.-8:30 a.m. and was coded as 2048A or Group 2. The second class of PHY 2048, with 174 participants, met from 8:50 a.m.-9:50 a.m. and was coded as 2048B or Group 3.
The target population for the this study was all state university students enrolled in introductory-level physics courses throughout the United States. The sampling frame was derived from the University of Florida's Spring registration catalog which listed all physics classes, including 14 introductory classes, taught for the Spring Semester. A convenient sample was selected from the 14 courses by coordinating dates and times with the professors' schedules.
For future studies a simple random sampling from a complete list of all state universities offering introductory-level physics classes would be preferred, since this would allow the study to be generalized to a larger population with a smaller margin of error. For the sampling frame, sampling all the students for all the introductory classes would have been preferred.
The survey items were calibrated using a rating scale model, based on the Rasch (1980) measurement theory, as implemented under the framework of the item response theory (IRT) program, FACETS (Lincacre, 1993). The Rasch analysis is a statistical method that can be used to determine and verify participants perceptions of the ordering of category meanings (Lopez, 1996).
When applying the Rasch analysis with the FACETS program, one can analyze both person and item calibrations on the same linear logistic scale. How well the calibrations fit the linear scale or rating scale model is referred to as Infit and Outfit statistics. The Infit statistic denotes the information-weighted mean-square residual difference between observed and expected responses. The Outfit statistic reports the unweighed mean-square residual and is more sensitive to outliers (Zhu, 1997). Infit and Outfit, with a value of 1, are considered a satisfactory model-data fit, while values greater than 1.5 or less then 0.6 are considered misfit (Wright & Linacre, 1994). Table 1 displays the Outfit statistics for the Scoring Categories with Category 0 having an Outfit of 0.9, Category 1 also having 0.9, Category 2 with 0.8, and Category 3 with an Outfit score of 1.1. The satisfactory fit of Outfit scores indicates that the responses associated with each of the categories contains meaningful and uniquely identifiable information. This is supported by the separation of the Probability Curves into four distinct hills (see Table 1).
Table 1: Category Statistics and Probability Curves
Rasch statistics and parameters. Reliability is a statistical measure of how reproducible the survey instrument's data are over time (Litwin, 1995). Reliability is necessary, but not a sufficient condition for validity (i.e.,your scoring is consistent but are you measuring what you really want to know). Different reliability measures should be selected according to the nature of the use. There are several methods of estimating reliability most of which are reported as correlation coefficients between two sets of similar measurements. Reliability coefficients vary between values of 0.00 and 1.00, with 1.00 indicating the unattainable perfect reliability, and 0.00 indicating no reliability. The closer the reliability is to 1.00, the more the test is free of error variance (summed effect of the chance differences between participants due to uncontrolled variables). A reliability score of 0.7 or higher is considered good.
The FACET analysis of the survey data for the People Measurement Reliability (all groups, N=286) was 0.46. For Group 1(PHY 2020, n=33) it was .58; Group 2 (PHY 2048A, n=79) was also 0.48; and Group 3 (PHY 2048B) was 0.41. The low reliability scores could be a result of the following conditions within the study:
a) homogeneous group being survey thereby limiting detection of differences;
b) not enough items to distinguish reliable difference between participants;
c) items not good measure of construct;
d) rating scale not working well.
Since the Category Statistic and the Probability Curves indicate that the rating scale was working well it is likely one of the other three factors. To determine whether the items are measuring the construct, the fit and point-biserial correlation coefficient (a conventional statistic which reflects the correlation between responses and respondents' total scores) should be checked. The Infit and Outfit were both 1.0 for All Groups, and 1.2 and 1.2 for Group 1, 0.9 and 0.9 for Group 2, and 1.0 and 1.0 for Group 3. Since the Fit should be as close as possible to 1.0, Group 1's score of 1.2 indicates their attitudes were different than the other groups. Therefore, they were being mismeasured by the survey. Group 1 was the smallest of the three classes surveyed and even though it was an introductory physics class (as were the other two classes), the professor noted that they seemed even more anxious about the class than usual.
The point-biserial correlations were -0.3, -0.9, -0.16, and -0.07 for All Groups, Group 1, Group 2, Group 3, respectively. The higher the point-biserial the better but 1.0 is considered good. A score greater than zero indicates: 1) the item correlates well with the composite score; and 2) a better discrimination of the item. A negative score or a score of zero indicates the items correlates poorly with the composite score and that the items could be measuring more than one thing. The latter is more likely since the survey was divided into three sections (Subject of Physics, Teaching Physics, Mathematics and Physics). In particular, the items relating to mathematics and physics are problematic due to many students' belief that the two are inseparable. In future studies, it would be preferable to expand and separate these sections into individual surveys.
In applying the Rasch analysis to the Items the Infit and Outfit statistics are examined to evaluate the fit of the item calibrations, while a Reliability value is provided to quantify the spread of parameter estimates. A reliability of separation index assesses whether the items are separated enough to reliably define a scale. It specifies the proportion of observed variance in parameter estimates not due to estimation error. Differences due to error alone will result in a value of zero (Wolfe and Miller, 1997). The Infit, Outfit and Reliability of Separation values were as follows (respectively): All Groups, 1.0,1.0, 0.99; Group 1, 0.8, 0.8, 0.98; Group 2, 1.0, 1.0, 0.99; Group 3, 1.2, 1.2, 0.98. Thus these analyses indicate acceptable fit of the data to the rating scale model and sufficient variability between elements to define a scale that can be used investigate the the attitudinal differences among the participants.
Conventional Statistics. Besides the Rasch statistics and parameter estimates, internal consistency was determined by computing the split-half correlation (splitting test items into odd and even items then correlating the two subtests scores). The split-half correlation was 0.39 and was corrected via the Spearman Brown Prophecy formula to become 0.56. The closer a score is to 1.00 the more the test is free of error variance and is a measure of true differences among persons in the dimensions assessed by the test. Again, the survey's low score indicates that the test was not a good measure of the construct. As indicated previously by the Rasch data, this probably due to the test measuring more than one facet.
Cronbach's alpha was also used to measure the internal consistency of the scale. The coefficient is a direct function of both the number of items in the scale and their magnitude of intercorrelation (Zhu, 1997), with a score close to 1.00 being desirable. Cronbach's alpha was 0.47, which reaffirms the split-half and Rasch results.
The Standard Error of Measurement (SEM) is a measure of reliability which allows you to estimate the range within which the participant's true score probably falls. The SEM is in the scale of original scores with 0.0 indicating perfect reliability with zero error. Since it is based solely on error and not on both error and true score, SEM is more stable compared to split-half and Cronbach's alpha correlations which are correlations that depend on variability. The survey's lack of item homogeneity also contributed to a very high SEM of 3.73. This high SEM makes it extremely difficult to infer the true score of the participants.
The stability of the test was not measured. In future studies, it would be desirable to test the instrument's stability by administering a retest. From this data a coefficient of stability could be calculated to determine the stability of participants' responses and item positions. It would also be interesting to determine if there was any change in the participants' attitudes over the course of the semester. However this is difficult to determine given changes could originate from a variety of sources. For example, the participants' could undergo genuine attitudinal changes, or the changes could be from changes in the items due to changes in the construct, or from changes in how participants' view the rating scale. However, the Rasch rating scale model can be used to measure change over time by disentangling these facets by creating a common frame of reference (Wolfe & Chiu, 1997).
A survey may be reliable without being valid. A survey's validity addresses the appropriateness, meaningfulness and usefulness of the specific inferences made from the survey scores (Miller, 1997). In other words, it assesses how well the survey measures what it set out to measure. Validity is not a property of the test itself but rather a property of an interpretation being made, therefore a test may be valid for one use but not another (Albert, Cluxton & Miller, 1997). Several types of validity are measured when assessing the performance of a survey instrument: content, criterion, consequential, and construct.
Content validity is a subjective measure of how appropriate the items seem to a set of reviewers who have some knowledge of the subject matter (Litwin, 1995). The reviewers examine the questionnaire to ensure it includes everything it should and to remove any items they feel are not appropriate. The content validity of this survey was established by having the items evaluated by two university physics professors, a university science education professor, a university professor of educational statistics, two university physics students, and four graduate students currently enrolled in a survey design and evaluation course. Their suggestions led to the elimination and reframing of some of the attitudinal items. Some survey questions were borrowed from a previous survey that examined South African high school students attitudes towards physics (Ogunsola-Bandele, 1997). The content validity of the South African instrument was established by three physics teachers and two psychologists. (The reliability of the South African instrument was 0.78 using the Spearman Brown split-half reliability method.)
Criterion validity is demonstrated by comparing the survey scores with one or more external variables (criterion) considered a direct measure of the characteristic in question (Albert, Cluxton & Miller, 1997). This comparison relates the ability of the survey to measure an individual's behavior on some other measured variable (criterion). The other measure is either administered at the same time (concurrent validity) or at a later time (predictive validity). Criterion validity for this survey was not established, however if time and resources allowed, it would be interesting to conduct individual exit interviews after the students have completed their final exam and to correlate both survey results with the results of the students' final course grades.
Consequential validity deals with systemic and fairness issues that arise from use of the survey. For example, test scores on an instrument could rise over the years due to the teacher teaching to the test or the students becoming more familiar with the instrument. Consequential validity was not addressed with this survey since it was a one-time administration.
Construct validity is the extent to which a test can be shown to measure a hypothetical construct (e.g., attitudes, anxiety, intelligence)(Albert, Cluxton, Miller, 1997). This form of validity is the most valuable because it truly assesses how well a test measures the trait or construct you are interested in. The assessment of construct validity for this survey focused on how attitudes toward mathematics influences one's attitudes towards physics. Based on the premise that students perceive mathematics and physics as inseparable, I predicted that students who struggle with and dislike mathematics will also dislike physics. Because of a history of teaching physics from mathematical viewpoint, students who excel in mathematics have traditionally been more likely to succeed in physics and therefore have a more positive attitude towards the subject.
Evidence for the construct analysis of the survey was provided by comparing the average group measures for the three different parts of the survey: Part I-The Subject of Physics (Group 1); Part II-The Teaching of Physics (Group 2); and Part III-Mathematics and Physics (Group 3).
To compare two groups to each other Two Independent Samples t-tests were conducted. In comparing Group 1 to Group 3, with a degrees of freedom of 10, p value of 0.05, and a critical value of 2.228, no significant difference was found with t-test of 1.074. In comparing Group 1 to Group 2, with a degrees of freedom of 12, p value of 0.05, and a critical value of 2.179, no significant difference was found with t-test of 0.029. And in comparing Group 2 to Group 3, with a degrees of freedom of 12, p value of 0.05, and a critical value of 2.179, no significant difference was found with t-test of 1.055.
For statistical validity, the assumption of normal distribution, independence of groups and equal variance among the groups were made. The statistical power of the t-test was very poor due to the low number of items in each group.
External validity for this survey was poor since the sampling method was one of convenience and did not represent the population well. Therefore the results of this study cannot be generalized to the population. For future studies, a simple random sample or a stratified random sample of the population would be preferable.
In review of the internal validity, the design of the study could be improved by redefining the Linear Continuum to be:
Linear Continuum of Survey
Low Score High Score
very negative very positive
attitude towards attitude towards
mathematics in physics mathematics in physics
This would narrow the purpose of the survey and the items could be reworked to better differentiate the participants attitudes. Reviewing other components of the survey design for internal validity (a la Cook & Campbell), the survey design is fairly strong. Since it was a one-time test, their were no problems in the areas of history, maturation, testing familiarity, statistical regression, mortality, or change in instrumentation. One situation that might have confounded the results was the selection of PHY 2020 for the survey. Since this course has a lower course catalog number (2020 vs. 2048), students with lower capabilities in mathematics may have selected it, therefore violating the assumption of equal variances.
With respect to evaluating the rating scale's construct validity, the rank ordering of the rating scale threshold produced by FACETS can be seen in Table 1.0 entitled "All Facet Vertical Rulers". The step thresholds for the rating scale are about the same which supports that the Likert scale is working. Yet the separate grouping of the Math items indicates that they could be their own strata of items. As was discussed previously, the fit statistics were all within the acceptable range of 0.6 to 1.5.
Table 2.0 All Facet Vertical "Rulers"
With the weak reliability and validity evidence indicated by the preceding analysis, it is difficult to make interpretive statements about the attitudes of university students enrolled in introductory physics courses. Since the linear continuum selected was confounded by some of the items, particularly those which entangled mathematics and physics (Part III), it is difficult to interpret the overall average score from the analysis. However, a cursory review of the observed averages for each item does offer a few interesting insights into students' attitudes. With the coding of the rating scale from 0 (strongly disagree) to 3 (strongly agree), the majority of the items (70%) fell between 1 and 2. This indicates, as did the step thresholds seen earlier, that while the rating scale was working it could be cut down to a three point scale. As it is, it is very tough to differentiate and interpret the correct attitudes of the students. Therefore a review of the items that scored at the extreme ends (0 to 1 or 2 to 3) of the scale might prove more useful.
In the first section,the Subject of Physics (Group 1), only two of the six items scored in the extreme. Item 1 scored an observed average of 2.4 (n=285) with the statement "Everyone should have some basic knowledge of physics." Item 6 scored an observed average of 2.3 (n=283) with the statement "Studying physics has made me realize that a lot of everyday things I see outside of class can be explained by physics." The responses to both these statements are encouraging since they indicate that student see the value of physics in a well-rounded education. Yet the other questions in the first section failed to differentiate the students' attitudes. The lowest item with the lowest observed average, 1.3 (n=283), was item 4 which stated "The only reason I am taking physics is because it is required." This item also supports items 1 and 6 that indicated students appreciation for the subject matter.
In the second section, the Teaching of Physics (Group 2), three of the eight items scored in the extreme. Item 12, "I really like it when the instructor uses demonstrations to help explain a physics concept.", scored an observed average of 2.5 (n=286). Item 7 , "I do not need to read my physics textbook to succeed in this class." scored an observed average of 2.4 (n=286).
Item 11, "I find the drawings in my current physics textbook helpful in understanding the basic concepts." scored an observed average of 2.0 (n=285). These results support the theory of introducing the concepts to the students initially before they are inundated with difficult calculations. Since the assessments in the courses were entirely multiple-choice mathematical problems the need to read the conceptual theory in the textbook was not valued as highly as working math problems. The lowest observed average was 1.3 (n=261) with Item 14, "I think my high school physics course gave me an adequate preparation for my first university physics class." With the faster pace and greater emphasis on mathematical problem-solving, only the students with a similar experience in their high school physics course would feel well prepared.
In the third section, Mathematics and Teaching (Group 3), only one of the six items scored in the extreme. Item 15, Mathematics helps me understand physics.", scored an observed average of 2.3 (n=285). But since the students's understanding of physics is only assessed using math problems it is unclear whether they have an adequate qualitative understanding of the concepts.
The literature (Rennie & Parker 1996, Hazel et al 1996) on student conception of physics suggests that students might be able to perform well on mathematical physics questions without adequate understanding of the concepts.
The results of this survey are inconclusive due to the ineffectiveness of the survey to distinguish between the qualitative and quantitative nature and assessment of physics. Yet there is enough evidence to warrant further study of students' perceptions and attitudes toward physics. Theoretically, it would be ideal for all physics courses to utilize assessments packages that included both mathematical and conceptual based questions. Students could demonstrate their conceptual knowledge through short flexible questions, multiple-choice questions, or essays which require the students to explain the applications of the concepts or the workings involved in a quantitative question. Combining the two approaches in one assessment would encourage students to link the abstract to the concrete.
The strengths of the survey were in its appearance, easy of use, and large number of participants. The survey was weak in it validity and reliability due to measuring of more than one construct and its convenient sampling method. For further research, the items should be refined to more accurately assess the students perceptions of physics and mathematics. More research is needed on how students' attitudes and understanding are driven by the types of assessment used.
Albert, R.T., Cluxton, J.C. and Miller, M.D.. (1997). Practical Application and Understanding of Quantitative Foundations of Educational Research. Course notes for University of Florida EDF 6403.
Escalda, L.T., H. P. Baptiste Jr., D. A. Zollman,and N. S. Rebello. (1997). Physics for all. The Science Teacher, 64 (2), 26-29.
Franz, J. (1983). The crisis in high school physics teaching: Paths to a solution. Physics Today, 36, 44-49.
Hewitt, P. G. (1994). Concepts before computation. The Physics Teacher, 32 (4), 224.
Hewitt, P. G. (1990). Conceptually speaking.... The Science Teacher, 57 (5), 54-57.
Kalton, Graham. (1983). Introduction to Survey Sampling. SAGE Publications, Inc..Thousand Oaks, California.
Karplus, R. (1972). Physics for Beginners. Physics Today, 25 (6), 36-47.
Litwin, Mark S..(1995). How to Measure Survey Reliability and Validity. SAGE Publications, Inc..Thousand Oaks, California.
Lopez, W. (1996). Communication validity and rating scales. Rasch Measurement Transactions, 10(1), 482-483.
Ogunsola-Bandele, M. F. (1996, September). Mathematics in Physics - Which Way Forward: The Influence of Mathematics on Students' Attitudes toward the Teaching of Physics. Paper presented at Annual Meeting of the National Science Teachers Association.
Roth, W. (1994). Experimenting in a constructivist high school physics laboratory. Journal of Research in Science Teaching, 31 (2), 197-223.
Sousa, D. A. (1996). Are we teaching high school science backward? National Association of Secondary School Principals Bulletin, 80 (577), 9-15.
Toews, William (1988). Why take physics in high school -- why plan to teach physics? The Physics Teacher, 26 (7), 458-460.
Wolfe, Edward W., Miller, T.R.. (1997). Barriers to the Implementation of Portfolio Assessment in Secondary Education. Applied Measurement in Education, 10(3), 235-251.
Wright, B.D., & Linacre, M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 170.
Zhu, Weimo, Updike, W. and Lewandowski, C.. (1997). Post-Hoc Rasch Analysis of Optimal Categorization of an Ordered-Response Scale. Journal of Outcome Measurement, 1(4), 286-304.
Return to Research Interests Index
Return to Mr. Jones' Index Page