EPID 503 Biostatistics 1

Take-Home Final Project Instructions

EPID 503 Biostatistics 1



The Baltimore-Washington Infant Study (BWIS) was the first and largest epidemiological study to address the question of risk factors for the birth of infants born with structural defects of the heart (congenital heart defects). As you will see in the assigned readings, this class of birth defects is the most common and, depending on the specific type of malformation, can result in mortality and important, lifelong morbidities. Yet nothing was known about its causes when the study was implemented in 1981 (it completed enrollment of nearly 8,000 families in 1989), and so the possibilities for prevention were equally unknown. Given the rarity of the disease in the general population (about 4 per 1,000 babies), a case-control design was the most efficient way of generating the information. Over 100 scientific reports – books, book-chapters, and peer-reviewed journal articles – resulted from the study.


From the readings below will see that the control group (N=3,572) was enrolled as a population-based sample of all infants in the study region who were free from heart defects at birth. As such, the controls can be studied for various factors that influenced their own developmental outcomes, such as weight at birth and gestational age.




For this project, you will be evaluating two questions:

(a) what are the factors that had a statistically significant effect on birthweight? and

(b) where there any differences between white and African-American babies regarding these factors?


Therefore, we can consider the project as providing an opportunity to put into practice the biostatistical tests and SAS programming techniques you have learned in the course to address a potential health disparities problem.


Background readings


The following articles are posted on Canvas.


Loffredo et al. 2000: read this article for an overview of the purpose of the BWIS and a summary of its major findings regarding possible causes of congenital heart defects.


Loffredo et al. 1993: read this chapter concerning the BWIS database that you will be using for your project. This chapter is from a book about the BWIS methods.


Petrossian et al. 2014: read the Introduction and Materials and Methods sections as an introduction to the topic of the epidemiology infant birthweight.


Rosenthal et al. 1991: read the Methods section to understand how the reliability of maternally-reported birthweights were assessed.


The Database


You will be using a subset of the entire BWIS database, i.e. the records of the control group only. On Canvas you will find 8 raw data files and a Word program named “BWISdat sas code.” You need to download all of the data files to have a complete database.


  • Note that the code book of variable names, titles, and their codes is embedded within the SAS code provided above. It is vitally important to know the coding so that you can plan for the appropriate analysis and interpret the results.


Copy and paste the entire Word file into your SAS program editor. That SAS program reads in the raw data and merges the files to create a new data set called FINAL.


  • DO NOT CHANGE ANY OF SAS CODE IN THIS FILE. To work on the data, you will simply add new lines at the bottom of the file to conduct your descriptive (e.g. PROC MEANS, PROC UNIVARIATE, PROC FREQ) and analytical work (e.g. PROC TTEST, PROC GLM, etc….) on the combined data.


  • To select ONLY THE CONTROLS from the database, insert the following lines of code at the bottom of the SAS file.


data controls; set final;

if gp=2;


This will create a subset of the database, named CONTROLS, that will contain only the records of the 3,572 control infants. The cases will not be included. You can then begin adding additional lines of code after the IF statement above to begin your analysis.


The Variables


The database contains hundreds of variables, many of which are not relevant to this assignment (i.e. concerning the diagnosis and medical history of the cases). You will consider ONLY the following categories of variables (see the code book to identify these variable names).


Infant factors: crace, sex, bwt, gestage

Parental sociodemographic factors: meduc, feduc, mage, income

Maternal medical history: diab, gdiab, prevpreg, pvprem, iv90, iv91, nv1078 (=BMI)

Parental lifestyle factors: msmoke, fsmoke, malc, malc2


The Assignment


Step 1: begin by looking at the distributions of infant birthweight (grams), for the entire control group and then separately by race. The results will help inform your decisions on how to proceed in terms of parametric versus nonparametric approaches, for example. You can choose whether to continue to evaluate birthweight as a continuous variable for the rest of the assignment, or categorize it as low (<2,500 grams) versus normal birthweight (>=2,500 grams). Once you decide, please be consistent.


àNote that race was reported in three categories: 1=white, 2=black, and 3=other. The “other group” is too small for separate analysis and it should be excluded from the disparities analysis (but be sure to INCLUDE them in the overall analysis). You can exclude the group by including the following statement in SAS: if crace=3 then delete;


Next, look at all the other variables in the dataset that are listed in the Variables section above. Run PROC UNIVARIATE and/or PROC FREQ to have a look at how they are coded and distributed. From those results, make decisions on how you are going to handle them in the analysis.


For example, if a variable is categorical but one of the categories has a small N, consider grouping some of the categories together, or exclude that particular code. Suppose we had a variable (fictitious in this case) called occupation that has 8 categories. In SAS you can exclude on the codes by writing: if occupation=8 then delete; If you want to create a new grouping of codes, you could write: if occupation=1 then newoccup=1; if occupation=2 or occupation=3 then newoccup=2; if occupation>3 then newoccup=3; and that series of commands will create new variable (newoccup) consisting of 3 codes, in comparison to the original variable that had 8 of them.


Step 2: find out which of the variables had a statistically significant effect on birthweight for the entire control group.


Step 3: find out whether the associations from Step 2 differ by race (white versus black). There are complex ways to do this, using multiple regression modeling as one of many possible approaches that include things like interaction terms. But for this assignment you will only use stratified analysis, i.e. look at whites and blacks separately and see which factors are associated with birthweight in one group or the other, or in both groups (recall and use the Breslow-Day test: look for evidence of confounding and effect modification).


The Project Report


The Project Report should consist of an overview of the analytical strategy you deployed, together with tables and figures, a description of the findings, conclusions, and appendix materials.


The Analytical Overview (suggested length: one to two pages, single spaced) should describe what you did to complete the project. Be sure to address the following issues.

  • How did you begin the project?
  • Did you decide examine birthweight as continuous or binary? What were the consequences of that decision?
  • How did you decide on how to handle any non-binary variables?
  • How did the results of the descriptive phase of the work (step 1) inform your decisions on the plan for the analytical phase (steps 2-3)?
  • What did you do to identify the variables that had a significant effect on birthweight (step 2)?
  • How did you assess the white-black disparities in birthweight (step 3)?


The Results section of the paper (no set length) should consist of a written narrative of what you found (e.g. “I examined the distributions of birthweights in white and black infants, as shown in Figure 1, which shows that the two groups had somewhat different distributions…. Table 1 shows a list of maternal factors, revealing that …..”) . Include tables and figures of the main results, focusing on the statistically significant findings. Those factors that were not significant can be summarized in words, rather than taking up space in tables and illustrations.


Concisely state your Conclusions from the results (suggested length: one page). Most importantly, what disparities did you discover in birthweight and its risk factors? Also describe your experience in working on the project, in terms of what challenges you faced, how you overcame them, and what you learned in the process.


As an Appendix, include a printed copy of your SAS program(s).


Due Date


Project reports are due by December 22, uploaded to Canvas,