# Applied Statistics and Data Analysis for Public Health

Statistics is defined as science of applied mathematics which concerns the collecting, processing, summarizing, interpreting and presenting the data. It explains and analyses the provided data and draws the conclusion from the information a sample data contained (Peck, Olsen & Devore, 2015). statistics is majorly used for appropriate data collection and presentation of complex data in suitable graphs, tables and diagrammatic forms to make it clear and easily understandable. Moreover, it helps to understand the complexity and pattern of variations in phenomenon of nature and is useful to plan a statistical analysis correctly and efficiently in any area of research (James et al., 2013). Statistical methods play a vital role in the monitoring and estimation process of public safety, used to classify population at risk, diagnose emerging health risks, organise public health interventions and assess their performance and effectiveness, and manage policy budgets and funds (Kellar & Kelvin, 2013), therefore commonly used in public health and clinical medicine and help the public health administrators to understand what population is experiencing under their control(Armitage & Berry, 1994). Statistics is broadly classified into two groups: descriptive statistics and inferential statistics. Descriptive statistics are the techniques which deals with the enumeration, organization, summarization and graphical description of collected data, thus helps to understand the features of specific data set. On the other hand, inferential statistic is concerned with generating conclusions about a population based on sample information, usually complex and involves more errors. Descriptive statistics provide basis for inferential statistics hence both are interrelated (Fisher & Marshall, 2009). In the following report both types of statistics will be used according to the nature of the question and type of the variables. Statistical reports and methods always need a software for statistical analysis of the collected data set. Choosing the right software is crucial as the selection of wrong software can give errors and false results to the researcher (Cavaliere, 2015). To address the given question in this report, data is given in IBM SPSS software which manipulates data and generates tables and graphs quickly to summarize data and performs statistical analysis ranging from basic descriptive statistics to advanced inferential statistics.

**Aim of the Report**

Purpose of this report is to perform appropriate statistical techniques on given dataset obtained from hypothetical study to evaluate the healthy lifestyle education intervention intending to encourage healthier way of living among university students. Key objectives of study were to increase health education and reduce weight of the selected participants. To address the questions, different statistical techniques will be used and critically discussed to investigate the effects of health education intervention on Body Mass Index (BMI), diabetes status and health literacy across both groups. Any change in BMI due to intervention in intervention group will be compared to the BMI of the control group. Further, report will provide explanation to predict post intervention differences in BMI of the participants utilising the variables of health literacy, age and sex.

**Preliminary Analysis and Investigations**

Designing a measurable, clear and concise question, setting clear measurement priorities and then data collection are the initial steps of data analysis which are already defined in the provided task. After collecting data, data screening and cleaning is the next important step which detects outliers, miscoded or missing values, normality of each variable and checks for possible errors, which helps to ensure the reliability and validity of the employed data for testing causal theory (Odem & Henson, 2002). Data screening provides the general impression of the collected data which is helpful to select and conduct a suitable analytical method and improves the performance of statistical techniques (Abubakar et al., 2017). Type of variables must be determined prior to screening data as they determine the type of descriptive and analytical methods to be used in data summarization and analysis (Mayya et al., 2017). According to Mcdonald (2009), there are two main categories of variables: numerical and categorical. Both categories have subcategories. categorical data uses a descriptive approach to express information and takes numeric values with qualitative properties having no mathematical meaning. Categorical data is unstructured or semi structured data which lacks standardized order scale and natural language description and can visualized using bar chart and pie chart when measuring frequency and percentages respectively. On the other hand, numeric data is structured data and is compatible with most statistical methods as compared to categorical data. It takes numeric values with numerical properties to depict relevant information with standardized order scale and is visualized using scatter plots and line graphs (Cambell & Swinscow, 2009).

The following study employed 81 participants, which are grouped as control group and intervention group and coded 1 & 2 respectively. Intervention status of participants, gender (male & female), location (campus A & campus B), smoking status (not smoking & smoking), asthmadiag i.e asthma status (no asthma & asthma) and diabdiag i.e diabetes status (no diabetes & diabetes) are the categorical or nominal variables (table 1.3) and are appropriately coded. Whereas, participant’s age, health literacy, height, weight1 (weight before intervention), weight2 (weight after intervention) are numeric variables (table 1.2). Both categorical and numerical data will be graphically displayed to make the data understandable which can be easily memorised and compared at one glance. General description, coding and labelling of all variables in the given data set are displayed in **table 1**.

**Presentation of continuous variables**

Descriptive statistics are used to summarize numerical data and is displayed in the form of histogram, line graph, scatter plot and box and whiskers plot. Numerical data usually involves the presentation of distribution, central tendency and dispersion which are three major sample characteristics of each variable (Mishra et al., 2018).

** **

**Table. 1.1** Descriptive statistics for continuous variables in healthy lifestyle data.

Table 1.1 presents the general impression of the descriptive statistics for continuous variables in the given study. This table shows that this study contained the participants from **18** years old (lowest) to **44** years old (highest) with the range of **26** which is the difference between highest and the lowest and has a mean age of **25**. **1.40** m and **1.90** m were the minimum and maximum heights among the participants having mean and range of **1.67** and **.50** respectively.

Participants had minimum health literacy score of **28.57** while maximum was **92.86** with mean of **58.73 **among the participants of control and intervention group and has normal distribution curve also known as Gaussian distribution (figure.1.1 & figure 1.2) which means data is more frequent near the mean than data far from mean.

**Figure 1.1**

** **

** **

**Figure. 1.2**

Before intervention, participants had minimum weight of **41.5 kg** (weight1) and after intervention it was **42 kg** (wieght2). **133.3 kg** (weight 1) was the maximum weight before intervention and after intervention it was 129 kg (weight2). Weight 1 had a mean of **67.7 kg** while weight2 had mean of **67kg** (table 1.1).

Three new continuous variables were added in the given data set to evaluate intervention effect on BMI in intervention group and control group before and after intervention. BMI1, BMI2, BMI_Diff which represent BMI before intervention, BMI after intervention and BMI change before and after intervention respectively. Table 1.2 shows that **15.62** and **38.95** were the lowest and highest BMI before intervention respectively with mean of **23.87**. On the other hand, **15.81** and **37.69** were the lowest and highest values for BMI after intervention respectively with mean of **23.62**. BMI change among the participants in both groups shows normal distribution curve. (figure. 1.5 &1.6).