Multivariate regression

107 views 13 pages ~ 3348 words Print

A statistical technique known as multivariate regression analysis examines the causal link between a certain dependent or outcome variable and a group of independent variables in a specific model. Since this statistical tool exclusively deals with numerical data, all the variables that are included in the regression must be of a numerical character. When data is measured on a nominal scale, methods can be used to transform the data into numerical measures, such as employing dummy variables, to make the data numerical and suitable for the use of the regression approach. The information at hand looks into the elements that affect people’s general health. It uses the health variable as the outcome variable being studied with 20 other independent variables. The report presents a comprehensive report of this analysis using regression analysis coupled with the model specification and diagnostics for model fitting.

Part one

Description of the data and variables

The data at hand contains a total of 290 observation collected from a scientific study instituted to investigate the determinants of an individual’s health. It has got a total of 21 variables selected as the indicators. The first variable in the data, health, is the outcome variable whereas the other 20 variables constitute the independent variables in the model. These independent variables are selected to provide and explanation for the variation in the variable of health in the model. These 20 independent variables are complex measurements of gene sequence collected over a very long period of time for a given set of patients. All these variables are continuous variables and will be further explored in the model.

Regression analysis

The first step in regression analysis is always to obtain the descriptive statistics of all the variables in the model. The reason is to get a simple snapshot of nature and look of the data. Some of the statistics include in such a descriptive analysis comprises of the measures of central tendency that includes statistics like mean, mode, variance, skewness, and kurtosis. These statistics provide the basis for further analysis and statistical modeling of a given set of variables in the model.

Descriptive statistics

Table 1: R-studio output for descriptive statistics of the variables in the model

Mean

S. E

Median

Variance

Kurtosis

Skewness

Range

Min

Max

health

2.85

0.12

2.19

2.02

4.10

9.05

2.76

14.05

0.61

14.65

PCDH12

1.11

0.07

0.94

1.18

1.40

194.78

12.74

19.13

0.20

19.32

DLG5

1.11

0.04

0.93

0.63

0.39

3.68

1.64

3.79

0.16

3.96

BC038559

1.07

0.03

0.98

0.47

0.22

0.86

0.88

2.89

0.00

2.89

SHISA5

1.12

0.03

1.03

0.49

0.24

1.57

0.95

3.09

0.08

3.17

AF161342

1.20

0.04

0.96

0.71

0.50

3.18

1.55

4.43

0.00

4.43

CARKD

1.25

0.06

0.98

1.04

1.08

36.28

4.89

11.02

0.23

11.25

F2R

1.06

0.02

1.01

0.41

0.17

0.20

0.58

2.32

0.09

2.41

PHKG1

1.14

0.03

1.06

0.57

0.32

2.20

1.15

3.73

0.19

3.92

CDCP1

1.32

0.04

1.15

0.73

0.54

2.88

1.43

4.69

0.22

4.91

PLEKHM1

1.00

0.03

0.88

0.50

0.25

0.11

0.78

2.61

0.01

2.62

SMC2

1.72

0.14

1.13

2.38

5.65

31.38

5.16

20.14

0.08

20.23

PSMB6

1.03

0.03

0.95

0.44

0.19

0.68

0.73

2.78

0.20

2.98

BX440400

1.16

0.03

1.06

0.52

0.27

0.87

0.90

2.98

0.12

3.10

A_24_P936373

1.09

0.04

0.95

0.62

0.38

11.39

2.64

4.72

0.06

4.78

PPAN

1.14

0.03

1.02

0.49

0.24

1.53

1.16

2.96

0.12

3.09

BC007917

1.09

0.03

1.00

0.47

0.22

1.15

1.03

2.60

0.34

2.93

C14orf143

1.29

0.06

1.03

0.96

0.92

12.08

2.80

7.91

0.20

8.11

LOC440104

1.06

0.03

0.96

0.56

0.31

1.47

1.12

3.33

0.13

3.46

THC2578957

1.13

0.03

1.02

0.48

0.23

2.02

1.15

3.20

0.23

3.43

ANKIB1

1.06

0.03

0.99

0.45

0.20

3.47

1.31

3.15

0.08

3.23

Source: Analysis from the Training data

From the table above it can be seen that; health (mean = 2.85, SE = 0.12), PCDH12 (Mean = 1.11, SE = 0.07), DLG5 (Mean = 1.11, SE = 0.04), BC038559 (Mean = 1.07, SE = 0.03). Furthermore SHISA5 (Mean = 1.12, SE = 0.03), AF161342 (Mean = 1.20, SE = 0.04), CARKD (Mean = 1.25, SE = 0.06), F2R (Mean = 1.06, SE = 0.02), PHKG1 (Mean = 1.14, SE = 0.03), CDCP1 (Mean = 1.32, SE = 0.04). Additionally, PLEKHM1 (Mean = 1.00, SE = 0.03), SMC2 (Mean = 1.72, SE = 0.14), PSMB6 (Mean = 1.03, SE = 0.03), BX440400 (Mean = 1.16, SE = 0.03), A_24_P936373 (Mean = 1.09, SE = 0.04). PPAN (Mean = 1.14, SE = 0.03), BC007917 (Mean = 1.09, SE = 0.03), C14orf143 (Mean = 1.29, SE = 0.06), LOC440104 (Mean = 1.06, SE = 0.03), THC2578957 (Mean = 1.13, SE = 0.03), ANKIB1 (Mean = 1.06, SE = 0.03).

The means and their associated standard errors indicate that the variables in the data are within similar range and are fit for analysis using the regression analysis. A closer look at their Kurtosis and Skewness statistics also indicates that none of these variables has got heavy skewness and Kurtosis to deter further application of the regression model. Thus, having obtained the basic descriptive statistics, a primary regression model consisting of all the initial data variables is presented subsequently.

Developing the model

An ordinally least squares model (OLS) containing all the 21 variables is presented followed by an in-depth analysis of the model diagnostics to investigate whether any or all the basic assumptions of the regression model are violated. In this regression model parameter estimates using the sample data is obtained for inference. The model also reports the coefficients of associated with all the variables in the model, the corresponding P-values for hypothesis testing, the residual variances, T-tests and other goodness of test statistics that will be summarized using both tables and graphics.

Table 2: R-output for regression analysis

Residuals:

Min 1Q Median 3Q Max

-2.3069 -0.7844 -0.1422 0.5988 12.8654

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.403259 0.883576 -3.852 0.000147 ***

PCDH12 -0.112777 0.076233 -1.479 0.140209

DLG5 -0.743025 0.227193 -3.270 0.001214 **

BC038559 -0.298175 0.367246 -0.812 0.417555

SHISA5 0.292498 0.323874 0.903 0.367269

AF161342 0.568462 0.149707 3.797 0.000181 ***

CARKD 0.173380 0.097949 1.770 0.077842 .

F2R 1.587816 0.470229 3.377 0.000842 ***

PHKG1 0.500789 0.237720 2.107 0.036075 *

CDCP1 0.124248 0.180410 0.689 0.491607

PLEKHM1 4.061004 0.309838 13.107 < 2e-16 ***

SMC2 0.086447 0.041384 2.089 0.037657 *

PSMB6 -0.872610 0.354422 -2.462 0.014441 *

BX440400 -0.005739 0.265745 -0.022 0.982787

A_24_P936373 0.528840 0.206534 2.561 0.010997 *

PPAN 0.324097 0.265618 1.220 0.223472

BC007917 -0.282504 0.307060 -0.920 0.358383

C14orf143 0.198250 0.112007 1.770 0.077864 .

LOC440104 -0.868232 0.309428 -2.806 0.005384 **

THC2578957 0.437893 0.321298 1.363 0.174059

ANKIB1 0.083391 0.342849 0.243 0.808013

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.39 on 269 degrees of freedom

Multiple R-squared: 0.5611, Adjusted R-squared: 0.5284

F-statistic: 17.19 on 20 and 269 DF, p-value: < 2.2e-16

Interpretation of the results

The interpretation of the results of the regression output is done using a 95% confidence level. From the regression output summarized in table 2 above, it can be seen the gene sequences of PCDH12 (B= -0.112, T = -1.479, P – value = 0.140209) has got a negative relationship with the outcome variable but is not significant in the model. Furthermore, the gene sequence of DLG5 (B= - 0.743025, T = -3.270, P – value = 0.001214) also has a negative effect on the health outcome variable and this relationship is significant since the P-value associated with it is less than the 0.05 level of significance. Furthermore, the gene sequence of BC038559 (B = -0.298175, T = -0.812, P – value = 0.417555) has an insignificant negative effect on the health outcome variable as indicated by the large P-value.

Further analysis also indicates that gene sequence of PSMB6 (B = -0.872610, T = -2.462, P – value = 0.014441) is negatively related with the health outcome variable and this relation is significant as implied by the very low p-values associated with this variable. Furthermore, the gene sequence of BX440400 (B = -0.005739, T = -0.022, P – value = 0.982787) also negatively affects the health outcome variable though its effect is also insignificant as implied by the p-values that are greater than the 0.05 level of significance. The last gene sequence to negatively impact the health outcome variable is BC007917 (B = -0.282504, T = -0.920, P – value = 0.358383). From its p- value which is clearly larger than the 0.05 level of significance, this variable is also not significant in explaining the variation in the dependent variable.

On the positive side, it can be seen that the gene sequence of SHISA5 (B= 0.292498, T = 0.903, P-value = 0.367269) has a positive but insignificant relationship with the health outcome variable. The gene sequence of AF161342 (B = 0.568462, T = 3.797, P-value = 0.000181) has a positive and very significant relationship with the outcome variable. Similarly, the gene sequence of CARKD (B = 0.173380, T= 1.770, P-value = 0.077842) has got a positive but slightly insignificant relationship with the outcome variable. The variable of F2R (B = 1.587816, T = 3.377, P-value = 0.000842) has a very significant positive relationship with the outcome variable. The variable of PHKG1 (B = 0.500789, T = 2.107, P-value = 0.036075) has also got a positive but rather insignificant relationship with the outcome variable. The analysis further shows that CDCP1 (B = 0.124248, T = 0.689, P-value =0.491607) has a very insignificant but positive relationship with the dependent variable as shown by the large P-values associated with it. The variable of PLEKHM1 (B = 4.061004, T = 13.107, P-value< 0.05) is positively related to the outcome variable in a very significant manner. The SMC2 (B = 0.086447, T = 2.089, P-value = 0.037657) has an insignificant positive relationship with the outcome variable.

The other gene sequences of A_24_P936373 (B = 0.528840, T = 2.561, P -value = 0.010997) also has a positive significant relationship with the outcome variable. The other gene sequences of PPAN (B = 0.324097, T = 1.220, P-value = 0.223472), C14orf143 (B = 0.198250, T = 1.770, P-value = 0.077864), THC2578957 (B = 0.437893, T = 1.363, P-value = 0.174059) and ANKIB1(B = 0.083391, T = 0.243 P-value = 0.808013) all have positive but rather insignificant relationships with the health outcome variable.

Thus, the variables PCDH12 (P-value = 0.140209), BC038559 (P-value = 0.417555), SHISA5 (P-value = 0.367269), CDCP1 (P-value = 0.491607), BX440400 (P-value = 0.982787), PPAN (P-value = 0.223472), BC007917 (P-value = 0.358383), THC2578957 (P-value = 0.174059), ANKIB1 (P-value = 0.808013) are insignificant in the model all the rest are significant in this model.

Model fitting and diagnostics test statistics

The model fitting statistics indicate that the models residual error is RSE (269) = 1.39. The 20 independent variables in the model explain 52.84% of the overall variation in the health outcome variable. This was indicated by the R-squared value of 0.5284 The overall model fitting was tested using Fischer’s F-static. Under the null hypothesis, the model is wrongly fitted, Fischer’s F – statistic indicates that the null hypothesis can be rejected. It was associated with an F value of F (20,269) = 17.19, P-value < 0.05. From the results of model fitting, it can be concluded that although the 20 independent variables explain only 52.84% of the overall variation in the outcome variable, the overall model is a got fitting on statistical grounds and can thus be adopted for further inference, decision making purposes and analysis.

Despite the fact that the model seems fitting on statistical grounds, complete inference cannot be done without properly ascertaining whether the basic assumptions of the regression model were not violated. For that reason, tests of linearity, multicollinearity, normality, leverage and heteroscedasticity were done. For the multicollinearity tests, a combination of the variance inflation factors (VIF) were obtained to ascertain whether the data does not contain incidences of collinearity. For the other tests, graphical plots were obtained to figuratively show whether those assumptions were not violated.

Table 3: R-output of VIF tests for testing multicollinearity in the data

Variable

Sqrt(VIF)

PCDH12

1.104033

DLG5

1.737295

BC038559

2.110274*

SHISA5

1.927498

AF161342

1.299877

CARKD

1.246167

F2R

2.3849*

PHKG1

1.648705

CDCP1

1.620839

PLEKHM1

1.900232

SMC2

1.202885

PSMB6

1.908476

BX440400

1.689062

A_24_P936373

1.557716

PPAN

1.587869

BC007917

1.758422

C14orf143

1.313639

LOC440104

2.116962

THC2578957

1.872446

ANKIB1

1.876502

Source: Analysis from the Training data

The Variance Inflation Factor (VIF) statistics are used to indicate whether an independent variable added in the model is collinear with one or more other independent variables in the same model. A variable is said to be collinear if the square root of its VIF statistic is more than 2. Using the R-studio environment, the VIFs and their square roots were obtained as summarized in the table 3 above and collinear variables are indicated with an asterisk.

The results of the analysis indicate that the variables BC038559 (sqrt (VIF) = 2.110274*) and F2R (sqrt (VIF) = 2.3849*) are collinear in the model. The square roots of their VIF statistics are greater than 2 as indicated in the table. That shows that the interaction between these variables unduly inflates the final outcomes of the regression coefficients, residual statistic and other model fitting statistics. Although common parlance often suggests that such variables ought to drop from the model, advanced methods of variable selection will be discussed later to see which other variables out to be dropped from the analysis as well.

Figure 1: Graphical model Diagnostics

Source: Analysis from the Training data

The graphical diagnostics tests plot above indicate that there are no instances of heteroscedasticity as implied by the Residual Vs Fitted plots. The graph indicates that all the residuals are concentrated around the fitted values indicating that there are no instances of heteroscedasticity in the model as the data seems homoscedastic. To investigate for normality, quartile-quartile (Q-Q) plots were developed for the residuals. The results of this analysis indicate that the residuals are approximately normally distributed as shown by the Normal Q-Q plots above. The Q-Q plot indicates that the residuals are centered around the line of best fit indicating that the normality assumption was not violated. Furthermore, the model is linear in parameters as shown by the scale-Location plot and finally, the data has little leverage with only two extreme variables in the data. That implies that the basic assumptions of the regression model are satisfied and the results can hold.

Part two

The present data includes many independent variables included to explain the variation in the dependent variable. For it has got a total of 20 independent variables and one dependent variable summing up to 21 variables in the model. That is a whole lot of variables for a regression model. Previous analysis has indicated that several variables in this model are not significant enough in explaining the variation in the health outcome variable. In fact, the collinearity statistics reported that some two variables are Collinear and would be dropped from the model. However, the process of dropping variables in the model is more complex than just dropping a few independent variables that seem insignificant in the model.

There are three variable selection methods used in most statistical variable selection methods. The forward selection methods begin with a model having only the dependent variable. Independent variables are then added in turn until a model having only those significant independent variables basing on their F- for change is obtained.

For each variable added in the model, the model tests of fitting are conducted to see if the added variables are significant enough and whether the added variables really add some vital information in the model. Whenever a new variable is added, model selection procedures for instance the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) are obtained. The goal is to identify the model that minimizes either of these two criteria.

For the backward selection method, it starts with a full model and independent variables are dropped in turn. For each independent variable that is dropped in model, an F-for change statistic is obtained to see if removal of another independent variable can lead to significant increase in the F-for change statistic. Similar to the forward selection method, a new model with only the available independent variable obtained after dropping a some of them is obtained. BIC and AIC statistics are obtained and are compared with those of the previous models. Model that minimizes any of these information criteria is considered a better model and is adopted for further analysis.

The backward-forward selection method is basically a combination of the forward and backward selection method. In this method, whenever some variables are dropped, the AIC statistics and F-for change statistics are obtained. Whenever a new variable is to be added, the previously added variables are also re-analyzed to see if they are still significant in the model or not. Those that are no longer significant will be dropped from the model and new variable will be added. The process is repeated until a variable that minimizes the AIC or BIC if finally obtained. It is this third method (forward-backward) that will be considered in this analysis. The process was done using R-studio and the AIC information criterion was used to identify the most appropriate model as demonstrated below. The model selection process involved a total of nine steps that involved a combination of forward and backward movement until the best model was obtained. The AIC was obtained at each stage and was used to identify the best fitting model.

Start: AIC=211.15

health ~ PCDH12 + DLG5 + BC038559 + SHISA5 + AF161342 + CARKD +

F2R + PHKG1 + CDCP1 + PLEKHM1 + SMC2 + PSMB6 + BX440400 +

A_24_P936373 + PPAN + BC007917 + C14orf143 + LOC440104 +

THC2578957 + ANKIB1

At the first step, the independent variables; PCDH12, DLG5, BC038559, HISA5, AF161342, CARKD, F2R, PHKG1, CDCP1, PLEKHM1, SMC2, PSMB6, BX440400, A_24_P936373, PPAN, BC007917, C14orf143, LOC440104, THC2578957 and ANKIB1 were added in the model. The model was associated with an AIC of 211.15 and the process of adding/ removing variables in the model was done again.

Step 2: AIC=209.15

health ~ PCDH12 + DLG5 + BC038559 + SHISA5 + AF161342 + CARKD +

F2R + PHKG1 + CDCP1 + PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 +

PPAN + BC007917 + C14orf143 + LOC440104 + THC2578957 + ANKIB1

At the step two, the independent variables; PCDH12, DLG5, BC038559 + SHISA5 + AF161342, CARKD, F2R, PHKG1, CDCP1, PLEKHM1, SMC2, PSMB6, A_24_P936373,

PPAN, BC007917, C14orf143, LOC440104, THC2578957 and ANKIB1 were added in the model. The model was associated with an AIC of 209.15 and the process of adding/ removing variables in the model was done again.

Step 3: AIC=207.21

health ~ PCDH12 + DLG5 + BC038559 + SHISA5 + AF161342 + CARKD +

F2R + PHKG1 + CDCP1 + PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 +

PPAN + BC007917 + C14orf143 + LOC440104 + THC2578957

At the step three, the independent variables; PCDH12, DLG5, BC038559, SHISA5, AF161342, CARKD, F2R, PHKG1, CDCP1, PLEKHM1, SMC2, PSMB6, A_24_P936373,

PPAN, BC007917, C14orf143, LOC440104, THC2578957 were added in the model. The model was associated with an AIC of 207.21 and the process of adding/ removing variables in the model was done again.

Step 4: AIC=205.84

health ~ PCDH12 + DLG5 + BC038559 + SHISA5 + AF161342 + CARKD +

F2R + PHKG1 + PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 + PPAN +

BC007917 + C14orf143 + LOC440104 + THC2578957

At the step four, the independent variables; PCDH12, DLG5, BC038559, SHISA5, AF161342, CARKD, F2R, PHKG1, PLEKHM1, SMC2, PSMB6, A_24_P936373, PPAN, BC007917, C14orf143, LOC440104, THC2578957 were added in the model. The model was associated with an AIC of 205.84and the process of adding/ removing variables in the model was done again.

Step 5: AIC=204.51

health ~ PCDH12 + DLG5 + SHISA5 + AF161342 + CARKD + F2R + PHKG1 +

PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 + PPAN + BC007917 +

C14orf143 + LOC440104 + THC2578957

Step 6: AIC=203.03

health ~ PCDH12 + DLG5 + SHISA5 + AF161342 + CARKD + F2R + PHKG1 +

PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 + PPAN + C14orf143 +

LOC440104 + THC2578957

Step 7: AIC=202.08

health ~ PCDH12 + DLG5 + SHISA5 + AF161342 + CARKD + F2R + PHKG1 +

PLEKHM1 + SMC2 + PSMB6 + A_24_P936373 + C14orf143 + LOC440104 +

THC2578957

Step 8: AIC=201.11

health ~ PCDH12 + DLG5 + AF161342 + CARKD + F2R + PHKG1 + PLEKHM1 +

SMC2 + PSMB6 + A_24_P936373 + C14orf143 + LOC440104 + THC2578957

Step 9: AIC=200.86

health ~ PCDH12 + DLG5 + AF161342 + CARKD + F2R + PHKG1 + PLEKHM1 +

SMC2 + PSMB6 + A_24_P936373 + C14orf143 + LOC440104.

Using the Akaike Information criterion (AIC), variable selection was done to identify those independent variables in the model that could provide the best explanation for the variation in the outcome variable. At step nine of the analysis, the independent variables PCDH12, DLG5, AF161342, CARKD + F2R, PHKG1, PLEKHM1, SMC2, PSMB6, A_24_P93637, C14orf143 and LOC440104 were added to the model. This final model had an AIC value of 200.86. The results of the back-forward variable selection method for variable selection is indicated in the output below.

Table 4: Variable selection output from the regression model

Df Sum of Sq RSS AIC

529.99 200.86

+ THC2578957 1 3.20 526.80 201.11

- PCDH12 1 4.98 534.97 201.58

+ SHISA5 1 2.21 527.78 201.65

+ CDCP1 1 1.77 528.22 201.89

+ PPAN 1 1.64 528.35 201.96

- CARKD 1 6.37 536.36 202.33

- PHKG1 1 6.77 536.76 202.55

+ BC007917 1 0.44 529.55 202.62

+ BX440400 1 0.06 529.93 202.83

+ ANKIB1 1 0.05 529.95 202.84

+ BC038559 1 0.00 529.99 202.86

- C14orf143 1 7.65 537.64 203.02

- SMC2 1 9.81 539.80 204.18

- PSMB6 1 13.35 543.34 206.08

- A_24_P936373 1 14.49 544.48 206.69

- DLG5 1 20.12 550.12 209.67

- LOC440104 1 27.02 557.01 213.28

- AF161342 1 30.06 560.05 214.86

- F2R 1 54.35 584.34 227.18

- PLEKHM1 1 392.33 922.32 359.53

After application of variable selection technics with an aim of identifying that model with the lowest AIC figure, out of the 20 independent variables, only THC2578957, SHISA5, CDCP1, PPAN, BC007917, BX440400, ANKIB1, BC038559 were included in the model.

Table 5: R-studio output for final regression model after variable section

Residuals:

Min 1Q Median 3Q Max

-2.5972 -1.1044 -0.4952 0.3743 12.2916

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.7370 0.9106 3.006 0.002888 **

THC2578957 -0.1031 0.3454 -0.298 0.765576

SHISA5 0.4561 0.3704 1.231 0.219190

CDCP1 0.1236 0.2219 0.557 0.578079

PPAN -0.2365 0.3303 -0.716 0.474560

BC007917 -0.3231 0.3626 -0.891 0.373685

BX440400 -1.1484 0.3445 -3.333 0.000974 ***

ANKIB1 0.4751 0.3388 1.402 0.161955

BC038559 0.9429 0.4018 2.346 0.019647 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.947 on 281 degrees of freedom

Multiple R-squared: 0.1001, Adjusted R-squared: 0.07447

F-statistic: 3.907 on 8 and 281 DF, p-value: 0.0002159

After dropping the insignificant independent variables in the model, a new regression was run using only the remaining independent variables in the model and the results above

where obtained. From the table above, it is evident that several independent variables were dropped from the model and only THC2578957, SHISA5, CDCP1, PPAN, BC007917, BX440400, ANKIB1, BC038559 remained in this model. In the final regression model after variable selection, only the variables BX440400 (B = -3.333, T = -3.333, P-value= 0.000974) and BC038559 (B = 2.346, T = -3.333, P-value = 0.019647) were significant as they were associated withvery low P-value compared to the level of significance.

All the other variables were very insignificant in the model due to the very large p-values associated to them. Furthermore, the R-squared statistics indicate that the final model explains only 10.1% of the overall variation in the dependent variable. Nevertheless, the final model was very significant because the results indicate it Fischer’s F statistic was associated with a very low p-value F (8,281) = 3.907, P-value = 0.0002159).

The analysis indicates that variable selection methods though vital for selecting the most appropriate model, it depends on the goal of the analysis. For instance, if the goal of conducting the analysis is to obtain a model that explains a huge proportion of the variation in the outcome variable, applying the variable selection methods may lead to a reduction in the overall R-squared statistics. However, if the goal is to obtain a model that minimizes/minimizes the F-for change, then the variable selection methods can judiciously be used. In the data at hand, application of the variable selection produced a better model on statistical grounds. For instance, the final model had the least measures of the AIC, more powerful model fitting statistical data on gene sequence for explaining the variation in the health outcome variable. However, the variable selection method led to a substantial loss in the R-squared value from 56.11% to 10.1% as indicated in the analysis. The predicted values after variable selection were added to a different column in the Submit Fitted.csv.

R CODE USED IN THE ANALYSIS

library(readxl)

training

April 06, 2023

Category:

Science Life Health

Subcategory:

Math

Subject area:

Statistics Model Data Analysis

Number of pages

Number of words

3348

Downloads:

Rate:

Expertise Data Analysis

Verified writer

LuckyStrike has helped me with my English and grammar as I asked him for editing and proofreading tasks. When I need professional fixing of my papers, I contact my writer. A great writer who will make your writing perfect.

Hire Writer

Use this essay example as a template for assignments, a source of information, and to borrow arguments and ideas for your paper. Remember, it is publicly available to other students and search engines, so direct copying may result in plagiarism.

Eliminate the stress of research and writing!

Hire one of our experts to create a completely original paper even in 3 hours!

Hire a Pro

Related Essays

154 views 3 pages ~ 609 words

Data Analysis Statistics Correlation

Pearson Correlation Test

The present assignment applied Pearson correlation test to evaluate the direction and strength of the association betwee...

85 views 2 pages ~ 398 words

Research Data Analysis Statistics Theory

A Comparison of Descriptive and Differential Statistics

The aim of this paper is to give a broad understanding of the underlying concepts of quantitative methods as well as off...

117 views 4 pages ~ 1053 words

Data Analysis Statistics

Demystifying Data Analysis

Statisticians and researchers in the modern era agree that most of the phenomena in the world are normally distributed. ...

263 views 2 pages ~ 296 words

Algebra

Calculating the Greatest Common Factor

The first step in solving this algebra is to open the brackets. We multiply the negative sign outside the bracket with t...

185 views 2 pages ~ 405 words

Data Analysis Statistics

Mean, Median, Mode, Range and Standard Deviation

Q1. Six students obtained the following scores during a mathematics test. 14, 13, 13, 11, 15, and 12. The above score ca...

275 views 6 pages ~ 1497 words

Calculus Engineering

The Use of Calculus in Engineering

Calculus is a fundamental topic in the field of mathematics that studies continuous change. Typically, there are two bra...