Assignment 3

The following assignment is aimed to give you some practice with exploring data and running a linear regression on your own using statistical software. You are welcome to use any statistical software you wish and you are also free to work in groups of up to 3 for this assignment. If you work in groups, please submit one completed assignment per group on ICON. Please make sure to add everyone’s name to the submission, this can be a comment on ICON or on the document itself.

Instructions

What to turn in

Please turn in a document that contains the following:

  1. answers to the questions below
  2. include any relevant statistics/figures that support your answer.

Finally,upload the final document to ICON.

Including Statistical Evidence

This assignment is worth a total of 10 points spread over 10 questions. Please make sure to include any statistical evidence to support your statements, this could include graphics or statistics. This assignment also gives you a choice about what question you are interested in exploring within these data, therefore, including the statistical evidence is extremely important. Failure to include statistical evidence to support claims will result in a 1pt deduction.

Due Date

Due May 9th, 2024.

The data for this assignment is fertility data for countries. The data originally were compiled from a variety of sources such as the World Bank and comes from a colleague of mine who uses the data in their class.

Data to use: The data can be obtained in csv format from GitHub. The data are also posted to the data folder within the IDAS. A short description for each attribute is as follows.

variable class description
country character Country Name
region character Region of the world (7 total)
fertility_rate double Average number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates.
educ_female double Average number of years of formal education (schooling) for females
infant_mortality double Number of infants dying before reaching one year of age, per 1,000 live births in a given year.
contraceptive double Percentage of women who are practicing, or whose sexual partners are practicing, any form of contraception. It is usually measured for women ages 15–49 who are married or in union.
gni_class character Categorization based on country’s gross national income per capita (calculated using the World Bank Atlas method); Low = Low-income economies; GNI per capita of $1,025 or less, Low/Middle = Lower-middle-income economies; GNI per capita between $1,026 and $3,995; Upper/Middle = Upper middle-income economies; GNI per capita between $3,996 and $12,375; Upper = High-income economies; GNI per capita of $12,376 or more.
high_gni double Dummy variable indicating if the country is has an upper-middle or high income economy (low- or low/middle-income = 0; upper/middle or upper income = 1)
region_simple character Region of the world, simplified. North/South America are combined and South Asia is combined with East Asia and Pacific (5 total regions)

Note

This is a new assignment that does not build off of the first two assignments in the course.

Questions

Note: Each question is worth 1 pt unless otherwise specified.

  1. Using the data descriptions above, create a research question to explore for this assignment that includes a categorical predictor. Explore the research question descriptively. Summarize key similarities/differences from the descriptive analysis. Note, for this first part, include just a single categorical predictor in your research questions.

  2. Fit a linear regression model to explore the research question from #1. Interpret each regression coefficient estimate from the linear regression model. That is, what do the specific linear regression coefficients mean in the context of the data problem at hand?

  3. Evaluate the linear regression data conditions/assumptions for the model fitted in #3. Summarize the degree to which the model meets the statistical data conditions of linear regression.

  4. Summarize the statistical evidence to answer the research question from #1. Be as specific as possible about which tests you are exploring and whether the statistical evidence supports or does not support the null hypothesis.

  5. Add any predictors you think may help increase the utility of the statistical model fitted as main effects/additive effects. Note, you are welcome to include more than one attribute in this step, but include at least one. For now, let’s not consider any interactions, make sure the terms are additive.

    • Does the model fit better than the one in #3?
    • Do the primary conclusions from the model fitted in #3 differ for the first categorical attribute considered?
  6. Explore an interaction between two categorical attributes. Ensure both categorical attributes are included as main effects too.

    • Interpret at least one of the interaction estimates. That is, what does the parameter estimate(s) mean in the context of the data? Note: I recommend including at most a two-way interaction effect (i.e., an interaction between two attributes).
    • Also, as part of the interpretation, what does the statistical evidence suggest about whether the interaction effect differs from 0? 2 pts
  7. Build a new model that adds a new interaction effect between a categorical and a continuous attribute (ensure both terms are included as main effects as well). Note, I recommend to aid in interpretability for this assignment, just include one categorical, one continuous, and the interaction effect.

    • Interpret the interaction estimate, that is, what does the interaction parameter estimate mean in the context of the data?
    • As part of the interpretation, what does the statistical evidence suggest about whether the interaction effect differs from 0? 2 pts
  8. Evaluate the data conditions for the model fitted in #7. Summarize the degree to which the model meets the statistical data conditions of linear regression.

Previous