class: center, middle, inverse, title-slide # Introduction to DAV 5300 ## Computational Mathematics and Statistics ### Jason Bryer, Ph.D. ### August 27, 2024 --- class: hide-logo, bottom, right, title-slide background-image: url(images/Greetings_from_Statistics.jpeg) background-size: contain .font70[ [@skyetetra](https://twitter.com/ChelseaParlett/status/1340463322118856705) ] --- # Agenda .pull-left[ * Introductions * Syllabus * Class meetups * Course Schedule * Assignments (how you will be graded) * Software setup * Brief introduction to R ] .pull-right[ While waiting, please complete this formative assessment: <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] --- class: font120 # A little about me... * Earned my Ph.D. in Educational Pschology and Methodology from the University at Albany. Dissertation: [A National Study Comparing Charter and Traditional Public Schools Using Propensity Score Analysis](https://github.com/jbryer/dissertation) * Assistant Professor at CUNY in Data Science and Information Systems * Principal Investigator for a Department of Education Grant to develop and test the Diagnostic Assessment and Achievement of College Skills ([www.DAACS.net](http://www.daacs.net)) * Authored over a dozen R packages including: * [likert](http://github.com/jbryer/likert) * [sqlutils](http://github.com/jbryer/sqlutils) * [timeline](http://github.com/jbryer/timeline) * Specialize in propensity score methods. Three new methods/R packages developed include: * [multilevelPSA](http://github.com/jbryer/multilevelPSA) * [TriMatch](http://github.com/jbryer/TriMatch) * [PSAboot](http://github.com/jbryer/PSAboot) --- # Also a Father... <img src="images/BoysFall2019.jpg" width="65%" style="display: block; margin: auto;" /> --- # Runner... <table border='0' width='100%'><tr><td> <center><img src='images/2020Dopey.jpg' height='450'></center> </td><td> <center><img src='images/2019NYCMarathon.jpg' height='450'></center> </td></tr></table> --- # And photographer. <img src="images/Sleeping_Empire.jpg" width="80%" style="display: block; margin: auto;" /> --- # Syllabus <img src="images/hex/rmarkdown.png" class="title-hex"><img src="images/hex/blogdown.png" class="title-hex"> .pull-left[ Syllabus and course materials are here: https://fall2024.dav5300.net We will use Canvas primary for submitting assignments only. Please submit PDFs. PDFs are preferred for the homework as there is some LaTeX formatting in the R markdown files. The `tineytex` R package helps with install LaTeX, but you can also install LaTeX using [MiKTeX](http://miktex.org) (for Windows) and [BasicTeX](http://www.tug.org/mactex/morepackages.html) (for Mac). ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- class: font90 # Class Meetings Class will meet every Tuesday. In order to get the most out of this class attendance is required. **One Minute Papers** - Complete the one minute paper after each Meetup (whether you watch live or watch the recordings). It should take approximately one to two minutes to complete. --- class: font60 # Schedule .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> Start </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Tuesday, August 27, 2024 </td> <td style="text-align:left;"> Intro to the Course </td> </tr> <tr> <td style="text-align:left;"> Tuesday, September 03, 2024 </td> <td style="text-align:left;"> Intro to Data </td> </tr> <tr> <td style="text-align:left;"> Tuesday, September 10, 2024 </td> <td style="text-align:left;"> Summarizing Data </td> </tr> <tr> <td style="text-align:left;"> Tuesday, September 17, 2024 </td> <td style="text-align:left;"> Probability </td> </tr> <tr> <td style="text-align:left;"> Tuesday, September 24, 2024 </td> <td style="text-align:left;"> Distributions </td> </tr> <tr> <td style="text-align:left;"> Tuesday, October 01, 2024 </td> <td style="text-align:left;"> Foundation for Inference </td> </tr> <tr> <td style="text-align:left;"> Tuesday, October 08, 2024 </td> <td style="text-align:left;"> Inference for Categorical Data </td> </tr> <tr> <td style="text-align:left;"> Tuesday, October 15, 2024 </td> <td style="text-align:left;"> Inference for Numerical Data </td> </tr> <tr> <td style="text-align:left;"> Tuesday, October 22, 2024 </td> <td style="text-align:left;"> Linear Regression </td> </tr> <tr> <td style="text-align:left;"> Tuesday, October 29, 2024 </td> <td style="text-align:left;"> Maximum Likelihood Estimation </td> </tr> <tr> <td style="text-align:left;"> Tuesday, November 05, 2024 </td> <td style="text-align:left;"> Multiple Regression </td> </tr> <tr> <td style="text-align:left;"> Tuesday, November 12, 2024 </td> <td style="text-align:left;"> Conferences (online) </td> </tr> <tr> <td style="text-align:left;"> Tuesday, November 19, 2024 </td> <td style="text-align:left;"> Predictive Modeling </td> </tr> <tr> <td style="text-align:left;"> Tuesday, November 26, 2024 </td> <td style="text-align:left;"> NO MEETUP - Thanksgiving </td> </tr> <tr> <td style="text-align:left;"> Tuesday, December 03, 2024 </td> <td style="text-align:left;"> Bayesian Analysis </td> </tr> <tr> <td style="text-align:left;"> Tuesday, December 10, 2024 </td> <td style="text-align:left;"> Presentations </td> </tr> <tr> <td style="text-align:left;"> Tuesday, December 17, 2024 </td> <td style="text-align:left;"> Final Exam </td> </tr> </tbody> </table> ] .pull-right[ Assignments are due on Monday before the next class. ] --- # Textbooks <img src="images/hex/openintro.png" class="title-hex"> [*OpenIntro Statistics*](https://github.com/jbryer/DATA606Fall2020/blob/master/Textbook/os4.pdf?raw=true) by David Diaz, Mine Çetinkaya-Rundel, and Christopher D Barr. [*Learning Statistics with R*](https://github.com/jbryer/DATA606Fall2020/blob/master/Textbook/lsr-0.6.pdf?raw=true) by Danielle Navaro - We will only use the Bayesian chapter from this book. ## Optional [*R for Data Science*](https://r4ds.hadley.nz) by Hadley Wickham and Garrett Grolemund - Recommended reference for those new to R. --- class: font90 # Assignments **Labs** (30%) - Labs are designed to provide you an opportunity to apply statistical concepts using statistical software. **Textbook questions** (15%) - The assigned questions from the textbook provide an opportunity to assess conceptional understandings. **Participation** (10%) - You are expected to attend every class and to complete a [one minute paper](https://forms.gle/CD5Qxkq3xtdxSheW8) at the conclusion of class. **Data Project** (25%) - In a group of 2 to 3 students will present the results of analysis using a data set of your choice. More details will be provided a few weeks into the class. **Final exam** (20%) - A multiple choice exam will be given on the last day of class. **All assignments are due on Monday.** Assignments submitted late will be penalized. Assignments will not be accepted more than one week after their due date. --- # Academic Integrity With the exception of the data project, I expect you to complete all assignments (e.g. homework, labs) on your own. It is fine to ask questions of your peers and professor, but working together and/or sharing answers is not allowed. ## Yeshiva's Policy The submission by a student of any examination, course assignment, or degree requirement is assumed to guarantee that the thoughts and expressions therein not expressly credited to another are literally the student’s own. Evidence to the contrary will result in appropriate penalties. For more information, visit https://www.yu.edu/academic-integrity. --- # Communication * Email: [jason.bryer@yu.edu](mailto:jason.bryer@yu.edu). * Canvas * Office hours before and after class and by appointment. --- class: inverse, middle, center # Software Setup --- # Why R? .pull-left[ There are many languages data scientists use. [R](https://r-project.org) is specifically designed for statistics. We will leverage many R packages that are specifically designed to conduct, teach, and communicate statistical analysis. To be a well rounded data scientists, I believe you need to have experience in both R and Python. For this course: * Use R for the labs (they are designed to help you learn the core commands). * You may use Python or R for the homework and data project. ] .pull-right[ <img src="images/R_logo.png" width="965" style="display: block; margin: auto;" /> ] --- # Software <img src="images/hex/tinytex.png" class="title-hex"><img src="images/hex/RStudio.png" class="title-hex"><img src="images/hex/rmarkdown.png" class="title-hex"> This is an applied statistics course so we will make extensive use of the [R statistical programming language](https://www.r-project.org). * Install [R](https://cran.r-project.org) and [RStudio](https://rstudio.com) on your own computer. I encourage everyone to do this at some point by the end of the semester. You will also need to have [LaTeX](https://www.latex-project.org) installed as well in order to create PDFs. The [`tinytex`](https://yihui.org/tinytex/) R package helps with this process: ``` install.packages('tinytex') tinytex::install_tinytex() ``` --- class: inverse, middle, center # Introduction to R --- # Workflow .center[ <img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' /> ] .font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)] --- # Tidy Data .center[ <img src='images/tidydata_1.jpg' height='500' /> ] See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html). --- # Types of Data .pull-left[ * Numerical (quantitative) * Continuous * Discrete * Categorical (qualitative) * Regular categorical * Ordinal ] .pull-right[ .center[ <img src='images/continuous_discrete.png' height='400' /> ] ] --- # Data Types in R <img src="images/DataTypesConceptModel.png" width="1000" style="display: block; margin: auto;" /> --- # Data Types / Descriptives / Visualizations Data Type | Descriptive Stats | Visualization -------------|-----------------------------------------------|-------------------| Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot Discrete | contingency table, proportional table, median | bar plot Categorical | contingency table, proportional table | bar plot Ordinal | contingency table, proportional table, median | bar plot Two quantitative | correlation | scatter plot Two qualitative | contingency table, chi-squared | mosaic plot, bar plot Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot --- # Variance .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `\(x_i - \bar{x}\)` ). See also: https://shiny.rit.albany.edu/stat/visualizess/ https://github.com/jbryer/VisualStats/ ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction. ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We end up with a square. ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares. ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ The variance is therefore the average of the area of all these squares, here represented by the orange square. ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- # Population versus Sample Variance .pull-left[ Typically we want the sample variance. The difference is we divide by `\(n - 1\)` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `\(n\)`. Population Variance (yellow): $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Sample Variance (green): $$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$ ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics Consider the following data randomly selected from the normal distribution: .pull-left[ ``` r set.seed(41) x <- rnorm(30, mean = 100, sd = 15) mean(x); sd(x) ``` ``` ## [1] 103.1934 ``` ``` ## [1] 16.8945 ``` ``` r median(x); IQR(x) ``` ``` ## [1] 103.9947 ``` ``` ## [1] 25.68004 ``` ] .pull-right[ <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> Let's add an extreme value: ``` r x <- c(x, 1000) ``` -- <img src="01-Intro_to_Course_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, * for skewed distributions it is often more helpful to use median and IQR to describe the center and spread * for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread --- class: inverse, right, middle, hide-logo <!--img src="images/hex/DATA606.png" width="150px"/--> # Good luck with the semester! [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M440 6.5L24 246.4c-34.4 19.9-31.1 70.8 5.7 85.9L144 379.6V464c0 46.4 59.2 65.5 86.6 28.6l43.8-59.1 111.9 46.2c5.9 2.4 12.1 3.6 18.3 3.6 8.2 0 16.3-2.1 23.6-6.2 12.8-7.2 21.6-20 23.9-34.5l59.4-387.2c6.1-40.1-36.9-68.8-71.5-48.9zM192 464v-64.6l36.6 15.1L192 464zm212.6-28.7l-153.8-63.5L391 169.5c10.7-15.5-9.5-33.5-23.7-21.2L155.8 332.6 48 288 464 48l-59.4 387.3z"></path></svg> jason.bryer@yu.edu](mailto:jason.bryer@yu.edu) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> @jbryer](https://github.com/jbryer) [<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M433 179.11c0-97.2-63.71-125.7-63.71-125.7-62.52-28.7-228.56-28.4-290.48 0 0 0-63.72 28.5-63.72 125.7 0 115.7-6.6 259.4 105.63 289.1 40.51 10.7 75.32 13 103.33 11.4 50.81-2.8 79.32-18.1 79.32-18.1l-1.7-36.9s-36.31 11.4-77.12 10.1c-40.41-1.4-83-4.4-89.63-54a102.54 102.54 0 0 1-.9-13.9c85.63 20.9 158.65 9.1 178.75 6.7 56.12-6.7 105-41.3 111.23-72.9 9.8-49.8 9-121.5 9-121.5zm-75.12 125.2h-46.63v-114.2c0-49.7-64-51.6-64 6.9v62.5h-46.33V197c0-58.5-64-56.6-64-6.9v114.2H90.19c0-122.1-5.2-147.9 18.41-175 25.9-28.9 79.82-30.8 103.83 6.1l11.6 19.5 11.6-19.5c24.11-37.1 78.12-34.8 103.83-6.1 23.71 27.3 18.4 53 18.4 175z"></path></svg> @jbryer@vis.social](https://vis.social/@jbryer) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"></path></svg> github.com/jbryer/DAV5300-2024-Spring](https://github.com/jbryer/DAV5300-2024-Spring)