# Statistics and Data Science

24 Hillhouse Avenue, 203.432.0666

http://statistics.yale.edu

M.A., Ph.D.

**Chair**

Harrison Zhou

**Directors of Graduate Studies**

John Emerson (24 Hlh, john.emerson@yale.edu)

Andrew Barron (24 Hlh, andrew.barron@yale.edu)

**Professors** Donald Andrews (*Economics*), Andrew Barron, Joseph Chang, Xiaohong Chen (Economics), John Emerson (*Adjunct*), John Hartigan (*Emeritus*), Theodore Holford (*Public Health; Biostatistics*), John Lafferty, Peter Phillips (*Economics*), David Pollard, Van Vu (Mathematics), Heping Zhang (*Public Health; Biostatistics*), Hongyu Zhao (*Public Health; Biostatistics*), Harrison Zhou

**Associate Professor** Sekhar Tatikonda (*Electrical Engineering*)

**Assistant Professors** Jessi Cisewski, Sahand Negahban, Yihong Wu

**Senior Lecturer** Jonathan Reuning-Scherer

**Lecturer** Susan Wang

## Fields of Study

Fields of study include the main areas of statistical theory (with emphasis on foundations, Bayes theory, decision theory, nonparametric statistics), probability theory (stochastic processes, asymptotics, weak convergence), information theory, bioinformatics and genetics, classification, data mining and machine learning, neural nets, network science, optimization, statistical computing, and graphical models and methods.

## Special Admissions Requirements

GRE scores for the General Test are required. A GRE Subject Test in the area closest to the undergraduate major is recommended for the Ph.D. program and encouraged for the M.A. program. All applicants should have a strong mathematical background, including advanced calculus, linear algebra, elementary probability theory, and at least one course providing an introduction to mathematical statistics. An undergraduate major may be in statistics, mathematics, computer science, or in a subject in which significant statistical problems may arise. For those whose native language is not English, the Test of English as a Foreign Language (TOEFL) scores are required. This requirement is waived only for applicants who, prior to matriculation at Yale, will have received a baccalaureate degree or its international equivalent with three years of residency from a college or university where English is the primary language of instruction.

## Special Requirements for the Ph.D. Degree in Statistics and Data Science

There is no foreign language requirement. Students take at least twelve courses, usually during the first two years. The department strongly recommends that students take S&DS 551 (Stochastic Processes), S&DS 600 (Advanced Probability), S&DS 610 (Statistical Inference), S&DS 612 (Linear Models), S&DS 625 (Statistical Case Studies), and S&DS 661 (Data Analysis), and requires that students take S&DS 626 (Practical Work). Substitutions are possible with the permission of the director of graduate studies (DGS); courses from other complementary departments such as Mathematics and Computer Science are encouraged.

The qualifying examination consists of three parts: a written report on an analysis of a data set, a written examination, and an oral examination. The examinations are taken as scheduled by the department, with provision for one subsequent reexamination of one or more parts in the event that a student does not pass the first time. All parts of the qualifying examination must be completed before the beginning of the third year. A prospectus for the dissertation should be submitted no later than the first week of March in the third year. The prospectus must be accepted by the department before the end of the third year if the student is to register for a fourth year. Upon successful completion of the qualifying examination and the prospectus (and meeting of Graduate School requirements), the student is admitted to candidacy. Students are expected to attend weekly departmental seminars.

Students normally serve as teaching fellows (at level 20 or the equivalent) during four terms to acquire professional training. Although this may be completed during the third and fourth years, some students elect to satisfy part of this requirement in the earlier years of study, with approval of the DGS and their adviser, in areas contributing to their professional development.

## Master’s Degrees

**M.A. (en route to the Ph.D. in Statistics and Data Science)** This degree may be awarded upon completion of eight term courses in Statistics with an average grade of HP or higher, and two terms of residence.

**Terminal Master’s Degree Program in Statistics** Students are also admitted directly to a terminal master’s degree program in Statistics. To qualify for the M.A., the student must successfully complete an approved program of eight term courses in Statistics with an average grade of HP or higher, chosen in consultation with the DGS. Full-time students must take a minimum of four courses per term. Part-time students are also accepted into the master’s degree program. See Terminal M.A./M.S. Degrees, under Policies and Regulations.

Program information is available online at http://statistics.yale.edu.

## Courses

**S&DS 500b, Introductory Statistics** John Emerson

An introduction to statistical reasoning. Topics include numerical and graphical summaries of data, data acquisition and experimental design, probability, hypothesis testing, confidence intervals, correlation and regression. Application of statistical concepts to data; analysis of real-world problems.

MWF 10:30am-11:20am

**S&DS 501a / E&EB 510a, Introduction to Statistics: Life Sciences** Jonathan Reuning-Scherer and Walter Jetz

Statistical and probabilistic analysis of biological problems, presented with a unified foundation in basic statistical theory. Problems are drawn from genetics, ecology, epidemiology, and bioinformatics. *Note:* S&DS 501–506 offer a basic introduction to statistics, including numerical and graphical summaries of data, probability, hypothesis testing, confidence intervals, and regression. Each course focuses on applications to a particular field of study and is taught jointly by two instructors, one specializing in statistics and the other in the relevant area of application. The first seven weeks are attended by all students in S&DS 501–506 together as general concepts and methods of statistics are developed. The course separates for the last six and a half weeks, which develop the concepts with examples and applications. Computers are used for data analysis. These courses are alternatives; they do not form a sequence, and only one may be taken for credit.

TTh 1pm-2:15pm

**S&DS 502a, Introduction to Statistics: Political Science** Jonathan Reuning-Scherer and Kelly Rader

Statistical analysis of politics, elections, and political psychology. Problems presented with reference to a wide array of examples: public opinion, campaign finance, racially motivated crime, and public policy. *Note:* S&DS 501–506 offer a basic introduction to statistics, including numerical and graphical summaries of data, probability, hypothesis testing, confidence intervals, and regression. Each course focuses on applications to a particular field of study and is taught jointly by two instructors, one specializing in statistics and the other in the relevant area of application. The first seven weeks are attended by all students in S&DS 501–506 together as general concepts and methods of statistics are developed. The course separates for the last six and a half weeks, which develop the concepts with examples and applications. Computers are used for data analysis. These courses are alternatives; they do not form a sequence, and only one may be taken for credit.

TTh 1pm-2:15pm

**S&DS 503a, Introduction to Statistics: Social Sciences** Jonathan Reuning-Scherer

Descriptive and inferential statistics applied to analysis of data from the social sciences. Introduction of concepts and skills for understanding and conducting quantitative research. *Note:* S&DS 501–506 offer a basic introduction to statistics, including numerical and graphical summaries of data, probability, hypothesis testing, confidence intervals, and regression. Each course focuses on applications to a particular field of study and is taught jointly by two instructors, one specializing in statistics and the other in the relevant area of application. The first seven weeks are attended by all students in S&DS 501–506 together as general concepts and methods of statistics are developed. The course separates for the last six and a half weeks, which develop the concepts with examples and applications. Computers are used for data analysis. These courses are alternatives; they do not form a sequence, and only one may be taken for credit.

TTh 1pm-2:15pm

**S&DS 505a, Introduction to Statistics: Medicine** Jonathan Reuning-Scherer and Russell Barbour

Statistical methods relied upon in medicine and medical research. Practice in reading medical literature competently and critically, as well as practical experience performing statistical analysis of medical data. *Note:* S&DS 501–506 offer a basic introduction to statistics, including numerical and graphical summaries of data, probability, hypothesis testing, confidence intervals, and regression. Each course focuses on applications to a particular field of study and is taught jointly by two instructors, one specializing in statistics and the other in the relevant area of application. The first seven weeks are attended by all students in S&DS 501–506 together as general concepts and methods of statistics are developed. The course separates for the last six and a half weeks, which develop the concepts with examples and applications. Computers are used for data analysis. These courses are alternatives; they do not form a sequence, and only one may be taken for credit.

TTh 1pm-2:15pm

**S&DS 510a, An Introduction to R for Statistical Computing and Data Science** John Emerson

An introduction to the R language for statistical computing and graphics. R is a widely accepted language for advanced statistical computing and data science in industry as well as in a wide range of academic disciplines. This course is a useful complement (concurrently or in advance) to many courses in S&DS. One-half credit; meets for eight weeks. ½ Course cr

TTh 9am-10:15am

**S&DS 520b, Intensive Introductory Statistics** Joseph Chang

An introduction to statistical reasoning designed for students with particular interest in data science and computing. Using the R language, topics include exploratory data analysis, probability, hypothesis testing, confidence intervals, regression, statistical modeling, and simulation. Computing is taught and used extensively throughout the course. Application of statistical concepts to the analysis of real-world data science problems.

TTh 9am-10:15am

**S&DS 530a or b / PLSC 530a or b, Data Exploration and Analysis** Staff

Survey of statistical methods: plots, transformations, regression, analysis of variance, clustering, principal components, contingency tables, and time series analysis. The R computing language and Web data sources are used.

HTBA

**S&DS 538a, Probability and Statistics** Joseph Chang

Fundamental principles and techniques of probabilistic thinking, statistical modeling, and data analysis. Essentials of probability: conditional probability, random variables, distributions, law of large numbers, central limit theorem, Markov chains. Statistical inference with emphasis on the Bayesian approach: parameter estimation, likelihood, prior and posterior distributions, Bayesian inference using Markov chain Monte Carlo. Introduction to regression and linear models. Computers are used throughout for calculations, simulations, and analysis of data. Prerequisite: differential calculus of several variables; some acquaintance with matrix algebra and computing is assumed.

TTh 1pm-2:15pm

**S&DS 541a, Probability Theory** Winston Lin

A first course in probability theory: probability spaces, random variables, expectations and probabilities, conditional probability, independence, some discrete and continuous distributions, central limit theorem, Markov chains, probabilistic modeling. Prerequisite: calculus of functions of several variables.

MW 9am-10:15am

**S&DS 542b, Theory of Statistics** Andrew Barron

Principles of statistical analysis: maximum likelihood, sampling distributions, estimation, confidence intervals, tests of significance, regression, analysis of variance, and the method of least squares. Prerequisite: S&DS 541.

MWF 9:25am-10:15am

**S&DS 551b, Stochastic Processes** Sahand Negahban

Introduction to the study of random processes, including Markov chains, Markov random fields, martingales, random walks, Brownian motion, and diffusions. Techniques in probability such as coupling and large deviations. Applications chosen from image reconstruction, Bayesian statistics, finance, probabilistic analysis of algorithms, genetics, and evolution.

MW 1pm-2:15pm

**S&DS 562a, Computational Tools for Data Science** Sahand Negahban

An introduction to computational tools for data science. The analysis of data using regression, classification, clustering, principal component analysis, independent component analysis, dictionary learning, topic modeling, dimension reduction, and network analysis. Optimization by gradient methods and alternating minimization. The application of high-performance computing and streaming algorithms to the analysis of large data sets. Prerequisites: linear algebra, multivariable calculus, and programming.

TTh 2:30pm-3:45pm

**S&DS 563b, Multivariate Statistical Methods for the Social Sciences** Jonathan Reuning-Scherer

An introduction to the analysis of multivariate data. Topics include principal components analysis, factor analysis, cluster analysis (hierarchical clustering, k-means), discriminant analysis, multidimensional scaling, and structural equations modeling. Emphasis on practical application of multivariate techniques to a variety of examples in the social sciences. Students complete extensive computer work using either SAS or SPSS. Prerequisites: knowledge of basic inferential procedures, experience with linear models (regression and ANOVA). Experience with some statistical package and/or familiarity with matrix notation is helpful but not required.

TTh 1pm-2:15pm

**S&DS 565a or b, Applied Data Mining and Machine Learning** Staff

Techniques for data mining and machine learning are covered from both a statistical and a computational perspective, including support vector machines, bagging, boosting, neural networks, and other nonlinear and nonparametric regression methods. The course gives the basic ideas and intuition behind these methods, a more formal understanding of how and why they work, and opportunities to experiment with machine-learning algorithms and apply them to data. Prerequisite: after or concurrent with S&DS 542.

HTBA

**S&DS 600b, Advanced Probability** David Pollard

Measure theoretic probability, conditioning, laws of large numbers, convergence in distribution, characteristic functions, central limit theorems, martingales. Some knowledge of real analysis is assumed.

TTh 2:30pm-3:45pm

**S&DS 610a, Statistical Inference** Harrison Zhou

A systematic development of the mathematical theory of statistical inference covering methods of estimation, hypothesis testing, and confidence intervals. An introduction to statistical decision theory. Knowledge of probability theory at the level of S&DS 541 is assumed.

TTh 11:35am-12:50pm

**S&DS 611b, Selected Topics in Statistical Decision Theory** Harrison Zhou

Recent developments in statistical decision theory, including nonparametric estimation, high-dimensional (non)linear estimation, low rank and sparse matrices estimation, covariance matrices estimation, graphical models, and network analysis. Prerequisite: S&DS 610.

W 9:25am-11:15am

**S&DS 612a, Linear Models** Joseph Chang

The geometry of least squares; distribution theory for normal errors; regression, analysis of variance, and designed experiments; numerical algorithms (with particular reference to the R statistical language); alternatives to least squares. Prerequisites: linear algebra and some acquaintance with statistics.

MW 11:35am-12:50pm

**S&DS 625a, Statistical Case Studies** Xiaofei Wang

Statistical analysis of a variety of statistical problems using real data. Emphasis on methods of choosing data, acquiring data, assessing data quality, and the issues posed by extremely large data sets. Extensive computations using R.

MW 1pm-2:15pm

**S&DS 626b, Practical Work** John Emerson

Individual one-term projects, with students working on studies outside the department, under the guidance of a statistician.

HTBA

**S&DS 627a and S&DS 628b, Statistical Consulting** John Emerson

Statistical consulting and collaborative research projects often require statisticians to explore new topics outside their area of expertise. This course exposes students to real problems, requiring them to draw on their expertise in probability, statistics, and data analysis. Students complete the course with individual projects supervised jointly by faculty outside the department and by one of the instructors. Students enroll for both terms (S&DS 627 and 628) and receive one credit at the end of the year. ½ Course cr per term

F 2:30pm-4:30pm

**S&DS 630a, Optimization Techniques** Sekhar Tatikonda

Fundamental theory and algorithms of optimization, emphasizing convex optimization. The geometry of convex sets, basic convex analysis, the principle of optimality, duality. Numerical algorithms: steepest descent, Newton’s method, interior point methods, dynamic programming, unimodal search. Applications from engineering and the sciences.

TTh 1pm-2:15pm

**S&DS 661b, Data Analysis** Winston Lin

By analyzing data sets using the R statistical computing language, a selection of statistical topics are studied: linear and nonlinear models, maximum likelihood, resampling methods, curve estimation, model selection, classification, and clustering. Prerequisite: after or concurrent with S&DS 542.

MW 2:30pm-3:45pm

**S&DS 662b, Statistical Computing** John Emerson

Topics in the practice of data analysis and statistical computing, with particular attention to problems involving massive data sets or large, complex simulations and computations. Programming with R, C/C++, and Perl/Python, computational efficiency, memory management, interactive and dynamic graphics, and parallel computing.

TTh 9am-10:15am

**S&DS 669b, Statistical Learning Theory** Sahand Negahban

Introduction to theoretical analysis of machine-learning algorithms, focusing on the statistical and computational aspects, and covering such subjects as decision theory, empirical process theory, and convex optimization. Prerequisites: linear algebra, multivariable calculus, stochastic processes, and introduction to machine learning such as S&DS 565 or a similar course.

MW 2:30pm-3:45pm

**S&DS 670a, Neural Nets** Andrew Barron

Artificial neural networks and related statistical learning methods are developed for high-dimensional function estimation and classification. Approximation capability, statistical accuracy, and computational methodology are explored. Students are expected to provide a report, usually on recent literature, and a computational exploration of one of the methods discussed. Prerequisite: background in probability, statistics, and computation.

W 9am-11:15am

**S&DS 690a or b, Independent Study** Staff

By arrangement with faculty. Approval of DGS required.

HTBA