A full-day tutorial at CSCL 2017
Philadelphia, PA
June 18, 2017

In this full-day tutorial, we will discuss the use and analysis of Big Data in Education. Students will learn methods for data science appropriately for the handling of large data sets, and methods for conducting data mining on the types of interaction and text data often seen in educational data sets. We will also discuss appropriate practice around data and data science in education.

The ability to collect, manipulate, analyze, and act on vast amounts of data presents potentially powerful research opportunities for scientists and educators. The explosive growth of interest in large-scale data analysis in turn has been driven by research in online education, by scholars with access to data at an unprecedented scale, and by government, social services and non-profits leveraging data for social good. This emerging discipline of data science relies on a novel mix of mathematical and statistical modeling, computational thinking and methods, data representation and management, data mining and machine learning, and domain expertise. In education, the educational data mining and learning analytics communities have emerged, with an interest in understanding how to best apply the methods of data science to education . The CSCL community has an opportunity to contribute significantly towards the research needed to drive development of the field of data science. In addition, the CSCL community has the obligation to engage in developing guidelines for the responsible use of data science methods and analytics in education technology.

This full-day tutorial will discuss emerging new theories, algorithms, and applications of data driven education. For example, one focal point to be discussed will be how to support participants to create and evaluate models of human behavior and develop computationally efficient and accurate methods.


bevblackboardBeverly Park Woolf, Ph.D., Ed. D. is a Research Professor Emeritus in the College of Information and Computer Science at the University of Massachusetts, Amherst. She brings a multi-disciplinary background to the project with a Ph.D. in computer science and artificial intelligence, a doctorate in Education and 25 years experience in computer science/education research. She has received 40 funded grants primarily to study the design and use of digital technology to support human learning. One of her Internet-based homework systems was licensed and is widely used throughout the U.S. by approximately 150,000 students at 500 institutions. She has designed, implemented and deployed dozens of intelligent tutors and educational systems. Dr. Woolf will oversee the Big Data and CSCL tutorial and help design the curriculum.

handsRyan Baker, Ph.D. is an Associate Professor of Cognitive Studies at University of Pennsylvania, Associate Editor of the Journal of Educational Data Mining and the International Journal of Artificial Intelligence and Education, and was Founding President of the International Educational Data Mining Society. Previously, he was Technical Director of the Pittsburgh Science of Learning Center DataShop. He has extensive experience teaching educational data mining. His three MOOCS and open online textbook on educational data mining and learning analytics have served over 100,000 students. He has conducted workshops and tutorials at several conferences worldwide on this subject. Baker has been PI, Co-PI, or senior personnel on 15 funded grants and conducted research on over a dozen online learning systems. He has garnered a dozen best paper awards or nominations and around 6000 citations. Ryan Baker will lead the teaching of the data science tutorial.

iarroyoIvon Arroyo, Ed. D. is an Assistant Professor in Social Science and Policy Studies at Worcester Polytechnic Institute with a doctorate in education and a master’s degree in computer science. She has ten years experience in the design and development of intelligent tutoring systems for mathematics education and has designed, implemented and evaluated several NSF supported intelligent tutors. Her work focuses on the infrastructure, pedagogy and content resources for digital instructional systems for mathematics teaching both at the elementary and high school level. Dr. Arroyo has taught several courses about educational data mining. Dr. Arroyo has extensive experience in randomized controlled evaluations of mathematics education software with thousands of K-12 students. She is a Fulbright Scholar and an elected (voted) member of the executive committee of the International Society of Artificial Intelligence in Education. Ivon Arroyo will teach the tutorial course with Ryan Baker.


In this full day tutorial, participants will learn about several tools appropriate for use in education, including tools for data mining, data processing and pre-preparation, tools for large-scale data management, and the MOOC Replication Framework (MORF) which enables CSCL analyses. There will be particular focus on RapidMiner, a software platform that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. RapidMiner provides data mining and machine learning procedures including: data loading and transformation, data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. During the tutorial, participants will be taught about existing large-scale data bases in CSCL and related areas.

Participants will also learn about key algorithms for conducting data mining, including prediction modeling algorithms and association rule mining. Participants will learn how to evaluate their models appropriately and ensure model validity. During the tutorial itself, participants will work with a database that contains up to 10 years of data for several cadres of students. Participants will use feature engineering and basic data mining techniques to conduct longitudinal prediction modeling analyses.

We will conclude with a discussion of responsible practice and data privacy in education.


This tutorial will consist of co-lectures from two experts in the field, as well as graduate students in their laboratories. We invite researchers from both academia and industry to attend the lecture. The tutorial will be designed as collaborative knowledge-building sessions on the focused issue of big data, where participants actively work together (e.g., analyzing data, discussing design criteria, collaborating on an analysis).


Please submit a brief statement of interest/application to by 1159pm Eastern Standard Time on 4/28/2017.

In this statement, please note what background you have in data mining, statistics, mathematical modeling, data analysis (including Excel), and computer programming, and indicate what your goals are for participating in this workshop (e.g., what do you hope to do with the material you learn). Note that an extensive statistical background is not required, and no prior programming experience is required. Our primary goal is to verify that all applicants will be able to benefit from participation in this tutorial.

Can’t make it to CSCL?

Take Big Data and Education online!