Skip to main content

Data mining

2018/2019
Programme:
Computer Science and Mathematics, Second Cycle
Year:
1 in 2 year
Semester:
second
Kind:
optional
ECTS:
6
Language:
slovenian, english
Course director:

Blaž Zupan

Hours per week – 2. semester:
Lectures
3
Seminar
1.33
Tutorial
0.67
Lab
0
Content (Syllabus outline)

The course will cover theoretical and practical aspects of the following data mining approaches:
Introduction to data mining, taxonomy of data mining approaches and tasks
Data mining programming environments (scripting, visual programming)
Data preprocessing (dimensionality reduction, feature construction, identification of outliers)
Classification, including support vector machines and feature interaction discovery
Clustering, with emphasis on techniques that can consider very large data sets, and techniques for to determine an appropriate number of clusters
Evaluation, including permutation-based and cross-validation approaches, statistical scoring of models
Data and model visualization techniques, visualization of networks
Text mining, text-based kernels for support vector machines
Integrative aspects, including ensemble methods and mining with inclusion of prior knowledge
Typical mistakes in data mining and how to avoid them
The course will be composed of lectures in core data mining techniques and tools, which will then be employed on practical problems during lab work. We will focus on open source solutions and modern scripting languages (e.g., Python). Students will use scripting to access various data mining techniques which they, in a programming framework, will combine into their own data mining procedures.

Readings
  1. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Education, Boston.
  2. Leskovec J, Rajaraman A, Ullman J (2014) Mining of Massive Datasets, Cambridge University Press, 2 edition.
  3. Chollet F (2018) Deep learning with Python, Manning Publications.
Objectives and competences

Students will learn a number of core techniques for data mining. The course will include an introduction to data mining as well as a detailed study of several selected methods. It will also focus on practical use of these methods on real-life problems. The course will use a scripting data mining environment, where students will learn how to use the existing data mining libraries and design and implement in code their own data mining solutions.

Intended learning outcomes

After the completion of the course the student will be able to:
recognize problems where one can apply machine learning,
understand the process of transformation of the problem-specific data to the form suitable for data mining,
understand the difference of various techniques of data mining in application to the real-world data,
identify what kind of advantages of different machine learning techniques provide for specific data sets,
be able to write Python scripts for data analytics and within them integrate various data mining libraries,
use libraries for deep learning,
understand the mathematics behind most of the data mining approaches.

Learning and teaching methods

Combined lecturing with simultaneous use of the blackboard and computer projection (coding, visualization of models, results). Lab work in computer-equipped lecture rooms. Individual and work in team. Emphasis on practical problem solving.

Assessment

Continuing (homework, midterm exams, project work)
Final (written and oral exam)
grading: 5 (fail), 6-10 (pass) (according to the Statute of UL)

Lecturer's references

Pet najpomembnejših del:
Stajdohar M, Rosengarten RD, Kokosar J, Jeran L, Blenkus D, Shaulsky G, Zupan B (2017) dictyExpress: a web-based platform for sequence data management and analytics in Dictyostelium and beyond, BMC Bioinformatics. 2017 Jun 2,18(1):291.
Zitnik M, Zupan B (2016) Jumping across biomedical contexts using compressive data fusion, Bioinformatics 15,32(12):i90-i100.
Zitnik M, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B (2015) Gene prioritization by compressive data fusion and chaining, PLoS Computational Biology 11(10):e1004552.
Staric A, Demsar J, Zupan B (2015) Concurrent software architectures for exploratory data analysis. WIREs Data Mining and Knowledge Discovery 5(4):165-180.
Zitnik M, Zupan B (2015) Data fusion by matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(1):41-53.
Celotna bibliografija je dostopna na SICRISu:
http://sicris.izum.si/search/rsr.aspx?lang=slv&,id=7764.