Seminar Distributed Data Mining

Course description Seminar Distributed Data Mining
Year: 2016-2017
Catalog number: 4343SDDM6
Teacher(s):
  • dr. W. Kowalczyk
Language: English
Blackboard: Yes
EC: 6.0
Level: 500
Period: Semester 2
Hours of study: 26:00 hrs
  • Yes Elective choice
  • Yes Contractual enrollment
  • Yes Exchange
  • Yes Study Abroad
  • No Evening course
  • No A la Carte
  • No Honours Class

Admission requirements

• Any course related to processing big data sets, e.g., Advances in Data Mining, Social Network Analysis, Multimedia Systems, Audio Processing and Indexing
• Fluency in Linux, Python, and optionally C/C++, Java

Description

In recent years we witness a rapidly growing gap between the amount of collected data and data processing capabilities of conventional computers. This is not surprising: according to the Moore’s Law, the processing power of an “average computer” doubles every 18 months, while, according to Lyman and Varian from Berkeley, the amount of stored data doubles every 12 months. In addition to this growing gap, there is an increasing need to analyze the data more quickly, more precisely, and more “intelligently”. In addition to the traditional data mining tasks: classification, regression and clustering, some new challenges emerged, which require completely new algorithms for:
• analysis of big networks: web pages, social networks (Facebook, Twitter), traffic, financial networks
• recommender systems: Amazon, Netflix
• digital forensics: analysis of data related to cybercrime
• analysis of large text corpora (Wikipedia, Github, Twitter)
• scientific data mining (bioinformatics, astronomy, physics)
• analysis of sensor data
In order to cope with this overwhelming data flow, several frameworks for distributed data mining, together with specialized data mining algorithms, have been invented, e.g., Hadoop and MapReduce, Spark, GraphLab. During the seminar students (organized in small teams) will work on some challenging data mining problems (selected by themselves), performing experiments on our cluster computers (DAS4 or DAS5) and reporting on their problems, approaches, results during weekly meetings. Each team will have to summarize their work in a final report

Course objectives

During the seminar, students will:
• gain detailed knowledge of some modern tools used in distributed data mining
• gain some hands-on experience with mining big data sets on distributed platforms
• learn to work together is small research teams
• indentify some promising research directions

Timetable

The most recent timetable can be found at the LIACS website

Mode of instruction

  • Weekly presentations and discussions
  • An experimental research project

Assessment method

The grade will be based on 3 components:

  • presentations (30%)
  • software developed during the seminar (30%)
  • final report (40%)

Blackboard

See Blackboard

Reading list

  • A. Rajaraman, J. Leskovec, and J. Ullman, Mining of Massive Datasets
  • Additional articles will be distributed during the first meeting.

Registration

You have to sign up for classes and examinations (including resits) in uSis. Check this link for more information and activity codes.

Contact information

Study coordinator Computer Science, Riet Derogee

Languages