Web Information Extraction and Retrieval

2022/2023
Programme:
Computer Science and Mathematics, Second Cycle
Year:
1 in 2 year
Semester:
first or second
Kind:
mandatory
ECTS:
6
Language:
slovenian, english
Lecturers:

Marko Bajec

Hours per week – 1. or 2. semester:
Lectures
3
Seminar
0.67
Tutorial
1.33
Lab
0
Content (Syllabus outline)

Content of the course:
This course will cover the following topics:

  • Information Retrieval and Web Search:
    Basic Concepts of Information Retrieval
    Information Retrieval Models
    Relevance Feedback
    Evaluation Measures
    Text and Web Page Pre-Processing
    Inverted Index and Its Compression
    Latent Semantic Indexing
    Web Search
    Meta-Search: Combining Multiple Rankings

  • Web Crawling:
    A Basic Crawler Algorithm
    Implementation Issues
    Universal Crawlers
    Focused Crawlers
    Topical Crawlers

  • Structured Data Extraction:
    Wrapper Induction
    Instance-Based Wrapper Learning
    Automatic Wrapper Generation
    String Matching and Tree Matching
    Multiple Alignment
    Building DOM Trees
    Extraction Based on a Single List Page or Multiple Pages

  • Information Integration:
    Schema-Level Matching
    Domain and Instance-Level Matching
    Combining Similarities
    1:m Match
    Integration of Web Query Interfaces
    Constructing a Unified Global Query Interface

  • Opinion Mining and Sentiment Analysis:
    Document Sentiment Classification
    Sentence Subjectivity and Sentiment Classification
    Opinion Lexicon Expansion
    Aspect-Based Opinion Mining
    Opinion Search and Retrieval

Readings
  • Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications, Springer, August 2013
  • Ricardo Baeza-Yates , Berthier Ribeiro-Neto: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd Edition, ACM Press Books, 2010
Objectives and competences

The main objective of this course is to teach students about how to develop programs for web search (including surface web and deep web search) and for extraction of structural data from both, static and dynamic web pages. Beside basic concepts of the web search and retrieval, students will learn about relevant techniques and approaches. After the course, if successful, students will be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).

Intended learning outcomes

After successful completion of the module, students will be able to:

  • summarize the most important approaches and techniques for searching and extracting data from the web
  • to select approaches and techniques that are most suitable for individual problems in web information extraction and retrieval.
  • to develop applications for data acquisition and analysis,
  • to construct new algorithms for web data search and extraction,
  • to explain behavior and time complexity of specific web search algorithms,
  • to integrate and employ different open-source solutions from the field.
Learning and teaching methods

Lectures, seminars, homeworks, oral presentations, project work.

Assessment

Continuing (homework, midterm exams, project work)
Final (written and oral exam)
grading: 5 (fail), 6-10 (pass) (according to the Statute of UL)

Lecturer's references

Pet najpomembnejših del:
ŠUBELJ, Lovro, BAJEC, Marko. Group detection in complex networks : an algorithm and comparison of the state of the art. Physica. A, 2014
ŽITNIK, Slavko, ŠUBELJ, Lovro, LAVBIČ, Dejan, VASILECAS, Olegas, BAJEC, Marko. General context-aware data matching and merging framework. Informatica, 2013
LAVBIČ, Dejan, BAJEC, Marko. Employing semantic web technologies in financial instruments trading : Dejan Lavbič and Marko Bajec. International journal of new computer architectures and their applications, 2012
ŠUBELJ, Lovro, FURLAN, Štefan, BAJEC, Marko. An expert system for detecting automobile insurance fraud using social network analysis. Expert systems with applications, 2011
ŠUBELJ, Lovro, JELENC, David, ZUPANČIČ, Eva, LAVBIČ, Dejan, TRČEK, Denis, KRISPER, Marjan, BAJEC, Marko. Merging data sources based on semantics, contexts and trust. The IPSI BgD transactions on internet research, 2011
Celotna bibliografija je dostopna na SICRISu:
http://sicris.izum.si/search/rsr.aspx?lang=slv&,id=9270.