Web Information Extraction and Retrieval

2024/2025

Programme:

Computer Science and Mathematics, Second Cycle

Year:

1 in 2 year

Semester:

first or second

Kind:

mandatory

ECTS:

Language:

slovenian, english

Course director:

Marko Bajec

Hours per week – 1. or 2. semester:

Lectures

Seminar

0.67

Tutorial

1.33

Lab

There are no prerequisites.

Content of the course:
This course will cover the following topics:

Information Retrieval and Web Search:
Basic Concepts of Information Retrieval
Information Retrieval Models
Relevance Feedback
Evaluation Measures
Text and Web Page Pre-Processing
Inverted Index and Its Compression
Latent Semantic Indexing
Web Search
Meta-Search: Combining Multiple Rankings
Web Crawling:
A Basic Crawler Algorithm
Implementation Issues
Universal Crawlers
Focused Crawlers
Topical Crawlers
Structured Data Extraction:
Wrapper Induction
Instance-Based Wrapper Learning
Automatic Wrapper Generation
String Matching and Tree Matching
Multiple Alignment
Building DOM Trees
Extraction Based on a Single List Page or Multiple Pages
Information Integration:
Schema-Level Matching
Domain and Instance-Level Matching
Combining Similarities
1:m Match
Integration of Web Query Interfaces
Constructing a Unified Global Query Interface
Opinion Mining and Sentiment Analysis:
Document Sentiment Classification
Sentence Subjectivity and Sentiment Classification
Opinion Lexicon Expansion
Aspect-Based Opinion Mining
Opinion Search and Retrieval

Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications, Springer, August 2013
Ricardo Baeza-Yates , Berthier Ribeiro-Neto: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd Edition, ACM Press Books, 2010

The main objective of this course is to teach students about how to develop programs for web search (including surface web and deep web search) and for extraction of structural data from both, static and dynamic web pages. Beside basic concepts of the web search and retrieval, students will learn about relevant techniques and approaches. After the course, if successful, students will be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).

After successful completion of the module, students will be able to:

summarize the most important approaches and techniques for searching and extracting data from the web
to select approaches and techniques that are most suitable for individual problems in web information extraction and retrieval.
to develop applications for data acquisition and analysis,
to construct new algorithms for web data search and extraction,
to explain behavior and time complexity of specific web search algorithms,
to integrate and employ different open-source solutions from the field.

Lectures, seminars, homeworks, oral presentations, project work.

Continuing (homework, midterm exams, project work)
Final (written and oral exam)
grading: 5 (fail), 6-10 (pass) (according to the Statute of UL)

Pet najpomembnejših del:
ŠUBELJ, Lovro, BAJEC, Marko. Group detection in complex networks : an algorithm and comparison of the state of the art. Physica. A, 2014
ŽITNIK, Slavko, ŠUBELJ, Lovro, LAVBIČ, Dejan, VASILECAS, Olegas, BAJEC, Marko. General context-aware data matching and merging framework. Informatica, 2013
LAVBIČ, Dejan, BAJEC, Marko. Employing semantic web technologies in financial instruments trading : Dejan Lavbič and Marko Bajec. International journal of new computer architectures and their applications, 2012
ŠUBELJ, Lovro, FURLAN, Štefan, BAJEC, Marko. An expert system for detecting automobile insurance fraud using social network analysis. Expert systems with applications, 2011
ŠUBELJ, Lovro, JELENC, David, ZUPANČIČ, Eva, LAVBIČ, Dejan, TRČEK, Denis, KRISPER, Marjan, BAJEC, Marko. Merging data sources based on semantics, contexts and trust. The IPSI BgD transactions on internet research, 2011
Celotna bibliografija je dostopna na SICRISu:
http://sicris.izum.si/search/rsr.aspx?lang=slv&,id=9270.