1379. sredin seminar: Dana Donevska: Big Data Analysis of Scholarly Networks Using the OpenAlex Dataset and Apache Spark SQL
Big Data Analysis of Scholarly Networks Using the OpenAlex Dataset and Apache Spark SQL
Dana Donevska, FAMNIT
This practical work analyzes large-scale scholarly data in computer science using the OpenAlex dataset, a knowledge graph containing metadata on over 474 million works linked to authors, institutions, funders, and other entities, all openly accessible. The analysis focuses on computer science works from Slovenian institutions, reflecting the scope of my thesis. Analysis is performed with Apache Spark SQL, using temporary views and three main queries (CS_SI_stats.sql, CS_SI_yearly.sql, and SI_best_authors.sql). These queries enable extraction of topic trends, citation behavior, and collaboration structures across works and authors. Visualizations are created with Tableau Public 2025.3 to illustrate institutional activity and the evolution of research themes over time. This work demonstrates how modern big-data tools can extract meaningful insights from massive scholarly datasets and support a deeper understanding of research dynamics in Slovenian computer science.
PS. Kdor bi rad kaj povedal na naslednjih seminarjih, naj mi sporoči naslov teme in doda kratek povzetek.