Data-Driven Media Research Methods – Artificial Intelligence – How to Analyze an Ocean of Text?

Eötvös Loránd Tudományegyetem

Delivery institution

Eötvös Loránd Tudományegyetem

ELTE/BTK

Department of Media and Communication

Instructor(s):

Márton Gosztonyi, PhD

Start date

17 February 2025

End date

17 May 2025

Study field

Social Sciences, journalism and information

CHARM priority field

Technology and STEM, Transversal Skills

Study level

BA/BSc

Study load, ECTS

Short description

The goal of the course is to provide students with comprehensive knowledge of fundamental techniques and methods in natural language processing (NLP) and machine learning. Throughout the semester, students will learn the Python programming language and its application to various text processing tasks, including web scraping, text preparation, tokenization, and lemmatization. The course covers vector models, probabilistic models, and different types of machine learning models, enabling students to summarize texts, classify texts, and apply various linguistic models and text modeling techniques. In addition to theoretical knowledge, students will deepen their understanding through practical exercises, preparing them to apply advanced NLP solutions in real-world contexts.

Full description

Course Topics and Schedule (based on a 14-week semester):

1) Introduction to Artificial Intelligence and NLP
2) Python I: Basic operations, variables, file reading
3) Python II: If-Else operations, strings, collections, lists, list functions, loops
4) Python III: Web Scraping, collections, tuples, dictionaries
5) Vector Models: Text preparation, tokenization, lemmatization
6) Vector Models: TF-IDF, Neural Word Embeddings
7) Probabilistic Models: Markov model, text classification
8) Probabilistic Models: Language models, text generation, poetry writing
9) Probabilistic Models: N-Gram-based word substitution
10) Machine Learning Models: Naive Bayes – How to determine if my model is adequate?
11) Machine Learning Models: Logistic Regression – Sentiment Analysis
12) Machine Learning Models: Text summarization
13) Machine Learning Models: Latent Dirichlet Allocation – Topic Modeling, Non-negative Matrix Factorization (NMF)
14) Machine Learning Models: Latent Semantic Analysis (Latent Semantic Indexing)

Learning outcomes

At the end of the course, the learner will be able to analyze large text datasets using natural language processing (NLP) and machine learning techniques.

At the end of the course, the learner will be able to apply Python programming skills to perform tasks such as web scraping, text preparation, tokenization, and lemmatization.

At the end of the course, the learner will be able to utilize vector models, probabilistic models, and machine learning models to summarize, classify, and model texts.

At the end of the course, the learner will be able to create language models for text generation and perform sentiment analysis using logistic regression.

At the end of the course, the learner will be able to implement machine learning techniques such as Naive Bayes, Latent Dirichlet Allocation (LDA), and Topic Modeling to explore patterns in textual data.

At the end of the course, the learner will be able to evaluate the effectiveness of NLP models by interpreting the results of machine learning algorithms.

At the end of the course, the learner will be able to collaborate in group projects to develop Python scripts, apply NLP methods, and present findings based on the analysis of real-world data.

Course requirements

No requirements.

Places available

Course literature (compulsory or recommended):

Required Reading:

Elhadad, M. (2010). Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O’Reilly Media.
Antić, Z. (2021). Python Natural Language Processing Cookbook. Packt Publishing Ltd.

Wittgenstein, L. (1998). Philosophical Investigations. Atlantis Budapest.

Chomsky, N. (1968). Linguistic Contributions to the Study of Mind: Future. Language and Thinking.

Gadamer, H.G. (2003). Truth and Method: Outline of a Philosophical Hermeneutics. Sapientia Humana Osiris.

Planned educational activities and teaching methods:

Interactive sessions that encourage discussion, analysis, and practical exploration of NLP techniques and machine learning models. Students will engage with real-world examples, case studies, and academic papers.

Group Work:

Collaborative projects where students will work in teams to apply machine learning methods to analyze text datasets. This includes creating Python scripts, performing data processing tasks, and presenting their findings.
Practical Lab Sessions:

Hands-on programming exercises conducted in a computer lab setting. Students will implement the techniques learned in the lectures, such as web scraping, tokenization, text classification, and summarization, using Python.

Students will present their group projects to the class, offering an opportunity to develop presentation skills and receive feedback from peers and the instructor.