Data-Driven Media Research Methods – Artificial Intelligence – How to Analyze an Ocean of Text?

Data-Driven Media Research Methods – Artificial Intelligence – How to Analyze an Ocean of Text?

Delivery institution

ELTE/BTK
Department of Media and Communication

Instructor(s):

Márton Gosztonyi, PhD

Start date

17 February 2025

End date

17 May 2025

Study field

CHARM priority field

Study level

Study load, ECTS

3

Short description

The goal of the course is to provide students with comprehensive knowledge of fundamental techniques and methods in natural language processing (NLP) and machine learning. Throughout the semester, students will learn the Python programming language and its application to various text processing tasks, including web scraping, text preparation, tokenization, and lemmatization. The course covers vector models, probabilistic models, and different types of machine learning models, enabling students to summarize texts, classify texts, and apply various linguistic models and text modeling techniques. In addition to theoretical knowledge, students will deepen their understanding through practical exercises, preparing them to apply advanced NLP solutions in real-world contexts.

Full description

Course Topics and Schedule (based on a 14-week semester):

1) Introduction to Artificial Intelligence and NLP
2) Python I: Basic operations, variables, file reading
3) Python II: If-Else operations, strings, collections, lists, list functions, loops
4) Python III: Web Scraping, collections, tuples, dictionaries
5) Vector Models: Text preparation, tokenization, lemmatization
6) Vector Models: TF-IDF, Neural Word Embeddings
7) Probabilistic Models: Markov model, text classification
8) Probabilistic Models: Language models, text generation, poetry writing
9) Probabilistic Models: N-Gram-based word substitution
10) Machine Learning Models: Naive Bayes – How to determine if my model is adequate?
11) Machine Learning Models: Logistic Regression – Sentiment Analysis
12) Machine Learning Models: Text summarization
13) Machine Learning Models: Latent Dirichlet Allocation – Topic Modeling, Non-negative Matrix Factorization (NMF)
14) Machine Learning Models: Latent Semantic Analysis (Latent Semantic Indexing)

Learning outcomes

At the end of the course, the learner will be able to analyze large text datasets using natural language processing (NLP) and machine learning techniques.

At the end of the course, the learner will be able to apply Python programming skills to perform tasks such as web scraping, text preparation, tokenization, and lemmatization.

At the end of the course, the learner will be able to utilize vector models, probabilistic models, and machine learning models to summarize, classify, and model texts.

At the end of the course, the learner will be able to create language models for text generation and perform sentiment analysis using logistic regression.

At the end of the course, the learner will be able to implement machine learning techniques such as Naive Bayes, Latent Dirichlet Allocation (LDA), and Topic Modeling to explore patterns in textual data.

At the end of the course, the learner will be able to evaluate the effectiveness of NLP models by interpreting the results of machine learning algorithms.

At the end of the course, the learner will be able to collaborate in group projects to develop Python scripts, apply NLP methods, and present findings based on the analysis of real-world data.

Course requirements

No requirements.

Places available

20

Course literature (compulsory or recommended):

Required Reading:

Elhadad, M. (2010). Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O’Reilly Media.
Antić, Z. (2021). Python Natural Language Processing Cookbook. Packt Publishing Ltd.

Wittgenstein, L. (1998). Philosophical Investigations. Atlantis Budapest.

Chomsky, N. (1968). Linguistic Contributions to the Study of Mind: Future. Language and Thinking.

Gadamer, H.G. (2003). Truth and Method: Outline of a Philosophical Hermeneutics. Sapientia Humana Osiris.

Recommended Reading:

Shannon, C. (2001). A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and Communications Review.

Barrios, F., López, F., Argerich, L., & Wachenchauzer, R. (2016). Variations of the Similarity Function of TextRank for Automated Summarization.

Steinberger, J., & Jezek, K. (2004). Using Latent Semantic Analysis in Text Summarization and Summary Evaluation.
Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam Filtering with Naive Bayes – Which Naive Bayes?

Aizawa, A. (2003). An Information-Theoretic Perspective of TF-IDF Measures.

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts.

Gong, Y., & Liu, X. (2001). Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis.

Ramadhan, W.P., Novianty, S.A., & Setianingsih, S.C. (2017). Sentiment Analysis Using Multinomial Logistic Regression.

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research.

Planned educational activities and teaching methods:

Interactive sessions that encourage discussion, analysis, and practical exploration of NLP techniques and machine learning models. Students will engage with real-world examples, case studies, and academic papers.

Group Work:

Collaborative projects where students will work in teams to apply machine learning methods to analyze text datasets. This includes creating Python scripts, performing data processing tasks, and presenting their findings.
Practical Lab Sessions:

Hands-on programming exercises conducted in a computer lab setting. Students will implement the techniques learned in the lectures, such as web scraping, tokenization, text classification, and summarization, using Python.

Students will present their group projects to the class, offering an opportunity to develop presentation skills and receive feedback from peers and the instructor.

Course code

BBN-MTU-465

Language

Assessment method

Final certification

Transcript of records

none
17 May 2025

Modality

Learning management System in use

Canvas

Contact hours per week for the student:

90min

Specific regular weekly teaching day/time

10:00-11:30

Time zone