NOTE: Please register for this course by 10th February
2025 on the UniPortal. After that, if too many students have registered, a draw
will be made, and you will get notified whether you can attend the course on 11th
February. LUMACSS students are prioritized as this course is mandatory for
them.
CONTENT: Data analysis increasingly involves mining data from the Internet and using innovative tools to handle large datasets. With the rise of Large Language Models (LLMs) such as ChatGPT, data mining practices are undergoing a significant transformation. This course bridges traditional data mining techniques and the potential of LLMs, equipping students with essential skills to automate and enhance their research workflows. The course employs a self-learning approach where students leverage LLMs to explore, self-learn, and apply tools for data mining. Under the guidance of the instructor, this course provides hands-on experience in collecting and handling web data, developing reproducible workflows, and critically evaluating LLM outputs. Students will gain both technical and analytical skills in a collaborative learning environment. The course is structured in three blocks: 1. An introductory block covers the essential knowledge for working with big data (notions of R programming, developing reproducible code, reporting in automated notebooks, version control, and Git/GitHub; secondary datasets for social science research & MySQL). 2. A data access block focuses on web scraping and related tools (introduction to regular expressions, HTML language, XML, and JSON data structures). 3. A third block introduces more advanced data access concepts, such as API interaction, and allows students to practice with live coding sessions in class. |