Processing and analyzing data I-E2 :Shell-based data processing fundamentals

Numbering Code U-LAS30 20033 SE11 Year/Term 2022 ・ Second semester
Number of Credits 2 Course Type Seminar
Target Year All students Target Student For all majors
Language English Day/Period Fri.3
Instructor name VEALE,Richard Edmund (Graduate School of Medicine Assistant Professor)
Outline and Purpose of the Course As the world and the sciences become increasingly computerized, it is increasingly important to understand how to search, process, and analyse large bodies of digital data. This course is designed for all students of all disciplines. The purpose is to learn the the basic concepts and methods for systematic processing of data encountered in any field.
Lectures will focus on learning basic command line tools for automatic processing of data, including sorting, filtering, summarizing, searching, and other related programming.
Course Goals At the end of the course, students should be able to operate a computer to automatically:
(1) search for specific entries in large collections of data
(2) search for pattern-like entries in large collections of data
(3) filter desired content from large collection of data
(4) perform basic summary and counting statistics on data
(5) assemble small processing pipelines from the various tools they will study
Schedule and Contents (1) What is a computer, what is an operating system?
Remove microsoft/apple preconceptions.
Using Command Line Interfaces (CLI) to interact with computers: Shell.
Logging in to a remote machine (SSH, public/private keys, etc.)

(2) Using remote and local machines.
Basic Networking: TCP, FTP/HTTP, IP.
Managing data: Disk management, file systems, file system structure (tree), file permissions.
Moving data between machines: SCP, RSYNC.
Installing software: package managers (RPM, APT).
Security: Super User (su, sudo), users, groups.
Diagnostic tools: PS, HTOP, DF, etc.

(3) Complex commands for string manipulation and search.
Moving data between programs: standard in/out/error streams, piping, redirecting.
String manipulation: Regular Expressions, wildcards, AWK, SED
Loops: for/while loops, loop conditions.
Finding information: Stack Overflow, MAN pages.

(4) Shell Scripts and programming languages.
What is a "program"? Libraries, functions, paths, environmental variables.
Programming languages: interpreted versus compiled, lazy versus strict evaluation, data types. Python, R, Perl, Fortran, C/C++, Java.

(5) Data Formats
Binary versus Textual (CSV etc.). HDF5 (computer independent representation).
Statistics: Summary statistics on data. Good/bad ways of thinking.

(6) Data representation/presentation
Simple plotting/graphing (matlab, matplotlib, R, ggplot, gnuplot).
Why excel is bad (limitations).
Formats: PDF, vector versus raster.

(7) Representation of large data sets.
(Relational) Databases, SQL, "queries", subsets.

(8) Keeping track of your work (Version Control).
Version Control: CVS, SVN, GIT, mercurial. Remote versus local repositories.
Backing up: Version Control is not back-up. Backing up practices (tape, disks, etc.).

(9) Data processing THEORY
Best practices: concepts to reproduce reusibility.
Basic parallelization (GNU parallel).

(10) "Big Data" processing.
Parallelizing: MapReduce, Hadoop, Spark, MPI.
Big filesystems: HDFS, lustre, NFS.
Clusters, Supercomputers.
Scheduling computer time and resources (scheduler): TORQUE

(11) Modeling, optimization, parameter search
Gradient descent methods, neural networks
Parameter estimation: markov chain monte-carlo, evolutionary algorithms.
Random seeds: pseudorandom issues on large machines

(12) Project

(13) Project

(14) Project (presentations)

(15-16) Feedback
Evaluation Methods and Policy Class attendance and participation (10%), Quizzes (40%), Final Project/Report (50%)
Course Requirements No prior knowledge of computer programming or data processing is necessary
Study outside of Class (preparation and review) Students are strongly recommended to practice class materials and on their own data outside of class to deepen their understanding.
Textbooks Textbooks/References No textbook used, lecture materials will be provided in class and online via PANDA.
Documentation about processing tools (e.g. manpages) will be introduced in class.
References, etc. Introduced during class
PAGE TOP