Processing and analyzing data I-E2 :Shell-based data processing fundamentals
Numbering Code | U-LAS30 20033 SE11 | Year/Term | 2022 ・ Second semester | |
---|---|---|---|---|
Number of Credits | 2 | Course Type | Seminar | |
Target Year | All students | Target Student | For all majors | |
Language | English | Day/Period | Fri.3 | |
Instructor name | VEALE,Richard Edmund (Graduate School of Medicine Assistant Professor) | |||
Outline and Purpose of the Course |
As the world and the sciences become increasingly computerized, it is increasingly important to understand how to search, process, and analyse large bodies of digital data. This course is designed for all students of all disciplines. The purpose is to learn the the basic concepts and methods for systematic processing of data encountered in any field. Lectures will focus on learning basic command line tools for automatic processing of data, including sorting, filtering, summarizing, searching, and other related programming. |
|||
Course Goals |
At the end of the course, students should be able to operate a computer to automatically: (1) search for specific entries in large collections of data (2) search for pattern-like entries in large collections of data (3) filter desired content from large collection of data (4) perform basic summary and counting statistics on data (5) assemble small processing pipelines from the various tools they will study |
|||
Schedule and Contents |
(1) What is a computer, what is an operating system? Remove microsoft/apple preconceptions. Using Command Line Interfaces (CLI) to interact with computers: Shell. Logging in to a remote machine (SSH, public/private keys, etc.) (2) Using remote and local machines. Basic Networking: TCP, FTP/HTTP, IP. Managing data: Disk management, file systems, file system structure (tree), file permissions. Moving data between machines: SCP, RSYNC. Installing software: package managers (RPM, APT). Security: Super User (su, sudo), users, groups. Diagnostic tools: PS, HTOP, DF, etc. (3) Complex commands for string manipulation and search. Moving data between programs: standard in/out/error streams, piping, redirecting. String manipulation: Regular Expressions, wildcards, AWK, SED Loops: for/while loops, loop conditions. Finding information: Stack Overflow, MAN pages. (4) Shell Scripts and programming languages. What is a "program"? Libraries, functions, paths, environmental variables. Programming languages: interpreted versus compiled, lazy versus strict evaluation, data types. Python, R, Perl, Fortran, C/C++, Java. (5) Data Formats Binary versus Textual (CSV etc.). HDF5 (computer independent representation). Statistics: Summary statistics on data. Good/bad ways of thinking. (6) Data representation/presentation Simple plotting/graphing (matlab, matplotlib, R, ggplot, gnuplot). Why excel is bad (limitations). Formats: PDF, vector versus raster. (7) Representation of large data sets. (Relational) Databases, SQL, "queries", subsets. (8) Keeping track of your work (Version Control). Version Control: CVS, SVN, GIT, mercurial. Remote versus local repositories. Backing up: Version Control is not back-up. Backing up practices (tape, disks, etc.). (9) Data processing THEORY Best practices: concepts to reproduce reusibility. Basic parallelization (GNU parallel). (10) "Big Data" processing. Parallelizing: MapReduce, Hadoop, Spark, MPI. Big filesystems: HDFS, lustre, NFS. Clusters, Supercomputers. Scheduling computer time and resources (scheduler): TORQUE (11) Modeling, optimization, parameter search Gradient descent methods, neural networks Parameter estimation: markov chain monte-carlo, evolutionary algorithms. Random seeds: pseudorandom issues on large machines (12) Project (13) Project (14) Project (presentations) (15-16) Feedback |
|||
Evaluation Methods and Policy | Class attendance and participation (10%), Quizzes (40%), Final Project/Report (50%) | |||
Course Requirements | No prior knowledge of computer programming or data processing is necessary | |||
Study outside of Class (preparation and review) | Students are strongly recommended to practice class materials and on their own data outside of class to deepen their understanding. | |||
Textbooks | Textbooks/References |
No textbook used, lecture materials will be provided in class and online via PANDA. Documentation about processing tools (e.g. manpages) will be introduced in class. |
||
References, etc. | Introduced during class |