Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

This repository contains the course materials from Math 157: Intro to Mathematical Software.

Creative Commons BY-SA 4.0 license.

Views: 3037
License: OTHER
Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

February 28, 2018: More on machine learning

Administrivia (to be updated):

  • CAPE evaluations are available! They close Monday, March 19 at 8am. Since this course is highly experimental, your feedback will be very helpful for shaping future offerings.

  • Attendance scores through Wednesday, February 21 are now posted on TritonEd. Contact Peter with any issues.

  • My office hours this week will be held Thursday, 3:30-4:30 (rather than 4-5).

  • Grades for HW5 will be available shortly.

  • The HW6 solution set is available.

  • HW 7 is now available. Some early comments:

  • Problem 1: The "Orange" dataset is not available via statsmodels. However, you can access it directly from R using the rpy2 module. Also, some students have had issues with the "FacetGrid" command crashing their kernel; if this occurs, simply state that and skip this part of the problem. (Added in class: try restarting your project before giving up.)

  • Problem 3a: The mpg dataset is in ggplot, not statsmodels: from ggplot import mpg

Added in class:

  • If you are unable to be on campus during week 10, please contact me immediately to set up a workaround for the final project.

  • It looks everyone can have their first choice for the final project. However, please fill out the Google Form (see Monday's lecture or Part 2 of the final project for the link) by Friday.

Advance notice for week 9:

  • No sections on Tuesday, March 6. However, during this time, you are welcome to use APM 6402 as a study room; we will also try to monitor the chat room.

  • Thomas's office hours (usually Tuesday 11am-12pm) are moved to Friday 11:30am-12:50pm.

  • Peter's office hours (usually Wednesday 3-5pm) are moved to Wednesday 5-7pm.

  • There will be an extra virtual office hour Thursday 6-7pm.

Advance notice for week 10:

  • No lectures on Monday, March 12 or Wednesday, March 14. You may wish to use this time to meet your assigned group for Part 2 of the final project.

  • There will be a lecture on Friday, March 16, on the topic of "Where to go from here?" This lecture will not be counted for course attendance; that is, the last lecture for which attendance counts is Friday, March 9.

  • My office hours on Thursday, March 15 are cancelled. All other sections and office hours meet as scheduled.

xkcd(1838)

Machine Learning

Machine learning is basically everywhere these days. Here a few common examples, together with some potential downsides caused by undesirable features of the input data (e.g., biased or insufficiently diverse training data). This [TED talk](https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms) summarizes the key issue; it mentions [this book](https://weaponsofmathdestructionbook.com/) which takes a deeper look at the phenomenon (disclosure: the author is married to my PhD advisor).

The rest of this lecture is not about ethical issues; however, I did want to highlight them before continuing. The proliferation of such issues makes it important for everyone to understand a bit about how machine learning works.

Modalities of machine learning

This cheat sheet can be used to navigate the maze of estimators available in scikit-learn. While in many of the real-world examples the predictions being made are rather complex, at the most basic level machine-learning tasks tend to fall into one of the following two categories.

  • Predicting a quantity (i.e., one or more real numbers).

  • Predicting a category (i.e., an element of a prescribed finite set).

This dichotomy parallels the conventional (and imprecise) distinction between continuous mathematics (like calculus) and discrete mathematics (like combinatorics).

Pause to go through the cheat sheet.

Demonstration on text data

Let's try an example which shows off a lot of different estimators at once. Let's first take a look here.

This being Python code, we can try it out here. But better yet, the download is available as a Jupyter notebook, so we can download that notebook directly into CoCalc and try it that way!

Try this now:

  • Copy the link to the Jupyter notebook at the bottom of the example (which I've also reproduced here).

  • Go to the Files view in your project (the folder icon at top left).

  • Hit the "Create..." button.

  • Paste the URL into the text box at the top of the page.

  • Hit "Download from Internet". You should end up viewing a new Jupyter notebook called document_classification_20newsgroups.ipynb with the header "Classification of text documents using sparse features".

  • Select "Run all" from the "Cell" menu.

  • Examine the results!

Question for you: Which estimator does the cheat sheet suggest to use in this case? Is this consistent with your experimental result?

Clustering

While I formulated machine learning in terms of predicting a function, another modality is to identify clusters among a collection of objects. Let's try an example of this based on the US stock market. Let's first take a look here; then download and run the notebook using this link.

Question for you: This notebook is attempting to classify stocks based on the extent to which their price movements are correlated. Can you identify what the various clusters have in common?

Here is an other example of clustering in the form of segmentation in machine vision. Let's first take a look here, then download the notebook using this link.

Here the goal is to discretize, i.e., to take what is effectively a "continuous" (or at least high-resolution) input, and break it up into its key features. For some applications it is important to be able to do this very efficiently, e.g., autonomous vehicle navigation ("self-driving cars").

For this course...

... you will not be expect to do much implementation "from scratch" of machine learning. To do this would require some more time to develop the relevant statistics background, and some more intricate programming than what we have done so far.

What I mostly what you to take away here is the underlying taxonomy of ideas, that is, how to separate different machine-learning problems into related categories (you might call this "meta-clustering").