Project: Ruida He - CITS2402 Introduction to Data Science

Path: Labs / DiversityInStudy / DiversityInStudy.ipynb

Views: ⁵³¹³
Image: default

Kernel: Python 3 (system-wide)

Lab 1

Diversity in Study: What (Else) Are We All Studying?

Data Science Across the Disciplines

Data Science has application right across the disciplines. In this course we'll see case studies ranging from the Environment to Economics to Health.

Before we start this journey, it's interesting to see what diverse areas the group is studying. As we said in the Welcome lecture, there are many students in the unit with many different study backgrounds and focusses.

To get a feel for this, we'll look at the range of enrolment patterns for this year.

Note that some randomisation has been applied to the enrolment patterns to ensure privacy and anonymity. However the disciplines are preserved.

Reference links

In this lab we will practise reading in data from a file, cleaning the data, and observing features of the data.

We'll be using standard python. The latest documentation will be useful, and it is suggested you bookmark it for future use.

The Python Language Reference, https://docs.python.org/3/reference/
The Python Standard Library, https://docs.python.org/3/library/
The Python Tutorial, https://docs.python.org/3/tutorial/

Data Acquisition

In future labs we'll be hunting down data as part of the work. In this case however the data will be provided in a text file.

Open the directory from the folder icon. Click on the file unit-patterns.txt to inspect the file.

The file contains all the different study plans of enrolled students (at the time of writing) with some obfuscation for privacy and anonymity.

This is a tab-separated values (tsv) file with two fields.

The first field contains a list of units studied (separated by plus signs), the second (after the tab character) contains a number and a percentage.

Note that in this lab we will refer to an instance of a unit name combined with a teaching period as a 'unit' - so for example, CITS1401-1 and CITS1401-2 would be treated as two distinct units.

Let's start with a constant to identify the file. By convention we'll use uppercase for constants. Run the following cell to make the assignment.

In [69]:

DATAFILE = "unit-patterns.txt"

As we saw in the 'Getting Started' lab, we can check to see that the assignment was made just by typing the variable name as the last thing in the cell and running the cell. Try this now.

In [70]:

DATAFILE = "unit-patterns.txt"

Reading the file

Use the built-in function open() and the text file IO method readline() to read in and print out the first 5 lines of the file.

Your output should start like this:

 CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 	 1 (2.0%) 

 CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 	 1 (2.0%) 

 CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 	 1 (2.0%) 
...

Tip: If you're a little rusty on reading files, you may find this tutorial useful: 7.2. Reading and Writing Files.

Also the Case Study video "CITS2402 FakeNews 2" provides a couple of examples of working with a tsv file.

In [71]:

f = open("unit-patterns.txt", "r")

for i in range(5):
    print(f.readline())

f.close()

 CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 	 1 (2.0%) 

 CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 	 1 (2.0%) 

 CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 	 1 (2.0%) 

 CITS2023-2 + CITS3220-2 + CITS3223-2 + CITS3021-1 + CITS3025-1 + MATH3224-1 + MATH3221-1 + STAT3264-1 + STAT3260-2 	 1 (2.0%) 

 BIOC3221-1 + BIOC3224-1 + BIOC3225-2 + BIOC3225-2 + CITS1023-2 + CITS2021-1 + CITS2023-2 + MATH1211-1 	 1 (2.0%)

Read the file again, but this time, as you read the lines from the file, save them in a list called pattern-strings.

Printing the first 5 lines of your list should give:

[' CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 \t 1 (2.0%) \n', ' CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 \t 1 (2.0%) \n', ' CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 \t 1 (2.0%) \n', ' CITS2023-2 + CITS3220-2 + CITS3223-2 + CITS3021-1 + CITS3025-1 + MATH3224-1 + MATH3221-1 + STAT3264-1 + STAT3260-2 \t 1 (2.0%) \n', ' BIOC3221-1 + BIOC3224-1 + BIOC3225-2 + BIOC3225-2 + CITS1023-2 + CITS2021-1 + CITS2023-2 + MATH1211-1 \t 1 (2.0%) \n']

Tip: Rather than using readline(), use an iterator on the file directly to get each line in turn.

In [72]:

f = open("unit-patterns.txt", "r")

pattern_strings = []

for line in f:
    pattern_strings.append(line)

print(pattern_strings[:5])

f.close()

[' CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 \t 1 (2.0%) \n', ' CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 \t 1 (2.0%) \n', ' CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 \t 1 (2.0%) \n', ' CITS2023-2 + CITS3220-2 + CITS3223-2 + CITS3021-1 + CITS3025-1 + MATH3224-1 + MATH3221-1 + STAT3264-1 + STAT3260-2 \t 1 (2.0%) \n', ' BIOC3221-1 + BIOC3224-1 + BIOC3225-2 + BIOC3225-2 + CITS1023-2 + CITS2021-1 + CITS2023-2 + MATH1211-1 \t 1 (2.0%) \n']

Data Cleaning

We are only interested in the patterns to the left of the tab.

Using the string partition() method, modify your code so that pattern_strings contains only the unit patterns.

The documentation for String Methods may be helpful.

Printing the first 5 lines of your list (print(pattern_strings[:5])) should now give:

[' CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 ', ' CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 ', ' CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 ', ' CITS2023-2 + CITS3220-2 + CITS3223-2 + CITS3021-1 + CITS3025-1 + MATH3224-1 + MATH3221-1 + STAT3264-1 + STAT3260-2 ', ' BIOC3221-1 + BIOC3224-1 + BIOC3225-2 + BIOC3225-2 + CITS1023-2 + CITS2021-1 + CITS2023-2 + MATH1211-1 ']

How many students were enrolled at the time the list was taken?

In [73]:

for i in range(len(pattern_strings)):
    pattern_strings[i] = str(pattern_strings[i].partition("\t")[0])

print(pattern_strings[:5])

[' CITS2224-1 + CITS2023-2 + CITS3221-2 + CITS3224-1 + CITS3225-1 ', ' CITS2023-2 + ENSC3221-2 + ENSC3224-1 + ENSC3225-1 + ENSC3227-2 + ENSC3210-1 + ENSC3221-1 + MECH3023-2 ', ' CITS1021-2 + CITS2021-1 + CITS2023-2 + ENSC2224-1 + ENSC3211-2 + ENSC3213-2 + MATH2221-1 + STAT1524-1 ', ' CITS2023-2 + CITS3220-2 + CITS3223-2 + CITS3021-1 + CITS3025-1 + MATH3224-1 + MATH3221-1 + STAT3264-1 + STAT3260-2 ', ' BIOC3221-1 + BIOC3224-1 + BIOC3225-2 + BIOC3225-2 + CITS1023-2 + CITS2021-1 + CITS2023-2 + MATH1211-1 ']

Rather than the units in each line being part of a string, it is more useful to have them in a list. This will enable us to iterate through the list later.

Use the split method to split each line into a list of units. Store the result in a variable pattern_lists.

This should result in a list of lists. Your output should start like this:

[[' CITS2224-1 ', ' CITS2023-2 ', ' CITS3221-2 ', ' CITS3224-1 ', ' CITS3225-1 '], [' CITS2023-2 ', ' ENSC3221-2 ', ' ENSC3224-1 ', ' ENSC3225-1 ', ' ENSC3227-2 ', ' ENSC3210-1 ', ' ENSC3221-1 ', ' MECH3023-2 '], [' CITS1021-2 ', ' CITS2021-1 ', ' CITS2023-2 ', ' ENSC2224-1 ', ' ENSC3211-2 ', ' ENSC3213-2 ', ' MATH2221-1 ', ' STAT1524-1 '], [' CITS2023-2 ', ' CITS3220-2 ', ' CITS3223-2 ', ' CITS3021-1 ', ' CITS3025-1 ', ' MATH3224-1 ', ' MATH3221-1 ', ' STAT3264-1 ', ' STAT3260-2 '], [' BIOC3221-1 ', ' BIOC3224-1 ', ' BIOC3225-2 ', ' BIOC3225-2 ', ' CITS1023-2 ', ' CITS2021-1 ', ' CITS2023-2 ', ' MATH1211-1 ']]

In [74]:

pattern_lists = []

for i in range(len(pattern_strings)):
    pattern_lists.append(pattern_strings[i].split("+"))

print(pattern_lists[:5])

[[' CITS2224-1 ', ' CITS2023-2 ', ' CITS3221-2 ', ' CITS3224-1 ', ' CITS3225-1 '], [' CITS2023-2 ', ' ENSC3221-2 ', ' ENSC3224-1 ', ' ENSC3225-1 ', ' ENSC3227-2 ', ' ENSC3210-1 ', ' ENSC3221-1 ', ' MECH3023-2 '], [' CITS1021-2 ', ' CITS2021-1 ', ' CITS2023-2 ', ' ENSC2224-1 ', ' ENSC3211-2 ', ' ENSC3213-2 ', ' MATH2221-1 ', ' STAT1524-1 '], [' CITS2023-2 ', ' CITS3220-2 ', ' CITS3223-2 ', ' CITS3021-1 ', ' CITS3025-1 ', ' MATH3224-1 ', ' MATH3221-1 ', ' STAT3264-1 ', ' STAT3260-2 '], [' BIOC3221-1 ', ' BIOC3224-1 ', ' BIOC3225-2 ', ' BIOC3225-2 ', ' CITS1023-2 ', ' CITS2021-1 ', ' CITS2023-2 ', ' MATH1211-1 ']]

Finally, notice that we've been left with unnecessary whitespace around each unit name.

Modify your code to strip off the extra whitespace.

Your first 5 patterns should now look like this:

[['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1'], ['CITS2023-2', 'ENSC3221-2', 'ENSC3224-1', 'ENSC3225-1', 'ENSC3227-2', 'ENSC3210-1', 'ENSC3221-1', 'MECH3023-2'], ['CITS1021-2', 'CITS2021-1', 'CITS2023-2', 'ENSC2224-1', 'ENSC3211-2', 'ENSC3213-2', 'MATH2221-1', 'STAT1524-1'], ['CITS2023-2', 'CITS3220-2', 'CITS3223-2', 'CITS3021-1', 'CITS3025-1', 'MATH3224-1', 'MATH3221-1', 'STAT3264-1', 'STAT3260-2'], ['BIOC3221-1', 'BIOC3224-1', 'BIOC3225-2', 'BIOC3225-2', 'CITS1023-2', 'CITS2021-1', 'CITS2023-2', 'MATH1211-1']]

In [59]:

for i in range(len(pattern_lists)):
    for j in range(len(pattern_lists[i])):
        pattern_lists[i][j] = pattern_lists[i][j].strip()

print(pattern_lists[:5])

[['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1'], ['CITS2023-2', 'ENSC3221-2', 'ENSC3224-1', 'ENSC3225-1', 'ENSC3227-2', 'ENSC3210-1', 'ENSC3221-1', 'MECH3023-2'], ['CITS1021-2', 'CITS2021-1', 'CITS2023-2', 'ENSC2224-1', 'ENSC3211-2', 'ENSC3213-2', 'MATH2221-1', 'STAT1524-1'], ['CITS2023-2', 'CITS3220-2', 'CITS3223-2', 'CITS3021-1', 'CITS3025-1', 'MATH3224-1', 'MATH3221-1', 'STAT3264-1', 'STAT3260-2'], ['BIOC3221-1', 'BIOC3224-1', 'BIOC3225-2', 'BIOC3225-2', 'CITS1023-2', 'CITS2021-1', 'CITS2023-2', 'MATH1211-1']]

Putting it all together

Congratulations, you have now 'wrangled' your data from its 'raw' format in the file, into a very useable form.

To consolidate this section, turn your code into a function get_patterns(filename) that takes a filename, and returns a list of lists containing all the patterns in the file.

print(get_patterns(DATAFILE)[:5]) should now print the same output as above.

In [60]:

def get_patterns(filename):
    
    f = open(filename, "r")

    pattern_strings = []

    for line in f:
        pattern_strings.append(line)

    f.close()
    
    
    for i in range(len(pattern_strings)):
        pattern_strings[i] = str(pattern_strings[i].partition("\t")[0])
    
    
    pattern_lists = []

    for i in range(len(pattern_strings)):
        pattern_lists.append(pattern_strings[i].split("+"))
    
    
    for i in range(len(pattern_lists)):
        for j in range(len(pattern_lists[i])):
            pattern_lists[i][j] = pattern_lists[i][j].strip()

    return pattern_lists

DATAFILE = "unit-patterns.txt"
print(get_patterns(DATAFILE)[:5])

[['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1'], ['CITS2023-2', 'ENSC3221-2', 'ENSC3224-1', 'ENSC3225-1', 'ENSC3227-2', 'ENSC3210-1', 'ENSC3221-1', 'MECH3023-2'], ['CITS1021-2', 'CITS2021-1', 'CITS2023-2', 'ENSC2224-1', 'ENSC3211-2', 'ENSC3213-2', 'MATH2221-1', 'STAT1524-1'], ['CITS2023-2', 'CITS3220-2', 'CITS3223-2', 'CITS3021-1', 'CITS3025-1', 'MATH3224-1', 'MATH3221-1', 'STAT3264-1', 'STAT3260-2'], ['BIOC3221-1', 'BIOC3224-1', 'BIOC3225-2', 'BIOC3225-2', 'CITS1023-2', 'CITS2021-1', 'CITS2023-2', 'MATH1211-1']]

Checked Questions and Answers

Many of the labs this semester will have one or two checked (or 'autograded') answers. These will contribute to the (35%) practical marks for the semester.

The checked answers will generally have two parts, or two cells.

Answer cells

The first cell is where you enter your code to be checked. This may contains a "stub" - that is, a part of the code to be completed. For example, it may be a function definition.

Test cells

The second cell is where the code to check the answer will go and is not editable. It may be blank, but will usually contain some example tests that you can use to test your own code before the answers are collected. There will be additional tests applied that are not shown.

Collection and deadlines

The notebook will be collected and marked, by running the tests, after the lab deadline. (As the same notebook will be collected, you may wish to take a backup copy of the notebook before starting work. Alternatively you can use the TimeTravel feature to revert the notebook if you need to start again.)

Everyone will automatically receive a one week "extension" from the week of the lab to finish off the work and to account for any unforeseen circumstances.

The deadline for credit will therefore be 11:59pm on the Friday of the week after the lab is released.

Checked Solution [1 mark]

Copy your solution for get_patterns(filename) to complete the following stub.

(We wouldn't normally enter the function a second time, it is only because we are introducing the graded answer for the first time.)

In [61]:

def get_patterns(filename):
    
    f = open(filename, "r")

    pattern_strings = []

    for line in f:
        pattern_strings.append(line)

    f.close()
    
    
    for i in range(len(pattern_strings)):
        pattern_strings[i] = str(pattern_strings[i].partition("\t")[0])
    
    
    pattern_lists = []

    for i in range(len(pattern_strings)):
        pattern_lists.append(pattern_strings[i].split("+"))
    
    
    for i in range(len(pattern_lists)):
        for j in range(len(pattern_lists[i])):
            pattern_lists[i][j] = pattern_lists[i][j].strip()

    return pattern_lists

# YOUR CODE HERE
# raise NotImplementedError()

Before a final test of your code, it is worth restarting the kernel from the "Kernel" drop-down menu, and re-running your code. This will clean up any leftover 'junk' in the memory from running earlier versions of bits of code.

You may find, for example, that after you fixed or changed a piece of code, you forgot to change a variable name, but the code still worked because an old value for the variable was still in memory.

Restart the kernel and test your code again to make sure it is doing everything correctly.

Running the grading tests

The test cells will often include some 'non-hidden' tests that you can run to see if your code passes. There will also be some 'hidden' tests that are only run once the code is collected after the deadline.

If your code produces the correct answers it will simply run successfully.

If your code does not pass the non-hidden tests, you will normally get an error message. For example, if I only stripped the whitespace from the left side and not both sides of the unit names, I might get an error message like this:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-82-f1838aea55b1> in <module>
      1 from nose.tools import assert_equal
      2 DATA = "unit-patterns.txt"
----> 3 assert_equal(get_patterns(DATA)[0],['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1'])
      4 assert_equal(get_patterns(DATA)[5],['CITS2021-1', 'CITS2023-2', 'GRMN2025-1', 'GRMN2020-2', 'MATH1214-1', 'MATH2521-2', 'PHYS2221-1', 'PHYS2223-2'])
...
668     def fail(self, msg=None):
    669         """Fail immediately, with the given message."""
--> 670         raise self.failureException(msg)
    671 
    672     def assertFalse(self, expr, msg=None):
AssertionError: Lists differ: ['CITS2224-1 ', 'CITS2023-2 ', 'CITS3221-2 ', 'CITS3224-1 ', 'CITS3225-1 '] != ['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1']

First differing element 0:
'CITS2224-1 '
'CITS2224-1'

- ['CITS2224-1 ', 'CITS2023-2 ', 'CITS3221-2 ', 'CITS3224-1 ', 'CITS3225-1 ']
?             -              -              -              -              -

+ ['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1']

The error messages may vary in their usefulness - they are not intended as a debugging tool, but often they will give some pointer as to what went wrong. In this case it is fairly clear that there is whitespace that shouldn't be there.

You can then correct your code and try again.

The following test is worth 1 prac mark. Give it a try. You can run the tests as may times as you wish before the deadline.

In [62]:

from nose.tools import assert_equal
DATA = "unit-patterns.txt"
assert_equal(get_patterns(DATA)[0],['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1'])
assert_equal(get_patterns(DATA)[5],['CITS2021-1', 'CITS2023-2', 'GRMN2025-1', 'GRMN2020-2', 'MATH1214-1', 'MATH2521-2', 'PHYS2221-1', 'PHYS2223-2'])

Data Inspection and Interpretation

Now let's have a closer look at the data. The file contains all the different study plans of enrolled students. Let's start by working out how many different or unique plans there are.

To make it easier to see what is going on, start by defining a better print function.

Write a function print_patterns (patterns, numlines) that prints the first numlines patterns each on a new line.

For example:

print_patterns(get_patterns(DATA), 8)

['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1']
['CITS2023-2', 'ENSC3221-2', 'ENSC3224-1', 'ENSC3225-1', 'ENSC3227-2', 'ENSC3210-1', 'ENSC3221-1', 'MECH3023-2']
['CITS1021-2', 'CITS2021-1', 'CITS2023-2', 'ENSC2224-1', 'ENSC3211-2', 'ENSC3213-2', 'MATH2221-1', 'STAT1524-1']
['CITS2023-2', 'CITS3220-2', 'CITS3223-2', 'CITS3021-1', 'CITS3025-1', 'MATH3224-1', 'MATH3221-1', 'STAT3264-1', 'STAT3260-2']
['BIOC3221-1', 'BIOC3224-1', 'BIOC3225-2', 'BIOC3225-2', 'CITS1023-2', 'CITS2021-1', 'CITS2023-2', 'MATH1211-1']
['CITS2021-1', 'CITS2023-2', 'GRMN2025-1', 'GRMN2020-2', 'MATH1214-1', 'MATH2521-2', 'PHYS2221-1', 'PHYS2223-2']
['CITS1221-1', 'CITS1021-1', 'CITS1023-2', 'CITS2023-2', 'MGMT1135-1', 'MGMT1136-2', 'STAT1024-1', 'STAT2023-2']
['CITS1221-2', 'CITS2023-2', 'CITS3223-2', 'CITS3021-1', 'CITS3025-1', 'CLAN3224-1', 'STAT2021-1', 'STAT3260-2']

In [63]:

def print_patterns (patterns, numlines):
    for i in range(numlines):
        print(patterns[i], end = "\n")

# print_patterns(get_patterns(DATA),8)

Write a function that takes a pattern list, and returns the list in which:
- the units in each list (pattern) are sorted alpha-numerically
- the lists (patterns) themselves are sorted alpha-numerically

The following test code:

print_patterns(get_patterns(DATA),3)
print()
print_patterns(sort_patterns(get_patterns(DATA)),3)

should produce the following output:

['CITS2224-1', 'CITS2023-2', 'CITS3221-2', 'CITS3224-1', 'CITS3225-1']
['CITS2023-2', 'ENSC3221-2', 'ENSC3224-1', 'ENSC3225-1', 'ENSC3227-2', 'ENSC3210-1', 'ENSC3221-1', 'MECH3023-2']
['CITS1021-2', 'CITS2021-1', 'CITS2023-2', 'ENSC2224-1', 'ENSC3211-2', 'ENSC3213-2', 'MATH2221-1', 'STAT1524-1']

['AACE1223-2AA', 'ACCT1121-1', 'CARS1223-2AA', 'CITS1021-1', 'CITS2023-2', 'ECON1121-1', 'FINA1221-2', 'INDG1223-2AA', 'MGMT1135-2', 'MKTG1225-2', 'STAT1524-1']
['AACE1223-2AA', 'CARS1223-2AA', 'CITS1021-1', 'CITS1023-2', 'CITS2023-2', 'INDG1223-2AA', 'LAWS1113-2', 'PSYC1121-1', 'PSYC1123-2', 'SCIE1121-1', 'STAT1024-1']
['AACE1223-2AA', 'CARS1223-2AA', 'CITS1021-2', 'CITS1023-2', 'COMM1223-2', 'INDG1223-2AA', 'STAT1023-2']```

Caution: If you are using python's built-in sorting methods, it is highly recommended that you use sorted to return a new sorted list, rather than sort to sort the list in-place. Methods that alter a data structure 'in-place' can lead to confusing behaviour when debugging.

In [64]:

def sort_patterns(patterns):
    
    length = len(patterns)
    
    for i in range(length):
        patterns[i] = sorted(patterns[i])
    
    patterns = sorted(patterns)
    
    return patterns

        
# print_patterns(get_patterns(DATA),3)
# print()
# print_patterns(sort_patterns(get_patterns(DATA)),3)

Write a function remove_duplicates (patterns) that returns a (sorted) list of patterns with any duplicate patterns removed.

How many unique study patterns are there in the class?

In [65]:

def remove_duplicates(patterns):
    
    len1 = len(patterns)
    
    patterns1 = []
    
    patterns = sort_patterns(patterns)
    
    for i in range(len1):
        if patterns[i] not in patterns1:
            patterns1.append(patterns[i])
    
    len2 = len(patterns1)
    
    percentage = len2 / len1
    
    print("The percentage is {:.1%}".format(percentage))
    
    return patterns, len2

# print(remove_duplicates(get_patterns(DATA)))

Print the percentage (to one decimal place) of the total number of study plans in the class that are different.

You should find that we are indeed a diverse group!

Checked Solution [1 mark]

Write a function get_unique_patterns (patterns) that returns a pair (unique_patterns, num_patterns) where:
- unique_patterns contains a sorted list of sorted patterns, with any duplicate patterns removed (as above)
- num_patterns contains the number of unique patterns

In [66]:

def get_unique_patterns (patterns):
    
    length = len(patterns)
    
    for i in range(length):
        patterns[i] = sorted(patterns[i])
    
    patterns = sorted(patterns)
    
    
    len1 = len(patterns)
    
    patterns1 = []
    
    for i in range(len1):
        if patterns[i] not in patterns1:
            patterns1.append(patterns[i])
    
    len2 = len(patterns1)
    
    return patterns, len2
    
    
# YOUR CODE HERE
# raise NotImplementedError()

In [75]:

from nose.tools import assert_equal
DATA = "unit-patterns.txt"
(patterns, num_patterns) = get_unique_patterns(get_patterns(DATA))
assert_equal(patterns[0],['AACE1223-2AA', 'ACCT1121-1', 'CARS1223-2AA', 'CITS1021-1', 'CITS2023-2', 'ECON1121-1', 'FINA1221-2', 'INDG1223-2AA', 'MGMT1135-2', 'MKTG1225-2', 'STAT1524-1'])
assert_equal(patterns[5][0],'AACE1224-ACE-2')
assert_equal(patterns[10][-1],'STAT1523-2')

Examining the range of disciplines

For this part, it is left to you to think about the best way to break the problems down. You won't want to write new code for each different discipline, so think about how you can use structured programming to efficiently answer the questions.

What proportion of study patterns contain:
- a maths unit?
- a stats unit?
- a business unit?
- a medicine unit?
- a law unit?
- a psychology unit?
- a music unit?
- a philosophy unit?
- a service learning unit?

How many different disciplines are studied altogether (for the purposes of this question assume the first four characters of the unit code represent the discipline.)

Challenge

Print a list of each unit that is being studied, and how many students are studying it, in order from the unit with the most students to the unit with the least.

Hints: Use a dictionary, look at the get method, and consider the key argument of the inbuilt function sorted.

(My code is 7 lines including the print statement.)

How many units are being studied altogether?

In [68]:

def unit_studied(patterns):
    
    dic = {}
    
    for i in range (len(patterns)):
        for j in range (len(patterns[i])):
            dic[patterns[i][j]] = dic.get(patterns[i][j], 0) + 1
            
    dic = sorted(dic.items(),key = lambda x : x[1], reverse = True)
    
    return dic, len(dic)

print(unit_studied(get_patterns(DATA)))

([('CITS2023-2', 255), ('CITS1023-2', 79), ('CITS1021-1', 77), ('CITS3025-1', 74), ('STAT2021-1', 72), ('STAT2023-2', 67), ('STAT1024-1', 60), ('CITS3021-1', 53), ('CITS2223-2', 48), ('CITS3223-2', 45), ('CITS2224-1', 43), ('CITS3220-2', 43), ('CITS2211-2', 36), ('STAT3260-2', 33), ('CITS3221-2', 28), ('CITS2021-1', 27), ('CITS1221-1', 26), ('CITS3224-1', 23), ('STAT1023-2', 21), ('CITS3225-1', 17), ('MATH2221-1', 13), ('MATH1214-1', 13), ('ENSC2224-1', 12), ('FINA1129-2', 11), ('STAT1524-1', 10), ('CITS1221-2', 10), ('FINA2223-2', 10), ('ACCT1121-1', 10), ('ECON1124-1', 10), ('MATH2231-1', 10), ('FINA1221-1', 9), ('MKTG1225-1', 9), ('MATH1211-1', 8), ('MKTG1225-2', 8), ('ENSC2221-1', 8), ('MATH1721-1', 8), ('PHIL1224-1', 8), ('ECON1121-1', 8), ('CITS2021-2', 8), ('INDG1223-2AA', 8), ('ENSC3221-2', 7), ('ENSC3227-2', 7), ('CITS1021-2', 7), ('FINA3320-1', 7), ('PSYC1121-1', 7), ('MUSC1255-2', 7), ('STAT2263-2', 7), ('ENSC3211-2', 6), ('PHYS2221-1', 6), ('MGMT1135-1', 6), ('FINA2229-1', 6), ('PHAR1121-2', 6), ('MATH3225-2', 6), ('CARS1223-2AA', 6), ('FINA2227-1', 6), ('ENSC3224-1', 5), ('ENSC3213-2', 5), ('ECON1121-2', 5), ('FINA2224-1', 5), ('SCIE1121-1', 5), ('MATH1724-1', 5), ('PSYC1123-2', 5), ('GEOG2221-1', 5), ('MUSC1981-1', 5), ('AACE1223-2AA', 5), ('ENSC3221-1', 4), ('MATH2521-2', 4), ('PHYS2223-2', 4), ('ENSC3223-2', 4), ('MGMT1135-2', 4), ('MUSC2275-1', 4), ('MKTG1220-1', 4), ('FINA2220-1', 4), ('MATH1211-2', 4), ('MATH1721-2', 4), ('ENSC3216-2', 4), ('MUSC1250-1', 4), ('STAT3025-1', 4), ('ACCT3323-2', 4), ('ENSC3220-1', 4), ('ACCT2114-1', 4), ('PSYC2225-1', 4), ('SCOM1121-1', 4), ('ACCT1121-2', 4), ('FINA3326-2', 4), ('ENSC1225-1', 4), ('FINA1221-2', 4), ('MATH3223-2', 4), ('ENSC3225-1', 3), ('GRMN2025-1', 3), ('GRMN2020-2', 3), ('ENSC1225-2', 3), ('FINA2225-1', 3), ('GRMN1021-1', 3), ('ECON2230-2', 3), ('ACCT2331-1', 3), ('SSEH1120-1', 3), ('LAWS2321-2', 3), ('MATH3220-2', 3), ('LAWS1120-1', 3), ('INDG1153-2', 3), ('MKTG2238-2', 3), ('ENSC1220-2', 3), ('FINA3327-2', 3), ('ENSC3215-2', 3), ('GENE2233-2', 3), ('GENE2254-1', 3), ('ENSC3210-1', 2), ('MECH3023-2', 2), ('MATH3224-1', 2), ('STAT3264-1', 2), ('BIOC3225-2', 2), ('SCIE1120-2', 2), ('MKTG2321-2', 2), ('MKTG2325-1', 2), ('ENSC1220-1', 2), ('IMED2223-2', 2), ('SCIE2124-1', 2), ('SPAN1021-1', 2), ('ENSC3228-2', 2), ('ENSC3229-1', 2), ('MATH1621-1', 2), ('SCIE1123-2', 2), ('COMM1223-2', 2), ('ENGL2021-1', 2), ('CHEM2221-1', 2), ('PUBH1121-1', 2), ('GEOG2221-2', 2), ('ACCT2113-2', 2), ('SVLG1223-2', 2), ('BIOL1134-1', 2), ('GRMN1021-2', 2), ('SCIE2267-2', 2), ('MUSC1981-2', 2), ('PSYC2228-2', 2), ('EDUC1124-1', 2), ('SSEH1125-1', 2), ('ECON1123-2', 2), ('IMED2221-1', 2), ('IMED2224-1', 2), ('IMED2225-2', 2), ('IMED2220-2', 2), ('GRMN1023-2', 2), ('ASIA3223-2', 2), ('SVLG1225-Y3', 2), ('MATH1723-2', 2), ('KORE1021-1', 2), ('JAPN2025-1', 2), ('PSYC2217-1', 2), ('PHYS1221-1', 2), ('SSEH2203-2', 2), ('MATH3221-1', 1), ('BIOC3221-1', 1), ('BIOC3224-1', 1), ('MGMT1136-2', 1), ('CLAN3224-1', 1), ('ARTF2250-2', 1), ('POLS1121-1', 1), ('JAPN1023-2', 1), ('MKTG3326-2', 1), ('ENSC3215-1', 1), ('ASIA1221-1', 1), ('PHIL2221-2', 1), ('PHIL2227-2', 1), ('MUSC2270-1', 1), ('ARCY2221-2', 1), ('MUSC3571-1', 1), ('CHEM2223-2', 1), ('PHYS1224-1', 1), ('ASIA2220-1', 1), ('CHIN3029-1', 1), ('SCIE2225-1', 1), ('IMED3221-1', 1), ('IMED3224-1', 1), ('IMED3225-2', 1), ('IMED3220-2', 1), ('BIOC2221-1', 1), ('BIOC2223-2', 1), ('PATH2214-1', 1), ('PATH2211-1', 1), ('GCRL1224-1', 1), ('INMT2233-2', 1), ('ECON3324-1', 1), ('ECON2220-1', 1), ('SVLG1224-Y3', 1), ('EMPL1121-2', 1), ('ANTH1221-1', 1), ('ACCT2221-1', 1), ('MKTG3327-1', 1), ('ECON1111-1', 1), ('PHYS1221-2', 1), ('LAWS3328-1', 1), ('ANHB1121-1', 1), ('SCOM2228-1', 1), ('PACM1124-1', 1), ('LING1221-1', 1), ('ECON2126-1', 1), ('ECON2235-1', 1), ('ANHB1123-2', 1), ('CHEM1221-1', 1), ('SCIE1126-2', 1), ('STAT1523-2', 1), ('FREN1021-2', 1), ('JAPN3025-1', 1), ('JAPN3026-2', 1), ('AACE1224-ACE-2', 1), ('CARS1224-ACE-2', 1), ('INDG1224-ACE-2', 1), ('PACM1121-1', 1), ('ASIA3226-2', 1), ('KORE2025-1', 1), ('KORE2020-2', 1), ('LAWS2221-1', 1), ('LAWS2227-2', 1), ('BUSN1123-2', 1), ('PHAR2214-1', 1), ('PHAR2223-2', 1), ('ENVT2221-2', 1), ('ENVT2254-1', 1), ('CITS0229-2', 1), ('MGMT5520-2', 1), ('STAT0260-2', 1), ('JAPN1021-1', 1), ('ENVT2224-1', 1), ('SPAN1021-2', 1), ('ASIA1223-2', 1), ('CLAN1221-1', 1), ('EDUC1120-1', 1), ('CHIN2025-1', 1), ('CHIN2026-2', 1), ('ECON2205-2', 1), ('PACM1121-U0', 1), ('PHYS1234-1', 1), ('SVLG1227-1', 1), ('LAWS1113-2', 1), ('JAPN2221-1', 1), ('JAPN2020-2', 1), ('MATH3233-2', 1), ('SSEH1125-2', 1), ('ENVT2251-2', 1), ('GCRL2223-2', 1), ('ARTF1254-1', 1), ('MGMT1136-1', 1), ('KORE3025-1', 1), ('STAT3026-2', 1), ('BIOL2220-2', 1), ('ENVT3327-2', 1), ('SCIE2220-1', 1), ('ANIM1221-1', 1), ('SSEH1123-2', 1), ('MATH1213-2', 1), ('SVLG1226-U0', 1), ('ENRL3225-2', 1), ('HIST1125-1', 1), ('POLS2221-2', 1), ('POLS2216-2', 1), ('POLS2224-1', 1), ('CHEM1225-2', 1), ('LING1223-2', 1), ('EMPL3274-1', 1), ('MGMT3335-1', 1), ('MGMT3307-2', 1), ('ARLA1224-1', 1), ('EART1120-2', 1), ('EART1125-1', 1), ('GCRL2224-1', 1), ('ITAL1021-1', 1), ('GCRL2224-Y3', 1), ('SVLG1226-1', 1), ('ENSC3226-1', 1), ('ENSC3218-2', 1), ('ENSC3219-2', 1), ('PSYC3313-2', 1), ('ENSC3225-2', 1), ('SSEH3355-1', 1), ('ENSC2225-2', 1), ('ACCT2221-2', 1), ('CHEM1225-1', 1), ('ENSC2220-2', 1), ('ACCT3325-1', 1), ('MGMT2301-1', 1), ('CHEM1220-2', 1), ('GEOG2224-A0A', 1), ('GEOG2225-1', 1), ('GEOG2226-2', 1), ('ANHB2214-1', 1), ('ANHB2215-1', 1), ('ANHB3320-2', 1), ('PHIL3225-1', 1), ('PHIL3220-2', 1), ('SSEH3366-2', 1), ('PHIL2220-1', 1), ('SVLG1225-1', 1)], 293)

In [0]:

Diversity in Study: What (Else) Are We All Studying?

Data Science Across the Disciplines

Reference links

Data Acquisition

Reading the file

Data Cleaning

Putting it all together

Checked Questions and Answers

Answer cells

Test cells

Collection and deadlines

Checked Solution [1 mark]

Running the grading tests

Data Inspection and Interpretation

Checked Solution [1 mark]

Examining the range of disciplines

Challenge

Product

Resources

Company