{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['text.usetex'] = True" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## MA934\n", "\n", "## Data types and data structures\n", "\n", "### There is more to life than linear arrays ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Data types\n", "\n", "A *data type* is an attribute of data that tells the compiler/interpreter how that data will be used. For example, ```Float64``` and ```Int64``` are both 64-bit binary strings but are interpreted differently. \n", "\n", "*Primitive* types: ```int``` etc, ```float``` etc, ```bool```, ```str```\n", "\n", "*Composite* types: derived from multiple primitive types: ```list```, ```tuple```.\n", "\n", "Python also provides some special types, such as ```None``` - see a few sources such as [this collection](https://www.pythoncheatsheet.org/) or this [cheat sheet](https://docs.julialang.org/en/v1/manual/noteworthy-differences/).\n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Working with types\n", "\n", "Python provides functions for type checking that can be very useful:\n", "\n", "* ```type(x)``` : returns the type of x\n", "* ```isinstance(x, T)``` : checks if x has type T" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cell_style": "split" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(, )\n" ] } ], "source": [ "n = 10\n", "x = 10.0\n", "print((type(n), type(x)))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "cell_style": "split", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(False, True)\n" ] } ], "source": [ "print((isinstance(x, int), isinstance(x,float)))" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Note ```DataType``` is itself a type:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cell_style": "split" }, "outputs": [ { "data": { "text/plain": [ "type" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(type(x))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The ```None``` special type\n", "\n", "Confusingly, ```None``` is a type that can only take the special value ```none```. This represents the value returned by functions which do not return anything." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type of nada: , Value of nada : None\n" ] } ], "source": [ "nada = None\n", "print('Type of nada: ', type(nada), ', Value of nada : ', nada)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the ```NULL``` value in C or ```Nothing``` in Julia." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The ```Union``` special type in Julia\n", "\n", "The ```Union``` is a type that includes all instances of any of its argument types, such as bringing together e.g. ```Union(Float64, Nothing)``` to represent the possibility of absent values. Python, by contrast, is a dynamically (rather than statically) typed language and hence such functionality is only done as part of an operation called [Type Hinting](https://docs.python.org/3/library/typing.html#typing.Union) (useful for annotating code or for work with IDEs).\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from typing import Union,TypeVar\n", "\n", "T = TypeVar('T')\n", "def f(x: T) -> Union[str, None]:\n", " if x:\n", " return \"x\"\n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Composite data types\n", "\n", "A collection of named fields, that can be treated as a single value. We have a few options in Python, such as structures via the ```struct``` keyword:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "cell_style": "split", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\n\\x00\\x00\\x00Code\\x00\\xc0\\xfcD'\n", "(10, b'Code', 2022.0)\n", "Size in bytes: 12\n" ] } ], "source": [ "import struct\n", "\n", "# packs data into binary form\n", "packedData = struct.pack('i 4s f', 10, b'Code', 2022)\n", "print(packedData)\n", "\n", "# converts back to original form\n", "packedData = b'\\n\\x00\\x00\\x00Code\\x00\\xc0\\xfcD'\n", "unpackedData = struct.unpack('i 4s f', packedData)\n", "print(unpackedData)\n", " \n", "# calculates size\n", "structSize = struct.calcsize('i 4s f')\n", "print(\"Size in bytes: {}\".format(structSize))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arguably sets, as shown in the example below, are a more natural way to collect data of different types" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Hi there', 2, 4.1, None}\n", "{'Hi there', 2, 4.1, None, (2, 4, 6)}\n" ] } ], "source": [ "x = {2, 'Hi there', 4.10, None}\n", "print(x)\n", "\n", "# Set elements must be immutable and duplicates are also not allowed.\n", "# But adding a tuple for example is\n", "x = {2, 'Hi there', 4.10, None, (2,4,6)}\n", "print(x)\n", "\n", "# Note that lists and dictionaries are mutable, so they can’t be set elements (try this out)\n", "\n", "# Also have a look through the documentation and consider trying out some standard commands \n", "# e.g. len, union, intersection, ..." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "center", "slideshow": { "slide_type": "slide" } }, "source": [ "### Data structures?\n", "\n", "A data structure is a specialised way of organising data in a computer so that certain operations can be performed efficiently.\n", "\n", "* Composite types are simplest examples.\n", "* *Static* data structures have a fixed size. *Dynamic* data structures can grow and shrink depending on the data that they contain.\n", "* Associated with almost every data structure is a set of basic operations that it is designed to perform efficiently (conversely some other operations might be very inefficient or impossible.)\n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Examples of some common data structures (presented generically)" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "-" } }, "source": [ "* Linear arrays\n", "* Linked lists\n", "* Stacks\n", "* Queues" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "* Hash tables\n", "* Binary trees\n", "* Heaps\n", "* Graphs" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Arrays\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Basic operations:\n", "\n", "* access(i) : return get value at index i\n", "* update(i,v) : set value at index i equal to v.\n", "\n", "insert() and delete() not possible - static data structure.\n", "\n", "Building block for many other data structures." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linked lists\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A linked list is a sequence of elements called *nodes* in linear order that are linked to each other.\n", "\n", "The first/last node is called the *head*/*tail* respectively." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linked lists\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Each node consists of a data container and a link to the next node.\n", "* Dynamic data structure but only sequential access is possible.\n", "* Variants: singly linked, doubly linked, circularly linked." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Linked lists : basic operations\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* search(x): determine if data x is in the list (and perhaps return a reference to it).\n", "* insert(x): add new node with data x at beginning/middle/end. \n", "* delete(x): delete node with data x from the list." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Aside: pointers and references\n", "\n", "Discussions of linked lists often refer to linking nodes using *pointers*. A pointer (especially in C/C++) is a data type that contains the memory address of another object/variable.\n", "\n", "Python does not make use of pointers (at least not in standard mode). Julia does not have pointers either - variables are accessed via *references*.\n", "\n", "A reference is also a data type that contains the memory address of another object/variable." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Aside: pointers and references - so what's the difference?\n", "\n", "* A reference must refer to an existing object. It cannot change once created.\n", "* A pointer can be NULL and can be updated to refer to a different memory location by changing its value." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Pointers are powerful but dangerous:\n", "* segmentation faults\n", "* memory leaks\n", "* dangling pointers" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "fragment" } }, "source": [ "If [Maslov](https://en.wikipedia.org/wiki/Law_of_the_instrument) were a software engineer:\n", "\n", "\"When the only tool you have is C++, every problem looks like your thumb\"." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Stacks\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "A *stack* is a linear data store with a LIFO (Last In First Out) access protocol: the last inserted element must be accessed first.\n", "\n", "Can be static or dynamic.\n", "\n", "So named because it resembles a stack of plates...\n", "\n", "Used, for example, to implement function calls in recursive programming. \n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Stacks : basic operations\n", "\n", "\"array\" \n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "* push(x) : add the element x to the top of the stack.\n", "* pop() : remove the top element from the stack and return it.\n", "* peek() : return the top element from the stack without deleting it.\n", "* isempty() : check if the stack is empty." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Queues\n", "\n", "\"queue\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "A *queue* is a linear data store with a FIFO (First In First Out) access protocol: the first inserted element must be accessed first.\n", "\n", "Can be static or dynamic.\n", "\n", "So named because it resembles a real queue!\n", "\n", "Used, for example, to serve requests on a shared resource." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Queues : basic operations\n", "\n", "\"queue\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "* enqueue(x): insert element x to the end of the queue. \n", "* dequeue(): return the element at the beginning of the queue and delete it from the queue." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Hash tables (also associative array or dictionary)\n", "\n", "\"hash\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "A hash table stores a set of values, \n", "$$\\left\\{A, B, C, D, E\\right\\},$$ \n", "associated with a set of keys,\n", "$$\\left\\{key\\ A, key\\ B, key\\ C, key\\ D, key\\ E\\right\\},$$\n", "in a way that supports efficient lookup - i.e. $\\mathcal{O}(1)$.\n", "\n", "Direct addressing (convert key X to an integer, k, and store value X in slot k) is often not feasible." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Hash tables - an example\n", "\n", "Suppose the keys are integers in the range 1 - 1024 and we need to store, say, 4 random key-value pairs. \n", "\n", "* Direct addressing would require an array of size 1024.\n", "* Instead use an array of size 23 and the hash function\n", "$$h(k) = k\\%23 + 1$$" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "cell_style": "split" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[872, 222, 44, 547]\n", "Key 872 -> index 22\n", "Key 222 -> index 16\n", "Key 44 -> index 22\n", "Key 547 -> index 19\n" ] } ], "source": [ "import random\n", "rng = np.random.default_rng()\n", "\n", "keys = random.sample(range(0, 1024), 4)\n", "print(keys)\n", "\n", "idx = [k%23 + 1 for k in keys]\n", "\n", "for i in range(0,4):\n", " print('Key ', keys[i], ' -> ', ' index ',idx[i])" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Of course need a strategy to resolve conflicts. e.g. use buckets.\n", "\n", "Probability of conflicts grows as the *load factor* (#entries/#buckets) increases." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Binary trees\n", "\n", "\"tree\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "A binary tree is a hierarchical data structure in which nodes are linked together in parent/child relationships.\n", "\n", "Each node contains a data container and pointers/references to left and right child nodes.\n", "\n", "*Height* of the tree : maximal number of edges from the *root* to the *leaves*." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Binary search trees (BST)\n", "A BST stores integer keys in a sorted order to facilitate fast search:\n", "* All nodes, y, in left subtree of any node, x, have y.key ≤ x.key.\n", "* All nodes, y, in the right subtree of any node x, have y.key ≥ x.key." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Here is a BST storing the keys {0,1,2,3,5,6,7,9}\n", "\"BST1\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Binary search trees (BST)\n", "A BST stores integer keys in a sorted order to facilitate fast search:\n", "* Nodes, y, in left subtree of node, x, have y.key ≤ x.key.\n", "* Nodes, y, in the right subtree of node x, have y.key ≥ x.key." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "-" } }, "source": [ "Here is a another BST storing the keys {0,1,2,3,5,6,7,9}\n", "\n", "\"BST2\" \n", "\n", "Not unique." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Fast search :\n", "Recursive algorithm to search for a key in a BST.\n", "\n", "Maximum number of comparisons is the depth of the tree.\n", "\n", "If the tree is *balanced*, depth is $\\mathcal{O}(\\log_2 n)$.\n", "\n", "Note *building* the tree is $\\mathcal{O}(n)$.\n", "\n", "Note: generally you will find good examples (and practice problems) online for many of these structures, e.g. [here](https://www.geeksforgeeks.org/binary-search-tree-set-1-search-and-insertion/). Do not be afraid to explore during the practice classes!\n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "```\n", "search(T::BST, k::int)\n", " if T is empty\n", " return false\n", " elseif T.key == k\n", " return true\n", " else\n", " if k <= T.key\n", " search(T.left, k)\n", " else\n", " search(T.right, k)\n", " end\n", " end\n", "end\n", "```" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Another application: event selection in the Gillespie algorithm\n", "\n", "Simulates trajectories from a continuous time Markov chain.\n", "\n", "From $S$ at time $t$, 8 possible states, $S_1\\ldots S_8$, accessible with transition rates, $r_1\\ldots r_8$.\n", "\n", "Probability of transition $S\\to S_i$ is proportional to $r_i$." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "\"Gillespie\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Gillespie algorithm\n", "\n", "Build the list of partial sums:\n", "\n", "$$ x_i = \\sum_{j=1}^i r_j $$\n", "\n", "Generate $x \\sim \\text{Uniform}(0, x_8)$\n", "\n", "\"Interval\" " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Find which interval $x$ falls in: find $k$ such that $x_{k-1} \\leq x < x_k$. \n", "\n", "Update state $S \\to S_k$ and update time $t \\to t+\\Delta t$ where $\\Delta t \\sim \\text{Exponential}(x_8)$.\n", "\n", "In practice number of transitions, $n$, large. Can we find $k$ faster than $\\mathcal{O}(n)$?\n", "\n", "*Interval membership problem*." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Fenwick trees\n", "\n", "\"Fenwick\" \n", "\n", "A BST variant called a Fenwick tree can solve the interval membership problem in $\\mathcal{O}(\\log_2 n)$ comparisons." ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "Each node in a Fenwick tree stores the sum of the values stored in its children.\n", "\n", "Leaf nodes also need to store an integer key identifying the interval. \n", "\n", "Similar to tree search but when descending the right subtree, must remember to exclude the partial sum on the left subtree. " ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "slideshow": { "slide_type": "slide" } }, "source": [ "### Fast interval membership\n", "```\n", "search(T::FT, x::Float)\n", " if T is leaf\n", " return T.key\n", " else\n", " if x <= T.left.value\n", " search(T.left, x)\n", " else\n", " search(T.right, x - T.left.value)\n", " end\n", " end\n", "end\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split" }, "source": [ "If the tree is balanced, this search is $\\mathcal{O}(\\log_2 n)$ (depth of tree).\n", "\n", "Transition rates usually depend on state. Reconstructing the tree at each step would be $\\mathcal{O}(n)$.\n", "\n", "Partial sums can be updated in $\\mathcal{O}(\\log_2 n)$ operations. OK if small number of rates change at each step.\n", "\n", "Need occasional rebalancing." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }