Friday 4 April 2014

Sorting and Efficiency (Week 11)

I feel the phrase "Worst case runtime" really confuses people. I like to think the "runtimes" (c, n, n^2, n!, etc) are more like worst case "rates of change". Because that's what they really are. How the running time on the algorithm CHANGES as the input grows. Because from what I've seen, a lot of people try to assign a value to this "runtime", for example, seconds. But unless a very deep analysis is done, it is difficult to put a real seconds value on some code, without actually testing it.

We talked about some sorting algorithms such as quick sort, merge sort, and the n^2 sorts. The one I liked the most was merge sort (aside from bogosort, of course). Mergesort is a nlogn sorting algorithm in its best, average, and worst case. It works by:
1) Recursively splitting up the list to be sorted (logn runtime)
2) When the list has been broken up into sublists each of size 1, the "merge" step kicks in.
The merge step consists of combining two lists into 1 larger sorted list. It does this in linear (n) run time by progressively checking the first index of the lists, comparing them, and appending the larger (or smaller) item into the larger list.
3) Step 2 is done recursively as the lists are rebuilt from their splitting size of 1 (from step 1), till finally two lists which are each half the size of the original list, are merged together into a fully sorted list.

The most impressive sort though, was timsort, which is the built in python list sort.
Timsort is roughly a combination of mergesort and insertion sort. It is also intelligent and is able to recognize partially sorted chunks of data, so that it does not have to waste time running the sort on the data.
Why insertion sort? Insertion sort is used because it has very good runtime on small lists (around 5~10 items). Because of the small size of the lists, there is a high probability that insertion sort will run at its best case runtime of (n). Therefore, what timsort roughly does is, split the list till it is smaller than a certain threshold, insertion sort the small lists, and then merge the larger lists. Also, because it is able to recognize sorted chunks, timsort is said to have "supernatural" best case run time of (n) steps. This takes place when the list is already pre-sorted in ascending or descending order.

Tuesday 25 March 2014

Regular expressions

First of, I belong to the side which pronounces it "Ree-jex", it flows off the tongue better, sue me.
Anyways, A2 was due a few days ago. It brought along with it: staring into space for a long time, many goofy hand gestures and a lot of pacing up and down. Overall a fun assignment, 8.42/10.

For this assignment we needed to be able to take in a regular expression, validate that it is syntactically correct, represent it in a tree and be able to check a string to see if it fulfils the rules stated by the regex.

To do this, we needed to create some functions, namely:
is_regex(s)
regex_match(r,s)
build_regex_tree(r)

is_regex(s)
When I started out on is_regex, I originally had a hybrid iterative and recursive approach in mind. It was efficient but was quite long (around 80~90 lines). It involved locating nested brackets and processing each nested regex (deepest to shallowest). When a valid nested regex was found, it was fully replaced by a placeholder which made the rest of the regex easier to validate. For example:
- (1.(2|0)*)
-> (1.e*)
-> (1.e)
-> e
VALID.
Unhappy with the code length, I rewrote everything from the ground up and ended up with a fully recursive solution which was around 20 lines long and didn't use any helpers. It worked by splitting a regex around binary operators (. , |) and checking whether either side was a valid regex. If it both sides were valid, then the entire regex was valid.

build_regex_tree(r)
This was a relatively easy function to build. I needed to read in a regex and output it's tree representation.
Again, I built a iterative-recursive solution, and unline is_regex, I stuck with it. This one involved finding the shallowest nested regex and then just rooting the tree at that regex and delving deeper (recursively). The iterative part came in finding the shallowest nested regex, which was done by looping over the regex and locating all the brackets.
The fully recursive approach which I thought could also work, would be to use the same approach of splitting binary operators as is_regex. I would locate a binary operator and run is_regex on the left and right side. Only the shallowest binary operator would pass the is_regex test. The rest would fail because of bracket mismatches. Then I could build the tree starting at that shallowest operator, and use recursion to go deeper into the regex. I decided against this because I felt that even though the code length would be comparatively short, the runtime would be a lot longer than my original approach.

regex_match(r, s)
I had mixed feelings about the difficulty of this function. It was relatively easy for binary and leaf trees. The issue arose when I needed to check for the star trees. It was seemed relatively straightforward, but there was slight ambiguity in the rules of what strings would match a star tree. This edge case in particular:
(1*.2*)*
What strings would match that? The real question being, how many times is the inner bracket evaluated?
-If it is evaluated only once, then:
For example if the inner bracket produced a string match of "1122". Would this regex only match a string of any number of 1122's concatenated? eg: 112211221122

-The inner bracket is evaluated indefinitely: (This is the approach I took)
Basically evaluate the inner bracket for every character of s. Meaning that, the above regex would match every single string that contains any number of 1's and 2's. eg: 112, 21, 21122, ' ', 1212
Let's take '21122' to show how this would evaluate true:
This can be broken up into 2 + 1 + 1 + 2 + 2
And each of those characters would match the inner bracket: (1*.2*). Therefore making the whose string true.

All in all, this was an interesting assignment with a reasonable difficulty. It was more enjoyable to do than the assignment 1.

Sunday 23 March 2014

Binary Search Trees

Did someone say ADTs? Here's another one!

Binary Search Trees are specialized types of Trees. Let's break up the name:
Binary- They can have a maximum of two children
Search - They have a special protocol when structuring elements. All elements which as lesser than the root are placed to once side of the root, and the greater values, to the other side. These are usually left and right, respectively. If you were to do an inorder traversal of a BST, you would get a list of values in ascending order.

Because of their strict form, binary search trees are useful for a number of applications. The most obvious being for conducting binary searches. Everytime a node is checked for a value, the remaining pool of values to check is more or less halved.

Saturday 15 March 2014

LinkedLists

And just when you though we were done with the ADT's, there's more!
What now? Stacked Queues?, Tree Stacks?, what more can you think of? LinkedLists of course!
"LinkedLists?" I hear you ask. Well, let's start at the basics.
A LinkedList is like a chain. It is a linear network of "links" which connect together to form a large structure.
Each "link" is only aware of other links in it's immediate vicinity.
So, a LinkedList consists of nodes (links). Each node contains 2-3 parameters. 2-3, that's it? Yep, that's it.
This is because each node is a very simple object, and what differentiates it from other ADTs, it the connection protocol of the nodes.
So each nodes holds some data object. This can be anything. The remaining (1-2) parameters store the information of the other adjacent nodes to which our current node is "linked" to.
So we have a collection of nodes, all these nodes are connected to each other like links in a chain. Lets say I want to access the data at node 5. I can not just immediate jumped to node 5, I need to start at the tip of the chain (head) and work my way down, 1 node at a time, till I reach node 5. Because of their recursive design, LinkedLists are very versatile and relative easy to code. LinkedLists can easily be edited. By changing the head or changing a pointer inside of one of the nodes, a LL can be shortened or extended.

To be honest, I really don't see how LinkedLists are useful yet. They seem to be in some grey zone between a Stack and a Tree. They can work as a dynamic stack which can pop elements from both ends and remove elements from the middle, but really, it really isn't a stack if you can do all that stuff.
They also resemble very constricted trees. SURE, each node can be though of as having a child, but then again, what's the use of a tree with only one hierarchy of elements which form a long chain of elements.
The only thing I see good in LinkedLists is that they can be used to greatly reduce operation complexity of a Queue, when it comes to things such as enqueue.
But that comes back to the fundamental optimization problem. You are either burdening your memory to makes things quicker, or you can do a lot of processing to go light on the memory. So it is not some amazing game changing design, just a different way of structuring your problem (queue efficiency).

Wednesday 5 March 2014

Recursion Part 2 (Week 7)

Recursion: The processes of breaking down a complex problem of a specific type, into much smaller problems of the same type. This week we applied recursive solutions to a couple of data structures, namely, Trees and LinkedLists. These are two great data structures to apply recursive solutions because both structures are just made of smaller instances of themselves. For example, a tree can be thought of being made up of many smaller trees, each with their own "value", "root" and "branches". So if a function can work on a single tree, it can be easily modified to work on a more complex tree. Same thing for LinkedLists. Each node can be though of as another LL. So lets say I want to remove an item from a LL, instead of writing a loop to traverse the LL, delete the item, and then fix all the pointers, what I can do is: traverse till the node and just call the decapitate method of that node. This is possible because that internal node has the exact same properties as the external LL, but only on a smaller, simpler LL.
Over the past couple of weeks I have really started to see the use in recursion, and see that it is actually very widely used, is very useful and is usually a very sleek alternative to iteration.

Saturday 1 March 2014

Dendrology

Trees. The Tree data structure has some similarities to their biological counterparts. They both have 'leaves', 'branches' and a 'root'. That's about where the similarities end. Let's go into a few more details about trees. But first, I like to think about trees containing "Nodes". Each node contains two items. A value and a collection of branches. The value is just the data that needs to be stored in the node and the branches are what other nodes that specific node points to.
Tree's are just a collection of nodes. A node can only be accessed through one unique path. The start-point being the top of the tree, called the root. The root is a unique node because no other nodes pointing to it. Each node has can have an unlimited number of branches, or even no branches. If a node has no branches, it is called a Leaf node. Leaf nodes signify the end of that specific chain of nodes.

Trees offer an interesting way to think about different things such as inheritance and auto text complete.

Atleast for single inheritance (where you can only inherit one class such as in java) we can think of the "object" class being the root of all other classes. Object inherits from no other class, and therefore has no other nodes pointing to it. Below the root, we have other classes which all occur below Object. And we can make our own classes and super classes which follow this same tree/inheritance structure (and all of them inherit from the root: object).

In the case of auto-complete, we can think of each node having a combination of letters, where the ones near the top are very short combination of letters (1-3 letters) and each of those link downwards to more complex words. So the auto-complete can guess what word you will type in based on what you are actively typing in. Ofcourse it isn't a simple tree, but it follows a lot of the same ideas.

Sunday 23 February 2014

Property, getters/setters, and Names

Coming from a Java background, in the beginning I was genuinely taken aback and slightly disgusted by lack of "order" (from my perspective) in Python. Here are the 2 things that really struck me:

1) How DARE you just go and assign ANYTHING to a variable?
>>> is_totally_a_list = [1, 2, 3]
>>> is_totally_a_list = "Not a list"
No error? Really? C'mon.

2) Kind of related to point 2, what do you mean you aren't going to make getters and setters? Yeah yeah I know they kludge up the code and are just plain annoying to write, but what if you want to change the code later, without breaking compatibility?

But as time went on, I learnt a a little more about Python and am starting to see the appeal for it (It's ok Java, I still love you. That is, until C++ comes and sweeps me away.).
Regarding the above points, what I have learnt so far:

1) Not all languages are built equally. Python is a lot more dynamic that languages like Java. Instead of plain variables, Python has "names". These "names" can be tagged on to pretty much anything (and I mean ANYTHING) and they are a way to represent the memory address of an object. So a "name" acts like a little nametag on a memory 'box'. It doesn't really do anything other than point to that specific object in memory. So how is this useful? For starters, you can "name" anything things, such as function objects.
So I can actually pass in functions as arguments for other functions. One of the uses of this is a benchmarking function which compares runtimes of functions being passed in.
Ofcourse, one of the slightly disconcerting side-effects is being able to put a nametag on any type of object or datatype, anytime, anywhere (Unlike in java where you initialize a variable to hold a specific datatype, such as an int). But, once you realize how harmless the nametag is, most of the confusion dissipates.

2) This one was actually quite neat. You DO end up writing getters and setters in Python if need be. But you aren't burnt at the steak if you implement them in later iterations of your code (unlike in Java). Python has a function called "Property". What Property does is, modify the behaviour of a class variable. So everytime someone tries to access/modify it, they are actually routed through a getter/setter (the ones which you just wrote) and all the getter/setter logic is applied (Such as exception raises, etc). The biggest implication of this is that you don't need to write getters/setters if you don't presently need them. This results in much smaller and cleaner code. And if you realise you actually DO need to sanity check the assignment of a variable, you can implement the code without changing how the variable is accessed. You just need to write your getters and setters and be on your way.