Career cup graph problem debugging approaches

## The original problem

Careercup is one of the many sites that offers interesting questions to solve that requires abilities in the field of logic, math or computer science (all the three fields very closely related). The problems there are defined for interviews that won't last days, so they can be very interesting but not too long. In other words they are not projects. Projects can be less interesting but very long, and going through this "length" without going astray is quite a skill too, but it is not the case of careercup interview questions for developers.

Furthermore many problems are not cleanly stated, and others can be expanded. I took one randomly that caught my interest, related to a graph problem, and expanded it due to the not exhaustive formulation on careercup.

This happened while I was searching problems to expand the processing list challenged for machines capable of interpreting userRPL (HP calculators). This because the userRPL environment offers a lot of interesting functions (mostly math based), especially for lists, so if one wants to explore them a list of challenges based on list processing is interesting.

The problem can be described as follows:

• Input:
• In input one has a list of triplets S D T where: S is the source node (a positive integer), D is the destination node (a positive integer), T is the connection type ( of value equal to 1 or 2 or 3).
• In the input list there are maximum E triplets, that could be redundant, and the maximum node number is at most $N= \sqrt{E}$ , of course this can vary.
• The graph is oriented, that is, if a node S reaches D , it does not mean that D reaches S.
• If T is 1, only goods of type 1 can flow. If T is 2, only goods of type 2 can flow. If T is 3, both goods of type 1 or type 2 can flow.
• Output:
• After processing the graph input to compact redundacies, we want to know the largest list of nodes in the graph that can reach other nodes in the list with connections of type 3, direct or indirect. So for example the triplets 1 2 3 ; 2 3 3 ; 3 1 3 would create the list { 1 2 3 } since every node can reach the other with flow of type 3 although sometimes through another node.

## Naive solution, userRPL, bash + sqlite

While I know that there are a lot of techniques developed for graphs, I do certain task to keep trained my solving skills and not just implement other ideas. Therefore if I'm in training mode, I first try by myself until I see that I do not make enough progress, then I look for better solutions. Or I do my solution, possible optimizations within a certain effort, and then I compare with existing solutions.

It depends all on the context, one thing is when I want to solve a problem quickly, then I prefer to check what solution exists, another is when I want to solve the problem for personal experience.

So since the problem was defined in the context of challenges for HP calculators, I tried to solve it with the userRPL and the hp 50g. Unfortunately, at least for my current skills, using the fast list processing functions in the userRPL is not debug friendly, and my approach produced a long code that needs debug. Actually whatever non trivial code that I produce needs at last 20 (mostly minor) fixes until it runs properly.

The alternative was to write my own version of the fast list processing, and I do not have interest for it at the moment, or use slow list processing, but it is so slow (processing 100 triplets in a minute, compared to few seconds, and that only for input processing without all the other steps!) that was not interesting1. So, since the hp 50g has other fast languages that are not too close to assembly (HRAST BASIC, newRPL, C through hpgcc) I decided to defer the challenge for the moment until I pick those languages. In the meanwhile I searched for another language I wanted to explore.

The choice fell on SQL and SQLITE. I used SQL but never so intensively and I never user SQLITE aside from quick tests. Therefore having an interesting problem, challenging but - as I wrote above - not too long to solve, helps to explore a new tool/language/whatever. Of course one has to have also the time. I have a long list of explorations to do, but in order of priority SQL was pretty high due to being deferred too many times. Moreover it is also needed for possible database operations regarding gladiabots statistics, before I use again postgresql in combination with plpgsql procedures.

So I decided to wrap SQLite calls in bash, that I already knew but a recent interview fostered my interest again mentioning the google code style for shell scripts, and went on to solve the problem, at least naively. You can check the code here.

Note that while the careercup problems are meant to be solved in, likely, some hours, I send several hours (and in gross terms, days) to first try to solve it with userRPL and then getting stuck on not debug friendly functions and then I needed to get used to translate procedures in SQL to get useful answers. Sure if I would have stuck with one language I would have been faster. It is always the same problem, either one standardizes on a language or tool and get "deep", so advanced operations with that tool, or one uses different tools but never get so deep with one.

## Further problem, the solution does not sound right

Now that I had the algorithmic solution, running in an openwrt 12.09 environment, on an asus 500 v2 router (32 mb ram) and 8gb flash usb disk2 I realized an unexpected pattern.

With low numbers of triplets in input, and therefore a low number of nodes and edges, the results were rarely containing the majority of nodes, with large number of triplets and edges, the results were likely to contain the majority of the nodes if not all.

Why? Surely, given the imposed relation between number of triplets in input and nodes, $N= \sqrt{E}$ , one could see that the number of nodes grow very slowlycompared to the number of edges (better triplets) in input. But even doubling the number, thus $N= 2 \cdot \sqrt{E}$, did not help much. Furthermore the triplets in input are random and maybe redundant, so it is not obvious that with 100 triplets and 20 nodes (or 10), all of those become well connected with flow of type 3.

So I was puzzled. One thing is to debug obvious syntax mistakes or design errors in an algorithm, another thing is to debug something subtle. Was my code wrong? If not, why did the nodes become well connected so easily?

In those cases a possible approach is to get the relationships underlying the problem, or at least an approximation of them, with math, hoping to make as fewer conceptual mistakes as possible.

## Trying to get a naive mathematical model to predict the result

So the triplets in input are built randomly. That is: the source node is picked randomly among the available ones, the destination node is picked randomly among the remaining ones, and the connection type is picked randomly among the values {1,2,3}.

So what is the probability that a node A connects to a node B with flow of type 3, directly?
If we choose to see the formula on the side of the destination, a node is reached by a flow of type 3 by another node with the following probability.

Where the first term means "what is the probability that a node different than the destination node is chosen", the second term means "what is the probability to pick a certain destination node", the third term means "what is the probability to pick a 3".

This simplifies to the term after the equation.

It is not finished here though. What if a node reaches another node through nodes in the middle? There are several combinations. For example:

(bigger picture: http://imgur.com/9r5Dmae )

So we approach very slowly3 and we use the same idea of before. We have actually two connections. One equal like before, between the source and the middle node, another between the middle node and the destination. Of course when we pick the nodes, we have to keep in mind that we should not repeat them. Plus we can consider (this may be wrong) the two connections independent, since they are randomly built.

Therefore we have:

Where the first three terms represent the probability to connect with flow 3 from middle node to the destination node (remember that we are reasoning backwards, from the destination node, otherwise we mess up the reasoning) and the second three terms represent the probability to connect from the source node to the middle with flow 3. Now for the second connection from source to the middle node, cannot pick from a pool of nodes containing the destination node, so the pool is not N-1 but N-2 since the destination node is picked already. Then the remaining formula follows the same structure of building randomly a connection. Note that in the two macro terms (composed by three single terms), the probability of picking the middle node appears twice. Once given the fact that we put aside the destination node (the very first term of the left formula) and once given the fact that we pick a random node that is not the source or the destination (the 5th term of the left formula).

So we see a pattern here, that combining a connection after another, is just multiplying 1/(3N) by itself. One can extend the same reasoning with three nodes connections and forun nodes connections until all the nodes are used.

There are also other combinations though. The case, for example, that the source reaches the destination using two different middle nodes, one that let the flow 1 pass, and another that let the flow 2 pass (in total, the flow 3 is allowed from source to destination). This is equal to two connections composed by two edges that exist, and since they are built independently from each other, it happens that their probability is equal to the result shown in the picture before (the generating formula is a bit different, because there is a N-3 in one case since a node cannot be the destination, the source, and the other middle node) multiplied by itself (because there are two different ways to reach the destination). So we have the fourth power of 1/(3N). The 1/3 remains because picking 1 or 2 or 3 has the same probability.

One could argue, through, that those type of combined flows to reach a destination node with type 3 using type 3 flows and combined type 1 and 2 flows, have more than one combination. Like shown in the following picture.

(full pic: http://imgur.com/u2On2QB )

We observe that we do not want to be super precise, we want just an idea, so even approximating the result rounding down (without considering a coefficient higher than one that multiplies a formula and so on) gives us a conservative (we round down, not up) estimate of the result. So we observe that aside from a direct connection, that is 1/(3N) in chances, a double connection is the same term squared, quite smaller since N is normally 10 or more. Then we have a double connection and a direct connection (like the destination reached directly with 1 and indirectly with 2), and that is 1/(3) to the third power (direct connection odds multiplied by a double connection odds, same term but with exponent 1 and exponent 2, together one gets exponent 3). Then we have one singe triple connection, third power again. Then we have two double connections, same term, fourth power. Then we have a direct connection and an indirect connection made by 3 connections, a fourth power again. We can continue this listing all the possible combinations.

The point is that we do not really care when the term has an exponent bigger than 3, the result is likely small, even considering all the possible combinations that we can have to have a, say, a direct connection and an indirect connection using two edges.

Therefore what we really consider is: the probability that a node reaches another, directly or indirectly, with flow of type 3 given a list of random triplets, is:

That just says "look, we are not interested in whatever term of 1/(3) starting with the third power is there, we know is there, [ https://en.wikipedia.org/wiki/Big_O_notation#Little-o_notation but it is too small]".

We (I and you, the reader. I always wonder if I make a mistake using "we") are there more or less. I mean we just need to plot what is this probability given the number of triplets in input. In other words we can give the number of triplets and a fixed number of nodes and see how the probability changes.

Before it was said that there is a fixed relationship between the input size and the nodes, the latter are the double of the square (rounded) of the former.

We make another conservative estimation assuming that we process triplets until we succeed for a given destination node (because the triplets in input can be redundant and repeated, so we can succeed more than once), therefore we use the Bernoulli trials stopping at the first success. We can succeed at the first triplet, at the second, at the third and so on.

This would tell us the probability that a node is reached by a flow of type 3 by at least another node in the graph. It is not exactly (I think) the probability that two nodes would reach it, or three and so on but it is a good indicator since this "other node" is random.

For example if we have 100 triplets in input and the double of the square root of this value in nodes, so 20, we would have the following formula to compute the probability that sooner or later through those triplets one random node would reach another random one with flow of type 3:
This due to the form:

So, having an approximate probability of 0.81, that is 81%, with 100 triplets indicates that it is very likely that one node may be reached with flow 3 and so may be in the result.

We can then plot what happens when we have an input of 10 triplets, or 11 or 12 and so on, having the number of nodes equal to the square root of the triplets in input, times two, rounded down.

(full pic http://imgur.com/4WgeoE4 )

The zig zag is a curious phenomena that I still have to justify. The point is, though, that the more the triplets in input, the more a node is likely to be reached by another with a flow of type 3. So having results with many nodes seems less strange.

The next step will be, thought, to compute the probability that two or more nodes reaches a node at the same time with flow of type 3, this would provide better insights, even though the entire computation is done discarding possible factors that would increase the resulting probability. (so an round down approximation)

### What if we want two nodes reaching another with a flow of type 3?

Let's try to attack the next step. So we know the approximate probability of having a node in the graph reached, directly or indirectly, by another with a flow of type 3.

What if we want two nodes reaching another with a flow of type 3?

Given N nodes. So we choose one source, that should not be like the destination, therefore the probability of this pick should be: (N-1) / N because we can pick whatever node except the destination among a pool of N nodes. The destination, then, can be picked considering the pool of nodes aside from the picked source, and we want one precise node. Therefore 1 / (N-1) . This is exactly the formula seen above (we have to multiple this for 1/3, the probability to pick a flow of type 3).

Then we have a third node, that wants to reach the same destination. So to pick this node, we can pick whatever node that is not the destination or the first source, among a pool of N nodes, so (N-2)/N . This time we want to pick the same destination, so a certain one, given that we choose among a pool of N-2 nodes due to the fact that we exclude the previous source and the new one. So 1 / (N-2) , "1" because we want a specific node. The result, due to the semplification of N-2 is exactly like the formula above.

Since the events have to happen together but in an independent way, the resulting probability would be 1/(3N) squared.
With 3 nodes, would be the probability above, but to the third power, and so on. And those are the direct connection probabilities, if one considers also indirect connections, the additional probability is way smaller but is there.

The point is, though, that we can approximate the value for the following argument. A result with , say, 5 nodes well connected between each other could be approximated by a pentagon, where the vertex are those nodes and the edges are the connections (oriented all clockwise or anticlockwise) with flow of type 3. Each connection has a probability of happening of 1/(3N) so the probability of happening of all those five connections is the previous formula to the fifth power. This discarding all the other possibilities, that would make the event more probable due to the definition of bringing more possibilities.

So what we really want as probability is, as minimum, that a sort of a path between all those nodes happens, therefore we want to compute:

Actually I has to start to a number equal to the number of edges needed, otherwise it is surely always a failure, so I should start from 5 and the exponent should be I-5, my bad.

Anyway the result is dismally small. With 100 edges or triplets in input and 20 nodes (so the double of the square root of the edges) the probability is like one over ten millions. I guess I made a mistake somewhere (likely in the probability function, not in the code) but I leave this one here as reminder for myself to find a proper reply in the next future. Even negative results (or personal mistakes) are results, to remind us what no to do or what to investigate more. In this case this is a negative result that stands as TODO as soon as I sort out stuff that I deferred for too long now.

It seems dumb to write them down but I always like when I get hint about tools used in articles that I like and furthermore it is useful for me of the future. More often than not I see what I did in the past and, when I like it, I ask "how did I do it?". Even the smallest tool can be crucial. So I hope to always remember to mention my tools if the documentation that I write is not so small, and I hope all the others out there writing documentation would do it too!

• asus 500 v2 with openwrt 12.09 for arm CPU
• bash
• sqlite
• winscp