Ten interesting algorithms, as a Software Engineer you should know

Arun Kumar Dave
10 min readDec 24, 2020

#1 Bloom Filter

A Bloom Filter is a data structure designed to tell, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false-positive retrieval results are possible, but false negatives are not. In other words, a query returns either “possibly in set” or “definitely not in set”. However, elements can only be added, not removed.; the more items added, the larger the probability of false positives. Bloom filters are both fast and space-efficient.

References:

#2 PageRank

PageRank is named after Google co-founder Larry Page, and is used to rank websites in Google’s search results. It counts the number, and quality, of links to a page which determines an estimation of how important the page is. The underlying assumption is that pages of importance are more likely to receive a higher volume of links from other pages.

PageRank is defined in the original Google paper as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where,

  • we assume that a page A has pages T1 to Tn which point to it (i.e., are citations).
  • d is a damping factor which can be set between 0 and 1. It is usually set to 0.85.
  • C(A) is defined as the number of links going out of page A.

References

#3 Hilbert Curve

The algorithm implements a recursive procedure that involves simple integer operations and quickly converges to the set of points that make the Hilbert curve. The algorithm is elegant, short, and considerably easier to implement than previous recursive and nonrecursive algorithms and can be efficiently implemented in all programming languages that have integer operations and allow recursion. Location-based databases are extensively used by apps like Google Maps, Uber, and Swiggy. We explore the data structures and algorithms which allow spatial or location-based queries, like the quad tree and the Hilbert Curve.

References

#4 HyperLogLog

HyperLogLog is an algorithm developed by Facebook’s engineering team for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

Suppose you have a large data set of elements with duplicate entries chosen from a set of cardinality n and you want to find n, the number of distinct elements in the set. For example, you would like to compute the number of distinct people who visited Facebook in a given week, where every person logs in multiple times a day. This would result in a large data set with many duplicates. Obvious approaches, such as sorting the elements or simply maintaining the set of unique elements seen, are impractical because they are either too computationally intensive or demand too much memory.

References

#5 Collaborative Filtering

Collaborative filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely to agree again in the future. The concept is simple: when we want to find a new movie to watch we’ll often ask our friends for recommendations. Naturally, we have greater trust in the recommendations from friends who share tastes similar to our own.

Most collaborative filtering systems apply the so-called similarity index-based technique. In the neighborhood-based approach, a number of users are selected based on their similarity to the active user. Inference for the active user is made by calculating a weighted average of the ratings of the selected users. Collaborative-filtering systems focus on the relationship between users and items. The similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items.

That’s how Netflix onboards its content.

References

#6 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a search technique in the field of Artificial Intelligence (AI). It is a probabilistic and heuristic driven search algorithm that combines the classic tree search implementations alongside machine learning principles of reinforcement learning.

In tree search, there’s always the possibility that the current best action is actually not the most optimal action. In such cases, the MCTS algorithm becomes useful as it continues to evaluate other alternatives periodically during the learning phase by executing them, instead of the current perceived optimal strategy. This is known as the “exploration-exploitation trade-off ” . It exploits the actions and strategies that are found to be the best till now but also must continue to explore the local space of alternative decisions and find out if they could replace the current best.

References

#7 Kadane’s Algorithm

A simple idea of the Kadane’s algorithm is to look for all positive contiguous segments of the array (max_ending_here is used for this). And keep track of the maximum sum contiguous segment among all positive segments (max_so_far is used for this). Each time we get a positive-sum compare it with max_so_far and update max_so_far if it is greater than max_so_far.

Kadane’s algorithm works by maintaining the start position of a subarray and repeatedly looking at the next element in the array and deciding to either.

  • Extend the subarray by appending that element, or
  • Discarding the subarray and starting a new subarray after that element.

References

#8 Consistent Hashing

Load Balancing is a key concept to system design. One of the popular ways to balance load in a system is to use the concept of consistent hashing. Consistent Hashing allows requests to be mapped into hash buckets while allowing the system to add and remove nodes flexibly so as to maintain a good load factor on each machine. The standard way to hash objects is to map them to a search space, and then transfer the load to the mapped computer. A system using this policy is likely to suffer when new nodes are added or removed from it.

Consistent Hashing maps servers to the key space and assigns requests(mapped to relevant buckets, called load) to the next clockwise server. Servers can then store relevant request data in them while allowing the system flexibility and scalability.

References

#9 Diffie Hellman — Curve25519

The Diffie-Hellman algorithm is being used to establish a shared secret that can be used for secret communications while exchanging data over a public network using the elliptic curve to generate points and get the secret key using the parameters.

Whatsapp uses Curve25519, state-of-the-art Diffie-Hellman function, that computes the user’s 32-byte public key. Given the user’s 32-byte secret key and another user’s 32-byte public key, Curve25519 computes a 32-byte secret shared by the two users. This secret can then be used to authenticate and encrypt messages between the two users.

The protocol uses compressed elliptic point (only X coordinates), so it allows efficient use of the Montgomery ladder for ECDH, using only XZ coordinates. Curve25519 is constructed such that it avoids many potential implementation pitfalls. By design, it is immune to timing attacks and it accepts any 32-byte string as a valid public key and does not require validating that a given point belongs to the curve, or is generated by the base point.

References

#10 Elo Rating

Elo Rating Algorithm is widely used rating algorithm that is used to rank players in many competitive games or to rate people in dating apps like Tinder, OkCupid, Bumble.

FaceMash was Facebook’s predecessor, and was developed by Mark Zuckerberg in his second year at Harvard. If you’ve seen The Social Network, you might remember this scene when Zuckerberg asks Eduardo Saverin for the equation used to rank chess players, except this time it was for rating the ‘attractiveness’ of female Harvard students.

The movie claims that the Elo equations (although written incorrectly in the above scene) were used in the algorithm in the original FaceMash website. Two students were shown side-by-side and users could vote on who was more attractive. In this scene, Rᴬ can be interpreted as student A’s Elo rating, Rᴮ as student B’s Elo rating, Eᴬ as the probability that student A is more attractive than student B and Eᴮ as the probability that student B is more attractive than student A.

Applyting the same principle in Tinder, the difference in the ratings between two users serves as a predictor of the outcome of a swipe. Two users with equal ratings who are compared against each other are expected to score an equal number of wins. A user whose rating is 100 points greater than their opponent’s is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.

References

A lot of references from various articles, publications, and videos by Gaurav Sen

--

--