Emir's blog
https://emiruz.com/
Recent content on Emir's blogHugo -- gohugo.ioen-gbSun, 28 Apr 2024 00:00:00 +0000RBF kernel approximation with random Fourier features
https://emiruz.com/post/2024-04-29-random-fourier/
Sun, 28 Apr 2024 00:00:00 +0000https://emiruz.com/post/2024-04-29-random-fourier/Linear methods are my favourite for elegance. I’m not sure any other area of applied maths has such a rich toolbox of practical theorems to draw from. A a basic application of linear methods are linear regression models. However, in a machine learning setting, vanilla models have the problem that they (1) typically have few parameters and saturate quickly, and (2) they imply a rigid geometry (a hyper-plane) which is often unrealistic.Metric learning with linear methods
https://emiruz.com/post/2024-04-24-metric-learning/
Wed, 24 Apr 2024 00:00:00 +0000https://emiruz.com/post/2024-04-24-metric-learning/I read this paper a while ago, which sets out the problem of linear metric learning nicely but then proceeds to solve it in a way I personally thought unnecessarily indirect. It seemed to me that there was a neat analytical solution, so I thought I’d have a go. It turned out to be quite straightforward.
Say we have some feature vectors \(x_i \in \mathbb{R}^p\) and some responses \(y_i \in \mathbb{R}^k\), I want:The "Billion Row Challenge!" with Fortran
https://emiruz.com/post/2024-03-27-1brc/
Sun, 24 Mar 2024 00:00:00 +0000https://emiruz.com/post/2024-03-27-1brc/SUMMARY I tackle 1BRC in Fortran which requires processing 1B rows of weather station data (~15GB) to obtain min/max/mean for each station as quickly, as you can muster. I started out with a time of 2m8s and reduced it to a best run time of <6s on a 4 i7 laptop with 16GB RAM. I herein document how.
INTRODUCTION The 1BRC data looks like this:
Hamburg;12.0 Bulawayo;8.9 Palembang;38.8 St.Advent of Code in Prolog, Haskell, Python and Scala
https://emiruz.com/post/2024-02-02-prolog-haskell/
Fri, 02 Feb 2024 00:00:00 +0000https://emiruz.com/post/2024-02-02-prolog-haskell/Here are some Advent of Code solutions:
2023 (Prolog)
2022 (Haskell)
2021 (Python & Scala) (in progress at the time of writing).
Here are some comparative notes:
My Haskell solutions were mostly < 27 LoC. The Prolog solutions where considerably longer. The Prolog solutions were, on average, much harder to code for me.
My Prolog solutions ended up looking rather functional for the most part.Domicles: a novel logic puzzle using Dominoe tiles
https://emiruz.com/post/2023-11-13-domicles/
Sun, 19 Nov 2023 00:00:00 +0000https://emiruz.com/post/2023-11-13-domicles/INTRODUCTION [If you want to have a go straight away, jump to the examples at the bottom of this post.]
Making a novel logic puzzle has been a bucket list item for me since yesteryear and I was finally handy enough with Prolog to endeavour for something elegant without having to write reams of code. I arbitrarily decided that I wanted the puzzle to be expressed in terms of Dominoe tiles.A minimal probabilistic Prolog meta-interpreter
https://emiruz.com/post/2023-10-18-minimal-prob-prolog/
Wed, 18 Oct 2023 00:00:00 +0000https://emiruz.com/post/2023-10-18-minimal-prob-prolog/What follows are some notes about a minimal proof-of-concept for a stochastic simulator in Prolog via a meta-interpreter.
META-INTERPRETER Here is a Prolog meta-interpreter which supports probabilistic head clauses through the use of the p/2 predicate:
prove(true) :- !. prove((A,B)) :- !,prove(A),prove(B). prove(Head) :- clause(Head,Body), (p(Head,P)->(random(X),1-P<X);true), prove(Body). sim(_,0,S,S) :- !. sim(Goal,N0,Acc0,S) :- (prove(Goal)->Acc is Acc0+1;Acc is Acc0), N is N0-1, sim(Goal,N,Acc,S). sim(Goal,N,P) :- sim(Goal,N,0,S), P is S/N. Here is an example program:Better data analysis with logic programming
https://emiruz.com/post/2023-10-15-logical-data-analysis/
Sun, 15 Oct 2023 00:00:00 +0000https://emiruz.com/post/2023-10-15-logical-data-analysis/INTRODUCTION Gentle reader, permit me to try and convince you that data analysis is better with logic programming. In this post I’ll analyse a staple dataset – the ggplot2 diamond prices – using a symbolic approach which, I will demonstrate, is able to establish a robust model, otherwise difficult to recover.
DATA I’ll use the diamond prices data which comes with the R ggplot2 package. It consists of information about 50k+ round-cut diamonds.Hidden information and solving Dominoes
https://emiruz.com/post/2023-10-06-dominoes/
Fri, 06 Oct 2023 00:00:00 +0000https://emiruz.com/post/2023-10-06-dominoes/Summary Some notes about the construction of a Block Dominoe playing algorithm for a hidden information variant of the game. I build a game simulator, learn from a heuristic algorithm and then develop some play-out based algorithms which seem fairly good. I conjecture the final algorithm approximates optimal play.
The final SWI Prolog implementation is available here.
I am selling an optimised Javascript library version; embeddable both in the browser or the backend.Analysis of the data job market using "Ask HN: Who is hiring?" posts
https://emiruz.com/post/2023-08-12-data-jobs/
Sat, 12 Aug 2023 00:00:00 +0000https://emiruz.com/post/2023-08-12-data-jobs/SUMMARY I parse HackerNews (HN) “Ask HN: Who is hiring?” posts from 2013 to time of writing and analyse them to better understand the trends in the data job market with a focus on the fate of data science. Here are my main conclusions:
It is likely that the Data Scientist role is in a long term decline and that skills such as data mining and visualisation are also out of favour.An optimal-stopping quant riddle
https://emiruz.com/post/2023-07-30-optimal-stopping/
Sun, 30 Jul 2023 00:00:00 +0000https://emiruz.com/post/2023-07-30-optimal-stopping/Introduction I happened upon a post by Gwern discussing, in some detail, various solutions to riddle #14 from Nigel Coldwell’s list of quant riddles. I initially got as far as the problem description in Gwern’s article and avoided reading further so I could first solve it for myself. The problem is stated as follows:
You have 52 playing cards (26 red, 26 black). You draw cards one by one. A red card pays you a dollar.Estimating gym goers: a mark and recapture experiment
https://emiruz.com/post/2023-07-05-gym-mark-recapture/
Wed, 05 Jul 2023 00:00:00 +0000https://emiruz.com/post/2023-07-05-gym-mark-recapture/Introduction I had recently started going to a new specialist gym that runs 3 classes per day during the working week and is closed the rest of the time. I’ve been at a few different times on a few different days, and already I was seeing many of the same people from the first class. It occurred to me that the chance of seeing the same faces should somehow scale with the number of people going to the gym, hence it may be possible to estimate the total number of gym members from the number of people I repeatedly see.Blocking, covariate adjustment and optimal experiment design
https://emiruz.com/post/2023-06-18-doe/
Sun, 18 Jun 2023 00:00:00 +0000https://emiruz.com/post/2023-06-18-doe/Summary I explain blocking, optimal design and covariate adjustment as methods to improve power in design of experiments. I try to motivate this as something data scientists working with online experiments ought to be doing since it can drastically improve the power of an experiment and make design of experiments tractable where otherwise it would not be. I also implement a D-optimal design fitting algorithm from first principles in Python to give the reader a deeper sense of what optimal design does, and I provide a slightly hand-wavy example to sketch how all these methods could be used together in the real world.Semi-supervised clustering with logic programming
https://emiruz.com/post/2023-05-12-semi-supervised-clustering/
Fri, 12 May 2023 00:00:00 +0000https://emiruz.com/post/2023-05-12-semi-supervised-clustering/Summary I motivate clustering as a problem well suited to logic programming in the general case, and I volunteer a couple artisanal clustering algorithms in Prolog demonstrated on some mock data.
Note: the code herein is my own. If you see bugs, or are a Prolog mage and can write it even more concisely, I’d be grateful if you could let me know.
Introduction There are many clustering algorithms born in specific circumstances such as k-means (via vector quantisation), biclustering (via gene expression analysis), DBSCAN (via spatial analysis) and so on, which went on to mostly be abused in the general setting.Prolog for data science
https://emiruz.com/post/2023-04-30-prolog-for-data-science/
Sun, 30 Apr 2023 00:00:00 +0000https://emiruz.com/post/2023-04-30-prolog-for-data-science/Summary I demonstrate a widely applicable pattern which integrates Prolog as a critical component in a data science analysis. Analytic methods are used to generate properties about the data under study and Prolog is used to reason about the data via the generated properties. The post includes some examples of piece-wise regression on timeseries data by symbolic reasoning. I also discuss the general pattern of application a bit.
Introduction Given some data, the bulk of “data science” for me is the study of what the data implies and whether it can be coerced into the context specific role usually decided by someone other than me.SQL + M4 = Composable SQL
https://emiruz.com/post/2022-12-28-composable-sql/
Wed, 28 Dec 2022 00:00:00 +0000https://emiruz.com/post/2022-12-28-composable-sql/Introduction I often work with clients who have large “data lakes” or big star schema style enterprise databases with fact and dimension tables as far as the eye can see. Invariably said clients end up with a substantial SQL codebase composed of hundreds of independent queries with lots of overlap between them. I want to be able to treat SQL repositories like I’d treat other codebases. That is, I’d like to create libraries, share code, test blocks independently, and so on.A beautiful embedding applied to defect detection
https://emiruz.com/post/2022-11-16-defect-detection/
Wed, 16 Nov 2022 00:00:00 +0000https://emiruz.com/post/2022-11-16-defect-detection/Introduction “Data science” has a handful of fundamental metaphors for problem solving, few moreso versatile than the “point cloud”. That is, translate your data into points in a n-dimensional metric space and then do linear algebra to it. The point cloud metaphor applies most simply to numeric tabular data, but with a little creativity it readily extends to text, images, time-series and so on. In this post I’m going to tackle the KolektorSDD2 image dataset – a collection of normal and defective surfaces for some unnamed product – by using the point cloud metaphor to create a simple custom embedding and then exploiting it with elementary regression methods.A fixed effect UK house price imputation model
https://emiruz.com/post/2022-05-21-uk-houses/
Sat, 21 May 2022 00:00:00 +0000https://emiruz.com/post/2022-05-21-uk-houses/SUMMARY I show how assumptions about price structure can be used to build a compelling fixed effect (deterministic) price imputation model for the UK residential housing market. The model uses just public price paid data. I describe how the data is collected and processed, how the model is designed, and how it is fitted using the Jax Python package. I showcase some results, I discuss shortcomings and I highlight further necessary work prior to use for decision making under uncertainty.Fast thinking on lichess.org
https://emiruz.com/post/2022-04-15-lichess1/
Fri, 15 Apr 2022 00:00:00 +0000https://emiruz.com/post/2022-04-15-lichess1/SUMMARY I use lichess.org games data to investigate the extent to which fast thinking is the dominant factor affecting game outcomes at any time control. I show how to (1) frame a pseudo-experiment, (2) database lichess.org data, and (3) carry out the analysis. I argue that fast thinking is most prominent in quick games. I analyse a sample containing games from pairs of users who have played each other at multiple time controls and show that win probabilities established using 180 sec Blitz games are heavily discounted in 600 sec Rapid games.Hello and goodbye to the J language
https://emiruz.com/post/2021-07-02-j/
Fri, 02 Jul 2021 00:00:00 +0000https://emiruz.com/post/2021-07-02-j/I spent about 50 hours making things with a language called J. Its an APL progeny and it promises to make possible the expression of general programming tasks as if in mathematical notation. In J, arrays are first class citizens, and most functions natively support array operations. It also has fancy composition rules, so rather than the usual f(a,b), in J you have either f a or a f b.Some less usual IQ scepticism
https://emiruz.com/post/2020-12-01-iq-rabbit-hole/
Tue, 01 Dec 2020 00:00:00 +0000https://emiruz.com/post/2020-12-01-iq-rabbit-hole/INTRODUCTION The crux with IQ – so far as I understand it – is that performance across abstract reasoning tasks is correlated no matter what the tasks are. That is, being good at one type of abstract puzzle implies that you’re more likely to be good at any other such puzzle. If you got \(M\) people to do \(N\) puzzles and made an \(M\times N\) matrix of their scores, and then did some SVD on it, the first singular value would account for much of the weight, and if you were to rank the rows (participants) by their weight in the first left singular vector, you’d have created something akin to an IQ score.About me
https://emiruz.com/about-me/
Mon, 01 Jan 0001 00:00:00 +0000https://emiruz.com/about-me/My name is Emir. I do research commercially, mostly by applying maths, stats and comp. sci. I’ve been at it about 7 years. I’m also a software engineer of 18+ years and an astronomer (PhD candidate). Aside from the PhD in progress, my academic background is in Analytical Philosophy (BA) and Applied Maths (GDip, Msc). My Linkedin is here. You can contact me by email. Construct my email address by prefixing the domain name with hi@.