We continue from DeepSeek and the AI Bubble; Napkin Calculation. You may wish to first peruse AI for Dummies, in Bite-Size Pieces, Part 1
Some may accept the probabilities outlined in DeepSeek and the AI Bubble; Napkin Calculation as worthy of consideration in the same manner as the Technological Singularity Both are typical of futurology, lacking hard justification. Yet those who have infrequent contact with futurology may demand a little more justification of the napkin calculation. The gist is this: When there exist a sufficient number of reasonably independent ways an event can happen, that event has a statistical character. Mathematicians use this very precisely with the ergodic theorems; we take license, applying it imprecisely but with strong analogy.
AI is a field where many things can happen. See (Wikipedia) Timeline of artificial intelligence. 1943 saw publication by Warren Sturgis McCulloch and Walter Pitts of the seminal paper, “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Since then, many speculative AI ideas have temporarily been the focus of attention, with minor development of some, before sinking under the intermittent surge of newer ones.
In most fields, the past is littered with discredited ideas. Rather uniquely, AI is not. The vast majority of the undeveloped past of AI remains fertile ground. Most of it is eligible for renewed interest. We cannot tell which idea and when; the statistical principle applies.
The ideas of AI bedrock, which are distinct from modeling, are a set of contradictions, governed by tradeoffs and practicality:
- Structure versus amorphous hardware architecture.
- Self organization of the hardware versus design.
- Self organization of the weights versus specification.
- While training, algorithmic versus statistical operation.
- Neuromorphic versus simulation of neuronal structures.
- Strong AI versus Weak AI.
The first item is relevant to DeepSeek, so let’s first focus on it. The designers claim to obtain performance comparable ChatGPT4 while using less: energy, computing resources, and training. Detractors suspect that DeepSeek cleverly cannibalized other AI’s, with the premasticated data responsible for the training economies. Even if this turns out to be the case, a significant aspect of the DeepSeek claim may be due to genuine improvement.
In the early years of AI, notwithstanding the early work of McCullough and Pitts, neural networks were assumed to have little in the way of complex structure — other than that induced by the weights. In theory, if every neuron were connected to every other neuron, with a structure that resembles an amorphous glass in hyperspace, if you knew how to set the weights or train the thing, the useless/redundant connections would vanish, and the optimal network would self-assemble. This has no practicality, with time to train unrealizable in the real world. It does have a mysterious attraction, the possibility of elusive, spontaneous, general intelligence, — g.
So structure was imposed. Some structures, like Rosenblatt’s perceptron, are layered, with an orderly progression from input to output. Some structures incorporate loops, saving processing units by re-use which allow the net to consider the results of its work multiple times. Some networks start with everything zeroed out, while others are initialized with hopefully helpful patterns. With the advent of large language models, layered systems took the lead.
Why they took the lead is not explainable in formal terms, because neural nets are the Wild West of mathematics. There appears to be a lot of truth in this bold statement: If a principle can be proven, it has limited utility, and if it is useful, it can’t be proven. The atmosphere contrasts with the frontiers of physics, where the latest developments are rigorous and disconnected from validation.
The hardware support for a natural language model has to be sufficient to store enough information to discriminate between all the patterns – words, sentences, logical statements, etc., that the designers intend to train with. Given a vocabulary of a certain size, how big does a network have to be? The Wild West has no answer to this question. The designers of ChatGPT et al. proceed empirically. If a designer desired a better answer, he would look at two papers: On Neural Network Kernels and the Storage Capacity Problem and Memory capacity of large structured neural networks. You need look only at the abstracts. The first paper works with a two layer, infinite network, which cannot be made or used, to derive some rigorous results that have no real world application. The second abstract depicts frustrated, limited research on more practical configurations, but is noteworthy for the ratio of kvetch to result.
In consequence, there is no first-principles way to tell how efficient a net is at storage. Try it and see is the order of the day. Regardless of the truth or lack, of accusations that DeepSeek is a DeepFake, this uncertainty is one hole DeepSeek found, crawled through, and successfully exploited.
This is a lot for your neural network to digest, so we’ll continue shortly.