Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Is Artificial General Intelligence (AGI) on the brink of surpassing human intelligence?

Elena Trajkova

August 28, 2024

Insights

You've most likely heard about GPT-4 and its remarkable capabilities. In this article, we run through 155 pages of the research paper "Sparks of AGI," in which Microsoft researchers present one of OpenAI's latest technology advancements, an early version of GPT-4. It convincingly demonstrates remarkable performance in many aspects, such as answering Fermi questions. These questions require logic and approximation to estimate quantities that are hard to measure directly, for example: "How long would it take to count to one billion?". Furthermore, it can handle tasks from the physical world, like assisting a human in repairing a water leak. The central claim around this model is that it can be considered early progress towards an artificial general intelligence system. The early experiments showcase GPT-4's ability to solve novel and difficult tasks in various fields, such as coding, mathematics, drawing, music, and even tricky social situations.

We review these claims by analyzing the early experiments and discussing the achievements and known limitations.

Unveiling GPT-4: How do you measure genius?

A common challenge computer scientists face is assessing the performance of a model trained on an unprecedented amount of text data. Most models are evaluated on independent benchmark datasets across various domains. While this method measures actual learning rather than merely memorization, the authors argue that there are limits to these assessments. There are two main reasons for not choosing such a standard evaluation technique. Since there is limited access to its complete training data, we can assume that GPT-4 has likely encountered most existing benchmarks. More importantly, GPT-4's intelligence is characterized by its generality, allowing it to perform tasks that are out of reach for domain-specific AI systems. Evaluating GPT-4 on generative or interactive tasks is challenging, as these are not single-solution tasks and are difficult to assess.

In light of this, the authors have opted for an approach that aligns more with traditional psychology than machine learning. They aim to leverage human ingenuity and curiosity to demonstrate GPT-4's deep and flexible understanding by testing it on novel and challenging tasks.

Milestones: A journey through GPT-4's achievements

Throughout this research paper, the authors present some impressive things GPT-4 can do. They conducted experiments in several domains specifically chosen to analyze its fundamental abilities: learning from experience, capacity to plan, quick learning, and problem-solving.

The eyes of artificial intelligence: Visual feats

Even though GPT-4 is a non-multimodal language model, it can still comprehend simple visual concepts. For example, it can generate Scalable Vector Graphics (SVG) images of four objects: a car, a truck, a cat, and a dog.

ref: sparks of artificial general intelligence (page 16, figure 2.4)

ref: sparks of artificial general intelligence (page 16, figure 2.5)

Even so, similar images could be present in the training data, and one may argue that the model just reproduced the code from there. However, in the following instance, GPT-4 is instructed to combine the letters H, O, and Y to draw a human and even add clothing. It can associate different geometric shapes to a human torso, showcasing a broader grasp of visual notions beyond simply copying existing code.

Note-worthy: Musical creations

GPT-4 can even write sheet music in ABC notation, although the outcome is relatively basic and limited. It produces accurate melodies and repetitive rhythm and can explain the overall structure of the tune. Its limitations, however, become apparent in the lack of harmony, where the model demonstrated little to no conceptual understanding. Possible reasoning for this outcome, as stated in the paper, is the lack of widespread adoption of the ABC notation, which would explain why GPT-4 couldn't even recognize well-known melodies, such as Beethoven's Ode to Joy.

Bit by bit: Cracking the code

This section highlights GPT-4's coding capabilities through coding challenges and real-world applications. It demonstrates its proficiency in coding complex tasks, from low-level components to high-level architectures. Additionally, the model can interpret and execute pseudo code, which involves understanding informal and often imprecise expressions unsupported by programming languages.

When tested against a few benchmarks, GPT-4 significantly outperforms other large language models. It achieves nearly 20% higher accuracy on the HumanEval and LeetCode benchmarks than the second-best model, text-davinci-003 (the base model of ChatGPT). On top of that, GPT-4's performance on the LeetCode benchmark nearly matches human performance, which is superior by only 0.2%.

More examples include developing simple games using HTML and JavaScript, writing incredibly complex functions in LaTeX, and even applying it to deep learning tasks, such as writing a custom PyTorch machine learning optimizer.

Beyond just writing code, GPT-4 tries to guess a password by reverse-engineering a binary executable code written in C. It uses tools such as GDB for debugging and Python for writing the 'crack-the-password' code. Interestingly, ChatGPT refuses to comply with the exact instructions, claiming that doing so would be unethical, even though reverse engineering is often used to improve software security. GPT-4, on the other hand, compares the password to a hash value derived from a mathematical expression, eventually figuring it out by guessing the right combination of digits that matches the value.

Mathematical marvels: How it all adds up

GPT-4 can solve high-school-level math problems and occasionally explain advanced math topics reasonably. In the same way, it can answer Fermi questions and tackle graph theory and algorithms.

However, it frequently generates basic mistakes and occasionally produces inconsistent results, which may be seen as a lack of conceptual sense and overall general intelligence. The model often makes arithmetic mistakes that would be a no-brainer for humans to solve, and its performance on the MATH dataset confirms just that. Even 68% of the generated solutions for the arithmetic tasks are incorrect.

Critical reasoning is where GPT-4 sees its most significant shortcomings. As stated in the paper, this is likely a challenge all large language models face, since these models are explicitly trained to predict the next word and lack an inner monolog that looks back to correct their previous mistakes.

This limitation can be potentially eliminated by adding more mathematical data to the training set, which also includes the "thinking process" of solving a mathematical question and not just the linear relationship between the problem and its solution.

Wielding the toolkit: Crafting solutions with AI

The enabled interaction of GPT-4 with external tools has emerged as one of the most notable trends for real-world applications. These resources are a great asset and can fill the gaps where GPT-4 lacks specific capabilities, such as up-to-date world knowledge, arithmetic operations, etc. For instance, it cannot provide a correct answer to questions concerning up-to-date events, such as the current president of the US. Additionally, it cannot solve a simple math equation nor identify the 13th letter in the word supralapsarian. However, GPT-4 has no problem executing these tasks when it can access resources like search engines and APIs.

Having said that, giving the model access to resources alone is insufficient to fix all of the challenges it may encounter. GPT-4 still needs explicit instructions that indicate whether using external tools is permitted or expected. Furthermore, it cannot always rationally decide when and how to apply tools. For instance, in one session, it uses a web search to find the capital of France, even though it should know it on its own.

Understanding the human mind: GPT-4's perspective

The model also exhibits a high level of theory of mind, which is the ability to recognize and process the mental and emotional states of others and oneself. It is able to interpret a situation from someone else's perspective and give an educated guess about their emotional state.

In the example below, it answers questions about a conversation between two people, explaining why one of them is sad.

ref: sparks of artificial general intelligence (page 55, figure 6.2)

Beyond the breaking point: LLM limitations

In this article, we discussed GPT-4's strengths and weaknesses in the context of different challenges and domains.

One of the main limitations the authors acknowledge is GPT-4's lack of ability to plan ahead, which they attribute to the autoregressive nature of the LLM. The model's inability to work step by step when solving a problem makes it more challenging to provide a correct answer. Interestingly, though, GPT-4 can plan things; we only have to prompt it. For example, when asked how many prime numbers are between 150 and 250, the zero-shot answer is 13, which is wrong. However, if you ask it to list all the numbers and then return the list size, it outputs the correct solution (18), as it is much easier to count the list items. Moreover, it has issues with text generation, as it seems to have difficulty planning ahead on a longer text (global scale), which is also inherent to its next-word prediction architecture.

Other examples include the unavoidable tendency to hallucinate, generate incorrect information, and make basic arithmetic mistakes.

To estimate roughly, GPT-4 excels when so-called fast-thinking is required, which is automatic and intuitive but entirely exposed to biases and errors. On the other hand, it cannot do slow-thinking, which is organizing the thought process and giving a rational, well-thought-out answer.

Improvement is also needed in other areas to achieve more general intelligence. These include improving long-term working memory, planning, better conceptualization, and learning from experience.

Closing thoughts: Does GPT-4 indeed show sparks of artificial general intelligence?

This paper's early exploration of GPT-4's capabilities suggests that it performs at a human level for many tasks and domains. One may wonder if GPT-4 truly grasps the explored concepts or simply excels at improvising without deep comprehension. This paper aims to address these doubts and provoke thoughts on the true nature of intelligence. Can an artificial intelligence system passing software engineering exams be deemed intelligent?

GPT-4 exhibits sparks of artificial general intelligence through its core mental capabilities, range of expertise, and task versatility, but more work is needed to achieve complete AGI. The ultimate test is the ability to generate new knowledge, a task still beyond the capabilities of large language models.

Nevertheless, evaluating the intelligence of large language models is necessary to ensure their reliability and effectiveness. A proper and comprehensive evaluation can detect errors, biases, and weaknesses, which can be utilized in improving their performance.

Learn more: We know how to measure the quality of LLMs

Toloka’s Deep Evaluation platform helps LLM developers evaluate their models effectively and produce better results. We achieve this by implementing customized quality metrics and human input to perform a thorough evaluation that matches your business needs.

How does it work?

Our experts develop a custom evaluation plan:

Review the usage scenarios and performance of the model
Formulate evaluation metrics tailored to your needs
Develop an evaluation pipeline with both automated and human annotation
Provide detailed reports for model improvement

Want to leverage the full potential of your LLM? Reach out to discuss solutions.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Detecting hidden harm in long contexts: How Toloka built an advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

Detecting hidden harm in long contexts: How Toloka built an advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

LLM evaluation framework: principles, practices, and tools

Jul 3, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?