Reader Comments

If there's Intelligent Life out There

by Theo Leone (2025-02-10)

 |  Post Reply

Optimizing LLMs to be proficient at particular tests backfires on Meta, Stability.

cyborg_featureimage.jpg

-.
-.
-.
-.
-.
-.
-


When you buy through links on our site, we might earn an affiliate commission. Here's how it works.


Hugging Face has released its 2nd LLM leaderboard to rank the very best language designs it has actually checked. The new leaderboard looks for to be a more difficult uniform standard for checking open big language design (LLM) efficiency across a range of jobs. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking three areas in the top 10.

artificial-intelligence.jpg

Pumped to reveal the brand new open LLM leaderboard. We burned 300 H100 to re-run new assessments like MMLU-pro for all significant open LLMs!Some learning:- Qwen 72B is the king and Chinese open models are controling total- Previous examinations have actually ended up being too simple for current ... June 26, 2024


Hugging Face's second leaderboard tests language models throughout four tasks: knowledge testing, reasoning on very long contexts, wiki.whenparked.com complicated mathematics abilities, and instruction following. Six criteria are utilized to check these qualities, with tests consisting of fixing 1,000-word murder mysteries, explaining PhD-level questions in layperson's terms, and many complicated of all: high-school mathematics equations. A full breakdown of the standards utilized can be found on Hugging Face's blog site.


The frontrunner of the brand-new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th location with its handful of versions. Also revealing up are Llama3-70B, Meta's LLM, and a handful of smaller sized open-source jobs that managed to outperform the pack. Notably absent is any sign of ChatGPT; Hugging Face's leaderboard does not test closed-source models to make sure reproducibility of outcomes.


Tests to qualify on the leaderboard are run solely on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anybody is free to submit new designs for testing and admission on the leaderboard, with a new ballot system focusing on popular brand-new entries for testing. The leaderboard can be filtered to show just a highlighted array of considerable designs to avoid a confusing glut of small LLMs.


As a pillar of the LLM space, Hugging Face has actually ended up being a relied on source for LLM learning and neighborhood partnership. After its first leaderboard was launched last year as a means to compare and replicate testing arise from several recognized LLMs, the board quickly removed in appeal. Getting high ranks on the board became the goal of lots of developers, small and big, and as designs have actually become typically stronger, 'smarter,' and enhanced for the particular tests of the very first leaderboard, its results have ended up being less and less meaningful, thus the development of a second variant.


Some LLMs, consisting of more recent variations of Meta's Llama, significantly underperformed in the brand-new leaderboard compared to their high marks in the very first. This originated from a trend of over-training LLMs just on the very first leaderboard's standards, leading to falling back in real-world efficiency. This regression of efficiency, thanks to hyperspecific and self-referential information, follows a trend of AI efficiency growing even worse over time, proving as soon as again as Google's AI answers have actually shown that LLM performance is just as great as its training information and that real synthetic "intelligence" is still many, lots of years away.

20240614_213621.png

Remain on the Cutting Edge: Get the Tom's Hardware Newsletter


Get Tom's Hardware's finest news and thorough evaluations, straight to your inbox.

r1_hist_en.jpeg

Dallin Grimm is a contributing writer for Tom's Hardware. He has actually been developing and breaking computers since 2017, functioning as the resident child at Tom's. From APUs to RGB, Dallin has a manage on all the most recent tech news.

Artificial-Intelegence-untuk-mahasiswa-1

Moore Threads GPUs apparently reveal 'exceptional' reasoning efficiency with DeepSeek models


DeepSeek research study suggests Huawei's Ascend 910C provides 60% of Nvidia H100 reasoning performance


Asus and MSI hike RTX 5090 and RTX 5080 GPU rates by as much as 18%


-.
bit_user.
LLM performance is only as excellent as its training information which real synthetic "intelligence" is still numerous, many years away.
First, this statement discounts the function of network architecture.


The definition of "intelligence" can not be whether something procedures details exactly like people do, or else the look for additional terrestrial intelligence would be entirely futile. If there's intelligent life out there, it probably doesn't believe rather like we do. Machines that act and behave smartly also need not necessarily do so, either.
Reply


-.
jp7189.
I don't like the click-bait China vs. the world title. The truth is qwen is open source, open weights and can be run anywhere. It can (and has already been) tweaked to add/remove bias. I praise hugging face's work to develop standardized tests for LLMs, and for putting the concentrate on open source, open weights first.
Reply


-.
jp7189.
bit_user said:.
First, this declaration discount rates the function of network architecture.


Second, intelligence isn't a binary thing - it's more like a spectrum. There are various classes cognitive tasks and abilities you may be acquainted with, if you study kid development or animal intelligence.


The definition of "intelligence" can not be whether something procedures details precisely like people do, otherwise the look for extra terrestrial intelligence would be totally useless. If there's smart life out there, it probably does not believe quite like we do. Machines that act and act wisely likewise needn't always do so, either.
We're creating a tools to help people, therfore I would argue LLMs are more practical if we grade them by human intelligence standards.
Reply

AI-in-healthcare2.jpg

- View All 3 Comments

skynews-deepseek-logo_6812410.jpg?202501

Most Popular


Tomshardware becomes part of Future US Inc, a global media group and leading digital publisher. Visit our business site.


- Terms.
- Contact Future's experts.
- Privacy policy.
- Cookies policy.
- Availability Statement.
- Advertise with us.
- About us.
- Coupons.
- Careers


© Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.



Add comment