#Microsoft Phi 4

81 messages · Page 1 of 1 (latest)

halcyon monolith
#

Phi 4 just dropped, only on Azure AI Foundry for now. It’s a dense 14B model claimed to be competitive with Llama 3.3 70b in many benchmarks. Most of the gains are from improved post-training rather than architectural improvements.
Paper: https://arxiv.org/abs/2412.08905
https://x.com/SebastienBubeck/status/1867379311067512876
https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft’s-newest-small-language-model-specializing-in-comple/4357090

Surprise #NeurIPS2024 drop for y'all: phi-4 available open weights and with amazing results!!!

Tl;dr: phi-4 is in Llama 3.3-70B category (win some lose some) with 5x fewer parameters, and notably outperforms on pure reasoning like GPQA (56%) and MATH (80%).

TECHCOMMUNITY.MICROSOFT.COM

Today we are introducing Phi-4, our 14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in...

opal snow
opal snow
#

Why did I get a notification for an emoji reaction 😭

regal slate
#

it's a decent model around Nemo 12B & Qwen2.5 14B level, with decent reasoning, very good STEM capability but lackluster code & instruct following. default vibe is very neutral and quite sterile, as expected from a microsoft model.

compared to the phi-3/3.5 models I tested half a year ago, it's much less stringent in terms of over-censoring.

tardy thorn
#

Q8_0, Q6_K, Q4_K_M and f16 ggufs and bf16 safetensors

lilac oxide
#

it fails to multiply 223*9271 which even the smaller marco-o1 model from alibaba can calculate correctly

lilac oxide
#

but other then that it seems to be quite a solid model

lilac oxide
#

i did a translation test for phi4, this is the result (i mean I wouldnt have expected anything else from a 14b model):
input:

On 14 December 2024, Yoon Suk Yeol, the president of South Korea, was impeached by the National Assembly following the second impeachment motion raised against him. This action came in response to Yoon's declaration of martial law on 3 December, which was overturned by the National Assembly and officially withdrawn six hours later. 

translating to german -> chinese -> french -> vietnamese -> norwegian -> english (without chat history ofc)
output:

On December 14, 2024, the president of South Korea, Yoon Suk-yeol, was impeached by the national 
parliament following another motion for his removal. This occurred in response to his announcement on 
December 3 to implement a ban on meetings—a decision that had already been deemed invalid by the 
parliament and officially withdrawn six hours later.
#

far from a correct translation, but also far better then any other 14b parameter model

#

(q8_0 quantization, maybe fp16 is better)

lilac oxide
#

wow fp16 much better, this is crazy good:

On December 14, 2024, the President of South Korea, Yoon Suk-yeol, was suspended from office by the National Assembly after 
undergoing his second temporary suspension during his term. This action followed his decision on December 3 to declare a state 
of emergency, which was later rejected and formally revoked by the National Assembly within six hours.
smoky sundial
#

you can run fp16 locally? how much vram does it consume.

lilac oxide
#

i have 24gigs of vram, llama.cpp does the offloading to ram

#

its slow but i still can make these translations in a few minutes

smoky sundial
#

anyway, i think there is model that is specifically being train to be translation and it could be better than it i suppose.

lilac oxide
#

still impressive for a 14b param model

smoky sundial
lilac oxide
#

i havent checked but since the model file is 29gigs, i suppose so

smoky sundial
#

so at least 48ram mac needed to run that thing.
anyway, thanks for the information.

lilac oxide
#

or the upcoming rtx 5090 xD

#

assuming 32gb would be enough

regal slate
#

32gb won't be enough unless you limit context to like 500tok. you need vram for system, multimonitor, and context. pretty sure you'd bust past 32GB even at 4k context

lilac oxide
#

i get 2.34 tokens/s for fp16 which is barely useable, but fine for me, if you do something different while the text is generating

woven marsh
#

φ

whole flare
lilac oxide
#

I've already seen a big difference between q8_0 and fp16

ornate star
lilac oxide
#

And was only correct with fp16

lilac oxide
#

Yes

ornate star
#

Okay this is strange, did you test any other models like this?

#

I usually use q4_k_m and q5_k_m for everything

lilac oxide
#

I đid test some other models, but don't remember which ones exactly, but was a 32b param model I think

vocal lake
lilac oxide
#

I am currently not at home, but if you want you (or someone else) can test this

jagged flare
lilac oxide
#

I haven't used or tested 4o mini at all so I can't say. But phi 4 is definitely a solid model for sure

ornate star
ornate star
#

nvm i only have an hdd and the verification is taking literal hours, somebody else pls do it

opal snow
#

There's a website that hosts any model I tried phi4 on there. I found that phi-4 had overfitted responses, in that it would have completely wrong reasoning and steps, but would get correct answers.

I think chatglf might have some type of deal.

fringe viper
#

Seems to be just added. https://openrouter.ai/microsoft/phi-4

Microsoft Research Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed.

At 14 billion parameters, it was trained on a mix of high-quality synthetic datasets, data from curated websites, and academic materials. Run Phi 4 with...

stark linden
#

Template totally wrong and it significantly impacted the intelligence of the model.

regal slate
# stark linden How is q8 busted?

It is not. Both the Q8's I tested (pre official launch and post) perform exactly as expected compared to full bf16, with very little degradation.
Others have tested this too, such as here https://youtu.be/eKSMKSZkN-I?t=2548 - and found the same results.

It's possible that someone downloads a bad quant with faulty template/params, or receiving poor results from user error, but this is not a general issue. Anyone making such claim should state exactly which model and parameters were used in which environment with which specific output issues, otherwise it's just baseless claims that cannot be verified.

tldr; it's not busted

stark linden
abstract compass
#

When will DeepSeek cram their stuff into Phi-4 instead of Qwen?

woven marsh
#

Interesting but a little unrelated to OR, the multimodal here (5.6B Params) is now at the top of the OpenASR leaderboard
(I didn't know until now that whisperv3large has 1.54B Params, qwen2-audio has 7B)

lilac oxide
minor citrus
#

And if it has image, it has video by extension?

languid comet
#

Can anyone using function calling via openrouter?

#

No endpoints found that support tool use

woven marsh
#

To fiddle with today

ornate star
lilac oxide
# ornate star

tbh logo memorization is not a good test for such a model

ornate star
#

Yeah but that output is so random

woven marsh
#

Deepinfra can be bumpy when they first start a model so may need testing & feedback

ornate star
#

Seems to be working now

#

This is an svg file btw, does openrouter automatically convert it to an image?

woven marsh
#

I think Chromium (Chrome+Edge+Brave etc) now includes auto svg-to-png functionality

high fox
lone flare
#

Getting garbled output using Phyi 4 multimodal instruct via Openrouter. I have lowered the temp and top p to 0.5 but nothing seems to fix it.

#

Dropping the temp all the way down to 0.1, and it barely responds even when asking for more detail...

lone flare
#

I just realized I'm referring to Phi 4 multimodal, which has another thread here I think.

crystal radish
#

So, I have seen that these models underperform in GPQA. What is the purpose of these smaller models if trained specifically on research papers? What makes then better at research than, say, o3-mini or gemini thinking where they excel in GPQA and STEM domain-specific questioning. Why might someone prefer to use Phi-4 over other current SOTA models excelling in STEM? I am trying to decide which would be better for a research assistant.

regal slate