Thoughts of LLMs in PL

7 minute read

Published:

关于LLM与PL的一些想法 Some thoughts of LLMs and Programming Language research

Notice: The English version(after the Chinese version) of this post is translated by LLM itself and verified by myself. This workflow is faithful to what I proposed in this post.

Chinese Version – 中文版

最近听了很多在各种任务上用LLM的talk。不可否认,现阶段试图在各类任务上引入LLM已经是大势所趋——有些我觉得挺有趣,有些我觉得有点扯。于是最近我一直在思考这个问题:到底是什么让我产生了这样的不同的感受?

一切的起点

为了讨论这个问题,我准备从我最熟悉的技术:Abstract Interpretation来讲。关于Abs Int,以下是一些我的基本认知:

  • Abstract Interpretation的核心思想之一是sound over approximation。也就是说为了找到所有的potential problem,我们允许一定的false alarm。
  • Abs Int是一个proof technique。它的目的是证明一个更大的集合(也就是更严格的标准)都成立。比如说如果你想知道自己每门课是否都拿到了60分,你如果能确定自己每门课甚至是都在80分以上,那么是否是60分已经不言自明了。

从核心来说,Abs Int对结果最大的要求是soundness。如果这个性质被破坏,则证明无意义。

Artificial Intelligence 与 Abstract Interpretation

基于如上的认知,我们凡是试图在Abs Int中引入Arti Intelli,其重要的要求之一就是结果要保证soundness。由于soundness proof只能通过某种abstract domain来提供,因此如果出现了这么一个场景:LLM提供了一个比当前abs int更精确的solution。这看似是很多利用LLM的分析想提供的benefit,但是当我们试图验证其正确性的时候,我们会陷入两个可能的情况:

  • 我们使用了某种abs int来验证/比对结果。其后果就是如果我们比这个abs int更精确,它依旧会给你report。此时我们无法判断这个report究竟是不是false alarm,所以结果是uncheckable。
  • 我们使用了一个更expensive的验证方式(如symbolic execution),这会破坏abstraction带来的scalability。对于更大的问题,其结果就是分析不出来,这还是一种”uncheckable”。

因此,我们发现:LLM/各种人工智能技术提供的结果,都只能和某个well-established abstract domain分析出来的结果对齐。 否则它的结果无法得到soundness check,在我们前文的共识下它就没啥用。

进一步来讲的话,任何试图在formal methods里引入Arti Intelli的行为都只能替代人本来做的事情,而无法替代本来由数学/逻辑来做的事情。这也就是为什么用LLM生成Lean candidate proof是有道理的(因为正确性来自于type system),生成test cases也是有道理的(正确性来自testing本身的under approximation),而生成dataflow dependency是扯淡的(因为这应该由某个确定性的iterative algorithm来做)。

再进一步来讲的话,我们之所以需要Formal Methods,是因为我们希望我们的结果是formal的,能获得一些guarantee。这些guarantee的背后是确定性的数学推理和计算,这些是无法被取代的。好比当你买东西的时候,如果你问小贩为什么是这个价格,你希望看到的是对方给你展示原价、税费、附加费等等计算出来的结果,而不是 “Trust me bro I am a Math PhD”。

所以看似轰轰烈烈的取代一切的LLM运动,于我看来在Formal Methods里其实更是一种dead end:它只能是一些engineering的问题,却无法触及scientific的地方。

Dead End?

“只能替代人能做的事情”是否就意味着没什么好做的呢? 其实也未必,engineering也并非就一定低人一等。

在Formal Methods的practice中,其实有很多人类参与的地方:选择abstract domain,猜测program invariant,包括上文提到的选择ITP candidate proof,其实都是一些需要人去设计和参与的heuristic。现今的LLM作为一个更加见多识广、更加吃苦耐劳但是没那么聪明的“人”,其实有很多可以做的地方。

比如猜invariant就是一个现在非常常见的应用,当然另一个更好的例子可能是program synthesis:LLM作为一个拥有一定domain knowledge的expert,可以用各种办法尝试生成一个满足spec的程序,它可以来自于SyGuS之类的搜索框架,也可以来自LLM生成,也可以去鸣濑神社请巫女小姐帮你写一个。但是最终想要确定得到的是否为需要的程序的话,还是不可避免地需要用verification techniques来检查;invariant同理,我们可以用各种办法,无论是搜索、咨询LLM、咨询巫女还是使用水晶球来获得candidate,只要最后能够验证其是一个invariant,我们的目标就成立了。

在这里多提两句LLM,在这些任务里LLM可能和巫女小姐与水晶球没有本质的差异。我听过一些LLM for SE的talk,发现从PL researcher眼里的program是一个由type等元素构成的形式系统,而从NLP的视角来看它们只是一些有统计特征的文本。虽然这个观点当然也是正确的,其统计特征当然很大一部分来自于其背后的形式系统。但是这种本质的观念分歧会导致对同一个任务产生截然不同的观念。这种似是而非的结果对于formal guarantee来说是远远不够的,但是如果你想得到一些candidate result,或者只是想得到“一些东西”并且希望它的质量尽可能高点,那这自然也就绰绰有余了。

上面那段话有一点“中文房间”的意味,所以我们自然也不妨干脆推得更大胆一点:即使是人类也并没有真正“理解”某个概念,只是在按照自己学到的pattern去模仿它罢了。因此不如干脆可以说:任何不经过formal verification的东西都是不可信的,不管它的作者是人还是LLM! 这难道不就是formal methods的魅力之一么——我可以犯错,但是数学上的推导不会!

Is that really the end?

当然以上的讨论都基于formal methods,放眼整个computer science world,formal methods自然也不是它的全部。比如soundness guarantee在formal methods researcher眼里极其的重要,但是在更广大的领域,这并不一定是必需品。 一个很好的例子是,尽管大部分静态程序分析的基础都是abstract interpretation,但是静态程序分析的一些应用并不是在使用其soundness feature。比如bug hunting,我可以用一个分析器去扫描一些开源软件,它不一定要报出所有的bug,也不一定非得没有误报,很多时候大家或许更像一个淘金客:只要它的report能找到bug,并且这个bug是新的,刚被发现的,那我就能获得一个CVE。于此情形之下,基于抽象解释的静态分析也只是一种工具,这些“包含real bug的金沙”可以来自于一个程序分析器,也可以来自LLM,当然也可以来自巫女小姐了。

English Version

It’s now a prevailing trend to explore the use of LLMs across various tasks, As a result, recently I’ve had the opportunity to hear about a wide range of LLM-related projects from different groups at different venue —- some of them genuinely interesting, others somewhat questionable. This led me to reflect on a key question: What exactly is it that makes me react so differently to these projects?

Where It All Starts

To approach this question, I’ll begin with the technique I’m most familiar with: Abstract Interpretation. Here are some of my basic understandings of it:

  • One of the core ideas of Abstract Interpretation is sound over-approximation. That is, in order to capture all potential problems, we allow for a certain number of false alarms.
  • Abstract Interpretation is a proof technique. Its goal is to prove the correctness of a larger set (a stricter standard). For example, if you want to check whether you scored at least 60 in every course, proving that you got over 80 in all of them obviously suffices.

At its core, Abstract Interpretation places its highest priority on soundness. If this property is broken, the proof becomes meaningless.

AI vs. AI: Artificial Intelligence Meets Abstract Interpretation

Based on this understanding, any attempt to introduce Artificial Intelligence into Abstract Interpretation must preserve soundness. Since a soundness proof can only be established through some form of abstract domain, we quickly arrive at a dilemma if an LLM is used to generate a more precise solution than traditional abstract interpretation, which seems also the benefit some LLM work claims to have, but:

  • If we use abstract interpretation to verify or compare the LLM result, and the LLM is more precise, the verifier might flag discrepancies. But then we cannot determine whether this is a false alarm or a real unsoundness. In short, the result becomes uncheckable.
  • If we use a more expensive method like symbolic execution for verification, we sacrifice scalability—meaning the analysis may not complete at all on larger problems. Again, the result becomes uncheckable.

Therefore, we realize a fundamental truth:
LLM/Artificial Intelligence-generated results can only be trusted if they align with those derived from a well-established abstract domain.
Otherwise, the results cannot be soundness-checked and are thus useless under the principles of formal methods.

Put differently, any attempt to introduce Artificial Intelligence into formal methods can only replace what was previously done by humans, not what was done by logic or mathematics. That’s why it makes sense to use LLMs to generate Lean candidate proofs (because correctness is guaranteed by the type system), or to generate test cases (because the correctness of testing comes from its under-approximation). But using LLMs to generate dataflow dependencies is nonsense—because that should be computed by a deterministic iterative algorithm.

In the end, we use formal methods because we want our results to be formal and come with guarantees. These guarantees are backed by deterministic reasoning and mathematics—things that cannot be replaced. It’s like when you’re buying something: if you ask the vendor why the price is what it is, you expect a breakdown—base price, tax, fees—not “Trust me bro I’m a Math PhD.”

So this seemingly unstoppable movement of LLMs replacing everything is, in the world of formal methods, more of a dead end: useful for engineering tasks, but unable to reach the scientific core.

Dead End?

But does “only replacing what humans do” mean there’s nothing to do here?
Not necessarily—engineering is not inherently inferior.

In the practice of formal methods, there are many places where humans are involved: selecting abstract domains, guessing program invariants, or selecting ITP candidate proofs. These are all heuristics that require human design and input. An LLM, as a more experienced, harder-working, but not-so-bright “human,” actually has a lot it can help with.

For example, generating invariants is already a very common application. Another even better example is program synthesis: LLMs, acting as domain experts, can try various ways to generate a program that meets a given spec. This can come from SyGuS-style search, from LLM generation, or even from asking a Miko at Naruse Shrine. But in the end, whether it’s the right program must be determined using verification techniques.
Invariants are similar: no matter how you obtain a candidate—be it via search, LLM, a Miko, or a crystal ball—if it can be verified as a valid invariant, then the goal is achieved.

A quick note on LLMs: in these tasks, LLMs may not be fundamentally different from Mikos or crystal balls. I’ve listened to a few talks on LLM for SE and noticed a distinction:

From a PL researcher’s view, a program is a formal system defined by types, etc.
From the NLP perspective, they are just texts with statistical patterns.
Of course, this viewpoint is valid—those statistical features do arise from formal systems. But this fundamental difference leads to very different expectations of the same task.
For formal guarantees, such results are far from enough. But if you’re just looking for a candidate result, or just “something” of reasonably high quality, then it’s more than good enough.

That previous paragraph might remind you of the “Chinese Room” thought experiment, so let’s push it even further:
Even humans don’t really “understand” concepts—they just mimic patterns they’ve learned.
So we might as well say:
Anything not formally verified is untrustworthy—no matter whether it was written by an LLM or a human.
Isn’t this one of the great appeals of formal methods? I might make mistakes, but mathematical derivations won’t!

Is That Really the End?

All of the above is based on the principles of formal methods, but formal methods are not the whole of computer science.
For example, while soundness guarantees are crucial to formal methods researchers, they’re not always necessary in broader contexts.

A great example: while most static program analysis is based on abstract interpretation, some of its applications do not rely on soundness.
Take bug hunting: I can use a tool to scan open-source software, and it doesn’t have to find every bug, or even be free of false positives. Often it’s more like gold panning: if it finds a bug—and that bug is new—then I can get a CVE.
In such cases, static analysis is just a tool. Those “gold nuggets” of real bugs might come from a static analyzer, an LLM, or even a Miko.