• #security on software development security and web security, security best practices and discussions, break-ins and countermeasures. Everything you ever wanted to know about software security but were afraid to ask, for fear of not understanding the answer!

AI Software Design Security and Language Choice

Or, rather, the quality of the Artificial Intelligence (AI) generated software in general but it goes for security as well, naturally, since security and quality are but the two sides of the same coin. Software engineers happily embraced the wonderful possibility of being able to program in a natural language – describing the task to an agent that uses a Large Language Model (LLM) to produce the actual code. The question I want to look into today is the choice of the programming language for such endeavours.

Interestingly, when you let the LLM code for you, you should not really care all that much in what programming language it will actually produce the code. You program, that is, you explain the tasks, in English (or whatever other language you prefer) and the LLM will produce the code that should match the expected result. Especially when you are really using the proper design techniques and things like Test Driven Development, you should get the correct result if you do everything right.

It would seem then that the choice of the language is mostly cosmetic and a matter of personal preference, wouldn’t it? Or would it?

Consider the fact that the models have to be trained on existing material to actually perform their work. And the bigger and better the training base is, the better will be the result. You are sort of getting slightly below average of all the code that was produced before and suits your task. So the better the average is – the better will be the result.

So, then, we have to pose the question: if the LLM are getting trained on the existing code base, then what language should we choose? And the answer should be obvious: of course, we should pick a language with the biggest and most stable code base to get the best results. Preferably, a language that was used many times over to perform tasks similar to what we are up to. We want to use a language that has been used for a long time, so that all its corner cases and fine points are well-known, used extensively and described in as much detail as possible as well.

Here are the top 10 languages by their estimated public knowledge base size courtesy of LLM.

RankLanguageEst. Public Code
Lines Of Code
Est. Public KB SizeYears in UseStability
1JavaScript~50 billion LOC100% (baseline)30 yearsMedium
2Python~40 billion LOC~90%34 yearsHigh
3Java~35 billion LOC~85%30 yearsHigh
4C/C++~25 billion LOC~70%53/42 yearsVery High
5PHP~15 billion LOC~55%30 yearsMedium-High
6C#~12 billion LOC~50%24 yearsHigh
7TypeScript~8 billion LOC~35%13 yearsMedium
8Ruby~5 billion LOC~25%30 yearsHigh
9Go~4 billion LOC~20%16 yearsVery High
10Rust~3 billion LOC~15%15 yearsHigh
  • Public KB includes: open-source code, package repositories, documentation, forums, books
  • COBOL (250-800B total LOC) excluded: >99% hidden in proprietary banking/government systems
  • Fortran, VBA, SQL excluded: majority locked in corporate/classified systems
  • Stability: Very High = rare breaking changes; Medium = frequent ecosystem evolution

Suddenly, it would appear that writing code in more modern language is not a good idea when you use agentic development, right? You can clearly see that there are several languages with 30 or more years of history, stable and with a huge knowledge base to back them up. Those languages should be preferred in the design assisted by LLM and chosen as the main language for the systems completely written by LLMs.

It’s a pity COBOL is mostly used for closed proprietary systems. There is now over 800 billion lines of COBOL in daily use on production systems according to a recent survey. Imagine if that knowledge base could be used for training LLMs. We could rewrite everything in COBOL. Or imagine the power of LLM programming in C, were you able to use the 100-200 billion LOC of C programs hidden in corporate infrastructure and billions of embedded devices. Even Fortran, with its estimated corporate presence of 50 billion lines of code would be pretty awesome. This is mind blowing.

Well, we have to choose from what’s available to us and now you know what the best choices are.

Leave a Reply

Your email address will not be published. Required fields are marked *