Waarom evalueren we (Open Source) LLM’s? Artwork

Greatminds podcast

Greatminds duikt in allerlei onderwerpen die te maken hebben met software architectuur: van AI tot integratie architectuur voor ieder komt er wel wat aan bod of je nu een tech-liefhebber bent, in de software-industrie werkt, een business owner bent, of gewoon nieuwsgierig naar wat de toekomst ons kan brengen.

All Episodes

Greatminds podcast

Waarom evalueren we (Open Source) LLM’s?

April 23, 2024 • Hildo van Es en Robin Smits • Season 1 • Episode 1

Send us a text

AI biedt veelbelovende kansen, maar brengt ook serieuze risico’s met zich mee — vooral als we vergeten te evalueren wat we precies gebruiken. In deze aflevering van de greatminds podcast gaan Hildo van Es en data scientist Robin Smits in op de noodzaak van het evalueren van open source LLM’s (Large Language Models). Waarom moet je niet blind vertrouwen op bestaande modellen? Wat zijn de risico’s als je dat wel doet?

🔑 Belangrijkste inzichten:

Evaluatie gaat verder dan prestatie: ethiek, bias en veiligheid zijn net zo belangrijk.
Hugging Face biedt standaardbenchmarks, maar handmatig testen blijft onmisbaar.
Kleine benchmarks kunnen evaluatie toegankelijker maken zonder veel performanceverlies.

📱 Connect met onze gast en host:

Robin Smits | Hildo van Es

⏱ Tijdstempels:

00:00 – Introductie Hildo en Robin

01:43 – Waarom je altijd moet evalueren: de DPD-chatbot en Cortana

03:32 – Wat is evalueren in de context van LLMs?

05:09 – Hugging Face en het Open LLM Leaderboard

08:19 – Van GLUE naar SuperGLUE naar moderne benchmarks

09:56 – Meertalige evaluatie en het Nederlandse leaderboard

11:58 – Fine-tuning op je eigen dataset: moet je opnieuw testen?

19:20 – Chatbot Arena & subjectieve vergelijking

20:35 – Kosten, hardware en stroomverbruik

21:40 – Tiny Benchmarks: minder data, bijna dezelfde betrouwbaarheid

22:47 – Vooruitblik op volgende aflevering over bias

For English: scroll down
[00:00 - 00:10] Hallo, welkom bij de Great Minds podcast. Dit is de eerste uit een lange serie podcasts die wij gaan hebben. En ik ben Hildo van Es.
[00:11 - 00:24] Ik ben een architect en medeoprichter van Great Minds. Dat is een kennisdomein waarbij we alles rond de architectuur willen verzamelen.
[00:24 - 00:35] En een van die onderdelen is AI. En naast mij zit Robin, met wie ik een hele serie AI topics de review laat passeren.
[00:36 - 00:43] Robin, stel jezelf eens even voor. Ja, Robin Smits. Ik werk als data scientist bij het NBWI.
[00:44 - 00:54] Daarnaast heb ik mijn eigen parttime AI consultancy bedrijf, Lumie ML Consulting. En zoals je al zei, werken wij al een langere tijd samen.
[00:54 - 01:06] En gaan we dat nog veel verder uitbreiden. En een van de dingen, hoe we dat gaan doen, is een hele reeks aan podcasts opnemen. Dus leuk. Ja, super.
[01:06 - 01:17] Nou, laten we er dan maar gelijk inspringen. Vandaag staat op het programma dat wij het gaan hebben over evaluatie van LLMs.
[01:17 - 01:27] En dan met name open source LLMs, dus Large Language Models. Nou ja, laten we maar gewoon... ...beginnen met de vraag van waarom zouden we het überhaupt doen?
[01:27 - 01:40] Ja, nou, laten we de vragen omdraaien. Wat zou er allemaal fout kunnen gaan als we het niet doen? Tijdje terug is in het nieuws geweest de chatbot van DPD.
[01:40 - 01:52] Ja, ik weet het nog. En dat ze die toch maar gauw offline hadden gehaald. Nadat een aantal mensen die ermee aan het chatten waren, hadden ontdekt... ...dat die toch wel heel anti voor zijn eigen werkgever was.
[01:53 - 02:05] En lekker aan het vloeken tegen... Dus ik denk dat dat een heel mooi voorbeeld is van hoe kan het fout gaan als jij niet test? Als jij een AI model niet evalueert.
[02:05 - 02:12] Als jij hem niet, zeker bij een chatbot, real time monitort. Ja, inderdaad. Inderdaad.
[02:12 - 02:24] En ik kan me zo voorstellen dat als ik bijvoorbeeld zelf software schrijf, gewoon normaal software schrijf... ...dat dan begin ik altijd met unittests te schrijven.
[02:24 - 02:33] En dan test ik. Op functionaliteit meestal. Maar omdat AI eigenlijk een heel ander soort beest is...
[02:33 - 02:43] ...kan ik me voorstellen dat ook de tests wel wat anders zijn. Want ik hoef dus bijvoorbeeld nooit te testen of die wel zich aan alle ethische regels houdt.
[02:43 - 02:54] Of mijn knopje geen seksisme ontwikkelt of racisme of iets dergelijks. Zoals een paar jaar geleden hadden we de Cortana bot...
[02:54 - 03:06] ...die in één nacht tijd nazi ideeën begon te ontwikkelen. Ik kan me voorstellen dat je daar dus anders op wil gaan testen. Ja, kijk voor vanuit de data science.
[03:07 - 03:14] Als jij een model traint, moet je hem altijd testen, altijd evalueren. En dat wordt al vele jaren gedaan.
[03:14 - 03:27] Hé, je hebt een training dataset en dat was tien jaar geleden was dat misschien simpele tekst met het voorspellen van een label. Bijna klassiek voorbeeld, sentiment analyse.
[03:27 - 03:40] Is mijn Twitter bericht of X dan nu positief of negatief? Ja, inderdaad. Je traint hem op het herkennen daarvan, het voorspellen. En vervolgens evalueer je met een dataset dat je model dat doet.
[03:40 - 03:52] Dus dat is eigenlijk al wat heel lang gedaan werd. Wat je nu ziet met LLM modellen zoals ze nu beschikbaar zijn, is dat je veel breder moet testen.
[03:52 - 04:05] Veel breder moet evalueren. Tijdens de ontwikkeling. Dus niet alleen doet hij datgene wat wij verwachten. Maar hoe zijn de, wat zijn de resultaten? Vertoont hij bias?
[04:06 - 04:12] Heb je last van seksisme, racisme? Heb je er zo min mogelijk last van?
[04:13 - 04:24] Zelfs elk model wat in den treuren ontwikkeld en getraind wordt om dat niet te vertonen, loop je nog steeds het risico. Dus het is ook niet dat je...
[04:24 - 04:35] Dat je doortasten en evalueren alles voorkomt, maar je minimaliseert je risico's. Ja, inderdaad. En nou ja, dat testen, dat doen we dus niet. Niet zelf alleen maar.
[04:36 - 04:39] Daar zijn ook leaderboards voor op Hugging Face bijvoorbeeld.
[04:40 - 04:52] Als je dus een model als basis gebruikt, dan wordt die ook gepubliceerd met al een heel overzicht aan scores erbij. Kun je daar iets meer over vertellen? Ja, zeker.
[04:52 - 05:00] Ik denk dat Hugging Face zullen voor de meeste mensen wel bekend zijn. Indien het niet is, ga eens op de site kijken van Hugging Face.
[05:02 - 05:16] Qua open source LLM modellen, overigens niet alleen LLM modellen, ook computer vision modellen, diverse andere type modellen zijn daar gewoon beschikbaar.
[05:16 - 05:27] Maar wat je dus ziet voor LLM's, is dat ze inderdaad een evaluatie leaderboard hebben. Waar je gewoon de scores op een aantal standaard evaluaties kan zien.
[05:28 - 05:38] En een set van zeven evaluaties. En die geven gewoon een hele brede indruk van de kwaliteiten van jouw model. Ja.
[05:38 - 05:51] En dat is eigenlijk, we zeggen het al, open LLM evaluation board. Als tegenhanger van de closes AI modellen.
[05:51 - 06:00] Ja, inderdaad. En soorten. Dus al is die grap geloof ik niet meer origineel en heel veel mensen schijnen hem te claimen, zag ik toevallig van de week op internet.
[06:00 - 06:10] Maar kijk, als wij mensen willen weten hoe goed is het model, dan kijk je naar performance gegevens. Ja.
[06:10 - 06:20] Dus hoe hoger het model scoort, in principe hoe beter die is. Dat zegt niet alles, maar het geeft wel een hele goede indicatie. Ja, precies.
[06:20 - 06:34] En zijn er ook dan standaard evaluaties die we dan, dus op het moment dat wij een LLM hebben of een model hebben, hoe wordt zoiets dan getest?
[06:34 - 06:41] Zijn er standaarden voor? Ja. Laten we vooropstellen dat er heel veel benchmarks zijn.
[06:42 - 06:53] Het is niet te doen denk ik om elke benchmark voor elk model te draaien. Al is het alleen maar de hoeveelheid hardware resources die je daarvoor nodig hebt.
[06:54 - 07:09] Wat je dus bij Hugging V ziet is dat ze een zevental standaard evaluaties hebben die in zo breed mogelijk scala aan performance metric voor je model aangeven.
[07:09 - 07:22] En daarbij gewoon een heel breed spectrum kennis, logica aan redeneren. Zeg maar aantonen wat de performance is of niet. Ja.
[07:22 - 07:31] Het is op zich wel mooi hoe dat zo ontwikkeld is. Als je een aantal jaren teruggaat, 2018 meen ik.
[07:32 - 07:40] Er was op dat moment een van de meest gebruikte evaluatie benchmarks was GLUE. The General Language Understanding Evaluation.
[07:40 - 07:53] En je zag dat heel veel modellen, die haalden daar een bepaalde score op. Er zijn... In de kern de huidige LLM modellen zijn gewoon een doorontwikkeling van de transformer modellen.
[07:54 - 08:05] Zoals die vanaf 2017 zeg maar verder ontwikkeld zijn. Zo'n beroemd paper wat toen uitkwam. Attention is all you need. Ja. Nou, hoe mooi ChatGPT ook is.
[08:06 - 08:17] Het is een doorontwikkeling van wat toen al neergezet is. En wat je dus zag in de jaren daarna is waar in 2018. The General Language Understanding Evaluation. Nog genoeg was.
[08:18 - 08:29] Het jaar later scoorden alle modellen daar al gewoon de maximum score op. Ja. Nou, toen zijn ze gekomen met de Super General Language Understanding Evaluation. Super GLUE.
[08:30 - 08:42] Rond 2021 was die ook alweer achterhaald. Ja. En eigenlijk sinds die tijd zie je dat ze veel complexere problemen, veel complexere testen.
[08:42 - 08:55] Om zeg maar die performance te evalueren. En wat je nu dus bij Huggingvee ziet is dat ze een zevental metric hebben. Ja. En dat geeft gewoon een hele goede indruk. Ja.
[08:55 - 09:07] Zeg, kun je misschien iets dieper ingaan op de verschillende evaluatie methodes en ook met name over de ondersteuning van evaluaties in andere talen?
[09:07 - 09:13] Want dat lijkt mij ook een ding. Ja, dat is zeker een ding.
[09:14 - 09:26] Wat je ziet bij de testen is dat ze eigenlijk allemaal geoptimaliseerd zijn voor Engels. Ja. Of Chinees. Ja, dat klinkt ook logisch.
[09:26 - 09:38] Amerika en China, dat zijn de drijvende krachten achter de AI ontwikkeling. Ja. Dus je ziet bijvoorbeeld met, noemen we dat de standaard evaluaties op het Huggingvee's leaderboard. Dat is allemaal Engels.
[09:38 - 09:47] Eén test is bijvoorbeeld test Amerikaanse high school test. Ja. Dat is heel erg leuk, maar daar heb je in Europa weer weinig aan. Nee, inderdaad.
[09:49 - 09:59] Dus dat is iets wat je nu ook ziet komen. Er is vorig jaar is er van een researchen een project geweest om in ieder geval een aantal
[09:59 - 10:12] van die testen te vertalen naar meerdere talen. Nederlands is daar één van geweest. Er is Nederlandse, ik denk dat het een Nederlandse student is, Bram van Roy.
[10:12 - 10:24] Die heeft dat gedaan. Die heeft op Huggingvee's, heeft die dus bijvoorbeeld de Open Dutch LLM leaderboard opgezet. Oh, dat is interessant. Dat zijn weliswaar vertaalde testen, machinaal vertaald. Ja.
[10:24 - 10:33] Dus daar kunnen eventueel kwaliteitsproblemen mee zijn. Dat zal niet heel groot zijn, maar dat staat ook aangegeven.
[10:34 - 10:46] Maar zo hebben we dus in ieder geval, ik meen dat er dan vier van de zeven evaluatieschets, zoals op het officiële leaderboard, die worden nu specifiek voor Nederlands gebruikt. Interessant.
[10:46 - 10:56] Interessant. En als jij nu, stel je dat je dus een LLM dus in je eigen bedrijf zou gaan gebruiken,
[10:56 - 11:07] wat voor, dan ga je hem dus ook trainen met je eigen, met je eigen dataset, je eigen bedrijfsdataset. Verandert dat dan ook de manier van testen?
[11:09 - 11:19] Nee. Ik denk het zozeer niet. Kijk, allereerst is het de vraag, ga jij een LLM trainen op je eigen bedrijfsdataset?
[11:19 - 11:25] Het kan, maar misschien is jouw toepassing dermate dat het niet nodig is.
[11:25 - 11:35] Als jij een model hebt wat al prima functioneert als chatbot, hè, en jij zou daar een applicatie
[11:35 - 11:46] omheen bouwen met een framework als Langchain. Waarbij jouw bedrijfsdata met RAC de informatie eruit kan halen. Ja.
[11:46 - 11:59] Dan hoef jij misschien niet strikt jouw model te trainen op jouw bedrijfsdata. Nee, inderdaad. Nou zou het kunnen zijn dat het wel interessant is puur omdat het performance verbetert.
[12:00 - 12:09] Daar kun je op gaan testen. Ja. En dan ga je inderdaad, waar je dan voor moet zorgen is dat jij inderdaad, je hebt je bedrijfsstatus
[12:09 - 12:22] set en daar maak jij een splitsing in tussen wat je als trainingsdata hanteert en een stukje wat je als test of evaluatiedata hanteert. Ja.
[12:22 - 12:33] Je wil niet dat jij traint op data die in je test set zit om leakage te voorkomen. Ja, dat heb ik. Dus dat die die data al kent. Ja.
[12:33 - 12:43] Maar de verdere stappen, hè, je kan allereerst kun je alle bestaande bedrijfsdata, je kan testen op benchmark en vanuitgaande dat je als bedrijf, dat je de hardware resources
[12:43 - 12:51] en het budget hebt, kun je al die standaard testen al draaien wat al enorm veel toegevoegde waarde geeft.
[12:52 - 13:04] Je hebt weliswaar op kleinere schaal, er zijn Nederlandse evaluatieschets beschikbaar, maar uiteraard testen op jouw eigen bedrijfsdata is natuurlijk het meest ideale. Ja.
[13:04 - 13:13] En vanuitgaande dat jij binnen een bedrijf jouw model voor een bepaald doel inzet. Ja. Is dat primair waarmee je wilt testen. Precies.
[13:13 - 13:21] Niet alleen of die zijn basis performance, maar of die ook echt zeg maar die toegevoegde waarde op businessgebied levert. Ja, precies.
[13:21 - 13:34] Hé en er komt ook wel een hoop handmatig testen bij kijken bij het testen van van van LLM's, hè, want je nog weer even terugkerend naar wat er met DPD is gebeurd.
[13:34 - 13:45] Daar zijn bepaalde soorten van tests, hè, dus prompt testings die je dan kunt doen. Kun je daar iets meer over vertellen? Ja. Jazeker.
[13:45 - 13:52] Wat we zeiden, je kan heel veel met standaard datasets, jouw bedrijfsdataset kun je testen.
[13:52 - 14:02] Alleen wat je nu ziet is als jij uiteindelijk een LLM echt als een chatbot binnen jouw bedrijf inzet. Ja.
[14:03 - 14:11] Dan denk ik dat dat toch wel met extra functionaliteiten bij je dat het toch wel de meest gebruikte toepassing is, omdat het gewoon de meeste waarde heeft.
[14:13 - 14:20] Maar wat jij ziet is mensen gaan zo'n chatbot uitlopen testen. Ja, precies. Kun je hem laten vloeken? Ja.
[14:21 - 14:32] Kun je hem, ja, als het een financiële chatbot zou zijn, kun je ermee over het het weer praten of over films? Ja. Nou zou het kunnen zijn dat je dat als bedrijf toestaat.
[14:32 - 14:44] Maar dat lijkt me. Ik denk het niet. Ik denk het eerlijk gezegd ook niet, maar ik wil open minded zijn. Dus je gaat zo'n model alijnen. Wat is mijn voorkeur?
[14:44 - 14:47] Hoe wil ik dat het model zich gedraagt? Ja.
[14:47 - 14:58] Nog even los van alle aspecten van seksisme, racisme, bias waar je niks mee te maken wil hebben.
[14:59 - 15:10] Daar moet je op gaan testen. Nou, daar zijn er wel een paar. Testen voor. Ik denk dat een heel belangrijk onderdeel is dat je ook als mens test. Ja.
[15:10 - 15:16] Want je kan niet alles als mens testen, maar je kan zeker met een stukje chat prompt engineering
[15:16 - 15:24] kun je heel erg inspelen op wat een model doet, wat die zegt en proberen daar omheen te werken. Ja.
[15:25 - 15:33] Daarnaast zijn er genoeg voorbeelden beschikbaar van jailbreak, waarbij je dus die beveiliging, die alignment probeert te omzeilen.
[15:33 - 15:42] Ik denk dat je daar echt aandacht aan moet besteden om te kijken dat jouw model zich goed gedraagt. Ja. Ja.
[15:42 - 15:52] En dan nog doe je alle testen, alle voorzorgsmaatregelen, dan nog kan het gewoon fout gaan. Ja. Die modellen zijn zo complex.
[15:52 - 16:03] Je kan daar, ze bieden heel veel waarde, maar er blijft altijd een vorm van risico inzitten. Ja. Er blijft altijd een vorm van bias aanwezig. Ja.
[16:03 - 16:14] Er blijft altijd een vorm van hallucinaties aanwezig. Dus ondanks al die testen moet je als bedrijf ook nadenken van hoe kunnen wij nou op realistische
[16:14 - 16:27] wijze zeg maar het online gedrag monitoren. Ja. Of in ieder geval al doe je maar steekproefgewijs dat in de gaten houden. Ja. Precies.
[16:27 - 16:39] Want je wil inderdaad niet dat je als een dpd chatpont in het nieuws komt of er zijn meer voorbeelden geweest. Ja. Dat wil je niet.
[16:39 - 16:52] Ja. Precies. Dus als ik het goed begrijp allemaal, dan er zijn risico's aan AI, er zijn grote voordelen aan AI, er zijn grote risico's aan AI. Ja. Ja. Ja.
[16:52 - 16:59] En dat kan je ook doen door het model te testen op waarheid, dus door functioneel
[16:59 - 17:11] te testen, maar ook onder dus de maatschappelijke elementen die daar omheen hangen, die ook te testen, proberen we de risico's te beperken. Ja.
[17:12 - 17:22] De meeste van die tests zijn geautomatiseerd. En kun je ook nog iets vertellen, is zo ook zoiets mogelijk bijvoorbeeld voor het chatgedeelte? Ja.
[17:22 - 17:31] Ja, wat we net al zeiden is, je moet handmatig testen, gewoon om die flexibiliteit te hebben.
[17:32 - 17:42] Alleen wat je dus specifiek ziet juist voor het testen van het chatgedeelte, het is gewoon lastiger met een standaard benchmark te doen.
[17:42 - 17:54] En wat je daar dus ziet is, dat noemen ze de chatbot arena. Daar zetten ze als het ware die chatbot, die laten ze tegen elkaar praten. Cool. Die zijn lekker onderlinge aan het kletsen en die gaan er ook continu weer door.
[17:54 - 18:02] En dan zie je gewoon dat het mensen zijn die beoordelen, dat is toch een beetje subjectief
[18:02 - 18:14] van hoe goed chat een chatbot nou, die mensen geven de beoordeling van hoe goed of hoe slecht de kwaliteit van een chatbot is. Ja.
[18:15 - 18:25] Dat is misschien wat subjectiever, maar als daar genoeg mensen een oordeel over hebben, dan krijg je toch best een aardige benchmark. Ja, inderdaad.
[18:25 - 18:37] En daar, nou ja, de laatste keer dat ik daar keek stond geloof ik GPT-4 Turbo inderdaad bovenaan. Maar je ziet daar wel dat open source LLMs komen daar steeds hoger. Ja.
[18:37 - 18:49] Worden steeds beter. Dus dat onderscheid wordt wel kleiner. Ja, ja. En ja, dat gaat automatisch. Ja. En dan kan ik me voorstellen dat je daar best wel een bak aan hardware voor nodig hebt.
[18:50 - 19:02] Ja, nou ja, als je kijkt naar de beurskoers van Nvidia, die varen daar wel bij. Ja, die is niet door het oude einde. Ja. Dat is, nou zijn het ook wel, ja, smaakverschil, maar ik vind het ook wel hele mooie GPU's,
[19:02 - 19:10] maar zeker voor dit soort testen, ja, de resources die daarvoor nodig zijn, dat is immens.
[19:10 - 19:19] Ik geloof, als ik een eigen model aanmeld bij het Hugging Face open LLM leaderboard, dan
[19:19 - 19:31] is hij enkele uren bezig om alle testen te draaien en dat draait op een set van high-end NVIDIA Enterprise GPU's. Ja.
[19:31 - 19:40] Dus dat is nog best wel kostbaar qua hardware, maar ook qua stroomgebruik. Maar daar zie je ook weer interessante ontwikkelingen.
[19:40 - 19:49] Toevallig net vanavond voordat ik hierheen kwam voor de podcast, zie ik op LinkedIn langskomen, Paper wat recent is uitgebracht.
[19:49 - 20:01] Tiny Benchmarks, wat hebben research gedaan voor zeg maar de standaard benchmarks, zoals ze gebruikt worden door Hugging Face, hebben ze heel goed gekeken van nou, kunnen we daar
[20:01 - 20:09] nou een kleinere set van nemen? En je moet je voorstellen, sommige van die benchmarks hebben tienduizenden vragen en
[20:09 - 20:18] ze hebben dat teruggebracht tot enkele honderden en de uiteindelijke score op basis van die tiny benchmarks. Ja.
[20:18 - 20:25] En dat blijft nog steeds binnen 2% van zeg maar de originele grote test.
[20:25 - 20:37] Kijk en dat is ook wel, als dat zoveel kleiner is, dat scheelt weer een hoop resources. En stroom denk ik ook. Ook. Ja. Ja, zeker. Ook niet geheel onbelangrijk. Zeker.
[20:37 - 20:49] Dus daar kun je gewoon, ook daar zie je weer dat er voortgang wordt gemaakt. Ja, inderdaad. Inderdaad. Ik wil graag gaan afronden. Want wij zitten hier. Ja. Volgende week weer. Volgende week weer.
[20:49 - 20:59] En even kort samengevat, AI biedt heel veel, maar het kan ook aardig risicovol worden op
[20:59 - 21:12] het moment dat je het niet test, niet evalueert. En dan evalueer je je model op twee manieren. Dus je, op functionaliteit, zoals we altijd al in softwareontwikkeling ook wel gewend
[21:12 - 21:24] zijn. Maar je moet dus ook je model moet je testen op dingen. Dingen die wij in de maatschappij relevant vinden en daar hebben dus een aantal hele goede methodes voor.
[21:25 - 21:34] Vergeet ook zeker niet naar Hug & Face te kijken en de leaderboards daar te evalueren.
[21:35 - 21:47] En volgende week gaan wij het hebben over bias. Ja. Al een vooruitblik. Bias hebben we het vandaag ook genoemd. Ja, zeker.
[21:47 - 21:56] Het moet ook getest worden. Ja. Dus dat is een mooi bruggetje voor volgende week. Leuk. Gaan we het volgende week verder over hebben. Super. Hartstikke leuk.
[21:57 - 22:02] Voor nu, hartstikke bedankt voor het luisteren en tot de volgende keer. Bedankt.

And the transcription in English
[00:00 - 00:10] Hello, welcome to the greatminds podcast. This is the first in a long series of podcasts that we're going to have. And I'm Hildo van Es.

[00:11 - 00:24] I'm an architect and co-founder of greatminds. It's a knowledge domain where we want to collect everything related to architecture.

[00:24 - 00:35] And one of those components is AI. And next to me sits Robin, with whom I'll be reviewing a whole series of AI topics.

[00:36 - 00:43] Robin, why don't you introduce yourself? Yes, Robin Smits. I work as a data scientist at NBWI.

[00:44 - 00:54] Besides that, I have my own part-time AI consultancy company, Lumie ML Consulting. And as you mentioned, we've been working together for quite some time.

[00:54 - 01:06] And we're going to expand that much further. And one of the ways we're going to do that is by recording a whole series of podcasts. So, exciting. Yes, super.

[01:06 - 01:17] Well, let's dive right in then. Today's program is about the evaluation of LLMs.

[01:17 - 01:27] And specifically open source LLMs, so Large Language Models. Well, let's just... start with the question of why we would even do it?

[01:27 - 01:40] Yes, well, let's turn the question around. What could go wrong if we don't do it? Some time ago, the DPD chatbot was in the news.

[01:40 - 01:52] Yes, I remember. And that they quickly took it offline. After several people who were chatting with it discovered... that it was quite anti its own employer.

[01:53 - 02:05] And was cursing at... So I think that's a perfect example of how things can go wrong if you don't test? If you don't evaluate an AI model.

[02:05 - 02:12] If you don't, especially with a chatbot, monitor it in real-time. Yes, indeed. Indeed.

[02:12 - 02:24] And I can imagine that if I write software myself, just normal software... I always start by writing unit tests.

[02:24 - 02:33] And then I test. Usually for functionality. But because AI is actually a very different kind of beast...

[02:33 - 02:43] ...I can imagine that the tests are quite different too. Because I never need to test whether it adheres to all ethical rules.

[02:43 - 02:54] Or if my button develops sexism or racism or something like that. Like a few years ago we had the Cortana bot...

[02:54 - 03:06] ...that developed Nazi ideas overnight. I can imagine you want to test differently for that. Yes, look from a data science perspective.

[03:07 - 03:14] If you train a model, you must always test it, always evaluate it. And that has been done for many years.

[03:14 - 03:27] Hey, you have a training dataset and ten years ago that might have been simple text with predicting a label. Almost classic example, sentiment analysis.

[03:27 - 03:40] Is my Twitter message or X now positive or negative? Yes, indeed. You train it to recognize that, to predict. And then you evaluate with a dataset that your model does that.

[03:40 - 03:52] So that's actually what has been done for a very long time. What you now see with LLM models as they are available now, is that you need to test much more broadly.

[03:52 - 04:05] Much broader evaluation. During development. So not just does it do what we expect. But what are the results? Does it show bias?

[04:06 - 04:12] Do you have issues with sexism, racism? Do you have as little of it as possible?

[04:13 - 04:24] Even every model that's developed and trained endlessly to not show that, you still run the risk. So it's not that you...

[04:24 - 04:35] That your probing and evaluation prevents everything, but you minimize your risks. Yes, indeed. And well, we don't just do that testing. Not just by ourselves.

[04:36 - 04:39] There are also leaderboards for that on Hugging Face for example.

[04:40 - 04:52] So if you use a model as a base, it's also published with a whole overview of scores attached. Can you tell us more about that? Yes, certainly.

[04:52 - 05:00] I think Hugging Face will be familiar to most people. If it's not, go take a look at the Hugging Face website.

[05:02 - 05:16] For open source LLM models, by the way not just LLM models, also computer vision models, various other types of models are just available there.

[05:16 - 05:27] But what you see for LLMs, is that they indeed have an evaluation leaderboard. Where you can just see the scores on a number of standard evaluations.

[05:28 - 05:38] And a set of seven evaluations. And those just give a very broad impression of the qualities of your model. Yes.

[05:38 - 05:51] And that is actually, we already say it, open LLM evaluation board. As a counterpart to the closed AI models.

[05:51 - 06:00] Yes, indeed. And types. So even though that joke is apparently no longer original and many people seem to claim it, I saw by chance this week on the internet.

[06:00 - 06:10] But look, if we humans want to know how good is the model, then you look at performance data. Yes.

[06:10 - 06:20] So the higher the model scores, in principle the better it is. That doesn't say everything, but it does give a very good indication. Yes, exactly.

[06:20 - 06:34] And are there also standard evaluations that we then, so when we have an LLM or a model, how is something like that tested?

[06:34 - 06:41] Are there standards for that? Yes. Let's establish first that there are many benchmarks.

[06:42 - 06:53] I don't think it's feasible to run every benchmark for every model. If only because of the amount of hardware resources you need for that.

[06:54 - 07:09] What you see at Hugging Face is that they have a set of seven standard evaluations that indicate a performance metric for your model in as broad a range as possible.

[07:09 - 07:22] And with that just a very broad spectrum of knowledge, logic in reasoning. Let's say demonstrate what the performance is or isn't. Yes.

[07:22 - 07:31] It's quite interesting how that has developed. If you go back a few years, 2018 I believe.

[07:32 - 07:40] At that time one of the most used evaluation benchmarks was GLUE. The General Language Understanding Evaluation.

[07:40 - 07:53] And you saw that many models achieved a certain score on that. There are... At their core the current LLM models are just a further development of the transformer models.

[07:54 - 08:05] As they have been developed further from 2017 onwards. Such a famous paper that came out then. Attention is all you need. Yes. Well, how beautiful ChatGPT may be.

[08:06 - 08:17] It's a further development of what was already established then. And what you saw in the years after is where in 2018. The General Language Understanding Evaluation. Was still enough.

[08:18 - 08:29] The year after all models already scored the maximum score on that. Yes. Well, then they came up with the Super General Language Understanding Evaluation. Super GLUE.

[08:30 - 08:42] Around 2021 that was already outdated too. Yes. And actually since that time you see that they use much more complex problems, much more complex tests.

[08:42 - 08:55] To evaluate that performance, let's say. And what you now see at Hugging Face is that they have seven metrics. Yes. And that just gives a very good impression. Yes.

[08:55 - 09:07] Say, could you perhaps go a bit deeper into the different evaluation methods and particularly about the support of evaluations in other languages?

[09:07 - 09:13] Because that seems like a thing to me too. Yes, that's certainly a thing.

[09:14 - 09:26] What you see with the tests is that they're actually all optimized for English. Yes. Or Chinese. Yes, that sounds logical too.

[09:26 - 09:38] America and China, those are the driving forces behind AI development. Yes. So you see for example with, let's call it the standard evaluations on the Hugging Face leaderboard. That's all English.

[09:38 - 09:47] One test is for example test American high school test. Yes. That's very nice, but you have little use for that in Europe. No, indeed.

[09:49 - 09:59] So that's something you now see coming too. Last year there was a research project to at least translate some

[09:59 - 10:12] of those tests into multiple languages. Dutch is one of them. There is a Dutch, I think it's a Dutch student, Bram van Roy.

[10:12 - 10:24] Who did that. Who has set up the Open Dutch LLM leaderboard on Hugging Face. Oh, that's interesting. Those are translated tests, machine translated. Yes.

[10:24 - 10:33] So there could be quality issues with that. That won't be very big, but that's also indicated.

[10:34 - 10:46] But so we have at least, I believe that four of the seven evaluation sets, as on the official leaderboard, are now specifically used for Dutch. Interesting.

[10:46 - 10:56] Interesting. And if you now, suppose that you're going to use an LLM in your own company,

[10:56 - 11:07] what kind of, then you're also going to train it with your own, with your own dataset, your own company dataset. Does that also change the way of testing?

[11:09 - 11:19] No. I don't think so much. Look, first of all the question is, are you going to train an LLM on your own company dataset?

[11:19 - 11:25] It's possible, but maybe your application is such that it's not necessary.

[11:25 - 11:35] If you have a model that already functions well as a chatbot, right, and you would build an application

[11:35 - 11:46] around it with a framework like Langchain. Where your company data with RAG can extract the information from it. Yes.

[11:46 - 11:59] Then you might not strictly need to train your model on your company data. No, indeed. Well it could be that it is interesting purely because it improves performance.

[12:00 - 12:09] You can test for that. Yes. And then you indeed, what you need to ensure then is that you indeed, you have your company status

[12:09 - 12:22] set and you make a split in that between what you use as training data and a piece that you use as test or evaluation data. Yes.

[12:22 - 12:33] You don't want to train on data that's in your test set to prevent leakage. Yes, I have that. So that it already knows that data. Yes.

[12:33 - 12:43] But the further steps, right, you can first use all existing company data, you can test on benchmark and assuming that as a company, you have the hardware resources

[12:43 - 12:51] and the budget, you can already run all those standard tests which already adds enormous value.

[12:52 - 13:04] You do have on a smaller scale, there are Dutch evaluation sets available, but of course testing on your own company data is naturally the most ideal. Yes.

[13:04 - 13:13] And assuming that you deploy your model within a company for a specific purpose. Yes. That's primarily what you want to test. Exactly.

[13:13 - 13:21] Not only if it has its basic performance, but if it also really delivers that added value in business terms. Yes, exactly.

[13:21 - 13:34] Hey and there's also quite a lot of manual testing involved in testing LLMs, right, because going back to what happened with DPD.

[13:34 - 13:45] There are certain types of tests, right, so prompt testing that you can do. Can you tell us more about that? Yes. Certainly.

[13:45 - 13:52] What we said, you can do a lot with standard datasets, you can test your company dataset.

[13:52 - 14:02] Only what you now see is if you ultimately deploy an LLM as a chatbot within your company. Yes.

[14:03 - 14:11] Then I think that it is with extra functionalities that it is still the most used application, because it just has the most value.

[14:13 - 14:20] But what you see is people are going to stress test such a chatbot. Yes, exactly. Can you make it curse? Yes.

[14:21 - 14:32] Can you, yes, if it were a financial chatbot, can you talk to it about the weather or about movies? Yes. Well it could be that you allow that as a company.

[14:32 - 14:44] But that seems to me. I don't think so. I honestly don't think so either, but I want to be open-minded. So you're going to align such a model. What is my preference?

[14:44 - 14:47] How do I want the model to behave? Yes.

[14:47 - 14:58] Even apart from all aspects of sexism, racism, bias that you want nothing to do with.

[14:59 - 15:10] You need to test for that. Well, there are a few. Tests for that. I think a very important part is that you also test as a human. Yes.

[15:10 - 15:16] Because you can't test everything as a human, but you can certainly with a bit of chat prompt engineering

[15:16 - 15:24] you can really play into what a model does, what it says and try to work around that. Yes.

[15:25 - 15:33] Additionally, there are plenty of examples available of jailbreak, where you try to bypass that security, that alignment.

[15:33 - 15:42] I think you really need to pay attention to that to check that your model behaves well. Yes. Yes.

[15:42 - 15:52] And then you do all the tests, all the precautions, then still things can just go wrong. Yes. Those models are so complex.

[15:52 - 16:03] You can, they offer a lot of value, but there always remains a form of risk in it. Yes. There always remains a form of bias present. Yes.

[16:03 - 16:14] There always remains a form of hallucinations present. So despite all those tests you as a company also need to think about how we can in a realistic

[16:14 - 16:27] way monitor the online behavior. Yes. Or at least if you just do it on a sample basis keep an eye on that. Yes. Exactly.

[16:27 - 16:39] Because you indeed don't want to end up in the news like a DPD chatbot or there have been more examples. Yes. You don't want that.

[16:39 - 16:52] Yes. Exactly. So if I understand it all correctly, then there are risks to AI, there are great advantages to AI, there are great risks to AI. Yes. Yes. Yes.

[16:52 - 16:59] And you can do that by testing the model for truth, so by testing functionally,

[16:59 - 17:11] but also testing the societal elements surrounding it, we try to limit the risks. Yes.

[17:12 - 17:22] Most of these tests are automated. And can you also tell us, is something like that possible for example for the chat part? Yes.

[17:22 - 17:31] Yes, what we just said is, you need to test manually, just to have that flexibility.

[17:32 - 17:42] Only what you specifically see just for testing the chat part, it's just harder to do with a standard benchmark.

[17:42 - 17:54] And what you see there is, they call it the chatbot arena. There they put that chatbot, they let them talk to each other. Cool. They're chatting among themselves and they keep going through it.

[17:54 - 18:02] And then you just see that it's people who judge, that's a bit subjective

[18:02 - 18:14] about how well does a chatbot chat, those people give the assessment of how good or how bad the quality of a chatbot is. Yes.

[18:15 - 18:25] That might be a bit more subjective, but if enough people have an opinion about it, you still get quite a decent benchmark. Yes, indeed.

[18:25 - 18:37] And there, well, the last time I looked I believe GPT-4 Turbo was indeed at the top. But you do see open source LLMs are getting higher there. Yes.

[18:37 - 18:49] Are getting better. So that distinction is getting smaller. Yes, yes. And yes, that happens automatically. Yes. And then I can imagine you need quite a bit of hardware for that.

[18:50 - 19:02] Yes, well, if you look at Nvidia's stock price, they're benefiting from that. Yes, that's not through the roof. Yes. That is, well they are also, yes, matter of taste, but I do think they're very beautiful GPUs,

[19:02 - 19:10] but especially for these kinds of tests, yes, the resources needed for that are immense.

[19:10 - 19:19] I believe, if I submit my own model to the Hugging Face open LLM leaderboard, then

[19:19 - 19:31] it takes several hours to run all the tests and that runs on a set of high-end NVIDIA Enterprise GPUs. Yes.

[19:31 - 19:40] So that's quite costly in terms of hardware, but also in terms of power usage. But there you see interesting developments too.

[19:40 - 19:49] Coincidentally just tonight before I came here for the podcast, I see on LinkedIn passing by, Paper that was recently released.

[19:49 - 20:01] Tiny Benchmarks, where research has been done for let's say the standard benchmarks, as they are used by Hugging Face, they looked very carefully at well, can we

[20:01 - 20:09] take a smaller set of that? And you have to imagine, some of those benchmarks have tens of thousands of questions and

[20:09 - 20:18] they've reduced that to several hundreds and the final score based on those tiny benchmarks. Yes.

[20:18 - 20:25] And that still stays within 2% of let's say the original large test.

[20:25 - 20:37] Look and that's also, if that's so much smaller, that saves a lot of resources. And power I think too. Also. Yes. Yes, certainly. Not entirely unimportant. Certainly.

[20:37 - 20:49] So there you can just, there too you see that progress is being made. Yes, indeed. Indeed. I want to start wrapping up. Because we're here. Yes. Next week again. Next week again.

[20:49 - 20:59] And briefly summarized, AI offers a lot, but it can also become quite risky

[20:59 - 21:12] when you don't test it, don't evaluate it. And then you evaluate your model in two ways. So you, on functionality, as we've always been used to in software development too.

[21:12 - 21:24] But you also need to test your model on things. Things that we find relevant in society and there are several very good methods for that.

[21:25 - 21:34] Don't forget to look at Hugging Face and evaluate the leaderboards there.

[21:35 - 21:47] And next week we're going to talk about bias. Yes. A preview. Bias we mentioned today too. Yes, certainly.

[21:47 - 21:56] It needs to be tested too. Yes. So that's a nice bridge to next week. Nice. We'll talk more about it next week. Super. Really nice.

[21:57 - 22:02] For now, thanks very much for listening and until next time. Thanks.

People on this episode

Hildo van Es

Host

Robin Smits

Co-host