EvalsïŒè©äŸ¡ïŒ
è©äŸ¡ïŒEvalsïŒã¯ããšãŒãžã§ã³ãã®åäœãæéã®çµéãšãšãã«ã©ã®ããã«å€åãããã远跡ããã®ã«åœ¹ç«ã¡ãŸããäŒè©±ããEvalããšããŠä¿åãããšãåŸã§ãããåçããŠããšãŒãžã§ã³ãã®å¿çã«å€åããããã確èªã§ããŸããEvalsã¯æ£ç¢ºæ§ã§ã¯ãªããäžè²«æ§ããæž¬å®ããŸããã€ãŸããåäœãæ£ãããééã£ãŠãããã§ã¯ãªããåäœããå€åãããã©ããããæããŠãããŸãã
Evalsãšã¯äœã
Evalãšã¯ãåçå¯èœãªãä¿åãããäŒè©±ãã®ããšã§ããEvalsãå®è¡ãããšãcagentã¯ãŠãŒã¶ãŒã¡ãã»ãŒãžãåçããæ°ããå¿çãæåã«ä¿åãããå ã®äŒè©±ãšæ¯èŒããŸããã¹ã³ã¢ãé«ãã»ã©ãšãŒãžã§ã³ãã以åãšåæ§ã«åäœããããšãæå³ããã¹ã³ã¢ãäœãã»ã©åäœãå€åããããšãæå³ããŸãã
ãã®æ å ±ãã©ã掻çšãããã¯ããªããã®äŒè©±ãä¿åãããã«ãã£ãŠç°ãªããŸããååž°ãæ€ç¥ããããã«æåããäŒè©±ãä¿åããããšãããã°ãæ¢ç¥ã®åé¡ãèšé²ãããããæ¹åããããã©ããã远跡ããããã«å€±æäŸãä¿åããããšããããŸãã
äžè¬çãªã¯ãŒã¯ãããŒ
Evalsã®äœ¿ãæ¹ã¯ãäœãéæããããšããŠãããã«ãã£ãŠç°ãªããŸãã
ååž°ãã¹ãïŒRegression testingïŒ: ãšãŒãžã§ã³ããããŸãæ©èœããäŒè©±ãä¿åããŸããåŸã§å€æŽïŒã¢ãã«ã®ã¢ããã°ã¬ãŒããããã³ããã®æŽæ°ãã³ãŒãã®ãªãã¡ã¯ã¿ãªã³ã°ãªã©ïŒãè¡ã£ãéã«ãEvalsãå®è¡ããŸããã¹ã³ã¢ãé«ãå Žåã¯åäœã«äžè²«æ§ãä¿ãããŠããããšãæå³ããéåžžã¯ãããæãŸããç¶æ ã§ããã¹ã³ã¢ãäœãå Žåã¯äœããå€åããããšãæå³ãããããæ°ããåäœã調ã¹ãŠããããäŸç¶ãšããŠæ£ãããã©ããã確èªããŸãã
æ¹åã®è¿œè·¡ïŒTracking improvementsïŒ: ãšãŒãžã§ã³ããèŠæŠããã倱æãããããäŒè©±ãä¿åããŸããæ¹åãå ãããã³ã«ãããã®Evalsãå®è¡ããåäœãã©ã®ããã«é²åãããã確èªããŸããã¹ã³ã¢ãäœãå Žåã¯ãšãŒãžã§ã³ãã®åäœãå€åããããšã瀺ããŠãããåé¡ãä¿®æ£ãããå¯èœæ§ããããŸããæ°ããåäœãå®éã«è¯ããªã£ãŠãããã©ããã¯ãæåã§ç¢ºèªããå¿ èŠããããŸãã
ãšããžã±ãŒã¹ã®èšé²ïŒDocumenting edge casesïŒ: å質ã®è¯ãæªãã«é¢ããããè峿·±ãäŒè©±ãçããäŒè©±ãä¿åããŸãããããã䜿çšããŠããšãŒãžã§ã³ãããšããžã±ãŒã¹ãã©ã®ããã«åŠçãããããŸããã®åäœãæéã®çµéãšãšãã«å€åãããã©ãããçè§£ããŸãã
Evalsã¯ãåäœãå€åãããã©ããããæž¬å®ããŸãããã®å€åãè¯ããæªããã倿ããã®ã¯ããªãã§ãã
Evalã®äœæ
察話åã»ãã·ã§ã³ããäŒè©±ãä¿åããŸãã
$ cagent run ./agent.yamlãšãŒãžã§ã³ããšäŒè©±ãè¡ãããããEvalãšããŠä¿åããŸãã
> /eval test-case-name
Eval saved to evals/test-case-name.jsonäŒè©±ã¯çŸåšã®äœæ¥ãã£ã¬ã¯ããªå
ã® evals/ ãã£ã¬ã¯ããªã«ä¿åãããŸããå¿
èŠã«å¿ããŠãEvalãã¡ã€ã«ããµããã£ã¬ã¯ããªã«æŽçããããšãå¯èœã§ãã
Evalsã®å®è¡
ããã©ã«ããã£ã¬ã¯ããªå ã®ãã¹ãŠã®Evalsãå®è¡ããŸãã
$ cagent eval ./agent.yamlã«ã¹ã¿ã Evalãã£ã¬ã¯ããªã䜿çšããå ŽåïŒ
$ cagent eval ./agent.yaml ./my-evalsã¬ãžã¹ããªå ã®ãšãŒãžã§ã³ãã«å¯ŸããŠEvalsãå®è¡ããå ŽåïŒ
$ cagent eval agentcatalog/myagentåºåäŸïŒ
$ cagent eval ./agent.yaml
--- 0
First message: tell me something interesting about kil
Eval file: c7e556c5-dae5-4898-a38c-73cc8e0e6abe
Tool trajectory score: 1.000000
Rouge-1 score: 0.447368
Cost: 0.00
Output tokens: 177çµæã®çè§£
åEvalã«ã€ããŠãcagentã¯ä»¥äžã®é ç®ã衚瀺ããŸãã
-
First message - ä¿åãããäŒè©±ã®æåã®ãŠãŒã¶ãŒã¡ãã»ãŒãž
-
Eval file - å®è¡ãããŠããEvalãã¡ã€ã«ã®UUID
-
Tool trajectory scoreïŒããŒã«å®è¡å±¥æŽã¹ã³ã¢ïŒ - ãšãŒãžã§ã³ããã©ãã»ã©åæ§ã«ããŒã«ã䜿çšãããïŒ0ã1ã®ã¹ã±ãŒã«ãé«ãã»ã©è¯ãïŒ
-
ROUGE-1 score - å¿çããã¹ãã®é¡äŒŒåºŠïŒ0ã1ã®ã¹ã±ãŒã«ãé«ãã»ã©è¯ãïŒ
-
Cost - ãã®Evalå®è¡ã«ããã£ãã³ã¹ã
-
Output tokens - çæãããããŒã¯ã³æ°
ã¹ã³ã¢ãé«ãã»ã©ããšãŒãžã§ã³ããèšé²ãããå ã®äŒè©±ã«è¿ãåäœãããããšãæå³ããŸããã¹ã³ã¢ã1.0ã®å Žåã¯ãå šãåäžã®åäœã§ããããšã瀺ããŸãã
ã¹ã³ã¢ã®æå³
Tool trajectory score ã¯ããšãŒãžã§ã³ããå ã®äŒè©±ãšåãããŒã«ãåãé åºã§åŒã³åºãããã©ãããæž¬å®ããŸããã¹ã³ã¢ãäœãå Žåã¯ããšãŒãžã§ã³ããåé¡ã解決ããããã«ç°ãªãã¢ãããŒããèŠã€ããå¯èœæ§ã瀺åããŠããŸããããã¯å¿ ãããééãã§ã¯ãããŸãããã調æ»ãã䟡å€ããããŸãã
Rouge-1 score ã¯ãå¿çããã¹ãããªãªãžãã«ãšã©ãã»ã©äŒŒãŠããããæž¬å®ããŸããããã¯ãã¥ãŒãªã¹ãã£ãã¯ãªææšã§ããèšãåããç°ãªã£ãŠããŠãæ£è§£ã§ããå Žåãããããã絶察çãªççã§ã¯ãªããäžã€ã®ã·ã°ãã«ãšããŠæããŠãã ããã
çµæã®è§£é
1.0ã«è¿ãã¹ã³ã¢ã¯ã倿Žãå ããŠãäžè²«ããåäœãç¶æãããŠããããšãæå³ããŸãããšãŒãžã§ã³ãã¯åãã¢ãããŒãããšããåæ§ã®å¿çãçæããŠããŸããããã¯äžè¬çã«è¯ãç¶æ ã§ããã倿Žã«ãã£ãŠæ¢åã®æ©èœãå£ããŠããªãããšã瀺ããŸãã
äœãã¹ã³ã¢ã¯ãä¿åãããäŒè©±ãšæ¯èŒããŠåäœãå€åããããšãæå³ããŸããããã¯ããšãŒãžã§ã³ãã®ããã©ãŒãã³ã¹ãäœäžãããååž°ïŒregressionïŒãã§ããå¯èœæ§ãããã°ãããè¯ãã¢ãããŒããèŠã€ãããæ¹åãã§ããå¯èœæ§ããããŸãã
ã¹ã³ã¢ãäœäžããå Žåã¯ãå®éã®åäœã調ã¹ãŠãããã以åããè¯ããªã£ãã®ãæªããªã£ãã®ãã倿ããŠãã ãããEvalãã¡ã€ã«ã¯ evals ãã£ã¬ã¯ããªã«JSONãšããŠä¿åãããŠããŸãããã¡ã€ã«ãéããŠå ã®äŒè©±ã確èªããŠãã ãããæ¬¡ã«ã倿ŽåŸã®ãšãŒãžã§ã³ãã§åãå ¥åããã¹ãããå¿çãæ¯èŒããŸããæ°ããå¿çã®æ¹ãåªããŠããå Žåã¯ãæ°ããäŒè©±ãä¿åããŠEvalã眮ãæããŠãã ãããæªããªã£ãŠããå Žåã¯ãååž°ãèŠã€ãã£ãããšã«ãªããŸãã
ã¹ã³ã¢ã¯ãäœãå€ãã£ããããã¬ã€ãããŠãããŸãããã®å€åãè¯ããæªãããæ±ºå®ããã®ã¯ãããªãã®å€æã§ãã
Evalsã䜿çšãã¹ããšã
Evalsã¯ãæéã®çµéã«äŒŽãåäœã®å€åã远跡ããã®ã«åœ¹ç«ã¡ãŸããã¢ãã«ãäŸåé¢ä¿ãã¢ããã°ã¬ãŒãããéã®ååž°ã®æ€ç¥ãä¿®æ£ãããæ¢ç¥ã®å€±æäŸã®èšé²ãååŸ©äœæ¥ã«äŒŽããšããžã±ãŒã¹ã®å€åã®ææ¡ãªã©ã«æçšã§ãã
Evalsã¯ãã©ã®ãšãŒãžã§ã³ãæ§æãæé©ãã倿ããããã®ãã®ã§ã¯ãããŸãããEvalsã¯ä¿åãããäŒè©±ã«å¯Ÿãããé¡äŒŒæ§ããæž¬å®ãããã®ã§ããããæ£ç¢ºæ§ããæž¬å®ãããã®ã§ã¯ãªãããã§ããç°ãªãæ§æãè©äŸ¡ããã©ã¡ããããåªããŠããããæ±ºå®ããã«ã¯ãæåãã¹ããè¡ã£ãŠãã ããã
远跡ãã䟡å€ã®ããäŒè©±ãä¿åããŸããããéèŠãªã¯ãŒã¯ãããŒãè峿·±ããšããžã±ãŒã¹ãæ¢ç¥ã®åé¡ã®ã³ã¬ã¯ã·ã§ã³ãæ§ç¯ããŠãã ããã倿Žãå ããéã«Evalsãå®è¡ããäœãã·ãããããã確èªããŠãã ããã
次ã®ã¹ããã
-
cagent evalã®ãã¹ãŠã®ãªãã·ã§ã³ã«ã€ããŠã¯ãCLIãªãã¡ã¬ã³ã¹ã確èªããŠãã ãã -
广çãªãšãŒãžã§ã³ããæ§ç¯ããããã® ãã¹ããã©ã¯ãã£ã¹ ãåŠã¶
-
ããŸããŸãªãšãŒãžã§ã³ãã¿ã€ãã® èšå®äŸÂ ã確èªãã