Megatest: Changes On Branch acb5b0b2be22dbeb

Changes In Branch v1.70-defunct Through [acb5b0b2be] Excluding Merge-Ins

This is equivalent to a diff from 2769e4b7c9 to acb5b0b2be

2021-03-01
17:42		Manually patched in the new view check-in: f5206150ee user: mrwellan tags: v1.6569-new-view
2021-01-26
14:00		Fix for the > crash. Maybe... Leaf check-in: 5a05fc04ff user: matt tags: v1.6569-gt-crash-fix
2021-01-25
12:03		rebased lazy-queue rollup check-in: 07ab120544 user: matt tags: v1.65-lazyqueue-items-rollup
2021-01-15
22:46		begin diet check-in: badd71f3b3 user: matt tags: v1.6569-diet
21:34		eval-string-in-environment if was disabled, re-enabled check-in: 9564772564 user: matt tags: v1.6569-reenable-eval-if
2021-01-08
11:42		enable custom value for max delay between archive time and test last update time Leaf check-in: 86a3d1148e user: pjhatwal tags: v1.6569-refactor
2020-11-25
12:00		Fixed issues in server gating code Leaf check-in: 063273e8cb user: mrwellan tags: v1.6569-server-gate-fix
2020-11-24
22:27		Added support for resetting run - allows to reload tests-paths to add tests to a run part way though. Just run megatest -clean-cache -runname $MT_RUNNAME Leaf check-in: 213021e02d user: mrwellan tags: v1.6596-reload-tests-paths
2020-10-13
16:46		Changed version from 69 to 76. No other changes. Will compile with chicken 13 check-in: 87ca35010f user: mmgraham tags: v1.65, v1.6576
2020-10-12
16:49		Reduced message from failed to info. Reverted a delay which seems to help pass full stack ext-tests. Leaf check-in: 9e35b1252c user: mrwellan tags: v1.65-minor-patch
10:18		Safe vector access in rmt. check-in: 58bb6d997a user: mrwellan tags: v1.65-side2
2020-10-11
22:46		Patched forward adjutant code. check-in: f936717bfa user: matt tags: v1.65-adjutant-again
2020-10-05
22:49		Do not exit on failure to create directory - race conditons on NFS cause false fail scenarios - just keep going and cross your fingers... (cherrypicked from v1.6572) check-in: 05b253a452 user: matt tags: v1.65-sidework
22:46		run duration testdat check-in: 4a0b43f3c6 user: matt tags: v1.65-test-rundat2
2020-10-04
02:21		Attempt to merge all across. Closed-Leaf check-in: 5e97f11795 user: mrwellan tags: v1.70-defunct
2020-10-03
22:25		Fixed merge related issues. check-in: acb5b0b2be user: mrwellan tags: v1.70-defunct
21:25		Fixed (again?) the DEAD issue. Bad logic on re-calc of prereq needed. The (runs:testdat-prereqs-not-met testdat) is telling you that this needs recalc as it was previously not met. Thus can bypass if was met previously (although why would we reach here if it was met previously?). check-in: d336ea7394 user: matt tags: v1.70-defunct
2020-09-21
15:36		merged in 1.65-test-rundat branch ==/FAIL/orion,mars/== check-in: cfd25d66e9 user: mmgraham tags: v1.6571, v1.65-failed-testdat
07:00		Added get-testsuite-name all over launch:setup and still not set when needed! This did NOT work. Closed-Leaf check-in: 2efe8ad422 user: mrwellan tags: v1.65-get-testsuitename
2020-09-19
04:21		Start moving test_rundat to no-sync db. ==/20/2/WARN/1203/mars/== check-in: abfabdb839 user: matt tags: v1.65-test-rundat
2020-09-18
17:30		added check for file existence before file delete ==/14/1.9/WARN/orion,mars/== NOTE: This is the last v1.65 before the split off. I.e code from before this point IS in the far future v1.65 branch. Code from this point to that branch might NOT be in the branch. check-in: 2769e4b7c9 user: mmgraham tags: v1.65, v1.6569
12:27		cherry picked 2 fixes, changed version to 1.6569 ==/7.2/2.0/PASS/1201/mars/== check-in: d145d0eb02 user: mmgraham tags: v1.65

Modified TODO from [da5eae4898] to [0885dee1e5].

Modified common.scm from [33c7316880] to [9136bd0109].

Modified dashboard.scm from [935bf4d2df] to [59b903a7ca].

Modified db.scm from [fb3a18f52f] to [900db7dda7].

Modified launch.scm from [d0067277fa] to [9905b8fdbe].

Modified megatest-version.scm from [f253e5978b] to [291b748ecb].

Modified rmt.scm from [39d97c528a] to [29d7593e43].

Modified runs.scm from [030b929939] to [c2e599115c].

Modified tests.scm from [0094b671e6] to [af455125f4].

︙
568 569 570 571 572 573 574 ~~575~~ 576 577 578 579 580 581 582	568 569 570 571 572 573 574 575 576 577 578 579 580 581 582	- +	(for-each (lambda (file) (let* ((fullname (conc "logs/" file))) (if (directory? fullname) (debug:print-info 0 default-log-port fullname " in logs directory is a directory! Cannot rotate it, it is best to not put subdirectories in the logs dir.") (handle-exceptions exn ~~(debug:print-~~error~~ 0 default-log-port "failed to remove " fullname ", exn=" exn)~~ (debug:print-info 0 default-log-port "failed to remove " fullname ", exn=" exn) (delete-file* fullname))))) files) (debug:print-info 0 default-log-port "Deleted " (length files) " files from logs, keeping " max-allowed " files.")))))) ;; Force a megatest cleanup-db if version is changed and skip-version-check not specified ;; Do NOT check if not on homehost! ;;
︙
1313 1314 1315 1316 1317 1318 1319 ~~1320 1321~~ 1322 1323 ~~1324 1325~~ 1326 ~~1327 1328~~ ~~1329 1330 1331~~ 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341	1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348	+ - - + + + - - + + + + + + + - - + + - - - + + +	rtestpatt) (else (debug:print-info 0 default-log-port "using testpatt " args-testpatt " rtestpatt:" rtestpatt) args-testpatt)))) ;; ~~(define (common:false-on-exception thunk #!key (message #f)) (handle-exceptions ~~exn~~~~ (define (common:false-on-exception thunk #!key (message #f)(tries 1)) (handle-exceptions exn (begin (if message ~~(debug:print-info 0 default-log-port message)) ~~#f)~~ (thunk) ))~~ (debug:print-info 0 default-log-port message " exn=" exn)) (if (> tries 1) (begin (thread-sleep! 1) (common:false-on-exception thunk message: message tries: (- tries 1))) #f)) (thunk))) ~~(define (common:file-exists? path-string #!key (silent #f)) ;; this avoids stack dumps in the case where~~ (define (common:file-exists? path-string #!key (silent #f)(tries 1)) ;; this avoids stack dumps in the case where NFS is slow or flakey ;;;; TODO: catch permission denied exceptions and emit appropriate warnings, eg: system error while trying to access file: "/nfs/pdx/disks/icf_env_disk001/bjbarcla/gwa/issues/mtdev/randy-slow/reproduce/q... (common:false-on-exception ~~(lambda () (file-exists? path-string))~~ (common:false-on-exception (lambda ()(file-exists? path-string)) message: (if (not silent) (conc "Unable to access path: " path-string) #f) tries: tries )) (define (common:directory-exists? path-string) ;;;; TODO: catch permission denied exceptions and emit appropriate warnings, eg: system error while trying to access file: "/nfs/pdx/disks/icf_env_disk001/bjbarcla/gwa/issues/mtdev/randy-slow/reproduce/q... (common:false-on-exception (lambda () (directory-exists? path-string)) message: (conc "Unable to access path: " path-string) ))
︙

︙
210 211 212 213 214 215 216 217 218 219 220 221 222 223	210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225	+ +	;; (define (dboard:common-set-tabdat! commondat tabnum tabdat) (hash-table-set! (dboard:commondat-tabdats commondat) tabnum tabdat)) (define updater-running #f) ;; move this into one of the stucts ;; gets and calls updater list based on curr-tab-num ;; (define (dboard:common-run-curr-updaters commondat #!key (tab-num #f)) (if (dboard:common-get-tabdat commondat tab-num: tab-num) ;; only update if there is a tabdat (let* ((tnum (or tab-num (dboard:commondat-curr-tab-num commondat))) (updaters (hash-table-ref/default (dboard:commondat-updaters commondat) tnum
︙
238 239 240 241 242 243 244 ~~245~~ 246 247 248 249 250 251 252	240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256	- + + +	(curr-updaters (hash-table-ref/default (dboard:commondat-updaters commondat) tnum '()))) (hash-table-set! (dboard:commondat-updaters commondat) tnum (cons updater curr-updaters)))) ;; data for each specific tab goes here ;; ~~(defstruct dboard:tabdat~~ (defstruct dboard:tabdat ;; flags ((already-running #f) : boolean) ;; the updater is already running. skip ;; runs ((allruns '()) : list) ;; list of dboard:rundat records ((allruns-by-id (make-hash-table)) : hash-table) ;; hash of run-id -> dboard:rundat records ((done-runs '()) : list) ;; list of runs already drawn ((not-done-runs '()) : list) ;; list of runs not yet drawn (header #f) ;; header for decoding the run records (keys #f) ;; keys for this run (i.e. target components)
︙
643 644 645 646 647 648 649 ~~650~~ 651 652 653 654 655 656 657	647 648 649 650 651 652 653 654 655 656 657 658 659 660 661	- +	;; ;; NOTE: Yes, this is used ;; (define (dboard:get-tests-for-run-duplicate tabdat run-id run testnamepatt key-vals) (let* ((start-time (current-seconds)) (access-mode (dboard:tabdat-access-mode tabdat)) (num-to-get (string->number (or (configf:lookup configdat "setup" "num-tests-to-get") ~~"200")))~~ "50"))) ;; was 200, which is fine in a normal run area. (states (hash-table-keys (dboard:tabdat-state-ignore-hash tabdat))) (statuses (hash-table-keys (dboard:tabdat-status-ignore-hash tabdat))) (do-not-use-db-file-timestamps #t) ;; (configf:lookup configdat "setup" "do-not-use-db-file-timestamps")) ;; this still hosts runs-summary-tab (do-not-use-query-timestamps #t) ;; (configf:lookup configdat "setup" "do-not-use-query-timestamps")) ;; this no longer troubles runs-summary-tab (sort-info (get-curr-sort)) (sort-by (vector-ref sort-info 1)) (sort-order (vector-ref sort-info 2))
︙
714 715 716 717 718 719 720 721 722 723 724 725 726 727	718 719 720 721 722 723 724 725 726 727 728 729 730 731 732	+	;; ;; (debug:print 0 default-log-port "got-all: " got-all " multi-get: " multi-get " num-to-get: " num-to-get " (length tmptests): " (length tmptests) " db-modified: " db-modified " db-mod-time: " db-mod-time " db-path: " db-path) (if got-all (begin (dboard:rundat-last-update-set! run-dat (- start-time 2)) (dboard:rundat-run-data-offset-set! run-dat 0)) (begin ;;; (thread-sleep! 0.25) ;; give the rest of the gui some time to update. <-- this did NOT help (dboard:rundat-run-data-offset-set! run-dat (+ num-to-get (dboard:rundat-run-data-offset run-dat))))) (for-each (lambda (tdat) (let ((test-id (db:test-get-id tdat)) (state (db:test-get-state tdat)))
︙
831 832 833 834 835 836 837 ~~838~~ 839 840 841 842 843 844 845	836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852	+ + - +	;; this calls dboard:get-tests-for-run-duplicate for each run ;; ;; create a virtual table of all the tests ;; keypatts: ( (KEY1 "abc%def")(KEY2 "%") ) ;; (define (dboard:update-rundat tabdat runnamepatt numruns testnamepatt keypatts) (dboard:tabdat-already-running-set! tabdat #t) (let* (;; (already-running (dboard:tabdat-already-running tabdat)) ~~~~(let* (~~(access-mode (dboard:tabdat-access-mode tabdat))~~ (access-mode (dboard:tabdat-access-mode tabdat)) (keys (dboard:tabdat-keys tabdat)) ;; (db:dispatch-query access-mode rmt:get-keys db:get-keys))) (last-runs-update (- (dboard:tabdat-last-runs-update tabdat) 2)) (allruns (rmt:get-runs runnamepatt numruns (dboard:tabdat-start-run-offset tabdat) keypatts)) ;;(allruns-tree (rmt:get-runs-by-patt (dboard:tabdat-keys tabdat) "%" #f #f #f #f)) (allruns-tree (rmt:get-runs-by-patt keys "%" #f #f #f #f 0)) ;; last-runs-update));;'("id" "runname") (header (db:get-header allruns)) (runs (db:get-rows allruns)) ;; RA => Filtered as per runpatt selected
︙
899 900 901 902 903 904 905 ~~906~~ 907 908 909 910 ~~911 912~~ ~~913 914 915~~ 916 917 918 919 920 921 ~~922~~ 923 924 925 926 927 928 929	906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935	- + + - - + + - - - - + +	(elapsed-time (- (current-seconds) start-time))) (if (null? all-test-ids) (hash-table-delete! (dboard:tabdat-allruns-by-id tabdat) run-id) (hash-table-set! (dboard:tabdat-allruns-by-id tabdat) run-id run-struct)) (if (or (null? tal) (> elapsed-time 2)) ;; stop loading data after 5 seconds, on the next call more data should be loaded since get-tests-for-run uses last update (begin ~~(when (> elapsed-time 2)~~ #;(when (> elapsed-time 2) (debug:print 0 default-log-port "NOTE: updates are taking a long time, " elapsed-time "s elapsed.") (let* ((old-val (iup:attribute tim "TIME")) (new-val (number->string (inexact->exact (floor (* 2 (string->number old-val))))))) (if (< (string->number new-val) 5000) (begin ~~((debug:print 0 default-log-port "NOTE: increasing poll interval from "old-val" to "new-val) (iup:attribute-set! tim "TIME" new-val))))~~ (debug:print 0 default-log-port "NOTE: increasing poll interval from "old-val" to "new-val) (iup:attribute-set! tim "TIME" new-val))))) ) (dboard:tabdat-allruns-set! tabdat new-res) maxtests) (if (> (dboard:rundat-run-data-offset run-struct) 0) (loop run tal new-res newmaxtests) ;; not done getting data for this run (loop (car tal)(cdr tal) new-res newmaxtests))))))) (dboard:tabdat-filters-changed-set! tabdat #f) ~~(dboard:update-tree tabdat runs-hash header tb)))~~ (dboard:update-tree tabdat runs-hash header tb) (dboard:tabdat-already-running-set! tabdat #f))) (define collapsed (make-hash-table)) (define (toggle-hide lnum uidat) ; fulltestname) (let* ((btn (vector-ref (dboard:uidat-get-lftcol uidat) lnum)) (fulltestname (iup:attribute btn "TITLE")) (parts (string-split fulltestname "("))
︙
2503 2504 2505 2506 2507 2508 2509 ~~2510 2511 2512 2513~~ 2514 2515 2516 2517 2518 2519 2520	2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522	- - - -	sort-lb))) ) ;; insert extra widget here (if extra-widget extra-widget (iup:hbox)) ;; empty widget ))) (let* ((status-toggles (map (lambda (status) (iup:toggle (conc status) #:fontsize 8 ;; btn-fontsz ;; "10" ;; #:expand "HORIZONTAL" #:action (lambda (obj val)
︙
3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 ~~3738 3739 3740 3741 3742 3743 3744 3745 3746 3747~~ 3748 3749 3750 3751 3752 3753 3754	3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764	+ + + + + + + - - - - - - - - - - + + + + + + + + + + +	;; removing the tabdat-values proc ;; ;; (define (tabdat-values tabdat) ;; runs update-rundat using the various filters from the gui ;; (define (dashboard:do-update-rundat tabdat) ;; this seems like a good place to check for already running and skip if so ;; ;; (set! updater-running #t) ;;(if (dboard:tabdat-already-running tabdat) ;; (begin ;; (debug:print-info 0 default-log-port "Dashboard overloaded - updates will be slow, skipping update.") ;; (dboard:tabdat-target tabdat)) (dboard:update-rundat tabdat (hash-table-ref/default (dboard:tabdat-searchpatts tabdat) "runname" "%") (dboard:tabdat-numruns tabdat) (hash-table-ref/default (dboard:tabdat-searchpatts tabdat) "test-name" "%/%") ;; generate key patterns from the target stored in tabdat (let* ((dbkeys (dboard:tabdat-dbkeys tabdat))) (let ((fres (if (dboard:tabdat-target tabdat) (let ((ptparts (append (dboard:tabdat-target tabdat)(make-list (length dbkeys) "%")))) (map (lambda (k v)(list k v)) dbkeys ptparts)) (let ((res '())) (for-each (lambda (key) (if (not (equal? key "runname")) (let ((val (hash-table-ref/default (dboard:tabdat-searchpatts tabdat) key #f))) (if val (set! res (cons (list key val) res)))))) dbkeys) res)))) fres)))) (let ((ptparts (append (dboard:tabdat-target tabdat)(make-list (length dbkeys) "%")))) (map (lambda (k v)(list k v)) dbkeys ptparts)) (let ((res '())) (for-each (lambda (key) (if (not (equal? key "runname")) (let ((val (hash-table-ref/default (dboard:tabdat-searchpatts tabdat) key #f))) (if val (set! res (cons (list key val) res)))))) dbkeys) res)))) fres))) #;(set! updater-running #f)) (define (dashboard:runs-tab-updater commondat tab-num) (debug:catch-and-dump (lambda () (let* ((tabdat (dboard:common-get-tabdat commondat tab-num: tab-num)) (dbkeys (dboard:tabdat-dbkeys tabdat))) (dashboard:do-update-rundat tabdat)
︙
3799 3800 3801 3802 3803 3804 3805 ~~3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817~~ 3818 3819 3820 3821 3822 3823 3824	3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837	- - - - - - - - - - - - + + + + + + + + + + + + + + +	commondat (lambda () (dashboard:runs-tab-updater commondat 1)) tab-num: 2) (iup:callback-set! tim "ACTION_CB" (lambda (time-obj) (let ((update~~-is~~-running ~~#f)~~) (mutex-lock! (dboard:commondat-update-mutex commondat)) (set! update-is-running (dboard:commondat-updating commondat)) (if (not update-is-running) (dboard:commondat-updating-set! commondat #t)) (mutex-unlock! (dboard:commondat-update-mutex commondat)) (if (not update-is-running) ;; we know that the update was not running and we now have a lock on doing an update (begin (dboard:common-run-curr-updaters commondat) ;; (dashboard:run-update commondat) (mutex-lock! (dboard:commondat-update-mutex commondat)) (dboard:commondat-updating-set! commondat #f) (mutex-unlock! (dboard:commondat-update-mutex commondat))) (if (not updater-running) (begin ;; (mutex-lock! (dboard:commondat-update-mutex commondat)) ;; (set! update-is-running (dboard:commondat-updating commondat)) ;;(if (not update-is-running) ;; (dboard:commondat-updating-set! commondat #t)) ;;(mutex-unlock! (dboard:commondat-update-mutex commondat)) ;;(if (not update-is-running) ;; we know that the update was not running and we now have a lock on doing an update ;; (begin (set! updater-running #t) (dboard:common-run-curr-updaters commondat) ;; (dashboard:run-update commondat) (set! updater-running #f) ;; (mutex-lock! (dboard:commondat-update-mutex commondat)) ;; (dboard:commondat-updating-set! commondat #f) ;; (mutex-unlock! (dboard:commondat-update-mutex commondat))) )) 1)))) (let ((th1 (make-thread (lambda () (thread-sleep! 1) (dboard:common-run-curr-updaters commondat 0) ;; force update of summary tab ) "update buttons once"))
︙

︙
1774 1775 1776 1777 1778 1779 1780 ~~1781 1782 1783 1784 1785~~ 1786 1787 1788 1789 1790 1791 1792	1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810	+ + - - - - - + + + + + + + + + + + + + + + + + + + + +	(debug:print 0 default-log-port "ERROR: cannot read " infile) (debug:print 0 default-log-port "ERROR: run-dir is " run-dir) #f ) (with-input-from-file infile read-lines) ))) ;; check duration against test-run.dat file if it exists and update the value in ;; the db if necessary ;; ~~select end_time-now from~~ ;; (~~select te~~st~~name,item_path,event_time+~~run_duration as ;; end~~_time,strftime('%s','now') as now from~~ test~~s where st~~at~~e in~~ ;; (~~'RUNNING','REMOTEHOSTSTART','LAUNCHED'~~)); ;; (define (db:adjust-run-duration dbstruct test-id run-dir event-time run-duration) (let* ((datf (conc run-dir ".mt_data/test-run.dat")) (modt (if (and (file-exists? datf) (file-read-access? datf)) (file-modification-time datf) #f)) ;; (+ event-time run-duration)))) (alt-run-duration (if modt (- modt event-time) #f))) (if (and alt-run-duration (> alt-run-duration run-duration)) (begin (debug:print 0 default-log-port "Test " test-id " run duration mismatch. Setting to " alt-run-duration) (db:with-db dbstruct #f #f (lambda (db) (sqlite3:execute db "UPDATE tests SET run_duration=? WHERE id=?;" alt-run-duration test-id) #t))) #f))) ;; #f = we did NOT adjust the time (define (db:find-and-mark-incomplete dbstruct run-id ovr-deadtime) (let* ((incompleted '()) (oldlaunched '()) (toplevels '()) ;; The default running-deadtime is 720 seconds = 12 minutes. ;; "(running-deadtime-default (+ server-start-allowance (* 2 launch-monitor-period)))" = 200 + (2 * (200 + 30 + 30)) (deadtime-trim (or ovr-deadtime (configf:lookup-number configdat "setup" "deadtime")))
︙
1826 1827 1828 1829 1830 1831 1832 ~~1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843~~ 1844 1845 ~~1846~~ 1847 1848 ~~1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859~~ 1860 1861 1862 1863 1864 1865 1866	1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886	+ - - - - - - - - - - - + + + + + + + + + + + - + + - - - - - - - - - - - + + + + + + + + + + +	;; HOWEVER: this code in run:test seems to work fine ;; (> (- (current-seconds)(+ (db:test-get-event_time testdat) ;; (db:test-get-run_duration testdat))) ;; 600) ;; (db:delay-if-busy dbdat) (sqlite3:for-each-row (lambda (test-id run-dir uname testname item-path event-time run-duration) (if (not (db:adjust-run-duration dbstruct test-id run-dir event-time run-duration)) (if (and (equal? uname "n/a") (equal? item-path "")) ;; this is a toplevel test ;; what to do with toplevel? call rollup? (begin (set! toplevels (cons (list test-id run-dir uname testname item-path run-id) toplevels)) (debug:print-info 0 default-log-port "Found old toplevel test in RUNNING state, test-id=" test-id)) (begin (set! incompleted (cons (list test-id run-dir uname testname item-path run-id) incompleted)) (debug:print-info 0 default-log-port "Found old test in RUNNING state, test-id=" test-id" exceeded running-deadtime "running-deadtime" now="(current-seconds) " event-time="event-time" run-duration="run-duration)))) (if (and (equal? uname "n/a") (equal? item-path "")) ;; this is a toplevel test ;; what to do with toplevel? call rollup? (begin (set! toplevels (cons (list test-id run-dir uname testname item-path run-id) toplevels)) (debug:print-info 0 default-log-port "Found old toplevel test in RUNNING state, test-id=" test-id)) (begin (set! incompleted (cons (list test-id run-dir uname testname item-path run-id) incompleted)) (debug:print-info 0 default-log-port "Found old test in RUNNING state, test-id=" test-id" exceeded running-deadtime "running-deadtime" now="(current-seconds) " event-time="event-time" run-duration="run-duration))))) stmth1 run-id running-deadtime) ;; default time 720 seconds (sqlite3:for-each-row (lambda (test-id run-dir uname testname item-path event-time run-duration) (if (not (db:adjust-run-duration dbstruct test-id run-dir event-time run-duration)) (if (and (equal? uname "n/a") (equal? item-path "")) ;; this is a toplevel test ;; what to do with toplevel? call rollup? (begin (set! toplevels (cons (list test-id run-dir uname testname item-path run-id) toplevels)) (debug:print-info 0 default-log-port "Found old toplevel test in RUNNING state, test-id=" test-id)) (begin (debug:print-info 0 default-log-port "Found old test in REMOTEHOSTSTART state, test-id=" test-id " exceeded running-deadtime "running-deadtime" now="(current-seconds)" event-time="event-time " run-duration="run-duration) (set! incompleted (cons (list test-id run-dir uname testname item-path run-id) incompleted))))) (if (and (equal? uname "n/a") (equal? item-path "")) ;; this is a toplevel test ;; what to do with toplevel? call rollup? (begin (set! toplevels (cons (list test-id run-dir uname testname item-path run-id) toplevels)) (debug:print-info 0 default-log-port "Found old toplevel test in RUNNING state, test-id=" test-id)) (begin (debug:print-info 0 default-log-port "Found old test in REMOTEHOSTSTART state, test-id=" test-id " exceeded running-deadtime "running-deadtime" now="(current-seconds)" event-time="event-time " run-duration="run-duration) (set! incompleted (cons (list test-id run-dir uname testname item-path run-id) incompleted)))))) stmth2 run-id remotehoststart-deadtime) ;; default time 230 seconds ;; in LAUNCHED for more than one day. Could be long due to job queues TODO/BUG: Need override for this in config ;; ;; (db:delay-if-busy dbdat) (sqlite3:for-each-row
︙
3045 3046 3047 3048 3049 3050 3051 ~~3052~~ 3053 3054 3055 3056 3057 3058 3059	3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079	- +	-1 "-" "-")) ;; ;; 1. cache tests-match-qry ;; 2. compile qry and store in hash ;; 3. convert for-each-row to fold ;; ~~(define (db:get-tests-for-run-state-status dbstruct run-id testpatt)~~ #;(define (db:get-tests-for-run-state-status dbstruct run-id testpatt) (db:with-db dbstruct run-id #f (lambda (db) (let* ((res '()) (stmt-cache (dbr:dbstruct-stmt-cache dbstruct)) (stmth (let* ((sh (db:hoh-get stmt-cache db testpatt))) (or sh
︙
3465 3466 3467 3468 3469 3470 3471 ~~3472~~ 3473 3474 3475 3476 3477 3478 3479	3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499	- +	(let* ((run-ids (db:get-all-run-ids mtdb))) (for-each (lambda (run-id) (let ((testrecs (db:get-all-tests-info-by-run-id mtdb run-id))) (db:prep-megatest.db-adj-test-ids (db:dbdat-get-db mtdb) run-id testrecs))) run-ids))) ~~;; Get test data using test_id, run-id is not used~~ ;; Get test data using test_id, run-id is not used - but it will be! ;; (define (db:get-test-info-by-id dbstruct run-id test-id) (db:with-db dbstruct #f ;; run-id #f (lambda (db)
︙

︙
203 204 205 206 207 208 209 ~~210~~ 211 212 213 214 215 ~~216~~ 217 218 219 220 221 222 223	203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225	- + - + + +	(round (- (current-seconds) start-seconds))))) (kill-tries 0)) ;; (tests:set-full-meta-info #f test-id run-id (calc-minutes) work-area) ;; (tests:set-full-meta-info test-id run-id (calc-minutes) work-area) ~~(tests:set-full-meta-info #f test-id run-id (calc-minutes) work-area 10)~~ (tests:set-full-meta-info #f test-id run-id (calc-minutes) work-area 10 update-db: #t) (let loop ((minutes (calc-minutes)) (cpu-load (alist-ref 'adj-core-load (common:get-normalized-cpu-load #f))) (disk-free (get-df (current-directory))) (last-sync (current-seconds))) ~~(common:telemetry-log "zombie" (conc "launch:monitor-job - ~~top of loop encountered at "(current-seconds)" with last-sync="last-sync))~~~~ ;; (common:telemetry-log "zombie" (conc "launch:monitor-job - ;; top of loop encountered at "(current-seconds)" with ;; last-sync="last-sync)) (let* ((over-time (> (current-seconds) (+ last-sync update-period))) (new-cpu-load (let* ((load (alist-ref 'adj-core-load (common:get-normalized-cpu-load #f))) (delta (abs (- load cpu-load)))) (if (> delta 0.1) ;; don't bother updating with small changes load #f))) (new-disk-free (let* ((df (if over-time ;; only get df every 30 seconds
︙
231 232 233 234 235 236 237 ~~238~~ 239 240 241 242 243 244 245 246 ~~247~~ 248 249 250 251 252 253 ~~254~~ ~~255 256 257 258~~ ~~259 260~~ 261 262 263 264 265 266 267	233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264	- + - + - + - - - - + - -	(do-sync (or new-cpu-load new-disk-free over-time)) (test-info (rmt:get-test-info-by-id run-id test-id)) (state (db:test-get-state test-info)) (status (db:test-get-status test-info)) (kill-reason "no kill reason specified") (kill-job? #f)) (common:telemetry-log "zombie" (conc "launch:monitor-job - decision time encountered at "(current-seconds)" with last-sync="last-sync" do-sync="do-sync" over-time="over-time" update-period="update-period)) #;(common:telemetry-log "zombie" (conc "launch:monitor-job - decision time encountered at "(current-seconds)" with last-sync="last-sync" do-sync="do-sync" over-time="over-time" update-period="update-period)) (cond ((test-get-kill-request run-id test-id) (set! kill-reason "KILLING TEST since received kill request (KILLREQ)") (set! kill-job? #t)) ((and runtlim (> (- (current-seconds) start-seconds) runtlim)) (set! kill-reason (conc "KILLING TEST DUE TO TIME LIMIT EXCEEDED! Runtime=" (- (current-seconds) start-seconds) " seconds, limit=" runtlim)) (set! kill-job? #t)) ((equal? status "DEAD") ~~(tests:update-central-meta-info run-id test-id new-cpu-load new-disk-free (calc-minutes) #f #f)~~ (tests:update-central-meta-info run-id test-id new-cpu-load new-disk-free (calc-minutes) #f #f update-db: #t) (rmt:set-state-status-and-roll-up-items run-id test-id 'foo "RUNNING" "n/a" "was marked dead; really still running.") ;;(set! kill-reason "KILLING TEST because it was marked as DEAD by launch:handle-zombie-tests (might indicate really overloaded server or else overzealous setup.deadtime)") ;; MARK RUNNING (set! kill-job? #f))) (debug:print 4 default-log-port "cpu: " new-cpu-load " disk: " new-disk-free " last-sync: " last-sync " do-sync: " do-sync) (launch:handle-zombie-tests run-id) ~~(~~when~~ do-sync~~ (if do-sync ;; save meta data about the running of this test ~~;;(with-output-to-file (conc (getenv "MT_TEST_RUN_DIR") "/last-loadinfo.log" #:append)~~ ~~;; (lambda () (pp (list (current-seconds) new-cpu-load new-disk-free (calc-minutes)))))~~ ~~(common:telemetry-log "zombie" (conc "launch:monitor-job - dosync started at "(current-seconds)))~~ (tests:update-central-meta-info run-id test-id new-cpu-load new-disk-free (calc-minutes) #f #f) (tests:update-central-meta-info run-id test-id new-cpu-load new-disk-free (calc-minutes) #f #f)) ~~(common:telemetry-log "zombie" (conc "launch:monitor-job - dosync finished at "(current-seconds))))~~ (if kill-job? (begin (debug:print-info 0 default-log-port "proceeding to kill test: "kill-reason) (mutex-lock! m) ;; NOTE: The pid can change as different steps are run. Do we need handshaking between this ;; section and the runit section? Or add a loop that tries three times with a 1/4 second ;; between tries?
︙
310 311 312 313 314 315 316 ~~317~~ 318 319 320 321 322 323 324	307 308 309 310 311 312 313 314 315 316 317 318 319 320 321	- +	(begin (thread-sleep! 3) ;; (+ 3 (random 6))) ;; add some jitter to the call home time to spread out the db accesses (if (hash-table-ref/default misc-flags 'keep-going #f) ;; keep originals for cpu-load and disk-free unless they change more than the allowed delta (loop (calc-minutes) (or new-cpu-load cpu-load) (or new-disk-free disk-free) (if do-sync (current-seconds) last-sync))))))) ~~(tests:update-central-meta-info run-id test-id (get-cpu-load) (get-df (current-directory))(calc-minutes) #f #f))) ;; NOTE: Checking twice for keep-going is intentional~~ (tests:update-central-meta-info run-id test-id (get-cpu-load) (get-df (current-directory))(calc-minutes) #f #f update-db: #t))) ;; NOTE: Checking twice for keep-going is intentional (define (launch:execute encoded-cmd) (let* ((cmdinfo (common:read-encoded-string encoded-cmd)) (tconfigreg #f)) (setenv "MT_CMDINFO" encoded-cmd) ;;(bb-check-path msg: "launch:execute incoming")
︙
463 464 465 466 467 468 469 ~~470~~ 471 472 473 474 475 476 477	460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476	- + + +	(db:test-get-host test-info) (begin (debug:print 0 default-log-port "ERROR: failed to find a record for test-id " test-id ", exiting.") (exit)))) (test-pid (db:test-get-process_id test-info))) (cond ;; -mrw- I'm removing KILLREQ from this list so that a test in KILLREQ state is treated as a "do not run" flag. ~~((member (db:test-get-state test-info) '("INCOMPLETE" "KILLED" "UNKNOWN" "STUCK")) ;; prior run of this test didn't complete, go ahead and try to rerun~~ ((or (member (db:test-get-state test-info) '("INCOMPLETE" "KILLED" "UNKNOWN" "STUCK")) ;; prior run of this test didn't complete, go ahead and try to rerun (and (equal? (db:test-get-state test-info) "COMPLETED") ;; completed/abort => rerun if asked (member (db:test-get-status test-info) '("ABORT")))) (debug:print 0 default-log-port "INFO: test is INCOMPLETE or KILLED, treat this execute call as a rerun request") ;; (tests:test-force-state-status! run-id test-id "REMOTEHOSTSTART" "n/a") (rmt:general-call 'set-test-start-time #f test-id) (rmt:test-set-state-status run-id test-id "REMOTEHOSTSTART" "n/a" #f) ) ;; prime it for running ((member (db:test-get-state test-info) '("RUNNING" "REMOTEHOSTSTART"))
︙
769 770 771 772 773 774 775 ~~776~~ 777 778 779 780 781 782 783	768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784	+ + - +	(item-path (vector-ref running-test 11))) (debug:print 0 default-log-port "test " test-name "/" item-path " not completed") (if (not (null? tal)) (loop (car tal) (cdr tal))))))))))) (define (launch:is-test-alive host pid) (if (and host pid (not (equal? host "n/a"))) (let* ((is-local (equal? host (get-host-name))) (ssh-cmd (if is-local " " (conc "ssh " host " "))) ~~~~(let* (~~(cmd (conc "ssh " ~~host "~~ pstree -A " pid))~~ (cmd (conc ssh-cmd "pstree -A " pid)) (output (with-input-from-pipe cmd read-lines))) (debug:print 2 default-log-port "Running " cmd " received " output) (if (eq? (length output) 0) #f #t)) #t))
︙
1304 1305 1306 1307 1308 1309 1310 ~~1311~~ ~~1312~~ 1313 1314 1315 1316 1317 1318 1319	1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320	- + - +	(begin ;; (let ((lnktarget (conc lnkpath "/" item-path))) (debug:print 2 default-log-port "Setting up sub test run area") (debug:print 2 default-log-port " - creating run area in " test-path) (handle-exceptions exn (begin (debug:print-error 0 default-log-port " Failed to create directory " test-path ((condition-property-accessor 'exn 'message) exn) ~~", ~~exi~~ting, exn=" exn)~~ ", continuing (might cause downstream issues?), exn=" exn) ~~(exit 1))~~ #f) (create-directory test-path #t)) (debug:print 2 default-log-port " - creating link from: " test-path "\n" " to: " lnktarget) ;; If there is already a symlink delete it and recreate it. (handle-exceptions
︙

︙
20 21 22 23 24 25 26 27 28 29 30 31 32 33	20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35	+ +	(use format typed-records) ;; RADT => purpose of json format?? (declare (unit rmt)) (declare (uses api)) (declare (uses http-transport)) (include "common_records.scm") (include "db_records.scm") ;; (declare (uses rmtmod)) ;; (import rmtmod) ;; ;; THESE ARE ALL CALLED ON THE CLIENT SIDE!!! ;;
︙
523 524 525 526 527 528 529 ~~530~~ 531 532 533 ~~534~~ 535 536 537 538 539 540 541	525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552	- + - + + + + + + + + + +	;; Just some syntatic sugar (define (rmt:register-test run-id test-name item-path) (rmt:general-call 'register-test run-id run-id test-name item-path)) (define (rmt:get-test-id run-id testname item-path) (rmt:send-receive 'get-test-id run-id (list run-id testname item-path))) ~~;; run-id is NOT used~~ ;; run-id is NOT used - but it will be! ;; (define (rmt:get-test-info-by-id run-id test-id) (if (number? test-id) ~~(rmt:send-receive 'get-test-info-by-id run-id (list run-id test-id))~~ (let* ((testdat (rmt:send-receive 'get-test-info-by-id run-id (list run-id test-id))) (trundatf (conc (db:test-get-rundir testdat) "/.mt_data/test-run.dat"))) ;; now we can update a couple fields from the filesystem (if (and (db:test-get-rundir testdat) (file-exists? trundatf)) (let* ((duration (db:test-get-run_duration testdat)) (event-time (db:test-get-event_time testdat)) (last-touch (file-modification-time trundatf))) (db:test-set-run_duration! testdat (max duration (- last-touch event-time))))) testdat) (begin (debug:print 0 default-log-port "WARNING: Bad data handed to rmt:get-test-info-by-id run-id=" run-id ", test-id=" test-id) (print-call-chain (current-error-port)) #f))) (define (rmt:test-get-rundir-from-test-id run-id test-id) (rmt:send-receive 'test-get-rundir-from-test-id run-id (list run-id test-id)))
︙

︙
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72	58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76	- + + + + +	(last-load-check-time 0) (last-jobs-check-time 0) ) (defstruct runs:testdat hed tal reg reruns test-record test-name item-path jobgroup ~~waitons testmode newtal ~~itemmaps prereqs-not-met)~~~~ waitons testmode newtal itemmaps (prereqs-not-met #f) (last-update 0) ;; ) ;; look in the $MT_RUN_AREA_HOME/.softlocks directory for key-host-pid.softlock files ;; - remove any that are over 3600 seconds old ;; - if there are any that are younger than 10 seconds ;; * sleep 10 seconds ;; * touch my key-host-pid.softlock file ;; * return
︙
826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 ~~843~~ 844 ~~845~~ ~~846 847 848 849 850 851 852 853~~ ~~854~~ 855 856 857 858 859 860 861	830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875	+ + + + + + + + + + + + + + - + + + + + - + - - - - - - - - + -	;; => sometimes need to squeeze things in (added to reg) ;; => review of a previously seen test is higher priority of never visited test ;; reg - list of previously visited tests ;; tal - list of never visited tests ;; prefer next hed to be from reg than tal. (define runs:nothing-left-in-queue-count 0) ;; cache the result of get-prereqs-not-met and don't call it if called in past 10 seconds ;; NOTE: This is assuming that testdat is highly specific to this test ;; (define (runs:lazy-get-prereqs-not-met testdat run-id waitons hed item-path #!key (mode '(normal))(itemmaps #f)) ;; mode: testmode itemmaps: itemmaps) (if (and (runs:testdat-prereqs-not-met testdat) (< (- (current-seconds) (runs:testdat-last-update testdat)) 10)) ;;; only refresh for this test if ;;; it has been at least 10 seconds (runs:testdat-prereqs-not-met testdat) ;; return the cached result (let* ((res (rmt:get-prereqs-not-met run-id waitons hed item-path mode: mode itemmaps: itemmaps))) (runs:testdat-prereqs-not-met-set! testdat res) (runs:testdat-last-update-set! testdat (current-seconds)) res))) ;;====================================================================== ;; runs:expand-items is called by runs:run-tests-queue ;;====================================================================== ;; ;; return value of runs:expand-items is passed back to runs-tests-queue and is fed to named loop with this signature: ;; (let loop ((hed (car sorted-test-names)) ;; (tal (cdr sorted-test-names)) ;; (reg '()) ;; registered, put these at the head of tal ;; (reruns '())) (define (runs:expand-items hed tal reg reruns regfull newtal jobgroup max-concurrent-jobs ~~run-id waitons item-path testmode test-record can-run-more items runname tconfig reglen test-registry test-records itemmaps)~~ (define (runs:expand-items hed tal reg reruns regfull newtal jobgroup max-concurrent-jobs run-id waitons item-path testmode test-record can-run-more items runname tconfig reglen test-registry test-records itemmaps testdat) (let* ((loop-list (list hed tal reg reruns)) (prereqs-not-met (runs:lazy-get-prereqs-not-met testdat run-id waitons hed item-path ~~~~(prereqs-not-met (let ((res (rmt:get-prereqs-not-met run-id waitons hed item-path~~ mode: testmode itemmaps: itemmaps)))~~ mode: testmode itemmaps: itemmaps)) ~~(if (list? res)~~ ~~res~~ ~~(begin~~ (debug:print 0 default-log-port ~~"ERROR: rmt:get-prereqs-not-met returned non-list!\n"~~ ~~" res=" res " run-id=" run-id " waitons=" waitons " hed=" hed " item-path=" item-path " testmode=" testmode " itemmaps=" itemmaps)~~ ~~'()))))~~ (have-itemized (not (null? (lset-intersection eq? testmode '(itemmatch itemwait))))) (have-itemized (not (null? (lset-intersection eq? testmode '(itemmatch itemwait))))) ~~;; (prereqs-not-met (mt:lazy-get-prereqs-not-met run-id waitons item-path mode: testmode itemmap: itemmap))~~ (fails (runs:calc-fails prereqs-not-met)) (prereq-fails (runs:calc-prereq-fail prereqs-not-met)) (non-completed (runs:calc-not-completed prereqs-not-met)) (runnables (runs:calc-runnable prereqs-not-met)) (unexpanded-prereqs (filter (lambda (testname) (let* ((test-rec (hash-table-ref test-records testname))
︙
1713 1714 1715 1716 1717 1718 1719 ~~1720~~ 1721 1722 1723 1724 1725 1726 1727	1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743	+ + - +	;; wait for load here (if (runs:dat-load-mgmt-function runsdat)((runs:dat-load-mgmt-function runsdat))) (loop-can-run-more (runs:can-run-more-tests runsdat run-id jobgroup max-concurrent-jobs) (- remtries 1))))))) ))))) ;; I'm not clear on why prereqs are gathered here TODO: verfiy this is needed (runs:lazy-get-prereqs-not-met testdat run-id waitons hed item-path mode: testmode ~~~~(runs:testdat-prereqs-not-met-set! testdat (rmt:get-prereqs-not-met run-id waitons hed item-path mode: testmode~~ itemmaps: itemmaps))~~ itemmaps: itemmaps) ;; I'm not clear on why we'd capture running job counts here TODO: verify this is needed (runs:dat-can-run-more-tests-set! runsdat (runs:can-run-more-tests runsdat run-id jobgroup max-concurrent-jobs)) (let ((loop-list (runs:process-expanded-tests runsdat testdat))) ;; in process-expanded-tests ultimately run:test -> launch-test -> test actually running (if loop-list (apply loop loop-list))))
︙
1782 1783 1784 1785 1786 1787 1788 ~~1789~~ 1790 ~~1791~~ 1792 1793 ~~1794~~ ~~1795 1796~~ 1797 1798 1799 1800 1801 1802 1803	1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820	- + - + + + + - + - -	;; if items is a proc then need to run items:get-items-from-config, get the list and loop ;; - but only do that if resources exist to kick off the job ;; EXPAND ITEMS ((or (procedure? items)(eq? items 'have-procedure)) (debug:print-info 4 default-log-port "cond branch - " "rtq-4") (let ((can-run-more #f)) ;; (runs:can-run-more-tests runsdat run-id jobgroup max-concurrent-jobs))) ~~(if (not can-run-more) #;(and (list? can-run-more)~~ (if (not can-run-more) #;(and (list? can-run-more) ;; IDEA, this mechanism may have had some value, make it configurable to test pros/cons TODO (car can-run-more)) (let ((loop-list (runs:expand-items hed tal reg reruns regfull newtal jobgroup max-concurrent-jobs ~~run-id waitons item-path testmode test-record can-run-more items runname tconfig reglen test-registry test-records itemmaps))) ;; itemized test expanded here~~ (let ((loop-list (runs:expand-items hed tal reg reruns regfull newtal jobgroup max-concurrent-jobs run-id waitons item-path testmode test-record can-run-more items runname tconfig reglen test-registry test-records itemmaps testdat))) (if loop-list (apply loop loop-list) ~~(debug:print-info 4 default-log-port " -- Can't expand hed="hed)~~ (debug:print-info 4 default-log-port " -- Can't expand hed="hed))) ~~) )~~ ;; if can't run more just loop with next possible test (loop (car newtal)(cdr newtal) reg reruns)))) ;; this case should not happen, added to help catch any bugs ((and (list? items) itemdat) (debug:print-info 4 default-log-port "cond branch - " "rtq-5") (debug:print-error 0 default-log-port "Should not have a list of items in a test and the itemspath set - please report this")
︙