This forum uses cookies
This forum makes use of cookies to store your login information if you are registered, and your last visit if you are not. Cookies are small text documents stored on your computer; the cookies set by this forum can only be used on this website and pose no security risk. Cookies on this forum also track the specific topics you have read and when you last read them. Please confirm that you accept these cookies being set.

groupmonitor.lua is sometimes faulty
#1
Hello,
In my company we handle a pretty big pool of LogicMachines and we noticed that sometimes, event-based script that are active, don't trigger when they should. Here are more details about the issue:
  • It can happen on LMs with any firmware (we noticed that issue on LMs with 2017 firmwares as well a LMs with 2018, 2020 and 2021 firmwares)
  • Most of the time, when it happens, the LM has a pretty long uptime (it hasn't rebooted since 50 or even 100 days)
  • We noticed that by placing a simple "log('hello')" at the beginning of the active event-based script, then when writing into the trigger object, the log doesn't appear in the Logs panel.
  • When it happens, it doesn't happen for every KNX object of the LM, but only for some of them (which means some event-based scripts keep on triggering when they should, and some others don't).
  • Since "groupmonitor.lua" is the process responsible for listening to the bus and triggering the event-based scripts, we tried killing groupmonitor.lua and then waiting for it to be reloaded (I think there is a daemon/cron job that runs every minute and launches groupmonitor.lua if it isn't running). 
  • After reloading groupmonitor.lua, the problem disappears and every event-based script triggers properly.
What do you know about this issue? Would it be possible to imagine a script that could detect this issue on a LogicMachine, and when it detects it, kills groupmonitor.lua? Would it be possible to do it without any false positive (we think that the problem is happening while it actually isn't) nor any false negative (the reverse)?
Reply
#2
Have you tested this with latest firmware?

The most likely cause is that LM is overloaded. Run this script and check the last column (drops) values. If at least one row has a non-zero value it means that a certain system part was unable to process incoming data quick enough. Receive buffer got full and certain messages got dropped.
Code:
res = io.readfile('/proc/net/udp')
log(res)

To prevent this scripts must use grp.checkwrite/grp.checkupdate where applicable. Modbus send delta values and KNX filtering tables should be adjusted.
Reply
#3
i See a lot of drops.

* string: sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
88: 0100007F:EAD0 FFFFFF7F:1A40 01 00000000:00000000 00:00000000 00000000 0 0 492 2 cf9c6f00 0
111: 0100007F:03E7 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 500 2 cf9c7400 0
113: 00000000:14E9 00000000:0000 07 00000000:00000000 00:00000000 00000000 65534 0 1148 2 cf32ef00 2401
148: 0100007F:0B0C 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 836 2 cf9c7900 0
159: 0100007F:2717 00000000:0000 07 00000000:00000B00 00:00000000 00000000 0 0 914 2 cf9c6500 0
180: FFFFFF7F:1A2C 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 993 2 cf32e280 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 65729512 2 cc8b6f00 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 1020 2 cf32ea00 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 1015 2 cf32ec80 663
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 1006 2 cf32e780 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 998 2 cf32e500 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 960 2 cf32e000 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00050180 00:00000000 00000000 0 0 938 2 cf9c6280 6853867
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 837 2 cf9c7b80 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 507 2 cf9c7680 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 491 2 cf9c6c80 0
200: FFFFFF7F:1A40 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 485 2 cf9c6780 66
223: 00000000:0E57 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 494 2 cf9c7180 0
241: 0100007F:8B69 FFFFFF7F:04B0 01 00000000:00000000 00:00000000 00000000 0 0 486 2 cf9c6a00 0
Reply
#4
(23.10.2023, 08:08)admin Wrote: Have you tested this with latest firmware?

The most likely cause is that LM is overloaded. Run this script and check the last column (drops) values. If at least one row has a non-zero value it means that a certain system part was unable to process incoming data quick enough. Receive buffer got full and certain messages got dropped.
Code:
res = io.readfile('/proc/net/udp')
log(res)

To prevent this scripts must use grp.checkwrite/grp.checkupdate where applicable. Modbus send delta values and KNX filtering tables should be adjusted.

I haven't encountered this issue yet on LMs that are equipped with 2023 firmware.
Thanks for the idea of looking if there was any drop in UDP packets; next time I encounter this bug, I will check it immediately.
Reply
#5
@gjniewenhuijse, your system looks overloaded too. Another indication is non-zero rx_queue value. It means that messages cannot be processes quickly enough and are queued.

You can run this script, it will log all processes that have at least one message dropped.
Code:
inodes = {}

function collect(pid)
  local res = io.readproc('ls -l /proc/' .. pid .. '/fd') or ''
  local cmd = io.readfile('/proc/' .. pid .. '/cmdline') or ''
  cmd = cmd:gsub('%z', ' '):trim()

  for inode in res:gmatch('socket:%[(%d+)%]') do
    inodes[ tonumber(inode) ] = cmd
  end
end

pids = io.ls('/proc')

for _, pid in ipairs(pids) do
  pid = tonumber(pid)
  if pid then
    collect(pid)
  end
end

data = io.readfile('/proc/net/udp'):split('\n')
table.remove(data, 1)

function check(chunks)
  local drops = tonumber(chunks[ 13 ]) or 0

  if drops == 0 then
    return
  end

  local inode = tonumber(chunks[ 10 ]) or 0
  local cmd = inodes[ inode ]

  if cmd then
    log(cmd .. ', drops: ' .. drops)
  end
end

for _, line in ipairs(data) do
  chunks = line:trim():gsub('%s+', ' '):split(' ')
  if #chunks == 13 then
    check(chunks)
  end
end

Consider installing Group monitor app and check for duplicates and unnecessary object traffic that causes such overloads.
Reply
#6
important output:
string: lua /lib/genohm-scada/core/scripting-resident.lua 159, drops: 6908586

script 159: i checked every 10 seconds the knx bus with this script
Code:
-- resident knx connection handeler
-- check device physical address: log( knxlib.ping('2.1.1') )

if not client then
  function statushandler(status)
    if status == 0x00 then
      -- KNX Connected
      --log('connected')
      require('user.nit_notify')
      sendMessage('Knx bus','Knx bus hersteld.',-2)
    elseif status == 0xFF then
      -- KNX Disconnected
      --log('not connected')
      require('user.nit_notify')
      sendMessage('Knx bus','Knx bus probleem, controleer de verbinding.',-2)
    end
  end

  client = require('localbus').new(1)
  client:sethandler('status', statushandler)
end

client:step()
Reply
#7
Such monitoring script must use 0 sleep time.
Reply
#8
lua /lib/genohm-scada/core/scripting-resident.lua 210, drops: 246
lua /lib/genohm-scada/core/groupmonitor.lua, drops: 1694

The groupmonitor.lua drops really are a critical problem. Sometimes I do a grp.write() on a object, a script that listens to the localbus sees the telegram on the bus, but the object is never updated in the database!! So I can't even ensure that when I do a grp.write() on an object and then a grp.getvalue() on the same object 30 seconds later, I will effectively read the value that I wrote 30 seconds ago. This is really critical. What could be done to ensure it?
Reply
#9
The only thing that you can do is to reduce the overall system load (telegram count) because you are hitting the system limits.
Reply


Forum Jump: