00:20.01 | *** join/#asterisk-dev infobot (ibot@rikers.org) |
00:20.01 | *** topic/#asterisk-dev is Asterisk Development Discussion -=- http://www.asterisk.org/developers -=- Tier 2 and 3.14159265 support is in #asterisk -=- Check out our blog! blogs.asterisk.org -=- Follow on Twitter at @AsteriskDev |
02:25.03 | *** join/#asterisk-dev snuff-work (~snuff-wor@210.9.148.102) |
02:25.03 | *** mode/#asterisk-dev [+o snuff-work] by ChanServ |
02:33.38 | *** join/#asterisk-dev cresl1n (~Adium@asterisk/libpri-and-libss7-expert/Cresl1n) |
02:33.38 | *** mode/#asterisk-dev [+o cresl1n] by ChanServ |
11:38.48 | *** join/#asterisk-dev jkroon (~jkroon@165.16.204.169) |
12:46.06 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
12:47.10 | *** join/#asterisk-dev coreyfarrell (~coreyfarr@24-177-250-191.dhcp.nwtn.ct.charter.com) |
12:47.10 | *** mode/#asterisk-dev [+o coreyfarrell] by ChanServ |
13:30.39 | *** join/#asterisk-dev bford (uid283514@gateway/web/irccloud.com/x-fqmdopsilmhjmwoc) |
13:30.39 | *** mode/#asterisk-dev [+o bford] by ChanServ |
14:13.57 | *** join/#asterisk-dev kharwell (kharwell@nat/digium/x-tvhzjqoekejxtkdf) |
14:13.57 | *** mode/#asterisk-dev [+o kharwell] by ChanServ |
14:42.29 | coreyfarrell | when people have time could I get reviews for https://gerrit.asterisk.org/#/q/topic:ASTERISK-27824 ? This is required for Fedora 28 to compile Asterisk with --enable-dev-mode. |
14:42.59 | *** join/#asterisk-dev cresl1n (Adium@asterisk/libpri-and-libss7-expert/Cresl1n) |
14:42.59 | *** mode/#asterisk-dev [+o cresl1n] by ChanServ |
15:17.25 | seanbright | coreyfarrell: just a general comment that - i dislike magic numbers in general but when they are powers of two i generally think to myself: "ok, this is a buffer big enough to hold the data and then some" |
15:17.36 | seanbright | when they aren't powers of two, then i have to think |
15:18.10 | seanbright | so in the example of going from 512 to 520 - the question i have is "why?" |
15:18.36 | seanbright | so maybe 512 + SOME_IDENTIFIER_THAT_EXPLAINS_THE_ADDITIONAL_PADDING? |
15:19.18 | cresl1n | Yeah, I hate it when that happens too, but respected gcc's static analysis prowess |
15:19.29 | cresl1n | It seems like there's really a code problem in that case |
15:20.05 | cresl1n | Linus has traditionally not just accepted "make the gcc warning go away" patches that don't look at underlying causes |
15:20.22 | cresl1n | Not saying I'm gonna go that route, but it did make me wonder in a few places |
15:23.27 | *** join/#asterisk-dev csavinovich (sid296765@gateway/web/irccloud.com/x-gfutqfahcjadkdmb) |
15:23.27 | *** mode/#asterisk-dev [+o csavinovich] by ChanServ |
15:31.09 | coreyfarrell | I did suppress the warning for a few sources where possibility of truncation is pretty much unavoidable (or in the case of test_strings where it is intentional). I can look into using more calculated sizes like `char buf[512 + sizeof(somefield)];` but I won't be able to update it today. |
15:32.49 | *** join/#asterisk-dev Worldexe (~Worldexe@95-107-33-134.dsl.orel.ru) |
15:33.10 | seanbright | in the snprintf cases, you can elide the warning by checking the return value |
15:33.26 | coreyfarrell | in some cases it could be dealt with by switching to 'struct ast_str' but I wanted to avoid changes to logic for the system-wide 'get gcc to compile'. |
15:33.48 | coreyfarrell | seanbright: oh so (void)snprintf(...) might allow individual warning to be suppressed? |
15:33.59 | seanbright | testing |
15:34.08 | seanbright | (although it still is a band-air) |
15:34.10 | seanbright | aid* |
15:34.46 | seanbright | no, casting to void does not silence it |
15:38.01 | coreyfarrell | huh.. that seems like maybe a gcc bug / overly aggressive warning? should be possible to say "I know this call to snprintf can truncate and I don't care". |
15:39.47 | seanbright | i think the logic is "you should care of this truncates because it might affect something else" |
15:39.51 | seanbright | if* |
15:40.08 | gtjoseph | coreyfarrell: weren't we going to suppress the test_runner bits from these messages...Test ['./lib/python/asterisk/test_runner.py', 'tests/channels/pjsip/ami/pjsip_qualify'] passe |
15:40.17 | gtjoseph | i forgot |
15:40.57 | coreyfarrell | gtjoseph: my python3-compat patch includes that, switches it to just print the test name. |
15:41.06 | gtjoseph | ah, yeah ok |
15:52.16 | gtjoseph | coreyfarrell: without the ['...'] ?? |
15:52.31 | coreyfarrell | gtjoseph: I think the pretty_print updates are not unneeded to 15.4 / 13.21 since the python3-compat will only go to 13, 14, 15 and master? |
15:52.55 | coreyfarrell | gtjoseph: correct. the ['...'] was because the code was printing cmd (an array of strings). |
15:53.01 | gtjoseph | actually, i discovered the issue with cert 13.21 |
15:53.30 | gtjoseph | for some reason when you run custom tests, the second form of the result is used. |
15:54.44 | gtjoseph | oh it appears that ./self_test is missing from 13.21 and 15.4 branches |
15:55.01 | gtjoseph | that needs to be cherry-picked |
15:55.48 | gtjoseph | or we need to change the groovy to check for its existence. |
15:57.17 | coreyfarrell | up to you. keep in mind the self_test had two commits, second one switched to using '#!/usr/bin/env sh' and 'set -e' instead of '#!/usr/bin/sh -e' |
16:00.21 | gtjoseph | i'd rather cherry-pick to keep the code bases as consistent as possible |
16:01.19 | coreyfarrell | gtjoseph: ok I think I have a few minutes. you want me to just merge the two commits for the minor branches or cherry-pick both patches in series? |
16:01.42 | gtjoseph | cherry-pick in sequence i think would be safer. |
16:06.36 | *** join/#asterisk-dev dakudos (~dakudos@c-73-203-6-107.hsd1.co.comcast.net) |
16:07.34 | coreyfarrell | gtjoseph: ok all set. obviously ignore the jenkins error for 8964 as that's what 8965 fixes. |
16:07.44 | gtjoseph | yep, thx |
16:13.01 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
16:38.07 | *** join/#asterisk-dev Deeewayne (~dwayne@2605:a600:8050:5600:829d:1142:5677:7e57) |
16:38.07 | *** mode/#asterisk-dev [+o Deeewayne] by ChanServ |
18:22.19 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
18:27.12 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
18:37.43 | *** join/#asterisk-dev elguero (~miguel323@74-95-21-41-Connecticut.hfc.comcastbusiness.net) |
18:47.18 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
19:11.03 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
19:25.09 | *** join/#asterisk-dev scgm11_ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
19:31.27 | seanbright | you digium folks have any nortel phones laying around that you want to send me? |
19:39.57 | *** join/#asterisk-dev Bhakimi (~textual@208.78.139.170) |
19:40.40 | Bhakimi | hi guys! we are running asterisk 11 (yes i know its EOL but he customized it so we are stuck) and it kills the cpu after about 50k calls processed |
19:40.46 | Bhakimi | we tried to run a profiler on it and couldnt find a issue |
19:41.04 | *** join/#asterisk-dev scgm11__ (~scgm11@r186-49-50-18.dialup.adsl.anteldata.net.uy) |
19:41.36 | seanbright | sorry to hear that |
19:41.47 | seanbright | you should upgrade. how much was the code customized? |
19:45.31 | *** join/#asterisk-dev CELYA_ (~Thunderbi@LFbn-1-11898-144.w90-93.abo.wanadoo.fr) |
19:45.44 | Bhakimi | tons of custom modules |
19:45.51 | Bhakimi | we build a whole application around it |
19:45.57 | Bhakimi | it would takes a few years to upgrade |
19:46.09 | Bhakimi | its basically not a option which is why i mentioed it :) |
19:46.26 | seanbright | hmm. ok. |
19:46.35 | seanbright | so you are running a highly customized version of asterisk 11 |
19:46.38 | seanbright | how can we help? |
19:46.58 | Bhakimi | im tryign to find out what tools we can use to track it down |
19:47.21 | seanbright | core show threads & gdb |
19:47.38 | *** join/#asterisk-dev jpsharp (~jsharp@45.79.209.207) |
19:47.42 | seanbright | when it slows down, use gcore to create a core dump |
19:47.51 | Bhakimi | let me bring my dev in, he knows better but i think we did both |
19:47.58 | seanbright | no that's ok |
19:48.02 | jpsharp | I'm here. |
19:48.09 | seanbright | ok, then read above |
19:48.37 | coreyfarrell | also check memory usage when it slows down, possible you might be using too much memory and thrashing swap. that would slow the whole system down not just asterisk. |
19:48.45 | Bhakimi | no memory usage |
19:48.47 | Bhakimi | no swap |
19:48.49 | jpsharp | Memory is good. it's not swapping. |
19:48.56 | Bhakimi | swap is happy and it deosnt use muhc memory |
19:51.11 | jpsharp | I gotta install gcore/gdb. It'll take a moment. |
19:53.58 | seanbright | if you run 'htop' are there one or two threads running hot? |
19:55.10 | Bhakimi | all threads |
19:55.18 | seanbright | hmm |
19:55.31 | seanbright | do your custom modules create new threads? |
19:55.34 | Bhakimi | it looks like after 50k calls once we push the system it goes cpu crazy |
19:56.43 | jpsharp | no, the only fully custom module we have is a CEL module that uses redis. |
19:57.25 | seanbright | it would take a few years to upgrade that, eh? |
19:57.36 | seanbright | cel hasn't changed at all since asterisk 11 that i am aware of |
19:57.52 | Bhakimi | testing etc |
19:58.00 | seanbright | ok, so that's 3 weeks |
19:58.17 | seanbright | so the other 101 weeks are what? |
19:58.19 | Bhakimi | its a whole call center platform, we also took freepbx and build the ui for it |
19:58.26 | Bhakimi | there its tons of areas |
19:58.30 | seanbright | gotcha |
19:58.42 | seanbright | well, god speed and all that. |
19:58.43 | Bhakimi | its like five9 type system |
19:58.56 | seanbright | k |
19:59.27 | Bhakimi | its possible that upgrade would bring new issues |
19:59.53 | seanbright | how do you resolve the slow-down-after-50k-calls issue? |
19:59.54 | Bhakimi | if i knew 100% 15 wont cause any other issues i would but its too much risk to be honest |
20:00.07 | jpsharp | We do a full asterisk restart. |
20:00.21 | Bhakimi | i hate that we are on 11, and to solve this i could throw a bunhc of hardware at it and build systems to restart them but thats also not a good idea |
20:00.32 | seanbright | we upgraded from 11 to 15 (also run a hosted contact center platform) and didn't run into any problems |
20:00.39 | Bhakimi | we unloaded all moduels and reloaded them and the issue still happens, only way to fix it is to restart |
20:00.47 | seanbright | how often does this happen? |
20:00.51 | jpsharp | Every day. |
20:00.56 | seanbright | once a day? |
20:01.09 | jpsharp | yeah. After about 4-5 hours of calling. |
20:01.13 | Bhakimi | yea and then it has to run for about 4 hours or so |
20:01.18 | seanbright | gotcha |
20:01.47 | seanbright | well with a perfect system like that, there's no reason to upgrade |
20:01.50 | seanbright | :D |
20:02.37 | jpsharp | If we wanted that kind of stabilty, we'd run Windows :) |
20:02.46 | seanbright | yeesh |
20:02.52 | seanbright | i'm tapping out |
20:04.31 | Bhakimi | that soon lol |
20:04.45 | seanbright | you just seem like you are in over your head |
20:04.46 | Bhakimi | any suggestions what toold we can use to track the high cpu usage |
20:04.54 | seanbright | i already told you |
20:04.54 | Bhakimi | at least if we knew where int he code to look |
20:05.01 | seanbright | gcore & gdb |
20:05.09 | jpsharp | I'm running a gcore dump right now. |
20:05.12 | seanbright | gcore will let you get a core dump of the running process |
20:05.22 | seanbright | gdb will let you see where the threads are spending their time |
20:05.30 | seanbright | there is no way that *all* of the threads are 100% |
20:05.34 | Bhakimi | k 1 ec, may we show you the output ? |
20:05.36 | seanbright | it's simply not possible |
20:06.04 | seanbright | not me |
20:06.07 | seanbright | sounds like we're competitors |
20:06.21 | seanbright | someone else though maybe |
20:07.38 | Bhakimi | lol |
20:07.41 | Bhakimi | its for 1 center |
20:07.46 | Bhakimi | i dobuth we are competitors |
20:07.59 | jpsharp | gcore failed to create core. |
20:09.31 | seanbright | shrugs |
20:12.50 | jpsharp | https://imgur.com/a/OAiX3tm |
20:12.54 | jpsharp | That's the output of htop |
20:12.56 | Bhakimi | what call center platform do you work fpr? |
20:13.41 | seanbright | 1095% is a lot |
20:13.45 | seanbright | but i'm no expert |
20:14.17 | jpsharp | On the other instances, none of the channel threads exceed 1-2% cpu |
20:15.11 | seanbright | what's the output of: |
20:15.21 | seanbright | find /proc/<pid of bad asterisk>/fd -type l | wc |
20:15.53 | *** join/#asterisk-dev jkroon_ (~jkroon@165.16.204.166) |
20:16.07 | Bhakimi | 1 sec let me get it |
20:17.30 | Bhakimi | <PROTECTED> |
20:18.07 | seanbright | seems fine |
20:18.12 | seanbright | do you record? |
20:18.21 | Bhakimi | no |
20:18.43 | Bhakimi | all this server does it place cals and send them to the agent server once the calls connect |
20:18.52 | Bhakimi | it executes a agi script for the routing and thats pretty muhc it |
20:20.01 | seanbright | gotcha |
20:20.13 | seanbright | without a core dump i don't know that there is much you can do |
20:20.25 | seanbright | so figure out why gcore isn't working |
20:20.30 | Bhakimi | thanks for all the help, working on getting it |
20:27.12 | jpsharp | Okay, got a core dump. |
20:38.39 | seanbright | ok. pastebin it. |
20:41.40 | *** join/#asterisk-dev Tim_Toady (~fuzzy@snf-33276.vm.okeanos.grnet.gr) |
20:47.44 | jpsharp | Too big for paste bin, but here it is: http://n5xns.org/gdb.txt |
20:50.25 | seanbright | what protocols are you using? |
20:50.32 | Bhakimi | sip |
20:50.32 | seanbright | sip, iax2, dahdi? |
20:50.37 | seanbright | only sip? |
20:50.40 | Bhakimi | yes |
20:50.47 | seanbright | ok, stop loading chan_iax2.so |
20:50.55 | seanbright | stop loading chan_dahdi.so |
20:50.56 | Bhakimi | and dahdi |
20:51.07 | Bhakimi | yup, but can they still cause issues even if we dont use it? |
20:51.22 | seanbright | of course |
20:51.25 | seanbright | it's running code |
20:51.59 | Bhakimi | which timing source is mot stable to use? |
20:52.26 | seanbright | res_timing_timerfd |
20:52.32 | seanbright | unless you have dahdi hardware installed |
20:52.39 | seanbright | in which case i would use res_timing_dahdi |
20:53.07 | jpsharp | There's no dahdi hardware installed. |
20:53.52 | coreyfarrell | kharwell: have you (or anyone else) done a full testsuite run with the python3-compat (using python2)? |
20:55.11 | kharwell | I haven't. can't speak for others, but doubtful on that front as well |
20:56.05 | seanbright | jpsharp: can you run a 'core show locks' also please? |
20:56.31 | jpsharp | I don't have locking debugging enabled. |
20:56.41 | seanbright | hmm |
20:56.53 | jpsharp | Having that enabled completely hoses the system |
20:57.49 | coreyfarrell | kharwell: I've done some here and there checks but my system does poorly with the full testsuite (even without the patch I get some lock-ups and failures). as much as I'd love to see the python3 testsuite patch move forward I think +1 might be premature. |
20:58.02 | seanbright | 27 threads sitting in pthread_mutex_lock() |
20:58.47 | coreyfarrell | kharwell: that's one of the comments I posted, I'm hoping someone with better hardware can help in that department. |
20:59.45 | seanbright | jpsharp: is the box usable at all when it gets like this? does it still handle calls? |
21:00.42 | jpsharp | Yeah, it still processes calls. Not properly, but it still originates and hangs up calls. It's not a full deadlock. |
21:01.09 | kharwell | coreyfarrell: aah okay I'd missed that comment. wish there was a way to allow some comments to pin or not auto collapse in gerrit. |
21:01.21 | kharwell | coreyfarrell: I guess you don't want to -1 it so folks will look at it? |
21:02.21 | kharwell | (some people might have filters that show only reviews with out -1s) |
21:02.23 | seanbright | jpsharp: are manager events getting emitted? |
21:02.39 | coreyfarrell | kharwell: exactly. |
21:02.57 | kharwell | I'll just remove my +1 for now |
21:03.36 | kharwell | the starpy stuff is minor and should be okay though I'd think |
21:03.51 | jpsharp | seanbright: Let me attach to the manager and look. |
21:06.08 | coreyfarrell | kharwell: yes, I ran tests/manager/originate under python2 and python3 using the starpy patch, both passed (obviously using my python3-compat testsuite patch). |
21:07.34 | coreyfarrell | I know that doesn't give "complete" coverage but it watches for events and sends commands, plus starpy patch doesn't have inter-dependencies like the testsuite python does. |
21:08.07 | jpsharp | seanbright: A few events, but not nearly to the same rate as what's actually happening. |
21:08.58 | jpsharp | the account I'm logged into the manager as has a filter for CPD-Result from our AMD system. I know the system placed about 100 calls, but I only saw 4 events from the manager. |
21:09.11 | seanbright | i'm sorry - i have to leave for the day. there is definitely something going on with locking, i'm just not sure what. |
21:09.16 | seanbright | 'core show locks' output would help |
21:09.19 | seanbright | gotta go |
21:10.05 | jpsharp | I agree with the locking issue. |
21:34.10 | *** join/#asterisk-dev cresl1n (Adium@asterisk/libpri-and-libss7-expert/Cresl1n) |
21:34.10 | *** mode/#asterisk-dev [+o cresl1n] by ChanServ |
21:40.12 | jpsharp | If I do a "manager show eventq", should I get about 2-3 minutes of output? Why would the eventq be backing up like that? |
22:11.23 | *** part/#asterisk-dev kharwell (kharwell@nat/digium/x-tvhzjqoekejxtkdf) |
23:21.56 | *** join/#asterisk-dev pchero (~pchero@176-23-78-252-cable.dk.customer.tdc.net) |
23:25.41 | jpsharp | And I think I've found the problem. Events were backing up via the HTTP manager interface, causing the event queue to becoming extremely stupid large to the point where the ao2 iterator takes forever to go through, resulting in lots of excessive lock times on the list. |
23:36.13 | *** join/#asterisk-dev snuff-work (~snuff-wor@210.9.148.102) |
23:36.13 | *** mode/#asterisk-dev [+o snuff-work] by ChanServ |