BMC Atrium Discovery Community Forum

This forum is now closed. Please check sticky posts and announcements for further information.

Links to new community:

forgot password?
   
1 of 3
1
Consolidation Performance
Posted: 20 October 2011 09:56 AM   [ Ignore ]  
Rank
Newbie
Total Posts:  4
Joined:  2010-11-01

Hello all.

We’ve had serious performance issues on our consolidation appliances for the last 6 months which has occurred since we increased our OSI count from 3000 to 20000.
Currently we can push through no more than (approx 200 hosts per hour) which is nowhere near the business requirements of the customer. We’ve had a high priority call open for the last 6 months of which we have tested just about every scenario. Currently we are in a situation where we’ve been informed that the product simply doesn’t scale to the requirements we have (which is bi-daily consolidation of 35000 OSIs). Our consolidation architecture sits on an ESXi virtual infrastructure of 32Gb RAM and 8 logical CPUs (of which are reserved for our use). We are also aware of other large (OSIs) customers that have the same issues but uses a physical appliance and on a higher spec-ed server.

Essentially, i’m posting this thread to see if anyone else is using consolidation and is scanning more than 10000 OSIs and would like to know if you’re having similar issues (in regards to an OSI per hour performance)…..and more importantly….if you’re NOT. If you’re not, what are your OSI per hour performance stats?

Current measures we have undertaken are listed below: (please note that these problems are on the Consolidator so the Discovery performance is not an issue as the consolidation data on our appliances is already there). Secondly, we have the same performance issues running scans on the same number of OSIs via playback data.

Disabled all non-essential TKU Core patterns (526 of 551). All other TKU patterns are disabled.
We have (currently) 60 app modelling patterns of which appear to be performing reasonably.
Tested various datastore cache limits to no avail
Tested ECA engine count to extremely limited avail
Tested datastore indexing beta code change to limited avail
Tested number of concurrent discovery requests to limited avail
Disabled automatic Host Grouping to extremely limited avail
Tried modifying /etc/security/limits.conf file (now back at default)
Initialised the datastore to “clean up”. This appeared to work fine until our host count increased to approx 10000 then the reasoning performance nose-dived once again.
Modified the DDD destroy frequency from the default of 15 minutes to 4 hours and then switched off completely. No apparent avail.
Checked disk I/O performance using ‘iozone’ utility. Disk I/O stats came back higher than in documentation.
Configured consolidation to run on a different appliance (located on a different physical cluster)…….. same performance issues.
Our DDD ageing is currently set to 7 days and Dark Space is set to ‘remove all’

Whilst I want to clarify that we’ve had the full attention of the support guys on our case, we are currently having to work on an alternate solution to consolidate our data. The purpose of my posting this is purely to ask other customers if they’ve any additional suggestions that may have escaped our own and BMC’s thoughts???

Many thanks.

[ Edited: 28 October 2011 06:49 PM by Simon Woodward]
Profile
 
 
Posted: 20 October 2011 02:42 PM   [ Ignore ]   [ # 1 ]  
RankRankRankRank
Guru
Total Posts:  225
Joined:  2010-06-17

We may be on the same page, can you tell me how big your snapshot is when run from the consolidation server? And, do you believe this is a consolidation server issue, it appears from what you say that the scanning appliance is not the issue? To clarify, is you disconnect the scanning appliance from the consolidation server are your scans able to get through your real estate?

Thanks,

Craig

Profile
 
 
Posted: 20 October 2011 03:03 PM   [ Ignore ]   [ # 2 ]  
RankRankRankRank
Guru
Total Posts:  597
Joined:  2011-03-16

We have noticed considerable performance issues in general. From what we are seeing, ADDM is using up the memory after a scan and then never releasing any of it. We noticed that this happens on our consolidator as well during consolidation jobs (since it is recieving data just like a scanner would). We too currently have a support issue open. We are running ADDM 8.2.2, but I have noticed the same thing with ADDM 8.2.3 and ADDM 8.3. I will say that ADDM 8.1.1 did not have this issue and even one of the appliances that was an upgrade from ADDM 8.1.1 to ADDM 8.2.02 is not having the issue. It looks like it is related to a new install of anything above ADDM 8.2.2 (I can’t speak for 8.2.1). We even turned off consolidation and CMDB sync to make sure that this did not affect us. We ran tests by only scanning 3 subnets (765 endpoints) and we still had the same problem…memory being used and then not released until a reboot of the appliance. As for specs, we are at 2 CPUs and 8 GB of RAM and SWAP. Like I said, this we cannot even handle 765 endpoints. After a few runs of the 765 endpoints, our appliance hard locks and literally crashes. Even through the VM Center appliacation, you cannot get to the CLI (you get an out of memory error.) I hope this helps you, or at least gives you comfort that it is not something that you are doing, but it sounds like ADDM 8.2.2+ has a memory leak (I can’t speak for 8.2.1).

Profile
 
 
Posted: 20 October 2011 03:04 PM   [ Ignore ]   [ # 3 ]  
RankRankRankRank
Guru
Total Posts:  2740
Joined:  2008-01-25
Craig Nicholls - 20 October 2011 02:42 PM
To clarify, is you disconnect the scanning appliance from the consolidation server are your scans able to get through your real estate?

Craig I think if you disconnected the Scanner you might actually slow your scans down. When they are disconnected the Scanning role appliances will be caching the data on disk it should have been able to send to the Consolidator role appliance for when it is reconnected.

There is nothing about the way the Scanning and Consolidation scanners work that would allow the Consolidator to put back pressure on the Scanner to get it to slow down, at least not by design and I can’t think of a way that it could arise.

Do you think you’ve seen this happening?

Profile
 
 
Posted: 20 October 2011 03:10 PM   [ Ignore ]   [ # 4 ]  
Rank
Newbie
Total Posts:  4
Joined:  2010-11-01

Hi, the size of the datastore is 88GB because we recently initialised the datastore.

Its definitely consolidation as we have 8 scanning appliances globally scanning roughly 4-5K OSI’s each ultimately. They process around 400+ OSI’s per hour.

The issue comes when they try to consolidate the data. Discovery completes within the allotted windows on the scanning appliances but consolidation is so slow even consolidating once a week it wouldn’t finish in time.

Profile
 
 
Posted: 20 October 2011 03:16 PM   [ Ignore ]   [ # 5 ]  
RankRankRankRank
Guru
Total Posts:  225
Joined:  2010-06-17

Charles, no, I am seeing the same thing as Timothy and Neil. Memory not released, performance goes down the toilet, Ui will not respond, then you are hosed. I have case ISS03843949 open with support, and since my production environment was toast this morning, I am in the process of upgrading the 8.3 GA.

Neil, so you have 8 scanning appliances running against one consolidation server? If that is the case, I would create another consolidation server and split my scanners 4X4 then pipe CMDB sync from both consolidators.

Profile
 
 
Posted: 20 October 2011 03:35 PM   [ Ignore ]   [ # 6 ]  
RankRankRankRank
Guru
Total Posts:  2740
Joined:  2008-01-25

@Craig

Apologies I meant have you seen your Scanning appliances being slowed down when they are attached to a consolidator as you suggest when you said “To clarify, is you disconnect the scanning appliance from the consolidation server are your scans able to get through your real estate?”

Profile
 
 
Posted: 20 October 2011 03:39 PM   [ Ignore ]   [ # 7 ]  
RankRankRankRank
Guru
Total Posts:  2740
Joined:  2008-01-25

@Timothy – do you have an active case on this? I want us to keep an eye on the inflight ones in case we can spot anything that may be common.

Profile
 
 
Posted: 20 October 2011 03:39 PM   [ Ignore ]   [ # 8 ]  
BMC ADDM Staff
RankRankRankRank
Administrator
Total Posts:  160
Joined:  2008-02-14

On the subject of memory usage, it is totally normal for any Linux system to fill up its memory as it runs. It is not a problem at all if there is very little free memory because the OS fills any spare memory with cached data. Memory usage is only ever an issue if you see significant swap usage, since that shows the system needs more memory than it has available.

We do know that sometimes performance of the system degrades as it runs, and that restarting the system brings performance back up. We do not yet know why that is, but we do know that it is not a simple as having large memory usage or a memory leak. That is under active investigation.

On the subject of consolidation performance, I am not aware of any evidence that it is consolidation itself that causes an issue. In all the cases I’ve seen, it’s the volume of data that is the problem, not the fact that consolidation is involved — a single scanning machine with the same volume of data would show the same performance, as would a single machine playing back record data. If anyone has any clear evidence that it is the consolidation step itself that is an issue, please let me know.

One thing that makes it hard for us to work on these performance issues is lack of access to representative data in our labs. If anyone is in a position to give us a large amount of record data from an environment that shows these performance issues, that would make a big difference to our ability to reproduce and analyse the situation. Any data we get will of course be treated carefully and kept confidential. Let me know if you’d be in a position to help us out in that respect!

Cheers,

Duncan.

Profile
 
 
Posted: 20 October 2011 03:51 PM   [ Ignore ]   [ # 9 ]  
RankRankRankRank
Guru
Total Posts:  225
Joined:  2010-06-17

Charles, no, I do not have an issue with my two production scanning appliances going against the one consolidator.

Duncan, my scenario started after the September TKU & EOL updates, although I can not prove it. What you state is correct in that the appliance always uses all the memory it can get. What I see are two things:
1.) When the environment gets totally hosed, cannot login from UI, and only doing a couple of transactions every 5 minutes, my 20 gig swap space usage is at 50%.
2.) SNMP discoveries are extremely painful and that memory never seems to be released.

Things got better after the 8.2.02 upgrade, then went back to restarting all the time after Sept. TKU’s. I have attached a document from way back when I was trying to prove this to support.

Finally, no one in support ever asked for recorded data which I would have been happy to provide. Now that I am in-flight with an upgrade to 8.3, I do not know when or if I will be able to reproduce this problem. I hope not, but based on this thread, I should be able to provide it to you after I upgrade my 3 production appliances.

File Attachments
FR2 .doc  (File Size: 198KB - Downloads: 780)
Profile
 
 
Posted: 20 October 2011 05:28 PM   [ Ignore ]   [ # 10 ]  
RankRankRankRank
Guru
Total Posts:  597
Joined:  2011-03-16

Our issue is logged under ISS03849345. We even had a webex with support to allow them to see exactly what we see in real-time.

Our memory issues also come with swap usage. After the memory was fully used, the swap would begin to fill up. Once both filled up, less than one day if multiple scans of 765 endpoints were performed, then the system would crash. I know restarting is a fix, but as I told support, it should not have to be restarted every day. We do know that Linux likes to “use” memory, but it also returns most of it when it’s done. Our other Linux based servers (non-ADDM) do not seem to have problems like this.

I can tell you that it is not a consolidation issue on our side since as a test we took the brand new ADDM 8.3 image and hosted it without customizing TKUs, settings, even hostname. Again, no customizations were done (not even make it use a static IP) and the system had the same issues. This is one reason we are leaning towards a code issue with ADDM. We have even just left our appliance sitting idle with no scans or user activity and the memory slowly increases. I am not sure what ADDM would be doing, but within a day, it had used up another 500 MG of RAM and never returned it. Our ADDM 8.1.1 with POC specs can out perform our ADDM 8.2.02 Data Center sized appliance.

I believe our support ticket should have a lot of troubleshooting steps that we performed on our own.

Profile
 
 
Posted: 20 October 2011 06:06 PM   [ Ignore ]   [ # 11 ]  
RankRankRankRank
Guru
Total Posts:  225
Joined:  2010-06-17

Since you said you are running VM’s and not physical (which I am as well), my Linux VM administrator made some changes to make sure that what was specified for the appliance was really there. It is something like “reserving” the resources, I can check with them. Do you know if that was done in your case?

Profile
 
 
Posted: 20 October 2011 06:15 PM   [ Ignore ]   [ # 12 ]  
RankRankRankRank
Guru
Total Posts:  2740
Joined:  2008-01-25

We don’t disbelieve it’s happening but there are limits to what we can see via Webex, indeed we’ve had direct site visits with some folks experiencing the issue which helped find some things and I’m grateful for the support of the folks that let us do that.

Our frustration, as Duncan says, is that to get to the bottom of this we need to see the issue in the lab where we can have a system hooked up to our test infrastructure and add instrumentation as needed because right now we can’t provoke one of our systems to do this even though we have a dedicated effort to do so.

If you can share data with us to help us see this problem in the lab so we can fix it then either get in touch with Duncan or if you contact support ask for Richard Gwilliam who has agreed to be a point of contact for gathering data.

Profile
 
 
Posted: 20 October 2011 06:17 PM   [ Ignore ]   [ # 13 ]  
RankRankRankRank
Guru
Total Posts:  597
Joined:  2011-03-16

Craig,
Correct. We are not using shared resources and our VM team has dedicated the resources to our appliances. That was our requirement out of the gate and we verified that as our first troubleshooting step.

Profile
 
 
Posted: 20 October 2011 06:18 PM   [ Ignore ]   [ # 14 ]  
RankRankRankRank
Guru
Total Posts:  225
Joined:  2010-06-17

Richard G. has been onsite at my company and we discussed this issue. Maybe Duncan can provide the recording, I am into the 2nd of 3 appliances being upgraded to 8.3. Its gone to be COB tomorrow and maybe Saturday before I will be back online with ADDM 8.3 production.

Profile
 
 
Posted: 20 October 2011 06:19 PM   [ Ignore ]   [ # 15 ]  
RankRankRankRank
Guru
Total Posts:  597
Joined:  2011-03-16

Charles,
Please let us know what type of information (exactly what type) and how to get it, and I will see if I can clear it within our company. No promises, but I will see what I can do.

Profile
 
 
   
1 of 3
1