2007-Jun-26 - A real example of trouble-shooting
The problem found during a test: SMTP performance drops seriously while running volume test on both SMTP and POP3 of a gateway product.
Related modules: SMTP module, POP3 module, Scan module (which scans the network traffic)
Possible causes: Related modules problem (bug), data volume, specific test sample, others.
To narrow down the causes step by step: (--> Result ==> Conclusion) TEST 1: Load SMTP traffic only. No POP3 traffic at all. --> Much higher SMTP performance. ==> POP3 traffic has impact on SMTP performance.
TEST 2: Load POP3 traffic only. No SMTP traffic at all. --> POP3 performance does not have severe difference compared to the test with both SMTP and POP3 traffics. ==> SMTP traffic does not have impact on POP3 performance.
The following test try to remove the "others" cause: TEST 3: Disable unrelated features/actions, eg. notification, agent, debug logs, transaction logs, etc. Load both SMTP and POP3 traffic. --> SMTP performance still drops seriously. No better. Same result. ==> Exclude "others" cause.
The following tests try to remove the "data volume" cause: TEST 4: Adjust POP3 traffic to 5 mails/connection and 1 mail/connection. Compare the results. --> No better. The same. ==> Exclude POP3 traffic volume cause.
TEST 5: Minish POP3 sample mail size to range between 15KB ~ 26KB. --> No better. ==> Exclude POP3 Scan volume cause.
The following test try to remove the "specific test sample" cause. TEST 6: Replace all of the POP3 sample mails with SMTP sample mails. --> No better. ==> Exclude "specific test sample" cause. ==> So now we have the "Related modules problem" left for tracking the cause.
TEST 7: Run the same test on the previous version of this product. --> SMTP performance is better on old version. ==> It may be caused by the new features of this version.
TEST 8: Disable all of the POP3 Scan features and run the test on the current version again. --> No better. ==> Exclude POP3 Scan features cause.
TEST 9: Disable all of the SMTP Scan features too, except for NRS (Network Reputation Service, a new feature which is only applicable to SMTP). --> No better. ==> Exclude SMTP Scan old features cause.
TEST 10: Disable all Scan features including NRS. --> SMTP performance increases to the old version level. ==> It is possibly caused by NRS!
TEST 11: Load SMTP traffic only and do the following tests for comparison: 11-1: NRS disabled. --> High performance. 11-2: NRS enabled. --> Performance drops a lot. 11-3: NRS enabled but add the ip addresses to the white list. --> Performance drops a little than 11-1 but better than 11-2. ==> NRS IS the cause of this problem!! (Or at least, one of the main causes.)
One question: how to explain the result of TEST 1 with the final conclusion? Since NRS and other SMTP/POP3 scan features are all implemented by Scan module. When the traffic load is not heavy enough, Scan module can handle the tasks better. But when the traffic load becomes heavier, scan tasks are queued, and performance drops obviously. So it seems it's POP3 causes the problem but it's actually not. POP3 only increases traffic load and magnifies the symptom.
|