Troubleshooting Guide
This guide provides step-by-step instructions for troubleshooting common issues with the Xshield Gatekeeper.
1. Internet Connectivity Issues
Symptoms
- Gatekeeper cannot reach external services
- Unable to connect to ng.colortokens.com or other required endpoints
Verification Steps
-
Check Firewall Rules
- Review deployment-checklist.md for required outbound rules
- Verify firewall allows traffic to:
- ng.colortokens.com
- logs.ng.colortokens.com
- artifacts.ng.colortokens.com
- telemetry.ng.colortokens.com
- apt-ng.colortokens.com
- registration.colortokens.com
-
Verify DNS Configuration
- Check DNS resolution using
nslookup
ordig
- Verify DNS/hosts entries for:
- ng.colortokens.com
- logs.ng.colortokens.com
- For on-prem installations:
- Check private DNS entries
- Verify /etc/hosts file entries
- Check DNS resolution using
2. DHCP Issues
Symptoms
- Devices not receiving IP addresses
- DHCP service not responding
Verification Steps
-
Check DHCP Service Status
systemctl status ctcoredhcp
-
Verify Configuration
- Check DHCP relay vs server mode
- Verify DHCP range configuration
- Ensure no other DHCP servers on the network
-
Additional Checks for DHCP Relay
systemctl status coredhcp
- Verify upstream DHCP server is running
- Check network connectivity to upstream DHCP server
3. Gatekeeper Status Issues
Symptoms
- Gatekeeper reported as down
- Unresponsive in Xshield console
Verification Steps
-
Physical Checks
- Verify power supply
- Check network connectivity
- Ensure machine is powered on
-
System Status
systemctl status ctappliance
- Check Xshield console status
- Verify internet connectivity
- Check cloud site availability
- Verify on-premise site connectivity
4. Network Traffic Issues
Symptoms
- Devices cannot reach internet/intranet
- Traffic not flowing through gatekeeper
Verification Steps
-
Basic Connectivity
- Ping gatekeeper from affected devices
- Use traceroute to identify traffic drop points
-
Network Verification
tcpdump -i lan_interface
- Verify traffic on gatekeeper's LAN interface
- Check for ARP entries
- Verify correct gateway configuration
5. Traffic Flow Recording Issues
Symptoms
- No traffic flows recorded
- Conntrack not registering traffic
Verification Steps
-
Check Conntrack
conntrack -L
- Verify traffic registration
- Check for appropriate rules in:
nft list ruleset
-
Network Verification
tcpdump -i lan_interface
- Verify traffic visibility on LAN interface
- Check device gateway configuration
- Verify netmask is /32 or appropriate for environment
6. Device Discovery Issues
Symptoms
- Devices not appearing in gatekeeper
- Devices remain unmanaged
Verification Steps
-
Check ARP Table
arp -a
-
Verify Asset Cache
- Check /etc/colortokens/asset-cache/asset-cache*.json
- Verify device status (managed/unmanaged)
-
Additional Steps
- Verify device is sending/receiving traffic
- Check for traffic registration in conntrack
- Ping device from outside subnet to create traffic
7. Remote Access Issues
Symptoms
- Cannot SSH into gatekeeper
- No console access
Best Practices
-
Management LAN Connection
- Always connect gatekeeper to management LAN
- Configure BMC/ILoM/iDRAC
- Ensure management LAN has IP connectivity
-
Remote Access Options
- BMC/ILoM/iDRAC console
- Management LAN SSH access
- Note: GK-2000 boxes may not have console port
8. Additional Troubleshooting Tips
HA Cluster Issues
- Verify VRRP ID consistency across cluster
- Check peer IP configuration
- Verify virtual IP assignment
Network Configuration
- Verify VLAN trunking on switch ports
- Check network segmentation
- Verify correct subnet configuration
Logs and Diagnostics
- Collect diagnostics from Xshield console
- Check system logs
- Verify application logs
Best Practices
-
Documentation
- Maintain network documentation
- Document VLAN assignments
- Keep firewall rules documented
-
Monitoring
- Set up monitoring alerts
- Regularly check system status
- Verify network connectivity
9. Troubleshooting Checklist
-
Basic Connectivity
- Verify power
- Check network cables
- Verify switch connectivity
-
Network Configuration
- Check VLAN settings
- Verify IP configurations
- Check firewall rules
-
Services
- Verify DHCP service
- Check DNS resolution
- Verify conntrack
-
System Status
- Check system logs
- Verify service status
- Check resource usage
10. Emergency Recovery Procedures
-
Physical Access
- Power cycle if necessary
- Verify hardware connections
- Check system LEDs
-
Remote Recovery
- Use BMC/ILoM/iDRAC
- Follow vendor-specific recovery procedures
- Document recovery steps
11. Documentation References
- Refer to deployment-checklist.md for network requirements
- Check deployment-guide.md for configuration details
- Review gatekeeper-operation-maintenance.md for operational procedures
12. Support Information
When contacting support:
- Provide system logs
- Include diagnostic bundle
- Document troubleshooting steps taken
- Provide network topology information
- Include error messages and symptoms
13. HA Cluster Issues
Symptoms
- Both gatekeepers showing as Standby
- No active node in cluster
Verification Steps
-
Check Network Connectivity
ping <peer_gk_ip>
- Verify GK devices can ping each other
- Check multicast packet flow
- Verify no firewall blocking VRRP traffic
-
Verify VRRP Configuration
- Check VRRP ID consistency
- Verify same WAN VIP and LAN VIP
- Confirm identical WAN gateway IP
- Ensure VRRP ID is unique across tenant
- Verify VRRP password match
-
Check VRRP Traffic
tcpdump -i <interface> vrrp
- Verify VRRP packets are visible
- Check packet frequency and integrity
-
Review Configuration
cat /root/ctgatekeeper.json
- Verify IP configurations
- Check VRRP settings
- Confirm network interface configurations
Best Practices
- Document VRRP ID allocation
- Regularly verify network connectivity
- Monitor VRRP traffic
- Keep VRRP passwords secure
14. Agent Alerts
Agent Unreachable Alert
Symptoms
- Agent unreachable alert in console
- Both HA nodes down
- Standalone GK down
Verification Steps
-
Check Physical Status
- Verify power supply
- Check network cables
- Verify switch connectivity
-
Check Network Status
ping <gk_ip>
- Verify network connectivity
- Check for network outages
- Verify firewall rules
Agent Failover Alert
Symptoms
- Agent failover alert in console
- Interface down
- Connectivity loss
Verification Steps
-
Check System Logs
cat /var/log/syslog
- Look for interface down events
- Check for connectivity issues
- Verify error messages
-
Collect Diagnostics
not collect diagnostics
- Collect syslog
- Gather interface status
- Check network configuration
Common Alert Scenarios
-
Network Connectivity Issues
- Interface failures
- Network outages
- Firewall blocks
-
Hardware Issues
- Power failures
- Hardware failures
- Network card issues
-
Configuration Issues
- Incorrect IP settings
- Invalid VRRP configurations
- Network misconfigurations
Alert Resolution Steps
-
Network Connectivity
- Check physical connections
- Verify network configuration
- Check firewall rules
-
Hardware Issues
- Verify power supply
- Check hardware status
- Replace faulty components
-
Configuration Issues
- Verify IP configurations
- Check VRRP settings
- Validate network settings
15. Additional Monitoring and Prevention
Best Practices
-
Monitoring
- Set up network monitoring
- Configure alert thresholds
- Regularly check system logs
-
Documentation
- Document VRRP configurations
- Keep network diagrams updated
- Maintain configuration backups
-
Regular Maintenance
- Check system health
- Verify network connectivity
- Test failover procedures
16. Support Information
When contacting support for HA cluster issues:
- Provide system logs
- Include VRRP configuration
- Document network topology
- Share tcpdump captures
- Include error messages and symptoms
- Provide ctgatekeeper.json contents
17. Diagnostics Collection
Using Xshield Console
- Select Gatekeeper in UI
- Choose "Collect Diagnostics"
- This process may take several minutes
- Once complete:
- View diagnostics in UI
- Download diagnostics zip file
- Send zip file to support
Best Practices
- Always use UI for diagnostics collection
- Wait for complete diagnostics collection
- Download and verify diagnostics zip before sending
18. Gatekeeper Operations
Service Management
-
Reboot Options
- Use UI for planned reboots
- Console access for emergency reboots
-
Service Restart
# Restart Gatekeeper
systemctl restart ctappliance
# Restart HA Services
systemctl restart ctgatekeeper-ha
# Restart DHCP
systemctl restart ctcoredhcp
Device Management
- Device Scan
- Initiate from Xshield Console
- Re-identifies managed devices
- Updates device information
- Reassesses vulnerabilities
- Collects firmware version info
Best Practices
- Use UI for most operations
- Reserve console access for:
- WAN settings changes
- Proxy configuration
- Initial registration
- Emergency situations
19. Gatekeeper Uninstallation
Using Xshield Console
- Select Gatekeeper
- Choose "Decommission" option
- Confirm decommissioning
- Wait for completion
- Unplug from network
Console-Based Uninstallation
# Only as last resort
ssh <gk_ip>
ctgatekeeper decommission
Best Practices
- Always use UI for decommissioning
- Use console only for:
- WAN settings changes
- Proxy configuration
- Initial registration
- Emergency situations
20. Support Information
Required Information
- System Logs
- VRRP Configuration
- Network Topology
- Tcpdump Captures
- Error Messages
- ctgatekeeper.json Contents
- Diagnostics Zip File
Best Practices
- Include all requested information
- Document troubleshooting steps taken
- Provide network diagrams
- Share configuration backups
21. Support Workflow
Initial Contact
-
Contact ColorTokens Support
- Phone: [Support Number]
- Email: support@colortokens.com
- Portal: [Support Portal URL]
-
Provide Initial Information
- Customer details
- Issue description
- Impact level
- System logs
- Diagnostics zip file
Support Response Levels
Level 1 Support
- Initial issue triage
- Basic troubleshooting
- Configuration verification
- Common issue resolution
Level 2 Support
- Advanced troubleshooting
- System analysis
- Configuration review
- Escalation for complex issues
Product Engineering
- Root cause analysis
- Bug identification
- Feature requests
- Emergency fixes
Escalation Criteria
-
Critical Issues
- System downtime
- Security breaches
- Data loss
- Major functionality failure
-
High Priority Issues
- Significant performance impact
- Multiple users affected
- Business-critical operations
-
Standard Issues
- Configuration questions
- Minor functionality issues
- Documentation requests
Support Process Flow
-
Issue Reporting
- Open support ticket
- Provide required information
- Attach diagnostics
-
Initial Analysis
- Basic troubleshooting
- Configuration review
- Log analysis
-
Escalation Points
- If issue unresolved
- If impact critical
- If additional expertise needed
-
Resolution Steps
- Provide fix/workaround
- Document solution
- Close ticket
Best Practices for Support
-
Before Contacting Support
- Follow troubleshooting guide
- Collect diagnostics
- Document steps taken
- Gather system information
-
During Support Interaction
- Provide clear issue description
- Share all relevant information
- Follow support team instructions
- Document all interactions
-
After Resolution
- Verify fix
- Document solution
- Update procedures
- Close ticket
Support Response Times
-
Critical Issues
- Response: Within 1 hour
- Resolution: Within 4 hours
-
High Priority Issues
- Response: Within 4 hours
- Resolution: Within 24 hours
-
Standard Issues
- Response: Within 24 hours
- Resolution: Within 5 business days
Additional Support Resources
-
Knowledge Base
- Documentation
- Troubleshooting guides
- Best practices
-
Community Forums
- User discussions
- Best practices sharing
- Common solutions
-
Training Resources
- Webinars
- Training materials
- Best practices guides
Last Updated: [Date] Version: 1.3