Troubleshooting Guide

This guide provides step-by-step instructions for troubleshooting common issues with the Xshield Gatekeeper.

1. Internet Connectivity Issues

Symptoms

Gatekeeper cannot reach external services
Unable to connect to ng.colortokens.com or other required endpoints

Verification Steps

Check Firewall Rules
- Review deployment-checklist.md for required outbound rules
- Verify firewall allows traffic to:
  - ng.colortokens.com
  - logs.ng.colortokens.com
  - artifacts.ng.colortokens.com
  - telemetry.ng.colortokens.com
  - apt-ng.colortokens.com
  - registration.colortokens.com
Verify DNS Configuration
- Check DNS resolution using nslookup or dig
- Verify DNS/hosts entries for:
  - ng.colortokens.com
  - logs.ng.colortokens.com
- For on-prem installations:
  - Check private DNS entries
  - Verify /etc/hosts file entries

2. DHCP Issues

Symptoms

Devices not receiving IP addresses
DHCP service not responding

Verification Steps

Check DHCP Service Status
```
systemctl status ctcoredhcp
```
Verify Configuration
- Check DHCP relay vs server mode
- Verify DHCP range configuration
- Ensure no other DHCP servers on the network
Additional Checks for DHCP Relay
```
systemctl status coredhcp
```
- Verify upstream DHCP server is running
- Check network connectivity to upstream DHCP server

3. Gatekeeper Status Issues

Symptoms

Gatekeeper reported as down
Unresponsive in Xshield console

Verification Steps

Physical Checks
- Verify power supply
- Check network connectivity
- Ensure machine is powered on
System Status
```
systemctl status ctappliance
```
- Check Xshield console status
- Verify internet connectivity
- Check cloud site availability
- Verify on-premise site connectivity

4. Network Traffic Issues

Symptoms

Devices cannot reach internet/intranet
Traffic not flowing through gatekeeper

Verification Steps

Basic Connectivity
- Ping gatekeeper from affected devices
- Use traceroute to identify traffic drop points
Network Verification
```
tcpdump -i lan_interface
```
- Verify traffic on gatekeeper's LAN interface
- Check for ARP entries
- Verify correct gateway configuration

5. Traffic Flow Recording Issues

Symptoms

No traffic flows recorded
Conntrack not registering traffic

Verification Steps

Check Conntrack
```
conntrack -L
```
- Verify traffic registration
- Check for appropriate rules in:
```
nft list ruleset
```
Network Verification
```
tcpdump -i lan_interface
```
- Verify traffic visibility on LAN interface
- Check device gateway configuration
- Verify netmask is /32 or appropriate for environment

6. Device Discovery Issues

Symptoms

Devices not appearing in gatekeeper
Devices remain unmanaged

Verification Steps

Check ARP Table
```
arp -a
```
Verify Asset Cache
- Check /etc/colortokens/asset-cache/asset-cache*.json
- Verify device status (managed/unmanaged)
Additional Steps
- Verify device is sending/receiving traffic
- Check for traffic registration in conntrack
- Ping device from outside subnet to create traffic

7. Remote Access Issues

Symptoms

Cannot SSH into gatekeeper
No console access

Best Practices

Management LAN Connection
- Always connect gatekeeper to management LAN
- Configure BMC/ILoM/iDRAC
- Ensure management LAN has IP connectivity
Remote Access Options
- BMC/ILoM/iDRAC console
- Management LAN SSH access
- Note: GK-2000 boxes may not have console port

8. Resetting the password

If the gatekeeper's password is lost / expired, below are the steps to recover the password.

Re-flash the gatekeeper box, configure the WAN interface to reach the platform, and register it with new name.
Wait until the gatekeeper status changes to "Absent" in the Gatekeepers page. Once it does, proceed with the Recovery procedure as mentioned in Gatekeeper Recovery Guide starting from Step 2

8. Additional Troubleshooting Tips

HA Cluster Issues

Verify VRRP ID consistency across cluster
Check peer IP configuration
Verify virtual IP assignment

Network Configuration

Verify VLAN trunking on switch ports
Check network segmentation
Verify correct subnet configuration

Logs and Diagnostics

Collect diagnostics from Xshield console
Check system logs
Verify application logs

Best Practices

Documentation
- Maintain network documentation
- Document VLAN assignments
- Keep firewall rules documented
Monitoring
- Set up monitoring alerts
- Regularly check system status
- Verify network connectivity

9. Troubleshooting Checklist

Basic Connectivity
- Verify power
- Check network cables
- Verify switch connectivity
Network Configuration
- Check VLAN settings
- Verify IP configurations
- Check firewall rules
Services
- Verify DHCP service
- Check DNS resolution
- Verify conntrack
System Status
- Check system logs
- Verify service status
- Check resource usage

10. Emergency Recovery Procedures

Physical Access
- Power cycle if necessary
- Verify hardware connections
- Check system LEDs
Remote Recovery
- Use BMC/ILoM/iDRAC
- Follow vendor-specific recovery procedures
- Document recovery steps

11. Documentation References

Refer to deployment-checklist.md for network requirements
Check deployment-guide.md for configuration details
Review gatekeeper-operation-maintenance.md for operational procedures

12. Support Information

When contacting support:

Provide system logs
Include diagnostic bundle
Document troubleshooting steps taken
Provide network topology information
Include error messages and symptoms

13. HA Cluster Issues

Symptoms

Both gatekeepers showing as Standby
No active node in cluster

Verification Steps

Check Network Connectivity
```
ping <peer_gk_ip>
```
- Verify GK devices can ping each other
- Check multicast packet flow
- Verify no firewall blocking VRRP traffic
Verify VRRP Configuration
- Check VRRP ID consistency
- Verify same WAN VIP and LAN VIP
- Confirm identical WAN gateway IP
- Ensure VRRP ID is unique across tenant
- Verify VRRP password match
Check VRRP Traffic
```
tcpdump -i <interface> vrrp
```
- Verify VRRP packets are visible
- Check packet frequency and integrity
Review Configuration
```
cat /root/ctgatekeeper.json
```
- Verify IP configurations
- Check VRRP settings
- Confirm network interface configurations

Best Practices

Document VRRP ID allocation
Regularly verify network connectivity
Monitor VRRP traffic
Keep VRRP passwords secure

14. Agent Alerts

Agent Unreachable Alert

Symptoms

Agent unreachable alert in console
Both HA nodes down
Standalone GK down

Verification Steps

Check Physical Status
- Verify power supply
- Check network cables
- Verify switch connectivity
Check Network Status
```
ping <gk_ip>
```
- Verify network connectivity
- Check for network outages
- Verify firewall rules

Agent Failover Alert

Symptoms

Agent failover alert in console
Interface down
Connectivity loss

Verification Steps

Check System Logs
```
cat /var/log/syslog
```
- Look for interface down events
- Check for connectivity issues
- Verify error messages
Collect Diagnostics
```
not collect diagnostics
```
- Collect syslog
- Gather interface status
- Check network configuration

Common Alert Scenarios

Network Connectivity Issues
- Interface failures
- Network outages
- Firewall blocks
Hardware Issues
- Power failures
- Hardware failures
- Network card issues
Configuration Issues
- Incorrect IP settings
- Invalid VRRP configurations
- Network misconfigurations

Alert Resolution Steps

Network Connectivity
- Check physical connections
- Verify network configuration
- Check firewall rules
Hardware Issues
- Verify power supply
- Check hardware status
- Replace faulty components
Configuration Issues
- Verify IP configurations
- Check VRRP settings
- Validate network settings

15. Additional Monitoring and Prevention

Best Practices

Monitoring
- Set up network monitoring
- Configure alert thresholds
- Regularly check system logs
Documentation
- Document VRRP configurations
- Keep network diagrams updated
- Maintain configuration backups
Regular Maintenance
- Check system health
- Verify network connectivity
- Test failover procedures

16. Support Information

When contacting support for HA cluster issues:

Provide system logs
Include VRRP configuration
Document network topology
Share tcpdump captures
Include error messages and symptoms
Provide ctgatekeeper.json contents

17. Diagnostics Collection

Using Xshield Console

Select Gatekeeper in UI
Choose "Collect Diagnostics"
- This process may take several minutes
Once complete:
- View diagnostics in UI
- Download diagnostics zip file
- Send zip file to support

Best Practices

Always use UI for diagnostics collection
Wait for complete diagnostics collection
Download and verify diagnostics zip before sending

18. Gatekeeper Operations

Service Management

Reboot Options
- Use UI for planned reboots
- Console access for emergency reboots

Service Restart

# Restart Gatekeeper
systemctl restart ctappliance

# Restart HA Services
systemctl restart ctgatekeeper-ha

# Restart DHCP
systemctl restart ctcoredhcp

Device Management

Device Scan
- Initiate from Xshield Console
- Re-identifies managed devices
- Updates device information
- Reassesses vulnerabilities
- Collects firmware version info

Best Practices

Use UI for most operations
Reserve console access for:
- WAN settings changes
- Proxy configuration
- Initial registration
- Emergency situations

19. Gatekeeper Uninstallation

Using Xshield Console

Select Gatekeeper
Choose "Decommission" option
Confirm decommissioning
Wait for completion
Unplug from network

Console-Based Uninstallation

# Only as last resort
ssh <gk_ip>
ctgatekeeper decommission

Best Practices

Always use UI for decommissioning
Use console only for:
- WAN settings changes
- Proxy configuration
- Initial registration
- Emergency situations

20. Support Information

Required Information

System Logs
VRRP Configuration
Network Topology
Tcpdump Captures
Error Messages
ctgatekeeper.json Contents
Diagnostics Zip File

Best Practices

Include all requested information
Document troubleshooting steps taken
Provide network diagrams
Share configuration backups

21. Support Workflow

Initial Contact

Contact ColorTokens Support
- Phone: [Support Number]
- Email: support@colortokens.com
- Portal: [Support Portal URL]
Provide Initial Information
- Customer details
- Issue description
- Impact level
- System logs
- Diagnostics zip file

Support Response Levels

Level 1 Support

Initial issue triage
Basic troubleshooting
Configuration verification
Common issue resolution

Level 2 Support

Advanced troubleshooting
System analysis
Configuration review
Escalation for complex issues

Product Engineering

Root cause analysis
Bug identification
Feature requests
Emergency fixes

Escalation Criteria

Critical Issues
- System downtime
- Security breaches
- Data loss
- Major functionality failure
High Priority Issues
- Significant performance impact
- Multiple users affected
- Business-critical operations
Standard Issues
- Configuration questions
- Minor functionality issues
- Documentation requests

Support Process Flow

Issue Reporting
- Open support ticket
- Provide required information
- Attach diagnostics
Initial Analysis
- Basic troubleshooting
- Configuration review
- Log analysis
Escalation Points
- If issue unresolved
- If impact critical
- If additional expertise needed
Resolution Steps
- Provide fix/workaround
- Document solution
- Close ticket

Best Practices for Support

Before Contacting Support
- Follow troubleshooting guide
- Collect diagnostics
- Document steps taken
- Gather system information
During Support Interaction
- Provide clear issue description
- Share all relevant information
- Follow support team instructions
- Document all interactions
After Resolution
- Verify fix
- Document solution
- Update procedures
- Close ticket

Support Response Times

Critical Issues
- Response: Within 1 hour
- Resolution: Within 4 hours
High Priority Issues
- Response: Within 4 hours
- Resolution: Within 24 hours
Standard Issues
- Response: Within 24 hours
- Resolution: Within 5 business days

Additional Support Resources

Knowledge Base
- Documentation
- Troubleshooting guides
- Best practices
Community Forums
- User discussions
- Best practices sharing
- Common solutions
Training Resources
- Webinars
- Training materials
- Best practices guides

Last Updated: [Date] Version: 1.3

1. Internet Connectivity Issues​

Symptoms​

Verification Steps​

2. DHCP Issues​

Symptoms​

Verification Steps​

3. Gatekeeper Status Issues​

Symptoms​

Verification Steps​

4. Network Traffic Issues​

Symptoms​

Verification Steps​

5. Traffic Flow Recording Issues​

Symptoms​

Verification Steps​

6. Device Discovery Issues​

Symptoms​

Verification Steps​

7. Remote Access Issues​

Symptoms​

Best Practices​

8. Resetting the password​

8. Additional Troubleshooting Tips​

HA Cluster Issues​

Network Configuration​

Logs and Diagnostics​

Best Practices​

9. Troubleshooting Checklist​

10. Emergency Recovery Procedures​

11. Documentation References​

12. Support Information​

13. HA Cluster Issues​

Symptoms​

Verification Steps​

Best Practices​

14. Agent Alerts​

Agent Unreachable Alert​

Symptoms​

Verification Steps​

Agent Failover Alert​

Symptoms​

Verification Steps​

Common Alert Scenarios​

Alert Resolution Steps​

15. Additional Monitoring and Prevention​

Best Practices​

16. Support Information​

17. Diagnostics Collection​

Using Xshield Console​

Best Practices​

18. Gatekeeper Operations​

Service Management​

Device Management​

Best Practices​

19. Gatekeeper Uninstallation​

Using Xshield Console​

Console-Based Uninstallation​

Best Practices​

20. Support Information​

Required Information​

Best Practices​

21. Support Workflow​

Initial Contact​

Support Response Levels​

Level 1 Support​

Level 2 Support​

Product Engineering​

Escalation Criteria​

Support Process Flow​

Best Practices for Support​

Support Response Times​

Additional Support Resources​

1. Internet Connectivity Issues

Symptoms

Verification Steps

2. DHCP Issues

Symptoms

Verification Steps

3. Gatekeeper Status Issues

Symptoms

Verification Steps

4. Network Traffic Issues

Symptoms

Verification Steps

5. Traffic Flow Recording Issues

Symptoms

Verification Steps

6. Device Discovery Issues

Symptoms

Verification Steps

7. Remote Access Issues

Symptoms

Best Practices

8. Resetting the password

8. Additional Troubleshooting Tips

HA Cluster Issues

Network Configuration

Logs and Diagnostics

Best Practices

9. Troubleshooting Checklist

10. Emergency Recovery Procedures

11. Documentation References

12. Support Information

13. HA Cluster Issues

Symptoms

Verification Steps

Best Practices

14. Agent Alerts

Agent Unreachable Alert

Symptoms

Verification Steps

Agent Failover Alert

Symptoms

Verification Steps

Common Alert Scenarios

Alert Resolution Steps

15. Additional Monitoring and Prevention

Best Practices

16. Support Information

17. Diagnostics Collection

Using Xshield Console

Best Practices

18. Gatekeeper Operations

Service Management

Device Management

Best Practices

19. Gatekeeper Uninstallation

Using Xshield Console

Console-Based Uninstallation

Best Practices

20. Support Information

Required Information

Best Practices

21. Support Workflow

Initial Contact

Support Response Levels

Level 1 Support

Level 2 Support

Product Engineering

Escalation Criteria

Support Process Flow

Best Practices for Support

Support Response Times

Additional Support Resources