DBCollect/Usage

From Dirty Cache Wiki
Jump to navigation Jump to search



Usage

Where to run

Where to run dbcollect

  • All Database nodes (in scope of the project) including all RAC nodes and all Data Guard Standby nodes
  • Optional: On all application nodes and hypervisors
On non-database hosts, dbcollect will only pick up OS config and performance data

Short summary

If dbcollect is installed correctly:

# Simply run dbcollect
dbcollect

# Error: Zipfile already exists
# Overwrite previous zipfile
dbcollect --overwrite

# Error: No AWR or Statspack detected
# If you are sure the system is licensed with Diagnostics Pack
dbcollect --overwrite --force-awr

# If you want to retrieve more than 10 days worth of AWR files:
dbcollect --days 20

# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8)
dbcollect --tasks 2

Strip SQL code

AWR reports by default contain snippets of Oracle SQL queries. These are not used for reports but can be useful in performance diagnosis. Some customers consider the SQL code confidential information and don't allow this to leave the company. dbcollect has the option to remove all SQL code from the AWR reports:

dbcollect --strip

Notes:

  • This ONLY works for AWR, not for Statspack
  • It removes the SQL Statistics (such as SQL ordered by CPU Time) from AWR, as well as the entire ADDM section
  • It is based on XML parsing in Python, if the XML parsing fails, it will write the original AWR (HTML) report (but show a warning in the logfile).
  • Mileage may vary. So far it works flawlessly but no guarantee can be made. Inspect the ZIP files if you want to make sure.

Resolve AWR reporting issues

dbcollect checks for prior AWR usage to make sure it will not trigger Oracle license violations ( More info)

If you get Error [DBC-E021] No AWR or Statspack detected:

# If you are sure you have Diagnostics Pack license:
dbcollect --force-awr

# If you want to ignore databases without AWR usage:
# Note that this will result in under-reporting of performance and capacity due to missing data
dbcollect --ignore-awr

# If the problem databases are not important, try excluding them:
dbcollect --exclude inst1,inst2

Oracle RAC

On RAC, dbcollect will generate AWR reports for all instances on all nodes. This means if you run dbcollect on all RAC nodes (which is recommended) then all AWR reports will be generated multiple times (2 times for a 2-node cluster). To avoid this, run with the --no-rac option. This will cause AWR reports ONLY to be generated for the actual node, and saves on diskspace and resource overhead.

Oracle Data Guard

An Oracle instance that runs as Data Guard Standby is usually in MOUNT mode and does not allow AWR reports to be generated. Standby databases can be running in Real-Time Query or Real-Time Apply mode, but running standard AWR reports causes the AWR reports to be generated for the PRIMARY instance (there is no specific Standby performance data).

Data Guard can be configured with RMF (Remote Management Framework) which causes performance statistics to be collected for the STANDBY database.

dbcollect will pick up AWR reports for the STANDBY database if this is enabled. To disable this feature, use --no-stby.

Monitoring progress

Once dbcollect is running, it will pick up OS info, and after a few seconds it will start generating AWR reports.

[root@db08 ~](-) # dbcollect -o
dbcollect 1.16.3 - collect Oracle AWR/Statspack, database and system info
INFO     : dbcollect 1.16.3 - database and system info collector
INFO     : Python version 3.6.8
INFO     : Current user is oracle
INFO     : Zip file is /tmp/dbcollect-db08.lan.zip
INFO     : Collecting OS info (Linux)
INFO     : Collecting Linux SAR files
INFO     : Collecting Oracle info
INFO     : Detecting Oracle instances
INFO     : Stopped instances: demo81
INFO     : Running instances: demo08
INFO     : demo08: AWR usage detected, generating reports 
INFO     : demo08: 25 reports
INFO     : demo08: Running opatch lspatches
INFO     : demo08: Running dbinfo scripts
INFO     : demo08: DBInfo processor finished, elapsed time 8.38 seconds
INFO     : demo08: Started 2 SQLPlus sessions
Report 16 of 25 (64.0% done), elapsed: 0:00:30, remaining: 0:00:17, reports/s: 0.53

The remaining time is an estimate on how long is required to generate all reports.

Once complete, dbcollect should show a message that the ZIP file creation was succesful:

INFO     : Report 25 of 25 (100.0% done), elapsed: 0:00:40, remaining: 0:00:00, reports/s: 0.63
INFO     : Zip file /tmp/dbcollect-db08.lan.zip is created succesfully.
INFO     : Finished
[root@db08 ~](-) #

Check the output for any critical errors.

Checking results

The output of dbcollect is written by default to /tmp/dbcollect-<hostname>.zip.

The zipfile should be readable with normal ZIP tools (unzip):

# Check if the ZIP file exists
ls -al /tmp/dbcollect-*.zip 
-rw-r--r-- 1 oracle oinstall 2074253 Oct 10 14:37 /tmp/dbcollect-db08.lan.zip

# Verify contents
unzip -v /tmp/dbcollect-db08.lan.zip 
Archive:  /tmp/dbcollect-db08.lan.zip
dbcollect version=1.16.3 hostname=db08.lan
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     761  Defl:N      385  49% 10-10-2024 14:36 1296357e  db08.lan/meta.json
     855  Defl:N      402  53% 10-10-2024 14:36 f4d6f4f2  db08.lan/hostinfo.json
   11824  Defl:N     1536  87% 10-10-2024 14:36 83805993  db08.lan/diskinfo.json
-- snip --
   47823  Defl:N     6520  86% 10-10-2024 14:36 0b993954  db08.lan/oracle/log/demo08_sqlplus_7658.log
    2520  Defl:N      329  87% 10-10-2024 14:37 d4e5c89c  db08.lan/oracle/log/demo08_sqlplus_8094.log
    2730  Defl:N      342  88% 10-10-2024 14:37 5f3fd91a  db08.lan/oracle/log/demo08_sqlplus_8095.log
    4998  Defl:N      960  81% 10-10-2024 14:37 c2c4f51c  db08.lan/dbcollect.log
--------          -------  ---                            -------
18105774          2054255  89%                            125 files

# Note that the files in the ZIP always start with the hostname (to prevent a mess if you unzip it in your home dir).
# The first file in the archive should be the meta.json, the last file should be the dbcollect.log
# If dbcollect.log is missing, then the ZIP file was not written completely.
# The dbcollect.log is probably still in /tmp. Reading it may reveal what went wrong.

Do not change the generated ZIP archive (it has metadata and structures in place so it can be correctly processed by our reporting tools). Send the ZIP file as-is.

Running unattended

If dbcollect runs a very long time and you don't want to keep a CLI session open, there are a few ways to do that:

  1. Run in a screen or tmux session [1].
  2. Run in a nohup session: nohup dbcollect <options> &

Another way (if there are multiple systems) is to run using Ansible.

Troubleshooting

dbcollect fails to execute

  • Check if /usr/bin/env python starts up a Python session. It should show the version. Exit with CTRL-D.
  • If it does not work, check if Python is installed at all: python
  • If Python is available but running dbcollect does not work, run python /usr/local/bin/dbcollect <options> or try python3 or something similar

dbcollect runs but fails with errors

A list of all error messages and explanation can be found here DBCollect Error messages

Performance

This is frequently the case if dbcollect runs against Oracle RAC. There are a number of known issues that cause it to be very slow. I have observed up to 20 seconds per AWR report (usually about 0.5 seconds) making dbcollect run for over an hour or more for a typical 10-day, 1-hour interval, single database cluster. Check Support notes for fixes and workarounds: 2404906.1, 2565465.1, 2318124.1, 29932310.8, 2148489.1, 29470291.8. Or be very patient.

Update: As of version 1.11, dbcollect runs multiple AWR reports in parallel for each instance, making it much faster, by default 50% of CPUs. To use all available CPUs, use --tasks 0 (note that CPU load will likely go to 100%, be careful)

  • dbcollect seems to hang
    1. Check if the database itself works correctly. Can you switch logfiles, manually run a report?
  • Timeouts
    1. Sometimes a SELECT on DBA_FREE_SPACE or CDB_FREE_SPACE in Oracle is very slow, causing the dbinfo scripts to run very long which in turn causes the job processors to timeout. Possible fix:
SQL> purge dba_recyclebin;
SQL> EXEC DBMS_STATS.GATHER_DICTIONARY_STATS;
SQL> EXEC DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;