DBCollect/Usage: Difference between revisions

From Dirty Cache Wiki
Jump to navigation Jump to search
No edit summary
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:DBCollect]]
[[Category:DBCollect]]
{{PrevNext|DBCollect/Install|DBCollect/Options}}
= Usage =
== Where to run ==
Where to run {{dbcollect}}


= Usage =
* All Database nodes (in scope of the project) including all RAC nodes and all Data Guard Standby nodes
* Optional: On all application nodes and hypervisors
: On non-database hosts, {{dbcollect}} will only pick up OS config and performance data


== Short summary ==
== Short summary ==
Line 24: Line 32:
# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8)
# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8)
dbcollect --tasks 2
dbcollect --tasks 2
</syntaxhighlight>
== Verify correct install ==
<syntaxhighlight lang=bash>
# Show version
dbcollect -V
# Show usage
dbcollect -h
# Update to last version (if the system has https access to github)
dbcollect --update
</syntaxhighlight>
</syntaxhighlight>


== Run with default settings ==
== Strip SQL code ==


<syntaxhighlight lang=bash>
AWR reports by default contain snippets of Oracle SQL queries. These are not used for reports but can be useful in performance diagnosis. Some customers consider the SQL code confidential information and don't allow this to leave the company. {{dbcollect}} has the option to remove all SQL code from the AWR reports:
# Default run
dbcollect


# If you get Error DBC-020 (zipfile already exists), you can overwrite the old zipfile:
<code>dbcollect --strip</code>
dbcollect -o
# or
dbcollect --overwrite
</syntaxhighlight>


Notes:
* This ONLY works for AWR, not for Statspack
* It removes the SQL Statistics (such as <code>SQL ordered by CPU Time</code>) from AWR, as well as the entire ADDM section
* It is based on XML parsing in Python, if the XML parsing fails, it will write the original AWR (HTML) report (but show a warning in the logfile).
* Mileage may vary. So far it works flawlessly but no guarantee can be made. Inspect the ZIP files if you want to make sure.


== Resolve AWR reporting issues ==
== Resolve AWR reporting issues ==
Line 64: Line 57:


# If you want to ignore databases without AWR usage:
# If you want to ignore databases without AWR usage:
# Note that this will result in under-reporting of performance and capacity due to missing data
dbcollect --ignore-awr
dbcollect --ignore-awr


# If the problem databases are not important, try excluding them:
# If the problem databases are not important, try excluding them:
dbcollect --exclude inst1,inst2
dbcollect --exclude inst1,inst2
</syntaxhighlight>
</syntaxhighlight>


= Explanation of all options =
== Oracle RAC ==


== help ==
On RAC, {{dbcollect}} will generate AWR reports for ''all'' instances on all nodes. This means if you run {{dbcollect}} on all RAC nodes (which is recommended) then all AWR reports will be generated multiple times (2 times for a 2-node cluster). To avoid this, run with the <code>--no-rac</code> option. This will cause AWR reports ONLY to be generated for the actual node, and saves on diskspace and resource overhead.


<code>--help</code> or <code>-h</code>
== Oracle Data Guard ==


Shows usage (list of all arguments)
An Oracle instance that runs as Data Guard Standby is usually in MOUNT mode and does not allow AWR reports to be generated. Standby databases can be running in Real-Time Query or Real-Time Apply mode, but running standard AWR reports causes the AWR reports to be generated for the PRIMARY instance (there is no specific Standby performance data).


== version ==
Data Guard can be configured with RMF (Remote Management Framework) which causes performance statistics to be collected for the ''STANDBY'' database.


<code>-V | --version</code>
{{dbcollect}} will pick up AWR reports for the STANDBY database if this is enabled. To disable this feature, use <code>--no-stby</code>.


Shows detailed version, build and copyright details
= Monitoring progress =


== debug ==
Once {{dbcollect}} is running, it will pick up OS info, and after a few seconds it will start generating AWR reports.


<code>-D | --debug</code>
<syntaxhighlight lang=bash>
[root@db08 ~](-) # dbcollect -o
dbcollect 1.16.3 - collect Oracle AWR/Statspack, database and system info
INFO    : dbcollect 1.16.3 - database and system info collector
INFO    : Python version 3.6.8
INFO    : Current user is oracle
INFO    : Zip file is /tmp/dbcollect-db08.lan.zip
INFO    : Collecting OS info (Linux)
INFO    : Collecting Linux SAR files
INFO    : Collecting Oracle info
INFO    : Detecting Oracle instances
INFO    : Stopped instances: demo81
INFO    : Running instances: demo08
INFO    : demo08: AWR usage detected, generating reports
INFO    : demo08: 25 reports
INFO    : demo08: Running opatch lspatches
INFO    : demo08: Running dbinfo scripts
INFO    : demo08: DBInfo processor finished, elapsed time 8.38 seconds
INFO    : demo08: Started 2 SQLPlus sessions
Report 16 of 25 (64.0% done), elapsed: 0:00:30, remaining: 0:00:17, reports/s: 0.53
</syntaxhighlight>


enables debugging. This will result in more logging (sometimes dumping unhandled Python exceptions and traces.
The remaining time is an estimate on how long is required to generate all reports.


Even without this option, debug info is written to the {{dbcollect}} logfile for troubleshooting purposes.
Once complete, {{dbcollect}} should show a message that the ZIP file creation was succesful:


== quiet ==
<syntaxhighlight lang=bash>
INFO    : Report 25 of 25 (100.0% done), elapsed: 0:00:40, remaining: 0:00:00, reports/s: 0.63
INFO    : Zip file /tmp/dbcollect-db08.lan.zip is created succesfully.
INFO    : Finished
[root@db08 ~](-) #
</syntaxhighlight>


<code>-q | --quiet</code>
Check the output for any critical errors.


Hides most normal terminal output (except errors). Useful for scripted runs (such as with [[DBCollect/Ansible]])
= Checking results =


== overwrite ==
The output of {{dbcollect}} is written by default to <code>/tmp/dbcollect-<hostname>.zip</code>.


<code>-o | --overwrite</code>
The zipfile should be readable with normal ZIP tools (<code>unzip</code>):


Overwrites the previous dbcollect ZIP file if it exsts.
<syntaxhighlight lang=bash>
# Check if the ZIP file exists
ls -al /tmp/dbcollect-*.zip
-rw-r--r-- 1 oracle oinstall 2074253 Oct 10 14:37 /tmp/dbcollect-db08.lan.zip


== update ==
# Verify contents
unzip -v /tmp/dbcollect-db08.lan.zip
Archive:  /tmp/dbcollect-db08.lan.zip
dbcollect version=1.16.3 hostname=db08.lan
Length  Method    Size  Cmpr    Date    Time  CRC-32  Name
--------  ------  ------- ---- ---------- ----- --------  ----
    761  Defl:N      385  49% 10-10-2024 14:36 1296357e  db08.lan/meta.json
    855  Defl:N      402  53% 10-10-2024 14:36 f4d6f4f2  db08.lan/hostinfo.json
  11824  Defl:N    1536  87% 10-10-2024 14:36 83805993  db08.lan/diskinfo.json
-- snip --
  47823  Defl:N    6520  86% 10-10-2024 14:36 0b993954  db08.lan/oracle/log/demo08_sqlplus_7658.log
    2520  Defl:N      329  87% 10-10-2024 14:37 d4e5c89c  db08.lan/oracle/log/demo08_sqlplus_8094.log
    2730  Defl:N      342  88% 10-10-2024 14:37 5f3fd91a  db08.lan/oracle/log/demo08_sqlplus_8095.log
    4998  Defl:N      960  81% 10-10-2024 14:37 c2c4f51c  db08.lan/dbcollect.log
--------          -------  ---                            -------
18105774          2054255  89%                            125 files


<code>--update</code>
# Note that the files in the ZIP always start with the hostname (to prevent a mess if you unzip it in your home dir).
# The first file in the archive should be the meta.json, the last file should be the dbcollect.log
# If dbcollect.log is missing, then the ZIP file was not written completely.
# The dbcollect.log is probably still in /tmp. Reading it may reveal what went wrong.
</syntaxhighlight>


Downloads new version of {{dbcollect}} from github and replaces the old version.
Do not change the generated ZIP archive (it has metadata and structures in place so it can be correctly processed by our reporting tools). Send the ZIP file as-is.


Requires https access to github.com
= Running unattended =


== sudoers ==
If {{dbcollect}} runs a very long time and you don't want to keep a CLI session open, there are a few ways to do that:


<code>--sudoers</code>
# Run in a <code>screen</code> or <code>tmux</code> session [https://linuxize.com/post/how-to-use-linux-screen/].
# Run in a nohup session: <code>nohup dbcollect <options> &</code>


Writes sudoers settings to <code>/etc/sudoers.d/dbcollect</code>. This allows {{dbcollect}} to run some commands as root, that otherwise would fail.
Another way (if there are multiple systems) is to run using [[DBCollect/Ansible | Ansible]].


This retrieves some extra, optional information. Not required, but on HP-UX highly recommended, as otherwise disk devices cannot be properly detected.
= Troubleshooting =


== filename ==
== dbcollect fails to execute ==


<code>--filename FILENAME</code>
* Check if <code>/usr/bin/env python</code> starts up a Python session. It should show the version. Exit with CTRL-D.
* If it does not work, check if Python is installed at all: <code>python</code>
* If Python is available but running {{dbcollect}} does not work, run <code>python /usr/local/bin/dbcollect <options></code> or try <code>python3</code> or something similar


Write to the given FILENAME instead of /tmp/dbcollect-<hostname>.zip.
== dbcollect runs but fails with errors ==


If FILENAME is only an identifier, the new name will be /tmp/<filename>.zip.
A list of all error messages and explanation can be found here [[DBC:Errors | DBCollect Error messages]]


If FILENAME is a full path, dbcollect will use it as new location (.zip extension will be added if needed)
== Performance ==


== tempdir ==
* {{dbcollect}} takes very long to run


<code>--tempdir TEMPDIR</code>
This is frequently the case if dbcollect runs against Oracle RAC. There are a number of known issues that cause it to be very slow. I have observed up to 20 seconds per AWR report (usually about 0.5 seconds) making dbcollect run for over an hour or more for a typical 10-day, 1-hour interval, single database cluster. Check Support notes for fixes and workarounds: 2404906.1, 2565465.1, 2318124.1, 29932310.8, 2148489.1, 29470291.8. Or be very patient.


Use TEMPDIR as alternative temporary folder (for example <code>/var/tmp</code>.
Update: As of version 1.11, dbcollect runs multiple AWR reports in parallel for each instance, making it much faster, by default 50% of CPUs. To use all available CPUs, use --tasks 0 (note that CPU load will likely go to 100%, be careful)


Useful if <code>/tmp</code> freespace is insufficient. {{dbcollect}} will always create a subdirectory such as <code>/tmp/dbcollect_uwfb4li0</code> so it will never overwrite existing files or directories.
* {{dbcollect}} seems to hang
*# Check if the database itself works correctly. Can you switch logfiles, manually run a report?


== user ==
* Timeouts
 
*# Sometimes a SELECT on DBA_FREE_SPACE or CDB_FREE_SPACE in Oracle is very slow, causing the dbinfo scripts to run very long which in turn causes the job processors to timeout. Possible fix:
<code>-u USER | --user USER</code>
<syntaxhighlight lang=sql>
 
SQL> purge dba_recyclebin;
Normally, {{dbcollect}} will use the owner of the first detected Oracle instance (i.e., running <code>ora_pmon_<instance></code>. Use this option to force using another user.
SQL> EXEC DBMS_STATS.GATHER_DICTIONARY_STATS;
 
SQL> EXEC DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;
To work properly, it has to be a user with SYSDBA privileges.
</syntaxhighlight>
 
== days ==
 
<code>-d DAYS | --days DAYS</code>
 
The number of days for which to generate AWR or Statspack reports. The default is 10 days, so with an AWR interval of 1 hour and long enough AWR retention, about 240 AWR reports will be generated per instance.
 
Increasing this value will only work if the retention in Oracle is also high enough (and AWR data is actually available).
 
Note that the Linux SAR/Sysstat reports will always be the entire period of 30 or 31 days (where available).
 
== end_days ==
 
<code>--end_days END_DAYS</code>
 
Number of days ago to stop AWR report generation. For example, you want to pick up reports from 20 days ago to 10 days ago:
 
<code>dbcollect --days 20 --end_days 10</code>
 
This can be used if you want to pick up a very specific period of time only.
 
== force-awr ==
 
<code>--force-awr</code>
 
This causes {{dbcollect}} to ignore prior AWR report detection and always create AWR reports. You need to have Diagnostics Pack if you use this to avoid licensing violations.
 
== statspack ==
 
<code>--statspack</code>
 
Force usage of Statspack even if AWR usage is detected.
 
== ignore-awr ==
 
<code>--ignore-awr</code>
 
Ignores AWR reporting for databases that have no prior AWR usage.
 
== strip ==
 
<code>--strip</code>
 
AWR reports usually contain (parts of) SQL code. This is sometimes considered a security issue. This option removes all SQL code sections from the AWR reports. It only works for AWR, not Statspack.
 
Note that it works on best-effort basis, if the AWR report cannot be parsed, the report will be saved in its original form.
 
== no-rac ==
 
<code>--no-rac</code>
 
To make {{dbcollect}} as robust as possible, by default if running RAC, it generates AWR reports for all database instances which significantly increases runtime and ZIP file size.
 
If {{dbcollect}} is executed on all RAC nodes (recommended), this is not necessary and the double generation of AWR can be disabled using this option.
 
== no-stby ==
 
<code>--no-stby</code>
 
By default, AWR reports cannot be created for Data Guard Standby databases. This is typically not a big problem as standby databases do not usually generate much workload. However, if real-time queries are running against the standby database (requiring Active Data Guard license), we are also interested in the workload on those instances.
 
Since Oracle 12, Remote Management Framework (RMF) allows AWR reports to be generated for standby databases (but it has to be configured by the DBA first). {{dbcollect}} will pick up AWR reports from RMF managed standby databases (through the primary instance).
 
Use this option to disable this feature.
 
== no-awr ==
 
<code>--no-awr</code>
 
Skip AWR/Statspack reports completely. Note that the {{dbinfo}} reports are still created.
 
== no-sar ==
 
<code>--no-sar</code>
 
Skips collection of SAR reports (Linux/UNIX).
 
== no-ora ==
 
<code>--no-ora</code>
 
Skips entire collection of Oracle related information. Only OS info will be collected.
 
== no-sys ==
 
<code>--no-sys</code>
 
Skips entire collection of OS related information. Only Oracle info will be collected.
 
== no-orainv ==
 
<code>--no-orainv</code>
 
Skip using Oracle Inventory for ORACLE_HOME and instances detection. Only instances in <code>oratab</code> will be detected.
 
This can be useful for troubleshooting Oracle Instances detection.
 
== no-oratab ==
 
<code>--no-oratab</code>
 
Skip using Oracle <code>oratab</code> for ORACLE_HOME and instances detection. Only instances in Oracle Inventory will be detected.
 
This can be useful for troubleshooting Oracle Instances detection.
 
== splunk ==
 
<code>--splunk</code>
 
Generate additional reports for usage with [https://dell.com Dell's] internal SPLUNK toolset (for Dell Presales Engineers) or [https://www.liveoptics.com/ Dell LiveOptics]. These reports contain extra info such as database size.
 
== include ==
 
<code>--include INSTANCE_LIST</code>
 
Include only the given instances (comma separated). Without this option, all detected instances are included.
 
== exclude ==
 
<code>--exclude INSTANCE_LIST</code>
 
Exclude only the given instances (comma separated).
 
Note that excluded instances are prioritized over included.
 
== tasks ==
 
<code>-t TASKS | --tasks TASKS</code>
 
Set the maximum number of workers (for AWR generation) to TASKS (default = 50% of available CPUs).
 
== timeout ==
 
<code>--timeout TIMEOUT</code>
 
Change the timeout for long running SQL scripts. The default is 10 minutes.
 
Useful for troubleshooting timeout errors.

Latest revision as of 15:38, 10 October 2024



Usage

Where to run

Where to run dbcollect

  • All Database nodes (in scope of the project) including all RAC nodes and all Data Guard Standby nodes
  • Optional: On all application nodes and hypervisors
On non-database hosts, dbcollect will only pick up OS config and performance data

Short summary

If dbcollect is installed correctly:

# Simply run dbcollect
dbcollect

# Error: Zipfile already exists
# Overwrite previous zipfile
dbcollect --overwrite

# Error: No AWR or Statspack detected
# If you are sure the system is licensed with Diagnostics Pack
dbcollect --overwrite --force-awr

# If you want to retrieve more than 10 days worth of AWR files:
dbcollect --days 20

# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8)
dbcollect --tasks 2

Strip SQL code

AWR reports by default contain snippets of Oracle SQL queries. These are not used for reports but can be useful in performance diagnosis. Some customers consider the SQL code confidential information and don't allow this to leave the company. dbcollect has the option to remove all SQL code from the AWR reports:

dbcollect --strip

Notes:

  • This ONLY works for AWR, not for Statspack
  • It removes the SQL Statistics (such as SQL ordered by CPU Time) from AWR, as well as the entire ADDM section
  • It is based on XML parsing in Python, if the XML parsing fails, it will write the original AWR (HTML) report (but show a warning in the logfile).
  • Mileage may vary. So far it works flawlessly but no guarantee can be made. Inspect the ZIP files if you want to make sure.

Resolve AWR reporting issues

dbcollect checks for prior AWR usage to make sure it will not trigger Oracle license violations ( More info)

If you get Error [DBC-E021] No AWR or Statspack detected:

# If you are sure you have Diagnostics Pack license:
dbcollect --force-awr

# If you want to ignore databases without AWR usage:
# Note that this will result in under-reporting of performance and capacity due to missing data
dbcollect --ignore-awr

# If the problem databases are not important, try excluding them:
dbcollect --exclude inst1,inst2

Oracle RAC

On RAC, dbcollect will generate AWR reports for all instances on all nodes. This means if you run dbcollect on all RAC nodes (which is recommended) then all AWR reports will be generated multiple times (2 times for a 2-node cluster). To avoid this, run with the --no-rac option. This will cause AWR reports ONLY to be generated for the actual node, and saves on diskspace and resource overhead.

Oracle Data Guard

An Oracle instance that runs as Data Guard Standby is usually in MOUNT mode and does not allow AWR reports to be generated. Standby databases can be running in Real-Time Query or Real-Time Apply mode, but running standard AWR reports causes the AWR reports to be generated for the PRIMARY instance (there is no specific Standby performance data).

Data Guard can be configured with RMF (Remote Management Framework) which causes performance statistics to be collected for the STANDBY database.

dbcollect will pick up AWR reports for the STANDBY database if this is enabled. To disable this feature, use --no-stby.

Monitoring progress

Once dbcollect is running, it will pick up OS info, and after a few seconds it will start generating AWR reports.

[root@db08 ~](-) # dbcollect -o
dbcollect 1.16.3 - collect Oracle AWR/Statspack, database and system info
INFO     : dbcollect 1.16.3 - database and system info collector
INFO     : Python version 3.6.8
INFO     : Current user is oracle
INFO     : Zip file is /tmp/dbcollect-db08.lan.zip
INFO     : Collecting OS info (Linux)
INFO     : Collecting Linux SAR files
INFO     : Collecting Oracle info
INFO     : Detecting Oracle instances
INFO     : Stopped instances: demo81
INFO     : Running instances: demo08
INFO     : demo08: AWR usage detected, generating reports 
INFO     : demo08: 25 reports
INFO     : demo08: Running opatch lspatches
INFO     : demo08: Running dbinfo scripts
INFO     : demo08: DBInfo processor finished, elapsed time 8.38 seconds
INFO     : demo08: Started 2 SQLPlus sessions
Report 16 of 25 (64.0% done), elapsed: 0:00:30, remaining: 0:00:17, reports/s: 0.53

The remaining time is an estimate on how long is required to generate all reports.

Once complete, dbcollect should show a message that the ZIP file creation was succesful:

INFO     : Report 25 of 25 (100.0% done), elapsed: 0:00:40, remaining: 0:00:00, reports/s: 0.63
INFO     : Zip file /tmp/dbcollect-db08.lan.zip is created succesfully.
INFO     : Finished
[root@db08 ~](-) #

Check the output for any critical errors.

Checking results

The output of dbcollect is written by default to /tmp/dbcollect-<hostname>.zip.

The zipfile should be readable with normal ZIP tools (unzip):

# Check if the ZIP file exists
ls -al /tmp/dbcollect-*.zip 
-rw-r--r-- 1 oracle oinstall 2074253 Oct 10 14:37 /tmp/dbcollect-db08.lan.zip

# Verify contents
unzip -v /tmp/dbcollect-db08.lan.zip 
Archive:  /tmp/dbcollect-db08.lan.zip
dbcollect version=1.16.3 hostname=db08.lan
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     761  Defl:N      385  49% 10-10-2024 14:36 1296357e  db08.lan/meta.json
     855  Defl:N      402  53% 10-10-2024 14:36 f4d6f4f2  db08.lan/hostinfo.json
   11824  Defl:N     1536  87% 10-10-2024 14:36 83805993  db08.lan/diskinfo.json
-- snip --
   47823  Defl:N     6520  86% 10-10-2024 14:36 0b993954  db08.lan/oracle/log/demo08_sqlplus_7658.log
    2520  Defl:N      329  87% 10-10-2024 14:37 d4e5c89c  db08.lan/oracle/log/demo08_sqlplus_8094.log
    2730  Defl:N      342  88% 10-10-2024 14:37 5f3fd91a  db08.lan/oracle/log/demo08_sqlplus_8095.log
    4998  Defl:N      960  81% 10-10-2024 14:37 c2c4f51c  db08.lan/dbcollect.log
--------          -------  ---                            -------
18105774          2054255  89%                            125 files

# Note that the files in the ZIP always start with the hostname (to prevent a mess if you unzip it in your home dir).
# The first file in the archive should be the meta.json, the last file should be the dbcollect.log
# If dbcollect.log is missing, then the ZIP file was not written completely.
# The dbcollect.log is probably still in /tmp. Reading it may reveal what went wrong.

Do not change the generated ZIP archive (it has metadata and structures in place so it can be correctly processed by our reporting tools). Send the ZIP file as-is.

Running unattended

If dbcollect runs a very long time and you don't want to keep a CLI session open, there are a few ways to do that:

  1. Run in a screen or tmux session [1].
  2. Run in a nohup session: nohup dbcollect <options> &

Another way (if there are multiple systems) is to run using Ansible.

Troubleshooting

dbcollect fails to execute

  • Check if /usr/bin/env python starts up a Python session. It should show the version. Exit with CTRL-D.
  • If it does not work, check if Python is installed at all: python
  • If Python is available but running dbcollect does not work, run python /usr/local/bin/dbcollect <options> or try python3 or something similar

dbcollect runs but fails with errors

A list of all error messages and explanation can be found here DBCollect Error messages

Performance

This is frequently the case if dbcollect runs against Oracle RAC. There are a number of known issues that cause it to be very slow. I have observed up to 20 seconds per AWR report (usually about 0.5 seconds) making dbcollect run for over an hour or more for a typical 10-day, 1-hour interval, single database cluster. Check Support notes for fixes and workarounds: 2404906.1, 2565465.1, 2318124.1, 29932310.8, 2148489.1, 29470291.8. Or be very patient.

Update: As of version 1.11, dbcollect runs multiple AWR reports in parallel for each instance, making it much faster, by default 50% of CPUs. To use all available CPUs, use --tasks 0 (note that CPU load will likely go to 100%, be careful)

  • dbcollect seems to hang
    1. Check if the database itself works correctly. Can you switch logfiles, manually run a report?
  • Timeouts
    1. Sometimes a SELECT on DBA_FREE_SPACE or CDB_FREE_SPACE in Oracle is very slow, causing the dbinfo scripts to run very long which in turn causes the job processors to timeout. Possible fix:
SQL> purge dba_recyclebin;
SQL> EXEC DBMS_STATS.GATHER_DICTIONARY_STATS;
SQL> EXEC DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;