DBCollect/Usage: Difference between revisions
No edit summary |
|||
(36 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:DBCollect]] | |||
{{PrevNext|DBCollect/Install|DBCollect/Options}} | |||
= Usage = | = Usage = | ||
== | == Where to run == | ||
Where to run {{dbcollect}} | |||
dbcollect | |||
* All Database nodes (in scope of the project) including all RAC nodes and all Data Guard Standby nodes | |||
dbcollect | * Optional: On all application nodes and hypervisors | ||
: On non-database hosts, {{dbcollect}} will only pick up OS config and performance data | |||
== Short summary == | |||
If {{dbcollect}} is installed correctly: | |||
<syntaxhighlight lang=bash> | <syntaxhighlight lang=bash> | ||
# | # Simply run dbcollect | ||
dbcollect | dbcollect | ||
# | # Error: Zipfile already exists | ||
# Overwrite previous zipfile | |||
# | |||
dbcollect --overwrite | dbcollect --overwrite | ||
# Error: No AWR or Statspack detected | |||
# If you are sure the system is licensed with Diagnostics Pack | |||
dbcollect --overwrite --force-awr | |||
# If you want to retrieve more than 10 days worth of AWR files: | |||
dbcollect --days 20 | |||
# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8) | |||
dbcollect --tasks 2 | |||
</syntaxhighlight> | </syntaxhighlight> | ||
== Strip SQL code == | |||
AWR reports by default contain snippets of Oracle SQL queries. These are not used for reports but can be useful in performance diagnosis. Some customers consider the SQL code confidential information and don't allow this to leave the company. {{dbcollect}} has the option to remove all SQL code from the AWR reports: | |||
<code>dbcollect --strip</code> | |||
Notes: | |||
* This ONLY works for AWR, not for Statspack | |||
* It removes the SQL Statistics (such as <code>SQL ordered by CPU Time</code>) from AWR, as well as the entire ADDM section | |||
* It is based on XML parsing in Python, if the XML parsing fails, it will write the original AWR (HTML) report (but show a warning in the logfile). | |||
* Mileage may vary. So far it works flawlessly but no guarantee can be made. Inspect the ZIP files if you want to make sure. | |||
== Resolve AWR reporting issues == | == Resolve AWR reporting issues == | ||
Line 39: | Line 57: | ||
# If you want to ignore databases without AWR usage: | # If you want to ignore databases without AWR usage: | ||
# Note that this will result in under-reporting of performance and capacity due to missing data | |||
dbcollect --ignore-awr | dbcollect --ignore-awr | ||
# If the problem databases are not important, try excluding them: | # If the problem databases are not important, try excluding them: | ||
dbcollect --exclude inst1,inst2 | dbcollect --exclude inst1,inst2 | ||
</syntaxhighlight> | </syntaxhighlight> | ||
= | == Oracle RAC == | ||
On RAC, {{dbcollect}} will generate AWR reports for ''all'' instances on all nodes. This means if you run {{dbcollect}} on all RAC nodes (which is recommended) then all AWR reports will be generated multiple times (2 times for a 2-node cluster). To avoid this, run with the <code>--no-rac</code> option. This will cause AWR reports ONLY to be generated for the actual node, and saves on diskspace and resource overhead. | |||
== Oracle Data Guard == | |||
An Oracle instance that runs as Data Guard Standby is usually in MOUNT mode and does not allow AWR reports to be generated. Standby databases can be running in Real-Time Query or Real-Time Apply mode, but running standard AWR reports causes the AWR reports to be generated for the PRIMARY instance (there is no specific Standby performance data). | |||
Data Guard can be configured with RMF (Remote Management Framework) which causes performance statistics to be collected for the ''STANDBY'' database. | |||
<code>- | {{dbcollect}} will pick up AWR reports for the STANDBY database if this is enabled. To disable this feature, use <code>--no-stby</code>. | ||
= Monitoring progress = | |||
Once {{dbcollect}} is running, it will pick up OS info, and after a few seconds it will start generating AWR reports. | |||
< | <syntaxhighlight lang=bash> | ||
[root@db08 ~](-) # dbcollect -o | |||
dbcollect 1.16.3 - collect Oracle AWR/Statspack, database and system info | |||
INFO : dbcollect 1.16.3 - database and system info collector | |||
INFO : Python version 3.6.8 | |||
INFO : Current user is oracle | |||
INFO : Zip file is /tmp/dbcollect-db08.lan.zip | |||
INFO : Collecting OS info (Linux) | |||
INFO : Collecting Linux SAR files | |||
INFO : Collecting Oracle info | |||
INFO : Detecting Oracle instances | |||
INFO : Stopped instances: demo81 | |||
INFO : Running instances: demo08 | |||
INFO : demo08: AWR usage detected, generating reports | |||
INFO : demo08: 25 reports | |||
INFO : demo08: Running opatch lspatches | |||
INFO : demo08: Running dbinfo scripts | |||
INFO : demo08: DBInfo processor finished, elapsed time 8.38 seconds | |||
INFO : demo08: Started 2 SQLPlus sessions | |||
Report 16 of 25 (64.0% done), elapsed: 0:00:30, remaining: 0:00:17, reports/s: 0.53 | |||
</syntaxhighlight> | |||
The remaining time is an estimate on how long is required to generate all reports. | |||
Once complete, {{dbcollect}} should show a message that the ZIP file creation was succesful: | |||
= | <syntaxhighlight lang=bash> | ||
INFO : Report 25 of 25 (100.0% done), elapsed: 0:00:40, remaining: 0:00:00, reports/s: 0.63 | |||
INFO : Zip file /tmp/dbcollect-db08.lan.zip is created succesfully. | |||
INFO : Finished | |||
[root@db08 ~](-) # | |||
</syntaxhighlight> | |||
Check the output for any critical errors. | |||
= Checking results = | |||
The output of {{dbcollect}} is written by default to <code>/tmp/dbcollect-<hostname>.zip</code>. | |||
<code> | The zipfile should be readable with normal ZIP tools (<code>unzip</code>): | ||
<syntaxhighlight lang=bash> | |||
# Check if the ZIP file exists | |||
ls -al /tmp/dbcollect-*.zip | |||
-rw-r--r-- 1 oracle oinstall 2074253 Oct 10 14:37 /tmp/dbcollect-db08.lan.zip | |||
# Verify contents | |||
unzip -v /tmp/dbcollect-db08.lan.zip | |||
Archive: /tmp/dbcollect-db08.lan.zip | |||
dbcollect version=1.16.3 hostname=db08.lan | |||
Length Method Size Cmpr Date Time CRC-32 Name | |||
-------- ------ ------- ---- ---------- ----- -------- ---- | |||
761 Defl:N 385 49% 10-10-2024 14:36 1296357e db08.lan/meta.json | |||
855 Defl:N 402 53% 10-10-2024 14:36 f4d6f4f2 db08.lan/hostinfo.json | |||
11824 Defl:N 1536 87% 10-10-2024 14:36 83805993 db08.lan/diskinfo.json | |||
-- snip -- | |||
47823 Defl:N 6520 86% 10-10-2024 14:36 0b993954 db08.lan/oracle/log/demo08_sqlplus_7658.log | |||
2520 Defl:N 329 87% 10-10-2024 14:37 d4e5c89c db08.lan/oracle/log/demo08_sqlplus_8094.log | |||
2730 Defl:N 342 88% 10-10-2024 14:37 5f3fd91a db08.lan/oracle/log/demo08_sqlplus_8095.log | |||
4998 Defl:N 960 81% 10-10-2024 14:37 c2c4f51c db08.lan/dbcollect.log | |||
-------- ------- --- ------- | |||
18105774 2054255 89% 125 files | |||
# Note that the files in the ZIP always start with the hostname (to prevent a mess if you unzip it in your home dir). | |||
# The first file in the archive should be the meta.json, the last file should be the dbcollect.log | |||
# If dbcollect.log is missing, then the ZIP file was not written completely. | |||
# The dbcollect.log is probably still in /tmp. Reading it may reveal what went wrong. | |||
</syntaxhighlight> | |||
Do not change the generated ZIP archive (it has metadata and structures in place so it can be correctly processed by our reporting tools). Send the ZIP file as-is. | |||
= Running unattended = | |||
If {{dbcollect}} runs a very long time and you don't want to keep a CLI session open, there are a few ways to do that: | |||
# Run in a <code>screen</code> or <code>tmux</code> session [https://linuxize.com/post/how-to-use-linux-screen/]. | |||
# Run in a nohup session: <code>nohup dbcollect <options> &</code> | |||
Another way (if there are multiple systems) is to run using [[DBCollect/Ansible | Ansible]]. | |||
= Troubleshooting = | |||
== dbcollect fails to execute == | |||
* Check if <code>/usr/bin/env python</code> starts up a Python session. It should show the version. Exit with CTRL-D. | |||
* If it does not work, check if Python is installed at all: <code>python</code> | |||
* If Python is available but running {{dbcollect}} does not work, run <code>python /usr/local/bin/dbcollect <options></code> or try <code>python3</code> or something similar | |||
== dbcollect runs but fails with errors == | |||
A list of all error messages and explanation can be found here [[DBC:Errors | DBCollect Error messages]] | |||
== Performance == | |||
* {{dbcollect}} takes very long to run | |||
This is frequently the case if dbcollect runs against Oracle RAC. There are a number of known issues that cause it to be very slow. I have observed up to 20 seconds per AWR report (usually about 0.5 seconds) making dbcollect run for over an hour or more for a typical 10-day, 1-hour interval, single database cluster. Check Support notes for fixes and workarounds: 2404906.1, 2565465.1, 2318124.1, 29932310.8, 2148489.1, 29470291.8. Or be very patient. | |||
Update: As of version 1.11, dbcollect runs multiple AWR reports in parallel for each instance, making it much faster, by default 50% of CPUs. To use all available CPUs, use --tasks 0 (note that CPU load will likely go to 100%, be careful) | |||
* {{dbcollect}} seems to hang | |||
*# Check if the database itself works correctly. Can you switch logfiles, manually run a report? | |||
* Timeouts | |||
*# Sometimes a SELECT on DBA_FREE_SPACE or CDB_FREE_SPACE in Oracle is very slow, causing the dbinfo scripts to run very long which in turn causes the job processors to timeout. Possible fix: | |||
<syntaxhighlight lang=sql> | |||
SQL> purge dba_recyclebin; | |||
SQL> EXEC DBMS_STATS.GATHER_DICTIONARY_STATS; | |||
SQL> EXEC DBMS_STATS.GATHER_FIXED_OBJECTS_STATS; | |||
</syntaxhighlight> |
Latest revision as of 15:38, 10 October 2024
Usage
Where to run
Where to run dbcollect
- All Database nodes (in scope of the project) including all RAC nodes and all Data Guard Standby nodes
- Optional: On all application nodes and hypervisors
- On non-database hosts, dbcollect will only pick up OS config and performance data
Short summary
If dbcollect is installed correctly:
# Simply run dbcollect
dbcollect
# Error: Zipfile already exists
# Overwrite previous zipfile
dbcollect --overwrite
# Error: No AWR or Statspack detected
# If you are sure the system is licensed with Diagnostics Pack
dbcollect --overwrite --force-awr
# If you want to retrieve more than 10 days worth of AWR files:
dbcollect --days 20
# Limit dbcollect to use only 2 tasks (CPUs) for AWR generation (default = 50% of CPUs, with a maximum of 8)
dbcollect --tasks 2
Strip SQL code
AWR reports by default contain snippets of Oracle SQL queries. These are not used for reports but can be useful in performance diagnosis. Some customers consider the SQL code confidential information and don't allow this to leave the company. dbcollect has the option to remove all SQL code from the AWR reports:
dbcollect --strip
Notes:
- This ONLY works for AWR, not for Statspack
- It removes the SQL Statistics (such as
SQL ordered by CPU Time
) from AWR, as well as the entire ADDM section - It is based on XML parsing in Python, if the XML parsing fails, it will write the original AWR (HTML) report (but show a warning in the logfile).
- Mileage may vary. So far it works flawlessly but no guarantee can be made. Inspect the ZIP files if you want to make sure.
Resolve AWR reporting issues
dbcollect checks for prior AWR usage to make sure it will not trigger Oracle license violations ( More info)
If you get Error [DBC-E021] No AWR or Statspack detected:
# If you are sure you have Diagnostics Pack license:
dbcollect --force-awr
# If you want to ignore databases without AWR usage:
# Note that this will result in under-reporting of performance and capacity due to missing data
dbcollect --ignore-awr
# If the problem databases are not important, try excluding them:
dbcollect --exclude inst1,inst2
Oracle RAC
On RAC, dbcollect will generate AWR reports for all instances on all nodes. This means if you run dbcollect on all RAC nodes (which is recommended) then all AWR reports will be generated multiple times (2 times for a 2-node cluster). To avoid this, run with the --no-rac
option. This will cause AWR reports ONLY to be generated for the actual node, and saves on diskspace and resource overhead.
Oracle Data Guard
An Oracle instance that runs as Data Guard Standby is usually in MOUNT mode and does not allow AWR reports to be generated. Standby databases can be running in Real-Time Query or Real-Time Apply mode, but running standard AWR reports causes the AWR reports to be generated for the PRIMARY instance (there is no specific Standby performance data).
Data Guard can be configured with RMF (Remote Management Framework) which causes performance statistics to be collected for the STANDBY database.
dbcollect will pick up AWR reports for the STANDBY database if this is enabled. To disable this feature, use --no-stby
.
Monitoring progress
Once dbcollect is running, it will pick up OS info, and after a few seconds it will start generating AWR reports.
[root@db08 ~](-) # dbcollect -o
dbcollect 1.16.3 - collect Oracle AWR/Statspack, database and system info
INFO : dbcollect 1.16.3 - database and system info collector
INFO : Python version 3.6.8
INFO : Current user is oracle
INFO : Zip file is /tmp/dbcollect-db08.lan.zip
INFO : Collecting OS info (Linux)
INFO : Collecting Linux SAR files
INFO : Collecting Oracle info
INFO : Detecting Oracle instances
INFO : Stopped instances: demo81
INFO : Running instances: demo08
INFO : demo08: AWR usage detected, generating reports
INFO : demo08: 25 reports
INFO : demo08: Running opatch lspatches
INFO : demo08: Running dbinfo scripts
INFO : demo08: DBInfo processor finished, elapsed time 8.38 seconds
INFO : demo08: Started 2 SQLPlus sessions
Report 16 of 25 (64.0% done), elapsed: 0:00:30, remaining: 0:00:17, reports/s: 0.53
The remaining time is an estimate on how long is required to generate all reports.
Once complete, dbcollect should show a message that the ZIP file creation was succesful:
INFO : Report 25 of 25 (100.0% done), elapsed: 0:00:40, remaining: 0:00:00, reports/s: 0.63
INFO : Zip file /tmp/dbcollect-db08.lan.zip is created succesfully.
INFO : Finished
[root@db08 ~](-) #
Check the output for any critical errors.
Checking results
The output of dbcollect is written by default to /tmp/dbcollect-<hostname>.zip
.
The zipfile should be readable with normal ZIP tools (unzip
):
# Check if the ZIP file exists
ls -al /tmp/dbcollect-*.zip
-rw-r--r-- 1 oracle oinstall 2074253 Oct 10 14:37 /tmp/dbcollect-db08.lan.zip
# Verify contents
unzip -v /tmp/dbcollect-db08.lan.zip
Archive: /tmp/dbcollect-db08.lan.zip
dbcollect version=1.16.3 hostname=db08.lan
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
761 Defl:N 385 49% 10-10-2024 14:36 1296357e db08.lan/meta.json
855 Defl:N 402 53% 10-10-2024 14:36 f4d6f4f2 db08.lan/hostinfo.json
11824 Defl:N 1536 87% 10-10-2024 14:36 83805993 db08.lan/diskinfo.json
-- snip --
47823 Defl:N 6520 86% 10-10-2024 14:36 0b993954 db08.lan/oracle/log/demo08_sqlplus_7658.log
2520 Defl:N 329 87% 10-10-2024 14:37 d4e5c89c db08.lan/oracle/log/demo08_sqlplus_8094.log
2730 Defl:N 342 88% 10-10-2024 14:37 5f3fd91a db08.lan/oracle/log/demo08_sqlplus_8095.log
4998 Defl:N 960 81% 10-10-2024 14:37 c2c4f51c db08.lan/dbcollect.log
-------- ------- --- -------
18105774 2054255 89% 125 files
# Note that the files in the ZIP always start with the hostname (to prevent a mess if you unzip it in your home dir).
# The first file in the archive should be the meta.json, the last file should be the dbcollect.log
# If dbcollect.log is missing, then the ZIP file was not written completely.
# The dbcollect.log is probably still in /tmp. Reading it may reveal what went wrong.
Do not change the generated ZIP archive (it has metadata and structures in place so it can be correctly processed by our reporting tools). Send the ZIP file as-is.
Running unattended
If dbcollect runs a very long time and you don't want to keep a CLI session open, there are a few ways to do that:
- Run in a
screen
ortmux
session [1]. - Run in a nohup session:
nohup dbcollect <options> &
Another way (if there are multiple systems) is to run using Ansible.
Troubleshooting
dbcollect fails to execute
- Check if
/usr/bin/env python
starts up a Python session. It should show the version. Exit with CTRL-D. - If it does not work, check if Python is installed at all:
python
- If Python is available but running dbcollect does not work, run
python /usr/local/bin/dbcollect <options>
or trypython3
or something similar
dbcollect runs but fails with errors
A list of all error messages and explanation can be found here DBCollect Error messages
Performance
- dbcollect takes very long to run
This is frequently the case if dbcollect runs against Oracle RAC. There are a number of known issues that cause it to be very slow. I have observed up to 20 seconds per AWR report (usually about 0.5 seconds) making dbcollect run for over an hour or more for a typical 10-day, 1-hour interval, single database cluster. Check Support notes for fixes and workarounds: 2404906.1, 2565465.1, 2318124.1, 29932310.8, 2148489.1, 29470291.8. Or be very patient.
Update: As of version 1.11, dbcollect runs multiple AWR reports in parallel for each instance, making it much faster, by default 50% of CPUs. To use all available CPUs, use --tasks 0 (note that CPU load will likely go to 100%, be careful)
- dbcollect seems to hang
- Check if the database itself works correctly. Can you switch logfiles, manually run a report?
- Timeouts
- Sometimes a SELECT on DBA_FREE_SPACE or CDB_FREE_SPACE in Oracle is very slow, causing the dbinfo scripts to run very long which in turn causes the job processors to timeout. Possible fix:
SQL> purge dba_recyclebin;
SQL> EXEC DBMS_STATS.GATHER_DICTIONARY_STATS;
SQL> EXEC DBMS_STATS.GATHER_FIXED_OBJECTS_STATS;