Chapter3? Access and Join Methods 访问和连接方法
Karen Morton???? (page 83)
??? The optimizer must determine how to access the data your SQL statements require. You formulate your statement and the optimizer, during a hard parse, will figure out which operations should provide the data in the most effective way possible.? Using statistics as the primary guide, the optimizer will compute the cost of the possible alternatives to first access data and then join multiple tables to get the final result set.? The more you understand about the different access and join methods the optimizer will consider, the more likely you will be to formulate your SQL to help the optimizer make the best choices.? And, when the operation chosen by the optimizer doesn’t provide the performance you need, you can more accurately determine which operations would be more suited to produce the response times you want.
??? 优化器必须确定如何访问你的SQL语句所需的数据。你公式化你的语句,而在硬解析时,优化器将计算出所需操作,为了尽可能最有效的给出数据。使用统计信息作为首要的向导,优化器将计算出可能的替代方案的成本,首次访问在到多次连接多表获得最终的结果。你理解越多关于优化器所考虑的不同的访问和连接方法,越有可能你能公式化书写你的SQL以帮助优化器做出最佳的选择。而且,当优化器所选的操作并没有达到你所需的性能要求时,你能够更准确的确定哪种操作将更适合于产生你想要的响应时间。
??? After your SQL statement’s expressions and conditions are evaluated and any query transformation? that might help it more accurately develop the execution plan are complete, the next step in the development of the execution plan is for the optimizer to determine which method of accessing the data will be best.? In general, there are only two basic ways to access data: either via a full scan or an index scan.? During a full scan (which, by the way, can be a full table scan or a fast full index scan) multiple blocks are read in a single IO operation.? Index scans first scan index leaf blocks to retrieve specific rowids and then hand those rowids to the parent table access step to retrieve the actual row data.? These accesses are performed via single block reads.? If there are additional filters that need to be applied to the data after the table access step, the rows will pass through that filter before being included in the final result set from that step.? ?
??? 在你的SQL语句的表达式和条件被评估,且任何可能有助于准确开发执行计划的查询变换完成之后,开发执行计划的下一步就是优化器确定哪种访问数据的方法是最佳的。一般说来,只有两种最基本的访问数据的方法:要么全扫描,要么索引扫描。全扫描期间(顺便说一下,可以是全表扫描或者快速全索引扫描)多个块在一次IO操作中读取。索引扫描首先扫描索引页块检索指定的rowid然后传递这些rowid给父(上一步)的表访问步骤检索实际的行数据。这些访问都是通过单独的块读取(步骤)执行的。如果有附加的过滤器需要在表访问步骤之后应用于数据,在被包括在,自那步起的最终结果集之前,行集将穿过过滤器。
??? The access method that is chosen for a table is used to determine the join methods and join orders that will be used in the final plan.? So, if the access method chosen for a table is suboptimal, the likelihood that the whole plan is faulty is high.? As discussed in Chapter 2, statistics play a vital role in how accurate the optimizer is in determining the best method.? Along with representative statistics, the optimizer will use your query to figure out how much data you are requesting and which access method will provide that data as quickly as possible.? Each table in the query will first be evaluated independently from the others to determine its optimal access path.? In the next sections, I’ll review each of the access methods in detail.
??? 对某表所选的访问方法用于确定连接方法和连接顺序从而将用于最终的执行计划中。因此,如果对某表所选访问方法是欠优化的,很有可能整个执行计划的缺陷就很高。正如第二章讨论的,统计信息在让优化器如何准确的确定最佳方法上扮演了十分重要的角色。根据有代表性的统计数据,优化器将使用你的查询计算出你请求了多少数据和哪种访问方法能尽可能快的提供数据。在查询中每张表将首先相互独立的评估从而确定他的最佳访问路径。在接下的章节中,我将详细的检阅各种访问方法。
Full Scan Access Methods???? 全扫描访问方法
??? When full scanning an object, all the blocks associated with that object must be retrieved and
processed to determine if rows in a block match your query’s needs.? Remember that Oracle must read
an entire block into memory in order to get to the row data stored in that block.? So, when a full scan
occurs, there are actually two things the optimizer (and you) needs to consider: how many blocks must
be read and how much data in each block will be thrown away.? The idea I want you to grab on to at this
point is that the decision as to whether a full scan is the right choice isn’t just based on how many rows
your query will return.? There have been many “rules of thumb” published that state things like “if your
query will retrieve more than x% of rows from the table, then a full scan should be chosen.”? There’s
more to the decision than just that ROT (Rule Of Thumb = ROT) and I don’t want you to get stuck on a
rule that limits the consideration that should be given to the choice.
??? 当全扫描一个对象时,所有与对象相关联的块必须被检索到然后进一步判断块中的行集是否匹配你的需要。记住Oracle必须把整个块读入内存才能获取存在块中的行数据。因此,当一全扫描发生时,优化器(和你)需要考虑两件事:需要读多少块和每块中要抛弃多少数据。对于这点我想要你领悟的是:是否全扫描是(最终)正确的选择,该结论并不取决于你的查询所返回的行数。许多已发布的“经验”宣称如“若你的查询检索超出表的行集的x%,则会选择全扫描”。相比经验而言,还有更多的考量,而我不想让你卡在某个规则上,限制你的应该做出选择思路。
??? I'm not saying the theory behind the rule of thumb doesn’t make logical sense.? I’m just saying that
it isn’t everything that must be considered.? In a case where the query will return a very high
percentage of rows, the likelihood that a full scan should be use is certainly high, but the trouble with a
generalized rule is that the percentage of rows chosen is somewhat arbitrary.? Over the years, I’ve
seen this rule published in various books, articles, and forums with percentages varying from 20% to
70%.? Why should it change ???
??? 我并不是说规则背后的理论不符合逻辑。我只想说它没有考虑到全部。在某种情况下查询返回非常高百分比的行集,使用全扫描的可能性也必然高,但这条通用的规则的麻烦是:所选的行集的百分比有些武断。很多年来,我所见的在各种书籍,文章和论坛中所发布的这条规则,百分比值从20%变动到70%。为什么它要变动?
How Full Scan Operations are Chosen????? 全扫描操作如何被选中
??? At this point, now that I’ve briefly discussed the problem with generalizing how full table scans are chosen, I can continue with the rest of the story.? It’s not just about rows, it’s also about blocks and about throwaway.? The combination of all of these pieces of information may lead to a conclusion that it makes sense to do a full scan even when the percentage of rows is quite small.? On the other hand, a full scan may not be chosen even when a large percentage of the rows are returned.? Let’s walk through an example in Listing 3-1 showing how even when a small percentage of rows satisfies the query, the optimizer may choose a full table scan plan.? First, two tables are created that contain the exact same 10,000 rows.? Next, the execution plans for the same query against each table are shown.? Notice how even though the query returns 100 rows (only 1% of the total data), a full scan plan can be chosen.
???? 现在,既然我已经简要的讨论了如何选择全表扫描的一般规则的问题所在,我把故事讲完。它不仅关乎行集,它也关乎块集和抛弃。所有这些信息片组合得出结论,即使(所取行占)行集的百分比相当低使用全表扫描也合理。让我们通过列表3-1中的一个例子展示如何即使是是小百分比的行集满足查询,优化器可能还是使用全表扫描计划。首先,创建两表包含完全相同的10000行。接下来,展示针对每张表相同的执行计划。注意即使查询返回100行(仅1%的总数),全表扫描计划是如何被选中的。
Listing 3-1. Creating Two Test Tables
SQL> create table t1 as
? 2? select? trunc((rownum-1)/100) id,
? 3????????? rpad(rownum,100) t_pad
? 4? from??? dba_source
? 5? where?? rownum <= 10000;
?
Table created.
?
SQL> create index t1_idx1 on t1(id);
?
Index created.
?
SQL> exec dbms_stats.gather_table_stats(user,'t1',method_opt=>'FOR ALL COLUMNS SIZE
1',cascade=>TRUE);
?
PL/SQL procedure successfully completed.
?SQL> create table t2 as
? 2? select? mod(rownum,100) id,
? 3????????? rpad(rownum,100) t_pad
? 4? from??? dba_source
? 5? where?? rownum <= 10000;
?
Table created.
?
SQL> create index t2_idx1 on t2(id);
?
Index created.
?
SQL> exec dbms_stats.gather_table_stats(user,'t2',method_opt=>'FOR ALL COLUMNS SIZE
1',cascade=>TRUE);
?
PL/SQL procedure successfully completed.
?
??? Both tables will have 10,000 rows.? The id columns in both tables will have 100 rows for each value
between 0 and 99.? So, in terms of the data content, the tables are identical.? However, notice that for
t1 , the id column was populated using the expression? trunc((rownum-1)/100)? while for? t2 the id column
was populated using mod(rownum,100) .? Figure 3-1 shows how the rows might be stored physically in the
table’s data blocks.
??? 两张表都有10000行,两表的id列都有100行各自的值从0到99。如此,根据数据内容,表是相同的。然而,注意对于t1,id列使用的是表达式 trunc((rownum-1)/100) 填充的,而表t2的id列是用mod(rownum,100)填充的。图3-1展示行是如何在表的数据块中物理存储的。
????
??? Figure 3-1. Diagram of random vs. sequentially stored row values 随机和顺序存储行值比较图
??? Given what you just inserted, you’d expect to get a result set of 100 rows if you executed a query for
any single value of either table.? You know how many rows you should get because you just created the
tables yourself.? But, how could you get an idea of what the tables contained and how those rows were
stored otherwise?? One way is to run a query and use the COUNT? aggregate function, as shown in?
Listing 3-2.??
??? 要查出你刚才插入的数据,假设你执行查询取每张表的任意单值,你想要得到100行的结果集。你知道你将获取多少行因为你刚刚自己创建的表。但是,你如何获得概念关于表中包含了什么以及这些行是如何存储的?一种方法是运行查询使用count聚合函数,如列表3-2所示。
Listing 3-2. Count(*) Queries Against Tables T1 and T2
SQL> select count(*) ct from t1 where id = 1 ;
????????????? CT
---------------
??????????? 100
?1 row selected.
?
SQL> select count(*) ct from t2 where id = 1 ;
????????????? CT
---------------
??????????? 100
?1 row selected.
???? Notice that, as expected, you get 100 rows from both tables.? If it is reasonable to query actual data
to determine the result set sizes, this is a great way to know what to expect from your query.? For each
table involved in the query you write, you can execute individual queries that apply the predicates for
that table and count the number of rows returned.? This will? help you estimate which access method
will likely be best suited for your final query.? But, knowing row counts is only part of the information
you need.? Now, you need to go back to how the data is stored.
??? 注意,如期望的那样,你从两张表中都获取了100行。如果查询实际数据确定结果集的大小是合理的话,这将是个好方法,从你的查询就可知道期望(的数据大小)。在你写查询所涉及的每张表,你都能对那表执行独立的应用谓词的查询和count返回的行数。这将有助于你估计何种访问方法将可能最适合于你最终的查询。但是,知道行数仅是你所需的部分信息。现在,你需要返回去了解数据是如何存储的了。
??? Out of 10,000 total rows in each table, if you query for a single value ( where id = 1), you know
you’ll get back 100 rows.? That’s just 1% of the total rows.? Given that small percentage, you’d also then
likely expect that the optimizer would choose to use the index on? id? to access those rows, right?? That
certainly seems like a logical conclusion, but here is where knowing how your data is stored comes in.?
If your data is stored sequentially with most of the rows where? id = 1 stored physically in just a few
blocks, like is the case with table t1, this conclusion is correct, as shown in the explain plan in?
Listing 3-3.
??? 如果你查询单值(where id=1),从每张表的总行数10000的数据中,你知道你将获取到100行。就是总行数的1%。这是小的百分比,你也期望优化器将选择使用id上的索引来访问这些行,对吗?这当然看起来是很符合逻辑的结论,但是这里需要知道你的数据是如何存入的。如果你的数据是顺序存入的,大部分where id = 1的行集物理存储于几个块中,像表t1那中情况,这个结论是正确的,正如列表3-3的执行计划所展示的那样。
Listing 3-3. EXPLAIN PLAN for Query Against T1
---------------------------------------------------------------------------------------
| Id? | Operation????????????????? ? ? ? ? ? ? ? ? ? ? ? ? | Name????? | Rows? | Bytes? | Cost (%CPU)| Time???? |
---------------------------------------------------------------------------------------
|?? 0 | SELECT STATEMENT?????????????????????? |??????? ? ? ? ? |?? 100?? | 10300 |???? 3?? (0)?????? | 00:00:01 |
|?? 1 |? TABLE ACCESS BY INDEX ROWID?? | T1?????????? |?? 100?? | 10300 |???? 3?? (0)?????? | 00:00:01 |
|*? 2 |?? INDEX RANGE SCAN????????????????????? | T1_IDX1 |?? 100 ? |??????????? |???? 1?? (0)?????? | 00:00:01 |
---------------------------------------------------------------------------------------
?
Predicate Information (identified by operation id):
---------------------------------------------------
?? 2 - access("ID"=1)
??? So, wouldn’t you expect the query against t2 to do exactly the same thing since it will return the
same 100 rows?? As you can see from the explain plan shown in Listing 3-4, that is not the case at all.
??? 如此,你不是期望对t2表的查询也做同样的事情,因为它返回相同的100行么?你将看到在列表3-4的执行计划中,情况全然不同。
Listing 3-4. EXPLAIN PLAN for Query Against T2
--------------------------------------------------------------------------
| Id? | Operation???????????????????????? | Name | Rows? | Bytes?? | Cost (%CPU)? | Time???? |
--------------------------------------------------------------------------
|?? 0 | SELECT STATEMENT ? ?? |? ??? ? ?? |?? 100?? | 10300 |??? 39?? (3)???????? | 00:00:01 |
|*? 1 |? TABLE ACCESS FULL??? | T2?????? |?? 100?? | 10300 |??? 39?? (3)??????? | 00:00:01 |
--------------------------------------------------------------------------
?
Predicate Information (identified by operation id):
---------------------------------------------------
?? 1 - filter("ID"=1)
??? Why didn’t the optimizer make the same plan choice for both queries?? It’s because of how? the data
is stored in each table.? The query against table t1 will require that Oracle access only a few blocks to
get the 100 rows needed to satisfy the query.? Therefore, the index costs out to be the most attractive
option.? But, the query against table? t2 will end up having to read practically every block in the table to
get the same 100 rows since the rows are physically scattered throughout all the table blocks.? The
optimizer calculates that the time to read every block in the table using an index would likely be more
than just reading all the blocks in the table using a full table scan and simply throwing away rows that
aren’t needed from each block.? Retrieving the rows from an index would require approximately 200
block accesses.? I’ll discuss why it’s 200 in the next section when I cover index scans.? So, the query
against? t2 will use a? TABLE ACCESS FULL? operation instead of an index.
??? 为什么优化器没有对两个查询选择相同的计划?因为(还取决于)数据在每张表是如何存储的。对t1表的查询只需要Oracle访问几个数据块即可获取满足查询的所需的100行。因此,索引成本优势成为最有吸引力的选项。但是对t2表的查询将最终必须实际读在表中的每个块才能获取相同的100行,因为行集是物理分散贯穿到所有的表块上。优化器计算使用索引读表中的每个块的时间很可能比使用全表扫描只是读表中所有块再简单的从每块中抛弃不需要的行(的时间)要多。从一索引上检索行集可能需要大概200个块访问。下一节当我讲索引扫描的时候我将讨论为什么是200。如此以来,对t2表的查询使用TABLE ACCESS FULL(全表扫描操作)而不是索引了。
??? This demonstration shows you that there can be differences in the optimizer’s plan choices based
on how the data is stored.? While knowing this may not necessarily make a difference in how you end
up writing a query, it can make a difference in how you determine if the performance of the query will
meet your SLAs(service level agreement).? If you kept seeing a full table scan plan operation, you may think you needed to change or hint your query to force the use of the index.? But, doing so might make performance worse in the long term.? If you don’t understand how the data is stored, you can make poor decisions about what should happen when your query executes.
??? 这个示例告诉你基于数据是如何存储的,优化器选择的计划是能不同的。虽然知道这点可能不会最终对你写查询产生区别,它却能对你如何确定查询的性能是否将达到你的要求(服务级协议)产生区别。如果你总是看到(选择的是)全表扫描计划操作,你可能想你需要改变或提示你的查询而强制使用索引。但是,从长远看,这样做可能使得性能变糟。如果你不理解数据是如何存储的,你就可能对查询执行时发生什么做出错误的判断。
?
译者注:实际在译者电脑中实验Listing 3-1,和作者所述有所差异。当然,这并不是说作者说的概念是错误的,相反我认为她举得例子很形象,但是可能个人的环境还是有一定的影响。
1.如果两表数据量都是10000条,count(*)的执行计划都是用的全表扫描
1.如果两表数据量都是1000000条,count(*)的执行计划都是用的索引范围扫描
实验结果见附件。
?
她的例子中用到了autotrace追踪执行计划,但是她到现在为止也没有给出如何配置的说明。这里补充一下,
关于autotrace和PLUSTRACE角色的创建,参见
《Expert Oracle Database Architecture》中的"Setting up you Enviorment"
?
http://books.google.com.hk/books?id=HPJDlGmecwcC&printsec=frontcover&dq=Expert+Oracle+Database+Architecture&hl=zh-CN&sa=X&ei=vGlfT8GdEamjiAfOlJzdBw&ved=0CDEQ6AEwAA#v=onepage&q=Expert%20Oracle%20Database%20Architecture&f=false
?
?
?