一区二区久久-一区二区三区www-一区二区三区久久-一区二区三区久久精品-麻豆国产一区二区在线观看-麻豆国产视频

MapReduce:一個巨大的倒退

前言

databasecolumn 的數(shù)據(jù)庫大牛們(其中包括PostgreSQL的最初伯克利領(lǐng)導(dǎo):Michael Stonebraker)最近寫了一篇評論當前如日中天的MapReduce 技術(shù)的文章,引發(fā)劇烈的討論。我抽空在這兒翻譯一些,一起學(xué)習(xí)。

譯者注:這種 Tanenbaum vs. Linus 式的討論自然會導(dǎo)致非常熱烈的爭辯。但是老實說,從 Tanenbaum vs. Linus 的辯論歷史發(fā)展來看,Linux是越來越多地學(xué)習(xí)并以不同方式應(yīng)用了 Tanenbaum 等 OS 研究者的經(jīng)驗(而不是背棄); 所以 MapReduce vs. DBMS 的討論,希望也能給予后來者更多的啟迪,而不是對立。

原文見:http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html

MapReduce: A major step backwards/MapReduce: 一個巨大的倒退

注:作者是 David J. DeWitt 和 Michael Stonebraker

On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.

1月8日,一位Database Column的讀者詢問我們對各種新的分布式數(shù)據(jù)庫研究工作有何看法,我們就從MapReduce談起吧。現(xiàn)在討論MapReduce恰逢其時,因為最近商業(yè)媒體充斥著所謂“云計算(cloud computing)”革命的新聞。這種計算方式通過大量(低端的)并行工作的處理器來解決計算問題。實際上,就是用大量便宜貨(原文是jelly beans)代替數(shù)量小得多的高端服務(wù)器來構(gòu)造數(shù)據(jù)中心。

For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.

例如,IBM和Google已經(jīng)宣布,計劃構(gòu)建一個1000處理器的集群,開放給幾個大學(xué),教授學(xué)生使用一種名為MapReduce [1]的軟件工具對這種集群編程。加州大學(xué)伯克利分校甚至計劃教一年級新生如何使用MapReduce框架編程。

As both educators and researchers,we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of Scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

我們都既是教育者也是研究人員,MapReduce支持者們大肆宣傳它代表了可伸縮、數(shù)據(jù)密集計算發(fā)展中的一次范型轉(zhuǎn)移,對此我們非常驚訝。MapReduce就編寫某些類型的通用計算程序而言,可能是個不錯的想法,但是從數(shù)據(jù)庫界看來,并非如此:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on
  1. 在大規(guī)模的數(shù)據(jù)密集應(yīng)用的編程領(lǐng)域,它是一個巨大的倒退
  2. 它是一個非最優(yōu)的實現(xiàn),使用了蠻力而非索引
  3. 它一點也不新穎——代表了一種25年前已經(jīng)開發(fā)得非常完善的技術(shù)
  4. 它缺乏當前DBMS基本都擁有的大多數(shù)特性
  5. 它和DBMS用戶已經(jīng)依賴的所有工具都不兼容

First,we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.

首先,我們簡要地討論一下MapReduce是什么,然后更詳細地闡述上面列出的5點看法。

What is MapReduce?/何謂MapReduce?

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

MapReduce的基本思想很直接。它包括用戶寫的兩個程序:map和reduce,以及一個framework,在一個計算機簇中執(zhí)行大量的每個程序的實例。

The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

map程序從輸入文件中讀取"records"的集合,執(zhí)行任何需要的過濾或者轉(zhuǎn)換,并且以(key,data)的形式輸出records的集合。當map程序產(chǎn)生輸出記錄,"split"函數(shù)對每一個輸出的記錄的key應(yīng)用一個函數(shù),將records分割為M個不連續(xù)的塊(buckets)。這個split函數(shù)有可能是一個hash函數(shù),而其他確定的函數(shù)也是可用的。當一個塊被寫滿后,將被寫道磁盤上。然后map程序終止,輸出M個文件,每一個代表一個塊(bucket)。

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.

通常情況下,map程序的多個實例持續(xù)運行在compute cluster的不同節(jié)點上。每一個map實例都被MapReduce scheduler分配了input file的不同部分,然后執(zhí)行。如果有N個節(jié)點參與到map階段,那么在這N個節(jié)點的磁盤儲存都有M個文件,總共有N*M個文件。

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

值得注意的地方是,所有的map實例都使用同樣的hash函數(shù)。因此,有相同hash值的所有output record會出被放到相應(yīng)的輸出文件中。

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file,which forms part of the "answer" to a MapReduce computation.

MapReduce的第二個階段執(zhí)行M個reduce程序的實例, Rj, 1 <= j <= M.每一個reduce實例的輸入是Rj,包含文件Fi,j, 1<= i <= N.注意,每一個來自map階段的output record,含有相同的hash值的record將會被相同的reduce實例處理--不論是哪一個map實例產(chǎn)生的數(shù)據(jù)。在map-reduce架構(gòu)處理過后,input records將會被以他們的keys來分組(以排序或者哈希的方式),到一個reduce實例然后給reduce程序處理。和map程序一樣,reduce程序是任意計算語言表示的。因此,它可以對它的records做任何想做事情。例如,可以添加一些額外的函數(shù),來計算record的其他data field。每一個reduce實例可以將records寫到輸出文件中,組成MapReduce計算的"answer"的一部分。

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

和SQL可以做對比的是,map程序和聚集查詢中的 group-by語句相似。Reduce函數(shù)和聚集函數(shù)(例如,average,求平均)相似,在所有的有相同group-by的屬性的列上計算。

We now turn to the five concerns we have with this computing paradigm.

現(xiàn)在來談一談我們對這種計算方式的5點看法。

MapReduce is a step backwards in database access

As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.

  • Schemas are good.
  • Separation of the schema from the application is good.
  • High-level access languages are good.
  • Schemas是有益的。
  • 將schema和程序分開處理是有益的。
  • High-level存取語言是有益的。

MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.

MapReduce沒有學(xué)到任何一條,并且倒退回了60年代,倒退回了現(xiàn)代數(shù)據(jù)庫管理系統(tǒng)發(fā)明以前的時代。

The DBMS community learned the importance of schemas,whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.

DBMS社區(qū)懂得schemas的重要性,憑借fields和他們的數(shù)據(jù)類型記錄在儲存中。更重要的,運行狀態(tài)的DBMS系統(tǒng)可以確定輸入的記錄都遵循這個schema。這是最佳的保護程序不會添加任何垃圾信息到數(shù)據(jù)集中。MapReduce沒有任何這樣的功能,沒有任何控制數(shù)據(jù)集的預(yù)防垃圾數(shù)據(jù)機制。一個損壞的MapReduce數(shù)據(jù)集事實上可以無聲無息的破壞所有使用這個數(shù)據(jù)集的MapReduce程序。

It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast,when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.

將schema和程序分開也非常重要。如果一個程序員想要對一個數(shù)據(jù)集寫一個新程序,他必須知道數(shù)據(jù)集的結(jié)構(gòu)(record structure)。現(xiàn)代DBMS系統(tǒng)中,schema儲存在系統(tǒng)目錄中,并且可以被任意用戶查詢(使用SQL)它的結(jié)構(gòu)。相反的,如果schema不存在或者存在于程序中,程序員必須檢查程序的代碼來獲得數(shù)據(jù)的結(jié)構(gòu)。這不僅是一個單調(diào)枯燥的嘗試,而且程序員必須能夠找到先前程序的source code。每一個MapReduce程序員都必須承受后者的乏味,因為沒有系統(tǒng)目錄用來儲存records的結(jié)構(gòu)--就算這些結(jié)構(gòu)存在。

During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:

  • By stating what you want - rather than presenting an algorithm for how to get it (relational view)
  • By presenting an algorithm for data access (Codasyl view)

70年代DBMS社區(qū),在關(guān)系型數(shù)據(jù)庫支持者和Codasys型數(shù)據(jù)庫支持者之間發(fā)有一次"大討論"。一個重點議題就是DBMS存取程序應(yīng)該寫成哪種方式:

  • 描述你想要的--而不是展示一個算法,解釋如何工作的。(關(guān)系型數(shù)據(jù)庫的觀點)
  • 展示數(shù)據(jù)存取的算法。(Codasyl 的觀點)

The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.

討論的結(jié)果已經(jīng)是過去的歷史,但是整個世界看到High-level語言的價值,因此關(guān)系型數(shù)據(jù)庫開始流行.在High-level語言上編寫/修改程序比較容易,而且易于理解. Codasyl曾被批評為"DBMS存取的匯編語言".一個MapReduce程序員和Codasyl程序員類似-他們在low-level語言基礎(chǔ)上做low-level的記錄操作.沒有人提倡回到匯編語言,同樣沒有人被迫去編寫MapReduce程序.

MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.

MapReduce提倡者也許反對這個說法,宣稱他們的目標數(shù)據(jù)集是沒有schema的.我們不同意這個說法.從輸入數(shù)據(jù)集中抽取key, map函數(shù)至少依賴每個數(shù)據(jù)集的一個數(shù)據(jù)字段的存在, reduce函數(shù)也是如此,從收到要處理的記錄來計算值.

Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name,{values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.

在google的BigTable(或者Hadoop的HBase)基礎(chǔ)上寫MapReduce的應(yīng)用并沒有改變這個事實.通過自描述的元組格式(row key, column name,(value)),相同表的不同元組事實上有不同的schema.另外BigTable和HBase 并不提供邏輯獨立,例如view視圖機制.當邏輯schema改變時, View(視圖)很重要地簡化了程序的繼續(xù)運行.

MapReduce is a poor implementation

2. MapReduce是一個糟糕的實現(xiàn)

All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.

所有現(xiàn)代DBMS都使用散列或者B樹索引加速數(shù)據(jù)存取。如果要尋找記錄的某個子集(比如薪水為10000的雇員或者是鞋類專柜的雇員),經(jīng)常可以使用索引有效地將搜索范圍縮小一到兩個數(shù)量級。而且,還有查詢優(yōu)化器來確定是使用索引還是執(zhí)行蠻力順序搜索。

MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.

MapReduce沒有索引,因此處理時只有蠻力一種選擇。在索引是更好的存取機制時,MapReduce將劣勢盡顯。

One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.

有人可能會說,MapReduce的價值在于在計算機網(wǎng)格上自動地提供并行執(zhí)行。這種特性數(shù)據(jù)庫研究界在上世紀80年代就已經(jīng)探討過了,而且構(gòu)建了許多原型,包括 Gamma [2,3], Bubba [4],和 Grace [5]。而Teradata這樣的系統(tǒng)早在80年代晚期,就將這些想法商業(yè)化了。

In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.

對這一點做個總結(jié),過去的20年曾出現(xiàn)過許多高性能,商業(yè)化的,網(wǎng)格SQL引擎(帶有schemas和索引).與它們相比, MapReduce并沒有表現(xiàn)出眾.

There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.

而MapReduce本身存在一些lower-level實現(xiàn)的問題,特別是skew和數(shù)據(jù)交換.

One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.

MapReduce提倡者好象忽略的一個因素是skew問題.如"Parallel Database System: The Future of High Performance Database Systems"文章所述, skew是并行查詢系統(tǒng)想要成功達到擴展應(yīng)用的巨大障礙.當有同樣key的記錄分布變化很廣,這個問題會發(fā)生在map階段.這個變化反過來會導(dǎo)致一些reduce實例比其他實例要運行更長時間,這就使整個計算執(zhí)行時間是最慢的reduce實例的運行時間.并行數(shù)據(jù)庫業(yè)界已經(jīng)研究這個問題很深了,并開發(fā)了一些解決方案, MapReduce社區(qū)也許想采用.

There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously-- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.

還存在一個MapReduce支持者曲解的嚴重性能問題.想想N個map實例產(chǎn)生M個輸出文件-每個最后由不同的reduce 實例處理,這些文件寫到運行map實例機器的本地硬盤.如果N是1,000, M是500, map階段產(chǎn)生500,000個本地文件.當reduce階段開始, 500個reduce實例每個需要讀入1,000文件,并用類似FTP協(xié)議把它要的輸入文件從map實例運行的節(jié)點上pull取過來.假如同時有數(shù)量級為100的reduce實例運行,那么2個或2個以上的reduce實例同時訪問同一個map節(jié)點來獲取輸入文件是不可避免的-導(dǎo)致大量的硬盤查找,有效的硬盤運轉(zhuǎn)速度至少降低20%.這就是為什么并行數(shù)據(jù)庫系統(tǒng)不實現(xiàn)split文件,采用push(推到socket套接字)而不是pull.由于MapReduce的出色容錯依賴于如何實現(xiàn)split文件, MapReduce框架是否成功地轉(zhuǎn)向使用push范式,不是很清楚.

Given the experimental evaluations to date,we have serious doubts about howwell MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.

僅用實驗結(jié)果來說,我們嚴重懷疑MapReduce應(yīng)用如何能很好地擴展.甚至, MapReduce實現(xiàn)者應(yīng)該好好學(xué)習(xí)一下近 25年來的并行DBMS研究文獻.

MapReduce is not novel

MapReduce一點也不新穎

The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.

MapReduce社區(qū)好象覺得他們發(fā)現(xiàn)了一個嶄新的處理大數(shù)據(jù)集的范式.事實上, MapReduce使用的技術(shù)已經(jīng)有20年了.把大數(shù)據(jù)集切分成小的分區(qū)在"Application of Hash to Data Base Machine and Its Architecture"[11] 里作為一種新的join算法的基礎(chǔ)就已經(jīng)提出,在"Multiprocessor Hash-Based Join Algorithms"[7], Gerber 展現(xiàn)了Kitsuregawa的技術(shù)如何被擴展,在shared-nothing [8]的集群上結(jié)合分區(qū)表,分區(qū)執(zhí)行, hash拆分來并行執(zhí)行join. DeWitt [2]表明這些技術(shù)可以被采用來并行執(zhí)行無論有無group 語句的合并. DeWitt和Gray [6]描述了一些并行數(shù)據(jù)庫系統(tǒng),以及它們?nèi)绾翁幚聿樵? Shatdal和Naughton [9] 則探索了并行執(zhí)行合并的替代策略.

Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.

Teradata已經(jīng)利用這些技術(shù)實現(xiàn)了一個商業(yè)DBMS產(chǎn)品20年了; 特別是MapReduce人群宣稱發(fā)明的那些技術(shù).

While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation,we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.

當MapReduce支持者肯定地聲稱寫MapReduce函數(shù)是他們的軟件不同于并行SQL實現(xiàn)的地方,我們不得不提醒他們 POSTGRES在1980年代中期就支持用戶自定義函數(shù)和自定義合并了.實質(zhì)上,所有現(xiàn)代數(shù)據(jù)庫系統(tǒng)已經(jīng)提供這樣的功能很久了,大約起源于1995年左右的Illustra引擎.

MapReduce is missing features

MapReduce缺乏的特性

All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:

以下特性在現(xiàn)代DBMS都已經(jīng)缺省提供,但MapReduce里沒有.

  • Bulk loader -- to transform input data in files into a desired format and load it into a DBMS
  • Indexing -- as noted above
  • Updates -- to change the data in the data base
  • Transactions -- to support parallel update and recovery from failures during update
  • Integrity constraints -- to help keep garbage out of the data base
  • Referential integrity-- again, to help keep garbage out of the data base
  • Views -- so the schema can change without having to rewrite the application program
  • Bulk loader -把文件里的輸入數(shù)據(jù)轉(zhuǎn)成期望格式,然后導(dǎo)入DBMS.
  • Indexing -如上所述.
  • 更新-在數(shù)據(jù)庫里修改數(shù)據(jù)
  • 事務(wù)-支持并行更新,更新過程中能從失敗處恢復(fù).
  • 完整性約束-用來擋住垃圾數(shù)據(jù)
  • 引用完整-也是用來擋住垃圾數(shù)據(jù)
  • 視圖- schema可以改變,也無須重寫應(yīng)用程序

In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.

總之, MapReduce僅提供了現(xiàn)代DBMS功能的一小部分.

MapReduce is incompatible with the DBMS tools

它和DBMS工具不兼容

A modern SQL DBMS has available all of the following classes of tools:

一個現(xiàn)代SQL DBMS可以有以下類別的工具:

  • Report writers (e.g., Crystal reports) to prepare reports for human visualization
  • Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses
  • Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets
  • Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another
  • Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.
  • 報告生成器(例如水晶報表),可以提供可視化報告.
  • 商業(yè)智能工具(例如Business Objects或者Cognos),支持數(shù)據(jù)倉庫的專門查詢.
  • 數(shù)據(jù)挖掘工具(例如Oracle數(shù)據(jù)挖掘, IBM DB2智能挖掘),允許用戶在大數(shù)據(jù)集中發(fā)現(xiàn)結(jié)構(gòu).
  • 復(fù)制工具(例如Golden Gate),允許用戶從一個DBMS復(fù)制數(shù)據(jù)到另外一個.
  • 數(shù)據(jù)庫設(shè)計工具(例如Embarcadero),幫助用戶創(chuàng)建一個數(shù)據(jù)庫

MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.

MapReduce不能使用這些工具,但也沒有自己的工具.直到它成為SQL兼容,或者有人寫這些工具,否則在完成一個終端應(yīng)用時MapReduce會一直很難用.

In Summary

It is exciting to see a much larger community engaged in the design and implementation of Scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology-- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.

看到規(guī)模大得多的社區(qū)加入可伸縮的查詢處理技術(shù)的設(shè)計與實現(xiàn),非常令人興奮。但是,我們要強調(diào),他們不應(yīng)該忽視數(shù)據(jù)庫技術(shù)40多年來的教訓(xùn),尤其是數(shù)據(jù)庫技術(shù)中數(shù)據(jù)模型、物理和邏輯數(shù)據(jù)獨立性、像SQL這樣的聲明性查詢語言等等,可以為應(yīng)用程序的設(shè)計、實現(xiàn)和維護帶來的諸多好處。而且,計算機科學(xué)界往往喜歡自行其是,不理會其他學(xué)科的文獻。我們希望更多人來一起研究過去25年的并行DBMS文獻。MapReduce要達到能夠與現(xiàn)代DBMS相提并論的水平,還需要開發(fā)大量特性和工具。

We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finallywe note that some database researchers are beginning to explore using the MapReduce framework as the basis for building Scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.

我們完全理解數(shù)據(jù)庫系統(tǒng)也有自己的問題。數(shù)據(jù)庫界清楚地認識到,現(xiàn)在數(shù)據(jù)庫系統(tǒng)還太“難”使用,而且正在解決這一問題。數(shù)據(jù)庫界也從MapReduce為其應(yīng)用程序提供的出色的容錯上學(xué)到了有價值的東西。最后,我們注意到,一些數(shù)據(jù)庫研究人員也開始研究使用MapReduce框架作為構(gòu)建可伸縮數(shù)據(jù)庫系統(tǒng)的基礎(chǔ)。雅虎研究院的Pig[10]項目就是其中之一。

References

[1] "MapReduce: Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.

[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.

[4] "Gamma - A High Performance Dataflow Database Machine," DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.

[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.

[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.

[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.

[8] "The Case for Shared-Nothing," Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.

[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.

[10] "Pig", Chris Olston, 

it知識庫MapReduce:一個巨大的倒退,轉(zhuǎn)載需保留來源!

鄭重聲明:本文版權(quán)歸原作者所有,轉(zhuǎn)載文章僅為傳播更多信息之目的,如作者信息標記有誤,請第一時間聯(lián)系我們修改或刪除,多謝。

主站蜘蛛池模板: 精品久久久久久婷婷 | 免费女人扒开下面无遮挡 | 美女扒开内衣看个够网站 | 亚洲小视频网站 | 99精品伊人久久久大香线蕉 | 一色网| 久久er这里都是精品23 | 日韩激情影院 | 久久久久久亚洲精品中文字幕 | 国产成人精品免费视频大全办公室 | 91精选国产 | 国产欧美一区二区三区免费 | 好吊色妞 | 真实国产乱子伦精品一区二区三区 | 欧美性xxxx人妖 | 加勒比在线 | 久久思思精品 | 一级成人a毛片免费播放 | 4hu44四虎在线观看 | 一本三道a无线码一区v小说 | 国产亚洲视频在线播放大全 | 久99久热只有精品国产99 | 婷婷激情片 | 男人精品一线视频在线观看 | 美女很黄很黄是免费的 | 国产成人在线观看网站 | 免费国产一级特黄久久 | 69国产成人综合久久精品 | 91成品视频 | 网红主播大尺度精品福利视频 | 亚洲欧美性视频 | 91在线精品视频 | 丁香婷婷在线视频 | 亚洲一区二区三区免费 | 久国产精品久久精品国产四虎 | 亚洲国产精品久久网午夜 | 精品国产免费久久久久久婷婷 | 一区二区三区福利 | 伊人不卡久久大香线蕉综合影院 | 欧美一级日韩一级亚洲一级 | 国产日韩欧美精品一区二区三区 |