Loading...

Survey on Distributed Data Mining Systems

by Swetha Reddy Allam (Author) Kotagiri Santhosh (Author)

Scientific Essay 2014 5 Pages

Computer Science - Applied

Excerpt

ABSTRACT

With the increase in the usage of databases in various fields and domains, to overcome the challenges in a centralized data mining environment, more and more databases are distributed in networks. The objective of distributed data mining is to perform data mining operations based on the type and availability of distributed resources. To make a proper choice of a particular DDM system/model, the basic differences between each of them must be understood. This paper produces a survey of some of the DDM systems available. It mainly focusses on the homogeneous DDM models. It discusses methods based on semantic web and grid, multi-agent, mobile agent and i-Analyst. A hybrid method AGrIP is also discussed. A comparative analysis is made considering different key issues of DDM. Each method is described in detail by its method/algorithm.

Keywords

DDM; Multi-agent; i-Agent; Ontology; Semantic Web; Grid; CDM; DAP

1. INTRODUCTION

Technologies like communication, computation, science are fast growing. They form the fundamental reason for why distributed databases must be used in networks.

Centralized systems are replaced by distributed systems for various reasons; feasibility issues, security issues, limited bandwidth, organizational policies, cross-platform restrictions etc. DDM systems are advantageous over the traditional centralized systems for the following reasons [1]. First, when the model is smaller than the data, transferring the model and not the data will reduce the load on the network. Second, sharing the model is more secure than sharing the data. It also overcomes the security and privacy concerns of any organization. The key issues that determine the performance and utility of any DDM system can be listed [1] as:

- Data Copy: Data from one local site may have to be copied into other sites. The change is one copy not reflecting to the other copy raises the problem of data inconsistency.
- Communication Cost: Unlike in centralized environment where only I/O and CPU cost are considered, communication cost must also be considered in distributed data mining. Bandwidth and amount of data transferred will lead to this communication cost.
- Knowledge Integration: In a distributed scenario, the local results need to be integrated to form the global result. While doing so, the local results are to be verified to match the global degree to avoid ambiguity.

A general distributed data mining architecture is shown in figure 1.

Figure 1: Distributed Data Mining Architecture

Abbildung in dieser Leseprobe nicht enthalten.

The rest of this paper is structured as follows. The classification of different DDM methods and their categorization is explained in section 2. Section 3 discusses in detail some of the homogeneous methods. A comparative analysis is made in section 4. Section 5 concludes the paper.

2. Classification of DDM Systems

Out of the many existing DDM systems, few of them are researched and classified into these categories sub categories. A classification hierarchy is shown in figure 2.

2.1 Heterogeneous Vs. Homogeneous

Distributed data sources are the partitions of a global virtual data table. These partitions are mined separately. Depending on whether this partition is made horizontally or vertically, DDM systems are classified as heterogeneous or homogeneous [2].

Figure 2: Schematic of DDM techniques

Abbildung in dieser Leseprobe nicht enthalten.

2.1.1 Homogeneous DDM systems:

The centralized data mining system is considered homogeneous. All the data are contained in a single DBMS and maintained by a single management model. Everything here is treated as local. The sub categories of this type of systems are as below. Homogeneous systems are formed when the virtual table is horizontally divided.

2.1.1.1 DDM systems based on Data Mining Agents

Data mining agents are like a pseudo program designed to find patterns in data, to pull relevant data, to monitor changes in data etc. Agents have the properties like self-government, smart, Lasting and cooperation [6]. Agents’ automation, initiative, collaboration and adaptivity are used to obtain privacy, automation, cooperative mining and dynamic search capability respectively [2]. EMADS, CAKE, i-Analyst, AATP, Mobile Agent based DDM, TREAMA and AOC are some of the models under this category.

2.1.1.2 DDM models based on Grid

Grid can be described as a non-interactive work load. It is the sum of all different individual workloads that operate to find a final result. [9] Grid is mainly used for DDM for its advantages like resource sharing, open service and cooperative working. Applications with geographically distributed data use this method for mining. Some of the methods that fall in this category include DataMiningGrid, All Pairs and NORSC. The technology of agent and Grid are combined to form an Agent Grid. This is used to formulate a model named AGrIP.

2.1.1.3 Meta-Learning based DDM systems

Meta-learning is coined from the words learning and meta-data. Meta-learning is to learn the performance of the applications applying automatic learning algorithms on meta-data. Some of the methods that belong here are DDM architecture by Tozicka et al. [1,2] using the data source agent for meta-learning, another by Luo et al. by using multi-agents to learn meta-data. Others include SOA4KD, Weka4GML, XCS and EKDTO.

2.1.2 Heterogeneous Systems

Dividing the global virtual table vertically gives heterogeneous DDM models. Heterogeneous systems are all based on collective data mining frames (CDM).

2.1.2.1 DDM models based on CDM

CDM are used to improve local result quality. It hence reduces the ambiguity in the global results. CDMs give a DDM framework that guarantees correct local analysis and correct aggregation of local data models with minimal data communication. Model that comes under this category is BODHI.

3. Methods & Architecture

This section explains some of the methods mentioned above in detail. It concentrates on the agent based methods and Grid based methods.

3.1 Extendible Multi Agent Data mining System

Abbreviated as EMADS, this method is a homogeneous DDM technique. EMADS is a multi-agent driven approach and is advantageous over Agent-driven data mining. The architecture of this model is shown in figure 3. EMADS agents are responsible for accessing local data sources and for collaborative data analysis. EMADS includes data mining agents, data agents, task agents, user agents and mediators.

illustration not visible in this excerpt

Figure 3: EMADS conceptual framework

The data and mining agents are responsible for accessing data and carrying out the data mining process. These agents work in parallel and share information through the task agent. The task agent coordinates the data mining operations, and presents results to the user agent. Mediators are used for agents’ coordination. Data mining is carried out by means of local data mining agents to preserve privacy. Depending on the modes of operation, EMADS can be used by:

- EMADS developers: They develop algorithms
- End Users: Their access is restricted and do data mining tasks
- Contributors: They have restricted access and make the data available for use.

3.2 CAKE

CAKE stands for Classifying, Associating & Knowledge Discovery. CAKE uses Parallel Data Mining Agents (PADMAs) [13]. CAKE has a user interface to display the results. The architecture of CAKE is as shown in figure 4.

illustration not visible in this excerpt

Figure 4: CAKE architecture

It is 4-tier architecture [11] containing:

- Distributed data Warehouses: They are physically or logically located on different sites.
- PADMAs: Based on the operation they are of three different categories
- Rule-definer agents: They are used to define meta-data based on the rules
- Intelligent Data Mining Agents: They are responsible for all the calculations to mine data and to produce the desired result
- Knowledge Discovery Agents: They determine the final output to be a success or a failure

[...]

Details

Pages
5
Year
2014
ISBN (eBook)
9783656929604
ISBN (Book)
9783656929611
File size
802 KB
Language
English
Catalog Number
v294717
Institution / College
University of North Texas – Department of Computer Science
Grade
A
Tags
survey distributed data mining systems

Authors

Previous

Title: Survey on Distributed Data Mining Systems