Notes 4 Learners: DMDW Unit-4

Unit-4

Multidimensional Analysis and Descriptive Mining of Complex Data Objects

What are complex data object?

1. Many advanced, data-intensive applications, such as scientific research and engineering design, need to store, access, and analyze complex but relatively structured data objects.

2. These objects cannot be represented as simple and uniformly structured records (i.e., tuples) in data relations.

3. These kinds of systems deal with the efficient storage and access of vast amounts of disk-based complex structured data objects.

4. These systems organize a large set of complex data objects into classes, which are in turn organized into class/subclass hierarchies.

5. Each object in a class is associated with (1) an object-identifier, (2) a set of attributes that may contain sophisticated data structures, set- or list-valued data, class composition hierarchies, multimedia data, and (3) a set of methods that specify the computational routines or rules associated with the object class.

How can we Generalize the Structured Data

Typically, set-valued data can be generalized by:-

(1) Generalization of each value in the set to its corresponding higher-level concept,

(2) Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, the weighted average for numerical data, or the major clusters formed by the set.

Suppose that the hobby of a person is a set-valued attribute containing the set of values {tennis, hockey, soccer, violin, this set can be generalized to a set of high-level concepts, such as {sports, music, computer games}

How Aggregation and Approximation is done in Spatial and Multimedia Data Generalization

1. Aggregation and approximation are another important means of generalization

2. In a spatial merge, it is necessary to not only merge the regions of similar types within the same general class but also to compute the total areas, average density, or other aggregate functions while ignoring some scattered regions with different types if they are unimportant to the study.

3. Spatial aggregation and approximation. Suppose that we have different pieces of land for various purposes of agricultural usage, such as the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain highways, houses, and small stores. If the majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation

4. A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and other forms of audio/video information

5. Generalization on multimedia data can be performed by recognition and extraction of the essential features and/or general patterns of such data. There are many ways to extract such information. For an image, the size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image can be extracted by aggregation and/or approximation. For a segment of music, its melody can be summarized based on the approximate patterns that repeatedly occur in the segment, while its style can be summarized based on its tone, tempo, or the major musical instruments played.

What is Spatial Data Mining?

1. A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial databases have many features distinguishing them from relational databases.

2. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Such mining demands an integration of data mining with spatial database technologies.

3. It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and nonspatial data, constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial queries.

4. It is expected to have wide applications in geographic information systems, geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control, environmental studies, and many other areas where spatial data are used.

5. A crucial challenge to spatial data mining is the exploration of efficient spatial data mining techniques due to the huge amount of spatial data and the complexity of spatial data types and spatial access methods.

“Can we construct a spatial data warehouse?” Yes, as with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both spatial and nonspatial data in support of spatial data mining and spatial-datarelated decision-making processes.

There are three types of dimensions in a spatial data cube

1. A nonspatial dimension contains only nonspatial data. (such as “hot” for temperature and “wet” for precipitation)

2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but whose generalization, starting at a certain high level, becomes nonspatial.eg:-city

3. A spatial-to-spatial dimension is a dimension whose primitive level and all of its highlevel generalized data are spatial.eg:equitemp.

Two types of measures in a spatial data cube:

1. A numerical measure contains only numerical data. For example, one measure in a

spatial data warehouse could be the monthly revenue of a region

2. A spatial measure contains a collection of pointers to spatial objects.eg temperature and precipitation

There are several challenging issues regarding the construction and utilization of spatial datawarehouses.

1. The first challenge is the integration of spatial data from heterogeneous sources and systems.

2. The second challenge is the realization of fast and flexible on-line analytical processing in spatial data warehouses.

Mining Spatial Association

A spatial association rule is of the form A=>B [s%;c%], where A and B are sets of spatial or nonspatial predicates, s% is the support of the rule, and c%is the confidence of the rule. For example, the following

is a spatial association rule: is a(X; “school”)^close to(X; “sports center”))=>close to(X; “park”) [0:5%;80%].

This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data belongs to such a case.

------------------------------------------------------------------------------------------------à

What is Multimedia Data Mining?

A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages.

Similarity Search in Multimedia Data

For similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems:

(1) description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation;

(2) content-based retrieval systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image.

In a content-based image retrieval system, there are often two kinds of queries:

Image sample- based queries and image feature specification queries.

· Image-sample-based queries find all of the images that are similar to the given image sample. This search compares the feature vector (or signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database. Based on this comparison, images that are close to the sample image are returned.

· Image feature specification queries specify or sketch image features like color, texture, or shape, which are translated into a feature vector to be matched with the feature vectors of the images in the database

Mining Associations in Multimedia Data

1. Associations between image content and non image content features:

A rule like “If at least 50% of the upper part of the picture is blue, then it is likely to represent sky” belongs to this category since it links the image content to the keyword sky.

2. Associations among image contents that are not related to spatial relationships: A rule like “If a picture contains two blue squares, then it is likely to contain one red circle a swell” belongs to this category since the associations are all regarding image contents.

3. Associations among image contents related to spatial relationships: A rule like “If a red triangle is between two yellow squares, then it is likely a big oval-shaped object is underneath” belongs to this category since it associates objects in the image with spatial relationship.

Several approaches have been proposed and studied for similarity-based retrieval in image databases, based on image signature

1. Color histogram–based signature: In this approach, the signature of an image includes color histograms based on the color composition of an image regardless of its scale or orientation. This method does not contain any information about shape, image topology, or texture.

2. Multifeature composed signature: In this approach, the signature of an image includes a composition of multiple features: color histogram, shape, image topology, and texture.

“Can we construct a data cube for multimedia data analysis?” To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape.

What is Text Mining?

Text databases (or document databases), which consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, e-mail messages, and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic form, such as electronic publications, various kinds of electronic documents, e-mail, and the World Wide Web .

What is IR(Information Retrieval System)?

A typical information retrieval problem is to locate relevant documents in a document collection based on a user’s query, which is often some keywords describing an information need, although it could also be an example relevant document. In such a search problem, a user takes the initiative to “pull” the relevant information out from the collection; this is most appropriate when a user has some ad hoc (i.e., short-term)information need, such as finding information to buy a used car. When a user has a long-term information need (e.g., a researcher’s interests), a retrieval system may also take the initiative to “push” any newly arrived information item to a user if the item is judged as being relevant to the user’s information need. Such an information access

process is called information filtering, and the corresponding systems are often called filtering

systems or recommender systems.

Basic Measures for Text Retrieval: Precision and Recall

“Suppose that a text retrieval system has just retrieved a number of documents for me based

on my input in the form of a query. How can we assess how accurate or correct the system

was?”

Precision: This is the percentage of retrieved documents that are in fact relevant to

the query (i.e., “correct” responses). It is formally defined as

Recall: This is the percentage of documents that are relevant to the query and were,

in fact, retrieved. It is formally defined as

How Mining theWorld WideWeb is done?

The World Wide web serves as a huge,widely distributed, global information service center

for news, advertisements, consumer information, financial management, education,

government, e-commerce, and many other information services. The Web also contains

a rich and dynamic collection of hyperlink information and Web page access and usage

information, providing rich sources for data mining.

challenges for effective resource and knowledge discovery in web

1. The Web seems to be too huge for effective data warehousing and data mining. The size of the Web is in the order of hundreds of terabytes and is still growing rapidly. Many organizations and societies place most of their public-accessible information on the Web. It is barely possible to set up a data warehouse to replicate, store, or integrate all of the data on the Web.

2. The complexity of Web pages is far greater than that of any traditional text document

collection. Web pages lack a unifying structure.

3. The Web is a highly dynamic information source. Not only does the Web grow rapidly,

but its information is also constantly updated.

4. TheWeb serves a broad diversity of user communities. The Internet currently connects

more than 100 million workstations, and its user community is still rapidly expanding.

These challenges have promoted research into efficient and effective discovery and use of resources on the Internet.

Mining the WWW

1. Mining theWeb Page Layout Structure

The basic structure of a Web page is its DOM(Document Object Model) structure. The DOM structure of a Web page is a tree structure, where every HTML tag in the page corresponds to a node in the DOM tree. The Web page can be segmented by some predefined structural tags. Thus the DOM structure can be used to facilitate information extraction.

Here, we introduce an algorithm called VIsion-based Page Segmentation (VIPS).

VIPS aims to extract the semantic structure of a Web page based on its visual presentation

2. Mining the Web’s Link Structures to Identify Authoritative Web Pages

The Web consists not only of pages, but also of hyperlinks pointing from one page to another.

These hyperlinks contain an enormous amount of latent human annotation that can help automatically infer the notion of authority. These properties of Web link structures have led researchers to consider another important category of Web pages called a hub. A hub is one or a set ofWeb pages that provides collections of links to authorities.

What are the various Data Mining Applications?

· Data Mining for Financial Data Analysis-

Design and construction of data warehouses for multidimensional data analysis and

data mining: Financial data collected in the banking and financial industry are often relatively complete, reliable, and of high quality, which facilitates systematic data analysis and data mining. One may like to view the debt and revenue changes by month, by region, by sector, and by other factors, along with maximum, minimum, total, average, trend, and other statistical information.

Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan payment performance and customer credit rating.

Classification and clustering of customers for targeted marketing: Classification and clustering methods can be used for customer group identification and targeted marketing.

For example, we can use classification to identify the most crucial factors that may influence a customer’s decision regarding banking. Customers with similar behaviors regarding loan payments may be identified by multidimensional clustering techniques.

Detection of money laundering and other financial crimes: To detect money laundering

and other financial crimes, it is important to integrate information from multiple

databases (like bank transaction databases, and federal or state crime history databases), as long as they are potentially related to the study

· Data Mining for the Retail Industry

Design and construction of data warehouses based on the benefits of data mining: Because retail data cover a wide spectrum (including sales, customers, employees, goods transportation, consumption, and services), there can be many ways to design a data warehouse for this industry.

Multidimensional analysis of sales, customers, products, time, and region:

The retail industry requires timely information regarding customer needs, product sales, trends,and fashions, as well as the quality, cost, profit, and service of commodities

Analysis of the effectiveness of sales campaigns: The retail industry conducts sales

campaigns using advertisements, coupons, and various kinds of discounts and bonuses

to promote products and attract customers

Customer retention—analysis of customer loyalty: With customer loyalty card information,

one can register sequences of purchases of particular customers. Customer loyalty and purchase trends can be analyzed systematically

Product recommendation and cross-referencing of items: By mining associations

from sales records, one may discover that a customer who buys a digital camera is

likely to buy another set of items. Such information can be used to form product

recommendations. Collaborative recommender systems use data mining techniques

to make personalized product recommendations during live customer transactions,

based on the opinions of other customers

· Data Mining for the Telecommunication Industry

Fraudulent pattern analysis and the identification of unusual patterns:

Fraudulent activity costs the telecommunication industry millions of dollars per year. It

is important to (1) identify potentially fraudulent users and their atypical usage patterns;

(2) detect attempts to gain fraudulent entry to customer accounts; and(3) discover unusual patterns that may need special attention, such as busy-hour frustrated call attempts, switch and route congestion patterns, and periodic calls from automatic dial-out equipment (like fax machines) that have been improperly programmed

Multidimensional association and sequential pattern analysis: The discovery of association

and sequential patterns in multidimensional analysis can be used to promote telecommunication services. For example, suppose you would like to find usage patterns for a set of communication services by customer group, by month, and by time of day.

Mobile telecommunication services: Mobile telecommunication, Web and information

services, and mobile computing are becoming increasingly integrated and

common in our work and life.

Explain the Social Impacts of Data Mining?

· Ubiquitous data mining is the ever presence of data mining in many aspects of our daily lives. It can influence how we shop, work, search for information, and use a computer, as well as our leisure time, health, and well-being. In invisible data mining, “smart” software, such as Web search engines, customer-adaptive Web services (e.g., using recommender algorithms), e-mail managers, and so on, incorporates data mining into its functional components, often unbeknownst to the user.

· From grocery stores that print personalized coupons on customer receipts to on-line stores that recommend additional items based on customer interests, data mining has innovatively influenced what we buy, the way we shop, as well as our experience while shopping.

· Data mining has shaped the on-line shopping experience. Many shoppers routinely turn to on-line stores to purchase books, music, movies, and toys

· Many companies increasingly use data mining for customer relationship management (CRM), which helps provide more customized, personal service addressing individual customer’s needs, in lieu of mass marketing

· While you are viewing the results of your Google query, various ads pop up relating

to your query. Google’s strategy of tailoring advertising to match the user’s interests is

successful—it has increased the clicks for the companies involved by four to five times.

· Web-wide tracking is a technology that tracks a user across each site she visits. So,while

Surfing the Web, information about every site you visit may be recorded,which can provide

marketers with information reflecting your interests, lifestyle, and habits

· Finally, data mining can contribute toward our health and well-being. Several pharmaceutical companies use data mining software to analyze data when developing drugs and to find associations between patients, drugs, and outcomes. It is also being used to detect beneficial side effects of drugs

What are the Major concern of Data mining?

A major social concern of data mining is the issue of privacy and data security, particularly as the amount of data collected on individuals continues to grow. Fair information practices were established for privacy and data protection and cover aspects regarding the collection and use of personal data. Data mining for counterterrorism can benefit homeland security and save lives, yet raises additional concerns for privacy due to the possible access of personal data. Efforts towards

ensuring privacy and data security include the development of privacy-preserving data mining (which deals with obtaining valid data mining results without learning the underlying data values) and data security–enhancing techniques (such as encryption)

What are the Recent trends in Data mining?

Trends in data mining include further efforts toward the exploration of new application areas, improved scalable and interactive methods (including constraint-based mining), the integration of data mining with data warehousing and database systems, the standardization of data mining languages, visualization methods, and new methods for handling complex data types. Other trends include biological data mining, mining software bugs, Web mining, distributed and real-time mining, graph mining, social network analysis, multi relational and multi database data mining, data privacy protection, and data security

……………………………………………………………………………….

Notes 4 Learners

DMDW Unit-4

7 comments:

FEEDBACK

Blog Archive