IEEE project on data mining in Java

NNexus: An Automatic Linker for Collaborative Web-Based
Corpora
Abstract
: In this paper, we introduce Noosphere Networked Entry eXtension
and Unification System (NNexus), a generalization of the automatic linking
engine of Noosphere (at PlanetMath.org) and the first system that automates the
process of linking disparate “encyclopedia” entries into a fully connected
conceptual network. The main challenges of this problem space include: 1)
linking quality (correctly identifying which terms to link and which entry to link
to with minimal effort on the part of users), 2) efficiency and scalability, and 3)
generalization to multiple knowledge bases and web-based information
environment. We present the NNexus approach that utilizes subject classification
and other metadata to address these challenges. We also present evaluation results
demonstrating the effectiveness and efficiency of the approach and discuss
ongoing and future directions of research.


Predicting Missing Items in Shopping Carts
Abstract: Existing research in association mining has focused mainly on how to
expedite the search for frequently co-occurring groups of items in “shopping cart”
type of transactions; less attention has been paid to methods that exploit these
“frequent itemsets” for prediction purposes. This paper contributes to the latter
task by proposing a technique that uses partial information about the contents of a
shopping cart for the prediction of what else the customer is likely to buy. Using
the recently proposed data structure of itemset trees (IT-trees), we obtain, in a
computationally efficient manner, all rules whose antecedents contain at least one
item from the incomplete shopping cart. Then, we combine these rules by
uncertainty processing techniques, including the classical Bayesian decision
theory and a new algorithm based on the Dempster-Shafer (DS) theory of
evidence combination



Hierarchically Distributed Peer-to-Peer Document
Clustering and Cluster Summarization
Abstract: In distributed data mining, adopting a flat node distribution model can
affect scalability. To address the problem of modularity, flexibility, and
scalability, we propose a Hierarchically distributed Peer-to-Peer (HP2PC)
architecture and clustering algorithm. The architecture is based on a multilayer
overlay network of peer neighborhoods. Supernodes, which act as representatives
of neighborhoods, are recursively grouped to form higher level neighborhoods.
Within a certain level of the hierarchy, peers cooperate within their respective
neighborhoods to perform P2P clustering. Using this model, we can partition the
clustering problem in a modular way across neighborhoods, solve each part
individually using a distributed K-means variant, and then successively combine
clusterings up the hierarchy where increasingly more global solutions are
computed. In addition, for document clustering applications, we summarize the
distributed document clusters using a distributed keyphrase extraction algorithm,
thus providing interpretation of the clusters. Results show decent speedup,
reaching 165 times faster than centralized clustering for a 250-node simulated
network, with comparable clustering quality to the centralized approach. We also
provide comparison to the P2P K-means algorithm and show that HP2PC
accuracy is better for typical hierarchy heights. Results for distributed cluster
summarization match those of their centralized counterparts with up to 88 percent
accuracy.


Effective Collaboration with Information Sharing in Virtual
Universities
Abstract: A global education system, as a key area in future IT, has fostered
developers to provide various learning systems with low cost. While a variety of
e-learning advantages has been recognized for a long time and many advances in
e-learning systems have been implemented, the needs for effective information
sharing in a secure manner have to date been largely ignored, especially for virtual
university collaborative environments. Information sharing of virtual universities
usually occurs in broad, highly dynamic network-based environments, and
formally accessing the resources in a secure manner poses a difficult and vital
challenge. This paper aims to build a new rule-based framework to identify and
address issues of sharing in virtual university environments through role-based
access control (RBAC) management. The framework includes a role-based group
delegation granting model, group delegation revocation model, authorization
granting, and authorization revocation. We analyze various revocations and the
impact of revocations on role hierarchies. The implementation with XML-based
tools demonstrates the feasibility of the framework and authorization methods.
Finally, the current proposal is compared with other related work.




Histogram-Based Global Load Balancing in Structured
Peer-to-Peer Systems
Abstract: Over the past few years, peer-to-peer (P2P) systems have rapidly grown
in popularity and have become a dominant means for sharing resources. In these
systems, load balancing is a key challenge because nodes are often heterogeneous.
While several load-balancing schemes have been proposed in the literature, these
solutions are typically ad hoc, heuristic based, and localized. In this paper, we
present a general framework, HiGLOB, for global load balancing in structured
P2P systems. Each node in HiGLOB has two key components: 1) a histogram
manager maintains a histogram that reflects a global view of the distribution of
the load in the system, and 2) a load-balancing manager that redistributes the load
whenever the node becomes overloaded or underloaded. We exploit the routing
metadata to partition the P2P network into nonoverlapping regions corresponding
to the histogram buckets. We propose mechanisms to keep the cost of
constructing and maintaining the histograms low. We further show that our
scheme can control and bound the amount of load imbalance across the system.
Finally, we demonstrate the effectiveness of HiGLOB by instantiating it over
three existing structured P2P systems: Skip Graph, BATON, and Chord. Our
experimental results indicate that our approach works well in practice.


Improving Personalization Solutions through Optimal
Segmentation of Customer Bases
Abstract: On the Web, where the search costs are low and the competition is just
a mouse click away, it is crucial to segment the customers intelligently in order to
offer more targeted and personalized products and services to them. Traditionally,
customer segmentation is achieved using statistics-based methods that compute a
set of statistics from the customer data and group customers into segments by
applying distance-based clustering algorithms in the space of these statistics. In
this paper, we present a direct grouping-based approach to computing customer
segments that groups customers not based on computed statistics, but in terms of
optimally combining transactional data of several customers to build a data
mining model of customer behavior for each group. Then, building customer
segments becomes a combinatorial optimization problem of finding the best
partitioning of the customer base into disjoint groups. This paper shows that
finding an optimal customer partition is NP-hard, proposes several suboptimal
direct grouping segmentation methods, and empirically compares them among
themselves, traditional statistics-based hierarchical and affinity propagation-based
segmentation, and one-to-one methods across multiple experimental conditions. It
is shown that the best direct grouping method significantly dominates the
statistics-based and one-to-one approaches across most of the experimental
conditions, while still being computationally tractable. It is also shown that the
distribution of the sizes of customer segments generated by the best direct
grouping method follows a power law distribution and that microsegmentation
provides the best approach to personalization.


Storing and Indexing Spatial Data in P2P Systems
Abstract: The peer-to-peer (P2P) paradigm has become very popular for storing
and sharing information in a totally decentralized manner. At first, research
focused on P2P systems that host 1D data. Nowadays, the need for P2P
applications with multidimensional data has emerged, motivating research on P2P
systems that manage such data. The majority of the proposed techniques are based
either on the distribution of centralized indexes or on the reduction of
multidimensional data to one dimension. Our goal is to create from scratch a
technique that is inherently distributed and also maintains the multidimensionality
of data. Our focus is on structured P2P systems that share spatial information. We
present SPATIALP2P, a totally decentralized indexing and searching framework
that is suitable for spatial data. SPATIALP2P supports P2P applications in which
spatial information of various sizes can be dynamically inserted or deleted, and
peers can join or leave. The proposed technique preserves well locality and
directionality of space.


Olex: Effective Rule Learning for Text Categorization
Abstract: This paper describes Olex, a novel method for the automatic induction
of rule-based text classifiers. Olex supports a hypothesis language of the form “if
T1 or …….. or Tn occurs in document d, and none of Tn+1, . . . Tn+m occurs in d,
then classify d under category c,” where each Ti is a conjunction of terms. The
proposed method is simple and elegant. Despite this, the results of a systematic
experimentation performed on the REUTERS-21578, the OHSUMED, and the
ODP data collections show that Olex provides classifiers that are accurate,
compact, and comprehensible. A comparative analysis conducted against some of
the most well-known learning algorithms (namely, Naive Bayes, Ripper, C4.5,
SVM, and Linear Logistic Regression) demonstrates that it is more than
competitive in terms of both predictive accuracy and efficiency.


Exact Knowledge Hiding through Database Extension
Abstract: In this paper, we propose a novel, exact border-based approach that
provides an optimal solution for the hiding of sensitive frequent item sets by 1)
minimally extending the original database by a synthetically generated database
part - the database extension, 2) formulating the creation of the database extension
as a constraint satisfaction problem, 3) mapping the constraint satisfaction
problem to an equivalent binary integer programming problem, 4) exploiting
underutilized synthetic transactions to proportionally increase the support of
nonsensitive item sets, 5) minimally relaxing the constraint satisfaction problem
to provide an approximate solution close to the optimal one when an ideal
solution does not exist, and 6) using a partitioning in the universe of the items to
increase the efficiency of the proposed hiding algorithm. Extending the original
database for sensitive item set hiding is proved to provide optimal solutions to an
extended set of hiding problems compared to previous approaches and to provide
solutions of higher quality. Moreover, the application of binary integer
programming enables the simultaneous hiding of the sensitive item sets and thus
allows for the identification of globally optimal solutions.



Monitoring Online Tests through Data Visualization
Abstract: We present an approach and a system to let tutors monitor several
important aspects related to online tests, such as learner behavior and test quality.
The approach includes the logging of important data related to learner interaction
with the system during the execution of online tests and exploits data visualization
to highlight information useful to let tutors review and improve the whole
assessment process. We have focused on the discovery of behavioral patterns of
learners and conceptual relationships among test items. Furthermore, we have led
several experiments in our faculty in order to assess the whole approach. In
particular, by analyzing the data visualization charts, we have detected several
previously unknown test strategies used by the learners. Last, we have detected
several correlations among questions, which gave us useful feedbacks on the test
quality.


Evaluating the Effectiveness of Personalized Web Search
Abstract: Although personalized search has been under way for many years and
many personalization algorithms have been investigated, it is still unclear whether
personalization is consistently effective on different queries for different users and
under different search contexts. In this paper, we study this problem and provide
some findings. We present a large-scale evaluation framework for personalized
search based on query logs and then evaluate five personalized search algorithms
(including two click-based ones and three topical-interest-based ones) using 12-
day query logs of Windows Live Search. By analyzing the results, we reveal that
personalized Web search does not work equally well under various situations. It
represents a significant improvement over generic Web search for some queries,
while it has little effect and even harms query performance under some situations.
We propose click entropy as a simple measurement on whether a query should be
personalized. We further propose several features to automatically predict when a
query will benefit from a specific personalization algorithm. Experimental results
show that using a personalization algorithm for queries selected by our prediction
model is better than using it simply for all queries.



Similarity-Profiled Temporal Association Mining
Abstract: Given a time stamped transaction database and a user-defined
reference sequence of interest over time, similarity-profiled temporal association
mining discovers all associated item sets whose prevalence variations over time
are similar to the reference sequence. The similar temporal association patterns
can reveal interesting relationships of data items which co-occur with a particular
event over time. Most works in temporal association mining have focused on
capturing special temporal regulation patterns such as cyclic patterns and calendar
scheme-based patterns. However, our model is flexible in representing interesting
temporal patterns using a user-defined reference sequence. The dissimilarity
degree of the sequence of support values of an item set to the reference sequence
is used to capture how well its temporal prevalence variation matches the
reference pattern. By exploiting interesting properties such as an envelope of
support time sequence and a lower bounding distance for early pruning candidate
item sets, we develop an algorithm for effectively mining similarity-profiled
temporal association patterns. We prove the algorithm is correct and complete in
the mining results and provide the computational analysis. Experimental results
on real data as well as synthetic data show that the proposed algorithm is more
efficient than a sequential method using a traditional support-pruning scheme


Ranking and Suggesting Popular Items
Abstract: We consider the problem of ranking the popularity of items and
suggesting popular items based on user feedback. User feedback is obtained by
iteratively presenting a set of suggested items, and users selecting items based on
their own preferences either from this suggestion set or from the set of all possible
items. The goal is to quickly learn the true popularity ranking of items (unbiased
by the made suggestions), and suggest true popular items. The difficulty is that
making suggestions to users can reinforce popularity of some items and distort the
resulting item ranking. The described problem of ranking and suggesting items
arises in diverse applications including search query suggestions and tag
suggestions for social tagging systems. We propose and study several algorithms
for ranking and suggesting popular items, provide analytical results on their
performance, and present numerical results obtained using the inferred popularity
of tags from a month-long crawl of a popular social bookmarking service. Our
results suggest that lightweight, randomized update rules that require no special
configuration parameters provide good performance.


A Relation-Based Page Rank Algorithm for Semantic Web
Search Engines
Abstract: With the tremendous growth of information available to end users
through the Web, search engines come to play ever a more critical role.
Nevertheless, because of their general-purpose approach, it is always less
uncommon that obtained result sets provide a burden of useless pages. The nextgeneration
Web architecture, represented by the Semantic Web, provides the
layered architecture possibly allowing overcoming this limitation. Several search
engines have been proposed, which allow increasing information retrieval
accuracy by exploiting a key content of Semantic Web resources, that is, relations.
However, in order to rankresults, most of the existing solutions need to work on
the whole annotated knowledge base. In this paper, we propose a relation-based
page rank algorithm to be used in conjunction with Semantic Web search engines
that simply relies on information that could be extracted from user queries and on
annotated resources. Relevance is measured as the probability that a retrieved
resource actually contains those relations whose existence was assumed by the
user at the time of query definition.


Optimal Lot Sizing Policies for Sequential Online Auctions
Abstract: This study proposes methods for determining the optimal lot sizes for
sequential online auctions that are conducted to sell sizable quantities of an item.
These auctions are common in business-to-consumer (B2C) auctions. In these
auctions, the tradeoff for the auctioneer is between the alacrity with which funds
are received and the amount of funds collected by the faster clearing of inventory
using larger lot sizes. Observed bids in these auctions impact the auctioneer’s
decision on lot sizes in future auctions. We first present a goal programming
approach for estimating the bid distribution for the bidder population from the
observed bids, readily available in these auctions. We then develop models to
compute optimal lot sizes for both stationary and nonstationary bid distributions.
For stationary bid distributions, we present closed-form solutions and structural
results. Our findings show that the optimal lot size increases with inventory
holding costs and number of bidders. Our model for nonstationary bid
distributions capture the interauction dynamics such as the number of bidders,
their bids, past winning bids, and lot size. We use simulated data to test the
robustness of our model



Clustering and Sequential Pattern Mining of Online
Collaborative Learning Data
Abstract: Group work is widespread in education. The growing use of online
tools supporting group work generates huge amounts of data. We aim to exploit
this data to support mirroring: presenting useful high-level views of information
about the group, together with desired patterns characterizing the behavior of
strong groups. The goal is to enable the groups and their facilitators to see
relevant aspects of the group’s operation and provide feedbacks if these are more
likely to be associated with positive or negative outcomes and indicate where the
problems are. We explore how useful mirror information can be extracted via a
theory-driven approach and a range of clustering and sequential pattern mining.
The context is a senior software development project where students use the
collaboration tool TRAC. We extract patterns distinguishing the better from the
weaker groups and get insights in the success factors. The results point to the
importance of leadership and group interaction, and give promising indications if
they are occurring. Patterns indicating good individual practices were also
identified. We found that some key measures can be mined from early data. The
results are promising for advising groups at the start and early identification of
effective and poor practices, in time for remediation.




An Efficient Algorithm for Web Recommendation Systems
Abstract: Different efforts have been made to address the problem of information
overload on the Internet. Web recommendation systems based on web usage
mining try to mine users’ behavior patterns from web access logs, and recommend
pages to the online user by matching the user’s browsing behavior with the mined
historical behavior patterns. In this paper we propose effective and scalable
technique to solve the web page recommendation problem. We use distributed
learning automata to learn the behavior of previous users’ and cluster pages based
on learned pattern. One of the challenging problems in recommendation systems
is dealing with unvisited or newly added pages. As they would never be
recommended, we need to provide an opportunity for these rarely visited or newly
added pages to be included in the recommendation set. By considering this
problem, and introducing a novel Weighted Association Rule mining algorithm,
we present an algorithm for recommendation purpose. We employ the HITS
algorithm to extend the recommendation set. We evaluate proposed algorithm
under different settings and show how this method can improve the overall quality
of web recommendations.
Java




Extension of Protégé to support evolution of ontology
Abstract: Ontology over a period of time needs to be modified to reflect changes
in the real world, changes in the user’s requirements, and drawbacks in the initial
design. The changes allow incorporating additional functionalities and ensuring
incremental improvements. Although changes are inevitable during the
development of ontology, most of the current ontology editors unfortunately do
not provide enough support for efficient copying with changes. We have
developed an extension of the Protégé editor for automatically supporting the
evolution of ontologies and guiding users in performing other tasks for which
their intervention are required. We present in this paper the extension of protégé
and we show an application example for representing the Tunisian higher
education system


A metamodel of WSDL Web services using SAWSDL
semantic annotations
Abstract: The Web services technology is founded on the use of a number of
standards based particularly on XML, such as SOAP, WSDL, UDDI and BPEL.
However, these standards are not sufficient to allow an automation of the various
tasks of the Web service’s life cycle, namely the discovery, the invocation, the
publication and the composition. Recently, the W3C consortium produced the
SAWSDL language. This language is a new standard enabling the semantic Web
services’ description. It allows the semantic annotation of WSDL elements and
XML schemas. Within the framework of the MDA approach, this paper proposes
a semantic Web services metamodel founded on SAWSDL language. This
metamodel would constitute an abstraction of the complexity of the use of the
WSDL standard. It would also enable developers to easily implement semantic
Web services based applications.




Adaptive focused crawler based on tunneling and link
analysis
Abstract: At present, using focused crawler becomes a way to seek the needed
information. The main characteristic of a focused web crawler is to select and
retrieve only relevant web pages in each crawling process. In this paper, we
propose a learnable algorithm that combines link analysis with web content in
order to retrieve specific web documents, and it can predict the next URL through
learning. The algorithm also uses an adaptive tunneling to overcome some of the
limitations of normal focused crawlers. We apply three metrics to compare its
efficiency with other weD-known web crawling techniques based.
Java



Effective Snippet Clustering with Domain Knowledge
Abstract: Clustering Web search result is a promising way to help alleviate the
information overload for Web users. In this paper, we focus on clustering snippets
returned by Google Scholar. We propose a novel similarity function based on
mining domain knowledge and an outlier-conscious clustering algorithm.
Experimental results showed improved effectiveness of the proposed approach
compared with existing methods


CROEQS: Contemporaneous Role Ontology-based
Expanded Query Search —Implementation and Evaluation
Abstract: Searching annotated items in multimedia databases becomes
increasingly important. The traditional approach is to build a search engine based
on textual metadata. However, in manually annotated multimedia databases, the
conceptual level of what is searched for might differ from the high-levelness of
the annotations of the items. To address this problem, we present CROEQS, a
semantically enhanced search engine. It allows the user to query the annotated
persons not only on their name, but also on their roles at the time the multimedia
item was broadcast. We also present the ontology used to expand such queries: it
allows us to semantically represent the domain knowledge on people fulfilling a
role during a temporal interval in general, and politicians holding a political office
specifically. The evaluation results show that query expansion using data retrieved
from ontology considerably filters the result set, although there is a performance
penalty.


An Efficient Static Compressed Data Management System
for an Embedded DBMS
Abstract: Recently, an embedded DBMS is widely used for mobile computing
devices to manage information efficiently. Moreover, flash memory is prevalent
in the devices as data storage. However, since it has restriction on the number of
data I/O and is more expensive than the conventional magnetic hard disk, the high
memory utilization has become necessary. Therefore, we present a Compressed
Data Management System to efficiently manage the memory for embedded
DBMS. In addition, we clarify the efficiency of the proposed system by the
various experimental results. Java


Using Association Rule Mining to Improve Semantic Web
Services Composition Performance
Abstract: There are many usages for the web services in the World Wide Web.
For creating new services we can compose other developed services in the way
we want to use them. The large amount of web services make composing of
services a time consuming and impossible job. So to compose services some
automated and semi-automated ways were developed. One of these ways is
semantic web services (SWS). In this paper we introduced a method based on
association rule mining techniques on web services to find the best composition
among possible compositions to improve quality and performance of web service
composition.


ApproxRank: Estimating rank for a subgraph
Abstract: Customized semantic query answering, personalized search, focused
crawlers and localized search engines frequently focus on ranking the pages
contained within a subgraph of the global Web graph. The challenge for these
applications is to compute PageRank-style scores ef_ciently on the subgraph, i.e.,
the ranking must re_ect the global link structure of the Web graph but it must do
so without paying the high overhead associated with a global computation. We
propose a framework of an exact solution and an approximate solution for
computing ranking on a subgraph. The IdealRank algorithm is an exact solution
with the assumption that the scores of external pages are known. We prove that
the IdealRank scores for pages in the subgraph converge. Since the PageRankstyle
scores of external pages may not typically be available, we propose the
ApproxRank algorithm to estimate scores for the subgraph. Both IdealRank and
ApproxRank represent the set of external pages with an external node _ and
extend the subgraph with links to Λ. They also modify the PageRank-style
transition matrix with respect to Λ. We analyze the L1 distance between
IdealRank scores and ApproxRank scores of the subgraph and show that it is
within a constant factor of the L1 distance of the external pages (e.g., the true
PageRank scores and uniform scores assumed by ApproxRank). We compare
ApproxRank and a stochastic complementation approach (SC) [1], a current best
solution for this problem, on different types of subgraphs. ApproxRank has
similar or superior performance to SC and typically improves on the runtime
performance of SC by an order of magnitude or better. We demonstrate that
ApproxRank provides a good approximation to PageRank for a variety of
subgraphs.

BioNav: Effective Navigation on Query Results of
Biomedical Databases
Abstract: Search queries on biomedical databases like PubMed often return a
large number of results, only a small subset of which is relevant to the user.
Ranking and categorization, which can also be combined, have been proposed to
alleviate this information overload problem. Results categorization for biomedical
databases is the focus of this work. A natural way to organize biomedical citations
is according to their MeSH annotations, a comprehensive concept hierarchy used
by PubMed. In this paper, we present the BioNav system, a novel search interface
that enables the user to navigate large number of query results by organizing them
using the MeSH concept hierarchy. First, the query results are organized into a
navigation tree. Previous works expand the hierarchy in a predefined static
manner. In contrast, BioNav uses an intuitive navigation cost model to decide
what concepts to display at each step. Another difference from previous works is
that the hierarchy is not strictly displayed level-by-level.
Java


iBroker: An Intelligent Broker for Ontology Based
Publish/Subscribe Systems
Abstract: In this paper, we present iBroker, an Intelligent Broker for ontology
based publish/subscribe systems which syntactically and semantically match
incoming OWL data to multiple user profiles. iBroker effectively manages user
profiles based on the semantics of query patterns in user profiles formulated in
SPARQL. iBroker uses a semantic matching algorithm to efficiently process
OWL data and generate the complete results for user profiles, considering the core
semantics of OWL. Experimental results demonstrate that iBroker is more
efficient and scalable compared to an existing broker for ontology based
publish/subscribe systems.

Contextual Ranking of Keywords Using Click Data
Abstract: The problem of automatically extracting the most interesting and
relevant keyword phrases in a document has been studied extensively as it is
crucial for a number of applications. These applications include contextual
advertising, automatic text summarization, and user-centric entity detection
systems. All these applications can potentially benefit from a successful solution
as it enables computational efficiency (by decreasing the input size), noise
reduction, or overall improved user satisfaction. In this paper, we study this
problem and focus on improving the overall quality of user-centric entity
detection systems. First, we review our concept extraction technique, which relies
on search engine query logs. We then define a new feature space to represent the
interestingness of concepts, and describe a new approach to estimate their
relevancy for a given context. We utilize click through data obtained from a large
scale user-centric entity detection system – Contextual Shortcuts – to train a
model to rank the extracted concepts, and evaluate the resulting model extensively
again based on their click through data. Our results show that the learned model
outperforms the baseline model, which employs similar features but whose
weights are tuned carefully based on empirical observations, and reduces the error
rate from 30.22% to 18.66%.


Deriving Private Information from Association Rule Mining
Results
Abstract: Data publishing can provide enormous benefits to the society.
However, due to privacy concerns, data cannot be published in their original
forms. Two types of data publishing can address the privacy issue: one is to
publish the sanitized version of the original data, and the other is to publish the
aggregate information from the original data, such as data mining results. There
have been extensive studies to understand the privacy consequence in the first
approach, but there is not much investigation on the privacy consequence of
publishing data mining results, although, it is well believed that publishing data
mining results can lead to the disclosure of private information. We propose a
systematic method to study the privacy consequence of data mining results. Based
on a well-established theory, the principle of maximum entropy, we have
developed a method to precisely quantify the privacy risk when data mining
results are published. We take the association rule mining as an example in this
paper, and demonstrate how we quantify the privacy risk based on the published
association rules. We have conducted experiments to evaluate the effectiveness
and performance of our method. We have drawn several interesting observations
from our experiments

No comments: