🇧🇷 Ver este post na versão em Português ↗
Welcome!
By Felipe Lamounier, state of Minas Gerais, Brazil – powered by 🙂My Easy B.I.
Datamining, or Data Mining in English, is a set of techniques derived from Statistics and Artificial Intelligence (AI) with the specific purpose of discovering new knowledge that may be hidden within large amounts of data.
Data mining is the process of discovering new knowledge hidden in large amounts of data.
🔭 See also Related Posts:
📑 Table of Contents:
- Datamining: The Intelligence of Business
- The Stages of Data Mining in Business
- The Data Mining Methodology
- Data Mining Techniques
- Artificial Intelligence
- Data Mining Tools
- Conclusion
Datamining: The Intelligence of Business
Data mining specifically involves discovering relationships between products, classifying consumers, predicting sales, identifying potentially profitable geographic areas for new branches, inferring needs, among others.
We define data mining as the use of automatic techniques to explore large amounts of data in order to discover new patterns and relationships that, due to the volume of data, would not be easily uncovered by the human eye.
The algorithms and formulas that form the basis for data mining techniques are old, but it is only in recent years that they have been used for data exploration, for several reasons:
- The volume of data available is currently enormous: Datamining is a technique that only applies to large masses of data, as it needs this to calibrate its algorithms and extract reliable conclusions from the data.
- The data is being organized.
- Computational resources are powerful: Datamining needs a lot of computational resources to operate its algorithms on large amounts of data. Advances in distributed databases have also helped.
The Stages of Data Mining in Business
Datamining is about turning “bytes” into business returns $. Data mining techniques provide means to discover interesting relationships, but for them to be truly useful, the company must approach their use as a whole, being proactive rather than reactive.
Problem Identification
The first phase of the Datamining process is problem identification, which entails defining the objectives to be achieved.
Knowledge Discovery
Second phase of the Data Mining process which is the discovery of new relationships, not identifiable to the naked eye, but can be visualized through mechanical procedures of Artificial Intelligence, through a systematic and exhaustive analysis of the thousands of customer records in the company’s databases.
Analysis of Discovered Relationships
Once the phase of data mining related to discovering new relationships is complete, the phase of analyzing the discovered relationships begins. Usually, this phase relies on human reasoning to be evaluated. However, with the increasing power of Artificial Intelligence, in some cases, this analysis can be automated.
Use of Discovered Relations
The phase of analyzing the discovered relationships has been completed, and now the phase of using the discovered relationships begins, where decisions are made to make the best possible use of the relationships provided by the data mining process.
Evaluation of Results
The last phase of the datamining process is the evaluation of the results, as only after a thorough evaluation can we truly affirm that the causes of the problem to be solved have been addressed or the company’s objectives have been achieved.
In summary, the Phases are:
- Identification of a problem or definition of a goal to be achieved;
- Discovery of new relationships by data mining techniques;
- Human analysis of the newly discovered relationships;
- Rational use of new relationships discovered;
- Evaluation of the results.
The Data Mining Methodology
Data mining can be performed in three different ways, depending on the level of knowledge one has about the studied problem. If nothing is known about the behavior of the phenomenon, one can simply let the automatic data mining techniques search for “new” hidden relationships in the data that would not be easily identifiable to the naked eye. This method is called unsupervised discovery of relationships. When there is some knowledge about the field of operation of the company or some idea of the new relationship being sought, one can define a hypothesis and verify its confirmation or refutation through the data mining methodology known as hypothesis testing. Finally, when there is a higher level of knowledge about the area and the relationship that is being investigated, one proceeds with the data modeling methodology.
Unsupervised Discovery of Relationships
When there is no specific problem to be solved, allowing computers to freely search their databases through data mining algorithms. This “searching” is unconstrained by any pre-determined relationship, representing only an exhaustive observation of the data in order to, perhaps, discover a new and useful relationship. The lack of commitment in this quest for something new justifies the adjective “unsupervised” used in the name of the technique. Each time the unsupervised discovery technique is used, many “new” relationships emerge. These relationships are printed out and a human analyst will need to examine them to separate the truly interesting ones from the useless ones.
Hypothesis Testing
The individual who analyzes the discovered relationships can raise some hypotheses associated with them, for example, noticing the concern of chocolate consumers with their aesthetics and health, consuming reasonable amounts of diet products, one might think that these consumers also commonly use beauty products. To verify if their hypothesis is true, they employ the data mining methodology called hypothesis testing. Data mining concludes that indeed this group of consumers invests reasonably in beauty products. This new information can then be used to place the beauty products shelf next to the chocolate and diet products shelf, increasing sales and reminding the consumer of their consumption habits.
Mathematical Modeling of Data
Finally, the analyst may wish to get to know this consumer better by obtaining data about their economic level, as it would be interesting to assess whether this consumer has the means to invest in fine, imported chocolates, in addition to the domestic ones already sold by the supermarket. To do so, the analyst needs to use the data mining methodology called data modeling and assess whether the quantity of consumers of this type and their purchasing power would justify the profitable creation of a new section for imported chocolates.
Mathematical relationships between the data will then be created, allowing the analyst to check profit margins and sales forecasts for these potential consumers of imported chocolate based on their profile and purchasing power.
Data Mining Techniques
Any of the three possible data mining methodologies essentially require the same techniques for their execution. These techniques are of a generic nature and can be implemented using different tools such as Artificial Neural Networks, Statistics, or Symbolic Artificial Intelligence.
There are a large number of basic techniques, however, 5 general techniques encompass all other forms of presentation and allow for a more comprehensive and appropriate understanding of the subject:
- Classification
- Estimate
- Prediction
- Affinity Analysis
- Cluster Analysis
Classification
Classification is one of the most used techniques in data mining, simply because it is one of the most performed cognitive tasks by humans to aid in understanding the environment we live in. Humans are always classifying what they perceive around them, creating classes of relationships. When humans receive any stimulus from the environment and prepare to respond to it, they instinctively classify this stimulus into categories of other stimuli they have received in the past and for which they have a ready and immediate response.
The task of classification typically involves comparing an object or data with other data or objects that are supposed to belong to previously defined classes. To compare data or objects, a metric or measure of differences between them is used.
In datamining, classification tasks are common, for example, of customers in low, medium or high bank loan risk; potential consumers of a given product judging by their profile; of financial transactions as legal, illegal or suspicious in a system of protection and inspection of the financial market among several others.
Artificial Neural Networks, Statistics, and Genetic Algorithms are some of the widely used tools for data classification.
Classifying an object means determining which group of previously classified entities this object has the greatest similarity to.
Estimate
Estimating an index is determining its most likely value based on past data or data from similar indices that are known. The art of estimation is precisely this: determining the best possible value based on other values from identical situations, although never exactly the same.
Artificial Neural Networks, Statistics, Genetic Algorithms, and Simulated Annealing are some of the tools widely used for estimating quantities.
Estimating a magnitude is evaluating it based on similar cases in which this magnitude is present.
Prediction
The Prediction technique is based on evaluating the future value of an index, relying on past data of its behavior. The only way to determine if a forecast was accurate is to wait for the event to occur and assess the accuracy of the prediction made.
Certainly, prediction is one of the most challenging tasks not only in data mining but also in our everyday lives. Artificial Neural Networks and Statistics are tools used in prediction.
Prediction involves determining the future of a magnitude.
Affinity Analysis
To determine which events occur simultaneously with reasonable probability (co-occurrence) or which items in a data set are present together with a certain likelihood (correlation) are typical tasks of affinity analysis.
The easiest example is perhaps that of the supermarket cart from which a lot of information can be extracted about which products consumers consume together with great likelihood. This allows for targeted sales where items are already offered together (kits). From the numbers obtained from the affinity analysis, “rules” governing the consumption of certain items can be extracted.
Affinity analysis is concerned with discovering which elements of events have relationships over time.
Cluster Analysis
Clustering is the process of classifying a given set of data into unknown classes, without prior knowledge of their number or shape. One task is to assign a certain data point to a known category or class, while another, more complex task is to determine how many classes exist within a dataset and how they are structured. In cluster analysis, the groups or classes are formed based on the similarity between elements, and it is up to the analyst to determine if these resulting classes have any meaningful interpretation.
Cluster analysis is typically a preliminary technique used when little or nothing is known about the data, such as in the methodology of unsupervised discovery of relationships. Artificial Neural Networks, Statistics, and Genetic Algorithms are tools used for cluster analysis.
Grouping is, based on similarity measures, defining how many and which classes exist in a set of entities.
Implementation of a Data Mining Protocol
We can establish a generic protocol for implementing datamining by following the phases:
- Problem Definition
- Discovering New Relationships
- Analysis of New Relationships
- Application of New Relations
- Evaluation of Results
Problem Definition
If there is little knowledge, unsupervised discovery is done;
If there is any suspicion of an interesting relationship, hypothesis testing is performed;
If there is a lot of knowledge, mathematical modeling of the relationship is carried out.
⬇
Discovering New Relationships
Depending on the defined problem, the technique (classification, estimation, prediction, etc.) and the tool (artificial neural networks, genetic algorithms, etc.) capable of executing it are chosen;
Data preparation is carried out (selection, complementation, etc.) according to the tool to be used;
The tool is applied, generating “new” relationships.
⬇
Analysis of New Relationships
A team of experts analyzes and chooses viable and promising relationships.
⬇
Application of New Relations
The new relationships are applied (or explained) on an experimental basis.
⬇
Evaluation of Results
The results of applying (or explaining) the new relationship are opposed to the initial objectives. Eventually, it returns to problem redefinition.
Artificial Intelligence
A definition of Artificial Intelligence would be the study of how to create machines that can perform tasks in which, at the moment, humans are better. Therefore, if there is currently no machine capable of doing something better than humans, then the goal of Artificial Intelligence is to generate such a machine. However, this puts Artificial Intelligence in an insoluble paradox: if we create a machine capable of performing a given task similar to a human, that machine, from the moment it exists, is no longer the subject of study of Artificial Intelligence because the task has already been mechanized. In other words, Artificial Intelligence considers that an intelligent task that has already been mechanized is no longer considered intelligent precisely because it has been mechanized.
The Methodology of Artificial Intelligence
Intelligent processes are always carried out through a sequence of operations controlled by a centralizing or supervising element. These operations must be represented by symbols, which would be the roots of intelligence. The actual intelligence would be stored in special high-level symbols called heuristics. Intelligence would be expressed when the machine is involved in solving a specific problem, and its efficiency could be measured.
The Symbolist Artificial Intelligence methodology can be described in 3 phases:
- Choose a smart activity to study;
- Develop a logical-symbolist structure capable of imitating it;
- Compare the efficiency of this structure with real intelligent activity.
Due to the fact that the methodology of Artificial Intelligence is based on the choice of an intelligent activity, subdivisions of the paradigm arise, such as:
Natural Language Processing
NLP (Natural Language Processing) is the subfield of Artificial Intelligence that focuses on creating algorithms capable of understanding human language, both written and spoken. Language involves complex computational mechanisms that are still not fully understood today. Natural language processing systems need to possess a large, well-represented, easily accessible knowledge base, as well as the ability to perform inferences. Some developed systems can engage in dialogue within certain contexts, summarize texts, and “understand” questions asked for database queries.
Expert Systems
Expert systems are systems that mimic the reasoning of an expert in a specific field of knowledge. Various experts are consulted, and their procedures for handling specific situations are represented and programmed into the system. The system then responds to questions and suggests actions as if it were the expert. Many expert systems have been developed in fields such as medicine, finance, and management.
Planning
Planning actions or policies is a task we perform in our day-to-day lives. Some systems are capable of planning strategies in the administrative field, while others can generate plans for turning on and off networks of equipment without causing damage, a common problem in oil refineries. Since computer programs are plans, within this subarea of Artificial Intelligence, we find Automatic Programming, which is the study of how to create programs capable of programming the computer with minimal human interference.
Problems Solution
This area of Artificial Intelligence aims to develop new methodologies to solve mathematical problems. Many mathematical problems are so complex that they cannot be solved exactly in a reasonable time. Artificial Intelligence seeks approximate methodologies to solve the problem in an approximate, rather than exact, manner, but in a short time.
Pattern Recognition
When we see an object or hear a word, we are recognizing a visual or auditory pattern, respectively. Machines that recognize auditory patterns can be used to receive voice commands directly from their operator. Similarly, machines that recognize visual patterns can be used to detect defective parts on an assembly line, target for attack, abnormalities on an X-ray sheet, and so on. Economic patterns of bank or company bankruptcies can also be detected using pattern recognition techniques.
Machine Learning
This subfield concerns the creation of algorithms that allow computers to learn from the environment they are exposed to. If we provide a learning algorithm with a large amount of data, it will be able to draw some conclusions about the relationships existing within that data. Machine learning algorithms transform data into rules that express what is important in the data.
Data Mining Tools
Self-Organizing Maps in Data Mining
Self-organizing maps play an important role in data mining when determining clusters of data with similar patterns (patients, clients, products, etc.). After identifying these clusters, we can utilize them for marketing purposes, such as offering specific promotions to customers with profiles that are most likely to purchase a particular product, for example.
Whenever we want to discover new knowledge in a dataset, we should consider the possibility of presenting this dataset to a self-organizing map neural network. Maps are a simple geometric way to check if there is something interesting or organized in the dataset. Many commercial data mining programs already have ready-made routines for self-organizing maps. This type of neural network has few control parameters and is also computationally efficient due to the simplicity of its learning rule.
Backpropagation Learning Neural Networks
The name “error backpropagation” comes from the fact that when applying the learning rule, the learning errors of the neurons in the intermediate and input layers are calculated based on the errors of the neurons in the output layer. In other words, the errors of the neurons in the output layer are “propagated backwards” towards the input, allowing the modification of the synapses for learning. The learning of the neural network involves a variable that is propagated in the opposite direction to the normal flow of information in its neurons.
The applications of this neural network are endless in various fields of work. Whenever there is past data that is related to each other, we can use a neural network with error backpropagation learning to learn the supervised relationships. Later, when new data is presented to them, the neural network will provide the associated responses in its output layer based on the relationships it has already learned. This way, we can teach the neural network, for example, to associate data from banking transactions in the input layer with the degree of suspicion of their legality or illegality in the output layer, for a set of past transactions for which the legality or illegality is known. For each new banking transaction that we want to evaluate the degree of legality, we just need to stimulate the neurons in the input layer with its details, and in the output layer, a neuron will provide the level of legality of the transaction.
Statistical Cluster Analysis
Cluster analysis encompasses a variety of techniques and algorithms whose aim is to classify a sample of entities (individuals or objects) into mutually exclusive groups based on the similarities between the entities, according to a predetermined criterion. The resulting groups of objects should exhibit high internal homogeneity (within the group) and high external heterogeneity (between the groups). Therefore, if the classification is successful, the objects within the groups will be grouped together when represented geometrically, and different groups will be separated.
Thus, the problem that cluster analysis aims to solve is, given a sample of n objects (or individuals), each measured according to p variables, to seek a classification scheme that groups the objects into g groups based on their similarities. The number and characteristics of these groups must also be determined.
We also have multivariate data analysis, which refers to all statistical methods that simultaneously analyze multiple measurements of each individual or object under investigation. Multivariate analysis is an ever-expanding set of techniques for data analysis. Among the most established techniques are:
- Multiple regression and multiple correlation;
- Multiple discriminant analysis;
- Principal components and common factor analysis;
- Multivariate analysis of variance and covariance;
- Canonical correlation;
- Cluster analysis;
- Multidimensional scaling;
- Conjoint analysis;
- Correspondence analysis;
- Linear probability models;
- Simultaneous/structural equation modeling.
Similarity Measures
A fundamental concept in the use of clustering analysis techniques is the choice of a criterion that measures the distance between two objects or quantifies how similar they are. This measure is called the similarity coefficient. Technically, it can be divided into two categories: similarity measures and dissimilarity measures. In the former, the higher the observed value, the more similar the objects are. In the latter, the higher the observed value, the less similar the objects are.
Correlation coefficients are an example of a similarity measure, while the Euclidean distance is an example of a dissimilarity measure. Most clustering analysis algorithms are programmed to operate using the concept of distance (dissimilarity).
The most commonly used algorithms for forming clusters can be classified into two general categories: (1) hierarchical techniques and (2) non-hierarchical techniques or partitioning methods.
Machine Learning
Machine Learning is a research area of Symbolic Artificial Intelligence whose goal is to extract heuristic rules that may be embedded in large datasets. Machine learning algorithms are very interesting because, in addition to modeling the data to enable predictions and classifications, they provide heuristic rules that explain the existing patterns in the data. One of the most commonly used machine learning algorithms is the so-called recursive partitioning algorithms. These algorithms start from the original dataset and partition it into subgroups, which are in turn further partitioned until the desired level of detail is reached to extract precise heuristic rules about the patterns found in the data. Typically, these subgroups are generated from a group based on a heuristic rule that classifies the data into one or another subgroup. Therefore, a good representation for recursive partitioning is a binary tree called a decision tree, as at each node, a decision must be made on how to split the data into one of the two sides or subgroups. Once the decision tree is constructed, a new data point can be classified by following a path from the root node to a leaf node, directing left or right at each node based on the associated heuristic rule (decision). The aim is to make the resulting subgroups increasingly homogeneous in order to reach leaves with well-defined and organized data classes.
The Concept of Entropy
To measure the homogeneity of groups, algorithms use the concept of variance or entropy. In Physics, especially in Thermodynamics, the concept of entropy is associated with disorder.
Entropy is a measure of diversity, degeneration, disorder, disorganization, and chaos. Minimum entropy (zero), representing total organization, is defined as that of a geometrically perfect crystal at absolute zero temperature where no atoms move, therefore zero entropy means inertia, total order, and death.
Organized systems (such as a perfect crystal at absolute zero) are so predictable that they do not require any information for understanding. In other words, systems with zero entropy do not require information because they are already understood. However, if the phenomenon is complex, it becomes more difficult to understand it, and consequently, any information about the phenomenon becomes valuable. In other words, disorganized, chaotic systems with high entropy require a lot of information for their clarification. We call information entropy the amount of additional information required to understand a phenomenon or system.
We define information entropy as the average amount of information required to understand a phenomenon. If a complex and disorganized phenomenon depends on several hard-to-predict events, the amount of information needed to predict each event is high, and certainly the average of these quantities will also be high, resulting in high information entropy.
In Computer Science, entropy is the lack of knowledge in the present that must be supplied in the future, entropy is disorder due to lack of knowledge. The computational methods of data mining are exactly the realization of the search for order or knowledge and its consequent reduction in entropy.
Building Decision Trees
Understanding this important concept of entropy, we can easily grasp how decision trees are built. At each level of the decision tree, we need to define heuristic rules that partition the data into subgroups with the lowest possible entropies. In other words, each subgroup is more homogeneous and clearer in its behavior pattern, hence having a lower entropy. In the leaves of the decision tree, we have groups that are so homogeneous that they embrace only one type of data, leaving no doubt about what they are, and therefore, any additional information would be unnecessary (zero entropy).
Another interesting aspect of decision trees is their ability to extract heuristics from the classified data. It is only necessary to combine the questions asked at each level of the tree. Allowing decision tree algorithms to freely search the company’s data warehouse for heuristics is a valuable technique for unsupervised discovery.
Decision tree algorithms are widely available and can be easily found in commercial software for Statistics and Artificial Intelligence, making them accessible for data mining purposes. Extracting heuristics from large datasets is not exclusive to decision trees. Many other tools, including artificial neural networks, are part of the Machine Learning toolkit. Nevertheless, due to their widespread use, decision trees are often the go-to option for heuristic extraction.
Conclusion
In conclusion, datamining is a powerful technique that uses statistical methods and artificial intelligence to uncover hidden knowledge in large amounts of data.
Through data mining, it is possible to identify relationships between products, classify consumers, predict sales, and even locate profitable geographical areas for new branches. The phases of data mining in a company include problem identification, discovery of new relationships, human analysis of these relationships, rational use of them, and evaluation of results.
Furthermore, datamining can be carried out in different ways, such as unsupervised discovery of relationships, hypothesis testing, and mathematical modeling of the data. Among the techniques used in datamining, self-organizing maps, backpropagation neural networks, statistical cluster analysis, and decision tree construction are noteworthy. Machine learning is an essential area within datamining, enabling the extraction of heuristic rules from data and the analysis of complex patterns.
In summary, datamining is a valuable tool for companies looking to make the most of the information contained in their data, improving decision-making and driving business success.
Bibliographic Reference
Datamining – a Mineracao de Dados no Marketing, Medicina, Economia, Engenharia e Administração ↗
Keywords: Data Mining; datamining; Introduction to Data Mining: Uncovering Hidden Relationships in Data; Data Mining in Practice: Essential Guide to Data Analysis; First Steps in Data Mining: How to Transform Data into Decisions; Exploring Data Mining: Essential Strategies for Business Data Analysis; introduction to data mining; business data analysis; fundamentals of data mining; data mining for business; how to start with data mining in your business; benefits of data mining for businesses; effective data mining strategies for market analysis
Did you like the content? Want to get more tips? Subscribe ↗ for free!
Follow on social media:



Um comentário em “⛏️Introduction to Data Mining: The Intelligence of the Enterprise”