- Big Data: The term Big Data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing, and usage. Volume, speed, and variety are usually the three criteria used to qualify a database as “Big Data”.
- Data: This term comprises facts, observations and raw information. Data itself has little meaning if it is not processed.
- Data analysis: This is a class of statistical methods that makes it possible to process a very large volume of data and identify the most interesting aspects of its structure. Some methods help to extract relations between different sets of data and thus draw statistical information that makes it possible describe the most important information contained in the data in the most succinct manner possible. Other techniques make it possible to group data in order to identify its common denominators clearly, and thereby understand them better.
- Data governance: It constitutes a framework of quality control for management and key information resource protection within a company. Its mission is to ensure that the data is managed in accordance with the company’s values and convictions, to oversee its quality and to put mechanisms into place that monitor and maintain that quality. Data governance includes data management, oversight, quality evaluation, coherence, integrity and IT resource security within a company.
- Data journalism: The term designates a new form of journalism based on data analysis and (often) on its visual representation. The journalist uses databases as his or her sources and deduces knowledge, meaningful relationships or intuitions from them that would not be accessible through traditional research methods. Even when the article itself stands as the main component of the work, illustrating ideas through graphs, diagrams, maps, etc., is becoming more important day by day.
- Data mining: Also referred to as knowledge discovery from data, is intended for the extraction of knowledge from large amounts of data using automatic or semi-automatic methods. Data mining uses algorithms drawn from disciplines as diverse as statistics, artificial intelligence and computer science in order to develop models from data; that is, in order to find interesting structures or recurrent themes according to criteria determined beforehand and to extract the largest possible amount of knowledge useful to companies. It groups together all technologies capable of analyzing database information in order to find useful information and possible significant and useful relationships within the data.
- Data reuse: This practice consists of taking a dataset in order to visualize it, merge it to other datasets, use it in an application, modify it, correct it, comment it, etc.
- Data science: It is a new discipline that combines elements of mathematics, statistics, computer science and data visualization. The objective is to extract information from data sources. In this sense, data science is devoted to database exploration and analysis. This discipline has recently received much attention due to the growing interest in Big Data.
- Data visualization: Also known as “data viz”, it deals with data visualization technology, methods and tools. It can take the form of graphs, pie-charts, diagrams, mappers, timelines or even original graphic representations. Presenting data through illustrations makes it easier to read and understand.
- Dataset: Structured and documented collection of data on which reusers rely.
- Hadoop: Big Data software infrastructure that includes a storage system and a distributed processing tool.
- Information: It consists of interpreted data and has discernible meaning. It describes and answers questions like “who?”, “what?”, “when?” and “how many?”.
- Knowledge: It is a type of know-how that makes it possible to transform information into instructions. Knowledge can either be obtained through transmission from those who possess it, or by extraction from experience.
- Linked Open Data (LOD): This term designates a web-approach proposed by supporters of the “Semantic Web”, which describes all data in a way such that computers can scan it, and which links to it by describing its relationships, or by making it easier for the data to be related. Open public data is arranged in a “Semantic Web” format, such that its items have a unique identifier and datasets are linked together by those identifiers.
- Open data: This term refers to the principle according to which public data (gathered, maintained and used by government bodies) should be made available to be accessed and reused by citizens and companies.
- Semi-structured information: It is worth noting that the boundary between structured information and unstructured information is rather fuzzy, and that it is not always easy to classify a given document into one category or the other. In such a case, one is no doubt dealing with semi-structured information.
- Smart Data: The flood of data encountered by ordinary users and economic actors will bring about changes in behavior, as well as the development of new services and value creation. This data must be processed and developed in order to become “Smart Data”. Smart Data is the result of analysis and interpretation of raw data, which makes it possible to effectively draw value from it. It is, therefore, important to know how to work with the existing data in order to create value.
- Structured information: It can be found, for example, in databases or in programming languages. It can thus be recognized by the fact that it is arranged in a way such that it can be processed automatically and efficiently by a computer, but not necessarily by a human. According to Alain Garnier, the author of the book Unstructured Information in Companies, “information is structured when it is presentable, systematic, and calculable”. Some examples include forms, bills, pay slips, text documents, etc.
- Unstructured information: Unlike structured information, unstructured information constitutes the set of information for which it is impossible to find a predefined structure. It is always intended for humans, and is therefore composed mainly of text and multimedia documents, like letters, books, reports, video and image collections, patents, satellite images, service offers, resumes, calls for tenders, etc. The list is long.
Chapter 2: Open Data: A New Challenge
- Definition of Open Data (Page 26):
- availability and access: data must be available as a whole, and it must be accessible in a way that is comfortable and modifiable;
- reuse and redistribution: data must be made available in ways that make its reuse and redistribution possible, including the possibility of combining it with other sets of data;
- universal participation: everyone must be able to use, reuse, and disseminate data.
- Open data in five stages (Page 28):
- Big Data comes mainly from (Pages 31-32):
- the Web: newspapers, social networks, e-commerce, indexing, document, photo and video storage, linked data, etc.;
- the Internet and connected objects: sensor networks, call logs;
- science: genomics, astronomy, sub-atomic physics;
- commercial data (e.g. transaction histories in a supermarket chain);
- personal data (e.g. medical files);
- public (open) data.
- Intersection of Open Data and Big Data (Page 32):
- involves massive amounts of data
- an example of how data can create value in terms of usefulness
- Making data open means making it available online, which involves a new concept, “linked data”.
- This concept is a growing movement for companies because they must ensure their data is in a format that can be read by machines.
- This allows users to create and combine datasets and make their own interpretations of them.
- The data web consists of presenting that structured data online and linking it, which will increase its visibility and its capacity to be reused.
- Foregoing reflections surrounding Open Data account for two needs on the part of data reusers (Page 38):
- more raw data updated in real-time, and, of course, processed;
- contextualizing documents that make it possible to understand how and why a given set of data has been constructed.
Chapter 3: Data Development Mechanisms
- Statistical techniques for data development mechanisms (Pages 48-49):
- descriptive techniques: they seek to highlight existing information that is hidden by the volume of data (customer fragmentation and research of product association on receipts). Thus, these techniques help reduce, summarize and synthesize data such as:
- factor analysis: projection of data in graphical form to obtain a visualization of the overall connections (reconciliations and oppositions) between the various data;
- automatic classification (clustering–segmentation): Association of homogeneous data groups to highlight a segmentation of certain individuals in some classes;
- research of association (analysis of receipts): this involves spotting dependencies between the objects or individuals observed.
- predictive techniques: predictive analysis allows for very accurate projections to identify new opportunities (or threats) and thus anticipate appropriate responses to the situation. They aim to extrapolate new information from existing sources (this is the case of scoring). These are statistical methods which analyse the relation between one or several variables depending on independent variables, such as:
- classification/discrimination: the dependent variable is qualitative;
- discriminate analysis/logistic regression: find allocation rules from individuals to their groups;
- decision trees: they allow dividing the individuals in a population into classes; we begin by selecting the best variable to separate individuals in each class into sub-populations called nodes, depending on the targeted variable;
- neural networks: from the field of artificial intelligence, this is a set of interconnected nodes that have values. They can be understood by adjusting the weight of each node until you find an optimal solution to achieve the fixed number of iterations;
- forecast: the variable is continuous or quantitative;
- linear regression (simple and multiple): it allows for modeling the variation of a dependent variable with regards to one or more independent variables, and thus, to predict the evolution of the first in relation to the last;
- general linear model, it generalizes the linear regression with continuous explanatory variables.
- A model of economic intelligence (Page 62):
- Phases of dealing with data mining (Page 67):
- understanding the task: this first phase is essential in understanding the objectives and requirements of the task, in order to integrate them into the data mining project and to outline a plan to achieve them;
- understanding the data: this involves collecting and becoming familiar with the available data. It also involves identifying at the earliest possible opportunity problems of data quality and develop the initial institutions and detect the first sets and hypothesis to be analyzed;
- preparation of data: this phase is comprised of all the steps necessary to build datasets that will be used by the model(s). These steps are often performed several times based on the proposed model and the results of analysis already carried out. It involves extracting, transforming, formatting, cleaning and storing data appropriately. Data preparation constitutes about 60-70% of the work;
- modeling: this is where modeling methodologies of statistics come into play. Models are often validated and built with the help of business analysts and quantitative method experts, called “data scientists”. There are in most cases several ways of modeling the same problem of data mining and several techniques for adjusting a model to data;
- evaluation of a model: at this stage, one or several models are built. It must be clear that the results are deemed satisfactory and are coherent, notably in relation to their targets;
- the use of the model: the development of the model is not the end of the data mining process. Once information has been extracted from the data, they still need to be organized and presented so as to make them usable for the recipient.
Book introduction : Wiley