BACKGROUND:
The advances in Information Technologies focused on information describing geospatial phenomenons have revolutionized the information handling activities in research and industry domains. The effective access to geospatial information acquires a critical importance in these knowledge-based contexts. However, the increasing volume of geospatial data everyday makes direct search infeasible. As alternative, many information systems search geospatial metadata, that is, data that describes geospatial data. From emergency, rescue and locating systems to geopolitic, military, and industry using decision-making systems based on geographic information, it is essential to have access to geospatial resources through consistent metadata description and a minimum level of quality in order to ensure the resource retrievability (Hartmann and Stuckenschmidt, 2002; Martins et al., 2007). Moreover, for some authors, the quality and the consistency of a metadata description could be the difference between the life and death or between the success and failure (Dushay and Hillmann, 2003; Bruce and Hillmann, 2004; Hillmann et al., 2004).
This thesis researches how to assess the quality of a particular kind of metadata: the metadata that describe the spatial location of a resource. In particular, this thesis researches how to deal with problems that may surface when a metadata record describing some resource has more than one property that intends to describe the location of resource, that is, the metadata record contains semantically close geographical properties (Miles and Bechhofer, 2009). This problem is closely associated to the facility which georeferenced resources can be found in an information system because metadata properties are key to discovery, access and retrieval of resources in many systems based on indexed catalogs (Goodchild and Zhou, 2003; Hill, 2006).
OBJECTIVE:
The problems derived from the inconsistency of Geospatial metadata might be mitigated after performing a Quality Assessment (QA) of the geospatial description.
This thesis analyses a semi-automatic approach to detect geospatial inconsistencies and to suggest possible solutions for an inconsistency based on a geospatial context, which is built from consensual geospatial descriptions surrounding the inconsistent resource. In addition, the analysed solution should provide assessment to perform quality processes such as, curation and preservation of cartographic material in the digital repositories field.
SCOPE:
The QA analysis is restricted to two kinds of resources: (1) Metadata Web services compliant with the Open Geospatial Consortium (OGC) specifications and (2) MARC21 (http://www.loc.gov/marc/bibliographic/) metadata describing cartographic materials. The OGC leads the development of open and standardized Web service interface specifications for accessing geospatial information since 1994. Many companies, government agencies and universities are members of OGC, and they participate in consensual processes to develop publicly available Web service interface standards for the access to geospatial resources. Many OGC Web service interface specifications have become standards of the International Organisation for the Standardization.
Related to OGC standards, this thesis has its focus in the most used OGC standards in Spatial Data Infrastructures (SDI) (Nebert, 2004). In particular, we apply our QA analysis on Web Map Services metadata stored in OGC Catalogue Service (CSW, (Nebert et al.,2007)). However, our developed systems can be applied to other kind of Web services and SDI resources with semantically close geographical properties.
Although our methodology could be used to assess the quality of thematic and temporal properties, we will solely consider semantically close geographical properties.
On the other hands, MARC standards are a set of digital formats for the description of items catalogued by libraries, such as books. It was developed by the US Library of Congress during the 1960s to create records that can be used by computers, and to share those records among libraries. By 1971, MARC formats had become the national standard for dissemination of bibliographic data in the United States, and the international standard by 1973. There are several versions of MARC in use around the world, the most predominant being MARC21, created in 1999 as a result of the harmonization of U.S. and Canadian MARC formats, and UNIMARC, widely used in Europe. Additionally, in many libraries around the world, MARC21 metadata is the most used standard to document resources describing geographic phenomenons over the surface of the earth (furrie, 2009).
In MARC21 there are several different fields that can encode different aspects of direct/indirect spatial references including different ways to associate geographic codes, or different ways for expressing the geospatial reference method used for the coordinates in the direct spatial references. We will solely consider the two most frequent semantically close geographical properties that we found analysing the experimental datasets: the Direct Spatial References (geographical extent/spatial footprint) and the Indirect Spatial References (place name) (FGDC, 1998). In this thesis we do not take into account other geographical properties.
The SDI scenario is used as test case to validate all our proposed architecture. The descriptive information of the spatial resources in SDI is structured and proceeds from experts of the geographical domain. This makes us to think that the provided spatial description in SDI must be better than other scenarios and domains where the descriptions proceed from unstructured information and non-experts in the geographical domain, for example, Digital Libraries domains. In this sense, the methodology and the architecture have been tested with a more difficult case, cartographic materials. Our analysis has been restricted to two geographic areas, Spain and the Unites States of America. The described restriction guarantees that the contributions of this thesis can benefit two of the most frequent scenarios using geographic metadata (Spatial Data Infrastructures and Digital Libraries).
Regarding to the scope of geographical Knowledge Organisation System (KOS) (Hodge, 2000; Miles and Bechhofer, 2009) used in the spatial ranking process, this work has analysed several knowledge representation systems. These systems have been mainly simple SKOS (Simple Knowledge Organisation System (Isaac and Summers, 2009)) vocabularies and RDF/XML graphs. The KOS used in the tasks of spatial ranking must cover the geographical extent of the assessed geographical resources. Also, the level of granularity of the spatial footprints in the KOS must be in accordance with those in the analysed collection. In this thesis we restrict ourself to work with KOS with footprint of two-dimensions (2D). However, by means of simple processes, the 2D footprints can be simplified to classical points (1D). The open research line and the challenge is to shift from 1D to 2D geographical footprint analysis to assess their quality.
METHOD:
The methodological approach comprises aspects related with software engineering, knowledge engineering and artificial intelligence. The software engineering methodology is a classic incremental development of the solution (Boehm, 1988; Larman and Basili, 2003).. The knowledge engineering methodology is based in the steps proposed by the Methontology framework (specification, conceptualization, formalization, integration, implementation, and maintenance) (Fernández-López et al., 1997).
PREVIOUS EXPERIENCE, FUTURE WORK:
To ensure the access, retrieval and visualization of resources in the context of distributed and interoperable information systems are common and priority goals for many domains, for example, in SDI and Digital Libraries. One of the most consolidated cases is the European INSPIRE initiative, whose aim is to create a European SDI. One of the research lines of the IAAA research group(http://iaaa.cps.unizar.es/showContent.do?cid=presentacion.EN) focuses on SDI aspects related with the description of geospatial data and services, the discovery of these resources through standard catalogues, and the conceptual and architectural aspects related to geospatial data and services.
Some research results of the SDI research line where the author has participated are the exploration of new alternatives to ensure the quality of the descriptive information of geospatial resources and the identification of hidden geospatial resources in catalogues (Renteria-Agualimpia et al., 2013c), the exploration of the advances in semantic search engines and the integration of geospatial aspects (Renteria-Agualimpia et al., 2010),, and the development of multi-criteria geographic information retrieval models based on geospatial semantic integration (Renteria-Agualimpia and Levashkin, 2011).
These works have involved the identification, analysis and characterization of the most common errors of geospatial inconsistencies of web services metadata in the context of Geographic Information Retrieval (Renteria-Agualimpia et al., 2013b, 2014). Additionally, the author has collaborated in the development of reality checks of the status and availability of the OGC Web Services (López-Pellicer et al., 2011, 2012b,c).
Some research results of the Digital Library research line where the author has participated are the exploration of new alternatives to ensure the quality of the descriptive information of cartographic resources (Renteria-Agualimpia et al., 2013a). Additionally, the author has collaborated in the study of new ways for improving the visibility of geospatial resources on the Web (Lacasta et al., 2014b,a), and new ways for improving the detection of spatial inconsistency, ambiguous toponyms, and the detection of the existence of problems derived from the lack of enough coverage for fine-grain toponyms in gazetteers (Moncla et al., 2014).
This thesis is included in the aforementioned research lines and is the result of the cited researches. Future work will improve contributions in the QA of descriptions of geospatial resources, the characterization of other kinds of inconsistencies and the evaluation of their impact in information retrieval processes.
CONCLUSIONS:
This thesis has researched how to assess the quality of metadata that describe the spatial location of a resource, and the problems that may surface when a metadata record describing a resource has semantically close geographical properties, that is, pair of properties that describe its location using different reference systems (e.g. text and coordinates). This problem is closely associated to the facility which georeferenced resources can be retrieved in an information system. This approach has been used to show the need for methods and tools that analyse the geospatial semantic consistency of these properties in order to improve the discovery, accessing and retrieval processes of geographical information from different perspectives. Starting from this aim, the main contributions of this thesis are the following: * An approach that takes advantage of the spatial co-occurrence of the large volume of geospatial information: We have presented a methodology that takes advantage of spatial co-occurrent metadata and their cumulative knowledge describing a same place to validate a particular resource description or to find discrepancies with respect to its neighbourhoods.
The increasing volume of geospatial data everyday makes infeasible to search through their content directly. Many information systems use instead geospatial metadata. However, we have shown that large volume of spatial information can be exploited to provide Quality Assessment for co-occurring metadata.
With this methodology, an adapted two-dimensional clustering algorithm has been proposed to capture the geospatial co-occurrence, and to discriminate when a co-occurrent metadata just overlaps, and instead it belongs to an inferior or superior cluster.
One of the particularities of our methodology is its flexibility. This methodology has shown the capability to integrate different clustering algorithms, reverse geocoders, and two-dimensional ranking methods.
* A comparative study of spatial ranking approaches for one-dimensional and two-dimensional data: This thesis has introduced a comparison between different ranking approaches and their ability to work with one-dimensional and two-dimensional data. The results have showed a significant advantage in geospatial inconsistency detection of the approaches based on two-dimensions. The nature of geospatial inconsistencies was detected mostly when we shift from one dimension to two dimensions. The results have revealed that macro and micro geographical extents traditionally are mixed in a point, but approaches based on two dimensions help to discover inconsistencies hidden for one-dimensional approaches.
Also, in the comparison we contrasted approaches using social knowledge sources, Wikipedia and DBpedia, with approaches using official sources. In general, the accuracy was better when we used official sources to validate metadata descriptions, however, when we worked in the smallest geographical extents, the social sources provided spatial descriptions not found in official sources.
* Two real tests in two real scenarios with two two-dimensional datasets: This thesis has introduced an empirical and quantitative study of the spatial quality of the semantically close geographical properties in two scenarios: SDI and Digital Libraries. With these scenarios we have performed a dual validation of our methodology, the first validation used a dataset of more than 1000 Web services from the Spanish SDI, meanwhile the second validation consisted on a dataset of more than 42,000 MARC21 metadata records from the U.S. Library of Congress. The empirical study has provided an overview of the characteristics of the common spatial inconsistencies in published metadata resources, and also reveals common and systematic errors in the current practices in these communities in the provision of metadata for cartographic resources. The study has characterised and summarised these common spatial inconsistencies. The characterization of inconsistencies in the Digital Library scenario is made taking into account the experiences gathered with SDI catalogues. Although, inconsistency problems exist in the SDI scenario, they were less frequent than the Digital Libraries scenario. We have found that SDI quality problems are minimised because SDI personnel are experts and technicians with specialised, advanced and detailed spatial knowledge of the geographical domain, and also, it is due to the specialised geographical focus of the SDI catalogues and developed standards. For these reasons, the inconsistency problems of resource descriptions in SDI are probably caused by technical issues.
* A semi-automatic Quality Assessment tool for Geospatial Metadata: Correcting geospatial inconsistencies of SDI and Digital Library resources is not trivial for non-expert personnel and users in geospatial disciplines. In this line, our methodology can assist personnel with a semi-automatic Quality Assessment tool that improve the retrieval and systems interoperability by means of reducing the invisibility of the geospatial resources, specifically, the invisibility caused by geospatial inconsistencies of the semantically close geographical properties used to retrieve those resources.
Also, we have pointed out some of the implications of the geospatial inconsistency problems. A resource with a poor quality description is for most purposes invisible. Invisible resources deteriorate the effectiveness of the information system devote to manage the information. Ensure the quality of the description is vital to ensure the future access and discovery of resources held by SDI, libraries and archives. Our work has provided a mechanism to alert and generate reports of inconsistencies and then, help in digital curation processes. This mechanism of inconsistency detection can also be used to alert about potential problems of disconnection in interoperable and distributed information systems.
FUTURE WORK:
The goal of this thesis is the improvement of the accessibility, retrieval, and visualization for geospatial information resources in the context of digital repositories in general. Many open questions remain that require further research. These are the opportunities identified for following them up:
Apply lessons learned to the analysis of the geospatial consistency status of other domains that use other kind of metadata: The implementation of the approach proposed in this thesis only considers two kind of metadata document with geospatial information, the OGC Web services metadata and the MARC21 metadata. With respect to the first one, other resource metadata schemas in the geospatial domain, such as the ISO 19119 (ISO/TC 211, 2005), use also semantically close geospatial properties to access and retrieve Geospatial Web services. With respect to the second one, archives also have the custody of important geospatial resources, which are susceptible to be analysed in order to provide assessment to their semantically close geographical properties. It seems natural to perform further research on these scenarios.
Apply lessons learned in assessing the quality of semantically close geographic properties to other semantically close properties: The invisibility problems caused by geospatial inconsistencies can also be generated when users search for other facets. An open line is to measure the level of impact of other semantically close properties such as (temporal, thematic, etc.) In this sense, we need to take into account at less two additional issues: (1) The development of knowledge organization systems, such as temporal and thematic ontologies, must be in accordance with the kind of semantically close properties to be assessed. (2) Also, similarity measures must be developed to provide Quality Assessment for each kind of properties.
The knowledge organization systems and ontologies used in this thesis provide an inventory of spatial entities existing at a time. However, services and spatial data are dynamic and change a long time. For example, a Web service can change part of its contents between two short periods of time. Modelling the dynamic of some geographical resources and their content (e.g. Web services) along time is a complex problem (López-Pellicer, 2011). It will be interesting to investigate and detect spatial inconsistencies related with the time (e.g. from (x,y) to (x,y,t)).
Extend the analysis using more fine grained Knowledge Organization Systems: Part of the successful of the reverse geocoding process depends on the accuracy, the completeness and the level of detail of the gazetteers used, that is, the knowledge organization systems supporting the transformations (the spatial conversion between the reference systems). In our case, it has been the spatial ontologies used in the reverse geocoding process. Although, the results shown relevant results for the main cases, however, in areas where metadata documents refer to the smallest extents, it has been difficult to establish a spatial matching between the required/searched area and the spatial entities in the ontology. In one of the results of our research work (Moncla et al., 2014), we point out the need of official gazetteers and public spatial ontologies with a level of more fine-grained toponyms.
Explore new Spatial Ranking methods for Reverse Geocoding in the context of two-dimensional datasets: In this research work we use the concept of spatial ranking to transform (reverse geocoder) the Direct Spatial References into the most relevant Indirect Spatial References, which is referencing a location. Particularity, we use the notion of ranking query results based on the spatial similarity of two-dimensional footprints. Although, we have tested several measures of distance, new distances should be developed and tested to retrieve the resource with the best spatial matching. In our research we have found that search systems need improved measures for ranking better geospatial resources.
Explore new two-dimensional clustering algorithms and metrics: The collective metadata validation by means of clustering can be applied when we have additional information about neighbours with a good spatial consensus, that is to say, there must exist an agreement in the indirect spatial references that must describe the referenced location. Then the development of techniques to find this geospatial agreement in the presence of noisy and huge volumes of information is an open research issue. Many approaches deal with these problems in one dimension (resources referenced by a point), but the open research line and the challenge is to shift from 1D (a point) to 2D (MBBox, multi-polygons and complex geometries) geographical footprint to assess their quality. The last point regarding two-dimensional clustering is the internal metric used to the co-occurrence of resources. When we have resources with two-dimensional footprints, the metric must measure the geospatial matching between the compared resources, that is to say, the spatial similarity. The clustering algorithm and the internal metric could be exchanged for another in order to find more accurate clusters, and then avoid potential errors.
Apply lessons learned to Curation and Preservation processes: Taking in mind, the notion of Digital Libraries lifetime, i.e. ¿a Digital Library provides access to information whose value is preserved across long periods of time" (Dragland, 2005), digital curation is a research field with many opportunities and challenges (Janée, 2009) We believe that geospatial Quality Assessment can help to the digital resource preservation across long periods of time. The increasing volume of accumulated geospatial resource in Digital Libraries will make it more necessary to ensure proper and consistent spatial descriptions. Preservation processes of datasets must include both, data and metadata, i.e. the assurance that in the future a resource will not be invisible, that is, the resource can be found among millions by means of the metadata used to describe, explore, geo-visualised and retrieve it. We hope that our research results will motivate data and metadata creator to ensure that metadata records are created and maintained consistent. The development of policies and Quality Assessment tools will help to ensure the efficient retrieval in future search systems.
FINAL CONCLUSION:
Inconsistent metadata is often difficult to retrieve, especially, to complex query. The work developed in this thesis has shown that it is possible to detect inconsistencies in the context of geospatial digital repositories. It is done by applying geospatial Quality Assessment, particularly, assessment over the semantically close properties of the descriptive information. Metadata records in digital collections may become unretrievable due to inconsistencies between the semantically close properties of their metadata. In general, in information systems the efficiency of numerous tasks and processes depends on the consistency of the semantically close properties. In particular, processes such as discovery, retrieval, visualization, analysing, sharing and interoperability (e.g. Linked Data), curation, preservation, re-use, etc. In the geographical domain, geospatial Quality Assessment of the semantically close geographical properties can help to detect and fix inconsistencies.
Produce data, and in particular quality data is expensive. This explains why the re-use makes sense to reduce/share costs. However, it is required careful assessment of metadata descriptions that make the described resources easy to discover, share, and re-use for external consumers. It is often assumed by professionals that data management only entails preserving local consistency (not collective agreement or consensus about the proper description of a phenomenon). But this is not true. This thesis has shown that neglecting the quality of a pair of properties in a metadata record can cause serious problems of invisibility and retrievability. A resource without consistent metadata is for most purposes invisible and effectively lost. However, as this thesis presents, it is possible for large collections to make semi-automatic Quality Assessments able to detect those invisible records. Further research should analyse if this approach can be implement as an off-the-shelf component that can be added to popular information retrieval software.
REFERENCES:
B.W. Boehm. A spiral model of software development and enhancement. Computer, 21(5):61¿72, 1988.
T. R. Bruce and D. I. Hillmann. The continuum of metadata quality: defining, expressing, exploiting. In Metadata in Practice, Edited by Diane I. Hillmann and Elaine L.Westbrooks. Chicago: American Library Association. ALA editions, 2004.
K. Dragland. Adding a local node to a global georeferenced digital library. Master¿s thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Norway, 2005.
N. Dushay and D. I. Hillmann. Analyzing metadata for effective use and re-use. In DCMI Metadata Conference and Workshop, Seattle. Dublin Core Metadata Initiative, 2003.
M. Fernández-López, A. Gómez-Pérez, and N. Juristo. Methontology: from ontological art towards ontological engineering. In Proceedings of the AAAI97 Spring Symposium Series on Ontological Engineering, pages 33¿40, 1997.
FGDC. Content standard for digital geospatial metadata. Federal Geographic Data Committee, 1998b.
B. Furrie. Understanding marc bibliographic: machine-readable cataloging. Cataloging Distribution Service, Library of Congress, in collaboration with the Follett Software Company, 2009.
M. F. Goodchild and J. Zhou. Finding Geographic Information: Collection¿Level Metadata. GeoInformatica, 7(2):95¿112, 2003.
J. Hartmann and H. Stuckenschmidt. Automatic metadata analysis for environmental information systems. In Proceedings of the International Symposium on Environmental Informatics, 2002.
G. Hodge. Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. ERIC, 2000.
ISO/TC 211. ISO 19119:2005 Geographic Information ¿ Services. Published standard ISO 19119:2005, International Organization for Standardization, October 2005.
L. L. Hill. Georeferencing: The Geographic Associations of Information (Digital Libraries and Electronic Publishing. The MIT Press, 2006.
D. I. Hillmann, N. Dushay, and J. Phipps. Improving metadata quality: augmentation and recombination. In DC-2004, Shanghai, China. Dublin Core Metadata Initiative, 2004.
A. Isaac and E. Summers. SKOS Simple Knowledge Organization System Primer, W3C Working Group Note 18 August 2009, 2009.
G. Janée. Digital curation. In Encyclopedia of Database Systems, pages 816¿817. Springer, 2009.
J. Lacasta, J. López-Pellicer, W. Renteria-Agualimpia, and J. Nogueras-Iso. Improving the visibility of geospatial data on the Web. In Proceedings of Digital Libraries 2014: ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014) and International Conference on Theory and Practice of Digital Libraries (TPDL 2014), London, September 8-12th 2014. ACM/IEEE, 2014b.
J. Lacasta, J. López-Pellicer, W. Renteria-Agualimpia, and J. Nogueras-Iso. Agregador automático de serviciosWeb geoespaciales. Scire: representación y organización del conocimiento, 20(2):43¿48, 2014a.
C. Larman and V. R. Basili. Iterative and incremental development: A brief history. Computer, 36(6):47¿56, 2003.
F. J. López-Pellicer. Semantic Linkage of the Invisible Geospatial Web. PhD thesis, Universidad de Zaragoza, 2011.
F. López-Pellicer, R. Béjar, W. Renteria-Agualimpia, A. Florczyk, P. Muro-Medrano, and F. Zaragoza-Soria. Status of INSPIRE inspired OGCWeb Services. In INSPIRE Conference, 2011.
B. Martins, J. Borbinha, G. Pedrosa, J. Gil, and N. Freire. Geographically-aware information retrieval for collections of digitized historical maps. In Proceedings of the 4th ACM Workshop on Geographical information Retrieval, pages 39¿42. ACM, 2007.
A. Miles and S. Bechhofer. Skos simple knowledge organization system reference. W3C recommendation, 18:W3C, 2009.
L. Moncla, W. Renteria-Agualimpia, J. Nogueras-Iso, and M. Gaio. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, Texas, USA, November 4-7. ACM, 2014.
D. Nebert, A. Whiteside, and P. Vretanos. Open GIS Catalogue Services Specification.
OpenGIS Publicy Available Standard OGC-07-006r1, Open GIS Consortium Inc., February 2007. Version 2.0.2.
D. D. Nebert. Developing Spatial Data Infrastructures: The SDI Cookbook. Global Spatial Data Infrastructure, 2004.
W. Renteria-Agualimpia, F. J. López-Pellicer, J. Lacasta, F. J. Zarazaga-Soria, and P. R. Muro-Medrano. Identifying hidden geospatial resources in catalogues. In Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics, page 32. ACM, 2013c.
W. Renteria-Agualimpia, F. J. López-Pellicer, P. R. Muro-Medrano, J. Nogueras-Iso, and F. J. Zarazaga-Soria. Exploring the advances in semantic search engines. In Distributed Computing and Artificial Intelligence, pages 613¿620. Springer, 2010.
W. Renteria-Agualimpia and S. Levashkin. Multi-criteria geographic information retrieval model based on geospatial semantic integration. In GeoSpatial Semantics, pages 166¿181. Springer, 2011.
W. Renteria-Agualimpia, F. J. López-Pellicer, J. Lacasta, P. R. Muro-Medrano, and F. J. Zarazaga-Soria. Aproximación geosemántica para detectar inconsistencias en los metadatos de serviciosWeb geoespaciales. GeoFocus: International Review of Geographical Information Science and Technology, 13(1):154¿176, 2013b.
W. Renteria-Agualimpia, F. J. López-Pellicer, J. Lacasta, P. R. Muro-Medrano, and F. J. Zarazaga-Soria. Aproximación geosemántica para detectar inconsistencias en los metadatos de servicios Web geoespaciales. GeoFocus: International Review of Geographical Information Science and Technology, 13(1):154¿176, 2013b.
W. Renteria-Agualimpia, F. J. López-Pellicer, J. Lacasta, P. R. Muro-Medrano, and F. J. Zarazaga-Soria. Identifying geospatial inconsistency of Web services metadata using spatial ranking. Earth Science Informatics, pages 1¿11, 2014. ISSN 1865-0473. doi: 10.1007/s12145-014-0172-4.
W. Renteria-Agualimpia, F. J. López-Pellicer, A. J. Florczyk, J. López de Larrinzar, J. Lacasta, P. R. Muro-Medrano, and F. J. Zarazaga-Soria. Detectando anomalías en los metadatos de cartotecas. Scire: representación y organización del conocimiento, 19(1):23¿29, 2013a.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados