A boat can also be a vessel, a bombarda, a captain, a caravel, a ship, a galleon or sloop. Any of these words awaits, hidden between the thousands of ancient documents that has an historical archive. Can be hand-written, more or less legible, on letter humanistic, procedural, chained, or courtesan. There is No historian who has not been faced to this matrioska researcher. But one of them, Carlos Alonso, wondered if an artificial intelligence system could not make that cumbersome trance. And the project ‘Caravel’ we just demonstrate that an algorithm can be a kind of Rosetta stone to files historical.
“Time and money”, that’s the cost of any research on sunken shipwrecks, according to Alonso, a historian from the Centre for Underwater Archaeology (CAS) of Cadiz. He is one of the architects of this smart system, able to find words and combinations of words in old documents digitized. This system has occupied more than two years of work, researchers of the CAS -affiliated competition of the Andalusian Institute of Historic Heritage, IAPH – and of the Research Center of Pattern Recognition and Language Technology Human, PRHLT, Universitat Politècnica de València (UPV), led by professor Enrique Vidal.
The physical valenciano and his team (built by José Miguel Benedí, Lorenzo Quiros, Francisco Casacuberta, Moses Shepherd, Vicente Bosch, Alejandro Toselli, Verónica Romero and Joan Andreu Sanchez) carries more than 12 years engaged in research aimed at developing technologies capable of processing texts written by hand. Have obtained good results for particular collections, as with the manuscripts of the English philosopher Jeremy Bentham. But had never succeeded in the ambitious challenge that Alonso had in mind since, in 2011, knew of the work of Vidal through an interview: to get the system to understand different types of letter, usually devious, and in images of diverse quality.
“there Were difficulties that we had never touched,” explains Vidal. Until the project Carabela-developed between 2017 and 2019, with funding from the BBVA Foundation – have shown that the technology is ready to read words in photographs of low contrast and quality, of up to 125 pixels / inch, written in variables -and, sometimes, nearly illegible – letter styles from the FIFTEENTH century to the NINETEENTH. “We forced the most of the system and the result has been very good,” says Alonso. This variability of images, qualities and styles of writing were essential requirements for which could be useful in the research on sunken ships in the CAS done to make your charter archaeological underwater.
“Even if the documents are cataloged or digitized, you must take into account that 80% or 90% of the content of the files is unknown”
The system is based on a method of indexing a probabilistic, with an interface similar to a search engine by words. The algorithm works pixel by pixel of the image using optical models, who deciphered the writing of the characters, as with language models, that analyze how to combine these to form words and phrases. The searches produced successful results in more than 80% of the cases, and the system always reports back a percentage to the user about the degree of reliability of what is found. “The success is due in good measure to that we do not insist on transcribe verbatim, but that builds maps indexing with probabilities of everything that can be written on each point of each image,” explains Vidal.
But the algorithm is not learned only to do this task. “In Valencia they were able to put the school and we teach the child to read,” explains Alonso in reference to the work developed together with Carmen García Rivera -director of the CAS-, Lourdes Marquez, and co-workers María del Carmen Orcero and David Garrido. The team selected more than 130,000 images -photography by page – from the collections of the Historic archives of the province of Cadiz and the archivo General de Indias of Seville. Of them, Alonso chose to 514 documents randomly, in function of the different types of lettering, quality or image contrast.
The historian was transcribing word-for-word, by instructing the algorithm to the variations in writing that the terms have undergone in the centuries -abbreviations, changes between the v and the b – or its synonyms, and then be able to search by himself. “When I was only 10 documents, the system had already learned and helped to the task of transcribing manually,” recalls the historian. It was more a year of teaching with the uncertainty of whether they really ‘Caravel’ would work or not. The doubt was removed when he sought, for the first time, among the 130,000 documents, the word “shipwreck” and the system will return 400 references. Of these, 150 contained novel information for the CAS.
“Even if the documents are cataloged or digitized, you must take into account that 80% or 90% of the content of the files is unknown,” says the archaeologist. ‘Caravel’, in beta phase and available on the network, has proved to save that hurdle with success, but it can also become a danger to hunters and pirates that crawl the written references of subsidence to plunder the underwater sites. For this reason, the creators of the program have chosen to limit the access to the images used in the Archive of the Indies, where it is treasured 80 million documents on the trade with America for centuries. In addition, the program has served to classify the indexed documents according to their level of risk for public display. This will allow you to easily know which parts of sensitive files to protect.
The new Rosetta stone of the files is already shaping up as a future tool of great utility to the researchers, “although there is still much to develop and improve,” says the historian of cadiz. In fact, its developers dream to continue to improve the algorithm in future projects to further fine-tune the search and that the system is even able to produce transcripts of approximately paragraphs selected by the user. “It is a pilot project with good results. The key now is the sensitivity from the world of the files to demonstrate it,” ditch excited Alonso.