Featured Stories
0

Who Maps Language?

Mapping language, ultimately, depends on years and decades of painstaking, ethnographic and linguistic study and the cooperation of many small communities across the world.

SIL (Summer Institute of Linguistics, Inc) has investigated over 2,590 languages spoken by over 1.7 billion people in nearly 100 countries. SIL makes its data and publications available via the Ethnologue, Languages of the World. Other organizations, such as the United Nations Education, Scientific and Cultural Organisation (UNESCO), track language statistics, as gathered by census and other means.

The sites listed below present language data in a geospatial context. For the most part, these sites aim toward one of several purposes:

  • advancing academic understanding
  • promoting the awareness and preservation of cultural diversity
  • and, Christian evangelism.

Some sites present data to answer specific questions.

  • Where is a language spoken?
  • What languages are spoken in a given region?
  • For which languages has evangelical contact been made?

For simple questions, a search interface may suffice. A few of the sites below provide exploratory interfaces: the user is invited to look at multivariate data in terms of user selected layers and via other map features.

Language mapping sites listed below are likely not a comprehensive set. The first set of sites are focal data providers that also publish language maps. Below these, are sites that combine variable sorts of information across multiple data providers to enable some sort of data exploration. Finally, listed are a few data providers that don’t have a native mapping capability, but clearly fit this space well.

Focal language mapping data providers

UNESCO Language Atlas
“There is no perfect way to reflect the complexities of languages and their communities on a map. The print edition of the Atlas seeks to provide global coverage, dividing the world somewhat arbitrarily into regions; those with the greatest linguistic diversity are presented at smaller scale than those with less diversity. For the online edition, users determine the zoom level themselves, allowing a panoramic view or a very detailed one. No attempt is made to show population density or the area in which a language is spoken; we have instead selected a central point for each language.”

Unesco Language Mapping

Essentially, this site is a Google Maps search interface for endangered languages. This site uses pushpins to indicate a centroid region of where a language is spoken. Detailed information for each language is available by pressing the associated pushpin.

The World Atlas of Language Structures Online (WALS) is a… “large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of more than 40 authors (many of them the leading authorities on the subject).

WALS consists of 141 maps with accompanying texts on diverse features (such as vowel inventory size, noun-genitive order, passive constructions, and “hand”/”arm” polysemy), each of which is the responsibility of a single author (or team of authors). Each map shows between 120 and 1370 languages, each language being represented by a symbol, and different symbols showing different values of the feature. Altogether 2,650 languages are shown on the maps, and more than 58,000 datapoints give information on features in particular languages.”

Wals.info

Wals.info Detailed Language Data

Wals.info also provides a Google Maps interface to language data. The user can display either language “location” via pushpin or the user can examine particular linguistic features (e.g., consonant inventories) by colored pushpins. It is possible to view univariate data only. And, unfortunately, not all languages are represented.

The purpose of the Ethnologue is to provide a comprehensive listing of the known living languages of the world. “The Ethnologue is intended more as a catalog than as an encyclopedia and so provides summary data rather than more extensive descriptions of identified languages. Information comes from numerous sources and is confirmed by consulting both reliable published sources and a network of field correspondents. Much of the focus of Ethnologue is on the less commonly known languages. Greater detail and depth of description of many of the languages, especially the larger, more commonly studied languages, can be found in other works such as the International Encyclopedia of Linguistics (Frawley 2003), The World’s Major Languages (Comrie 1987), and The Atlas of Languages (Comrie, Matthews, and Polinsky 1997).”

Ethnologue has no interactive mapping and displays static language maps on pages separate from content about those languages. However, in conjunction with Ethnologue, Global Mapping International‘s (GMI) World Language Mapping System makes available Geographic Information System (GIS) data which maps language locations both as points and polygon, and including attribute information from Ethnologue. “The World Language Mapping System (WLMS) is the result of over 20 years of collaborative work between GMI and the SIL International (SIL), to map the over 6,800 languages described in SIL’s 16th edition Ethnologue.”

Here is an example of such a map, as posted on the Ethnologue website.

Ethnologue Language Map

Sites that consolidate and present language data across multiple data sources

LL-MAP “…is a project designed to integrate language information with data from the physical and social sciences by means of a Geographical Information System (GIS). Data sources include genetic relationships between languages, topography, political boundaries, demographics, climate, vegetation, and wildlife, “…thus providing a basis upon which to build hypotheses about language movement across territory. Some cultural information, e.g., on religion, ethnicity, and economics, will also be included.”

Recently, an Australian researcher, Quentin D. Atkinson, published a paper giving evidence to the theory that language originated in Africa. He traced a theoretical migration path corresponding to human migratory paths using data from WALS (number of phonemes), Ethnologue (population data), and GMI data (geographic extents). Perhaps, LLMAP will combine data sources that will give rise to hypotheses of other, more modern migration paths.

LLMap Viewer

LLMAP is an interactive viewer that uses OpenLayers and JSExt libraries for dynamic presentation. The viewer presents data layers in panel to the left of the map. Users navigate a tree structure in that panel to choose any number of layers to display. To select a layer, the user drags a layer to the map. Transparency can be set by right click in the “active layers” pane. In addition, LL-MAP can harvest and display data from any WMS (Web Mapping Service) compliant server.

One of the more sophisticated mapping applications listed in this post, LLMAP intends to link data to graphs of language trees generated via the MultiTree project. MultiTree is an NSF-funded project conducted by the Institute for Language Information and Technology (The LINGUIST List), at Eastern Michigan University. Relating genetic relations of language to geographic dispersion seems to invite linking of visualizations as LLMAP suggests.

Joshua Project is a Christian research initiative highlighting of ethnic groups around world. This site includes data from the World Christian Database, International Mission Board, and Ethnologue.

Joshua Project

Joshua Project Language Detail Map

Joshua Project uses interactive flash maps from AmMap to provide limited geospatial context. Most information is provided in separate tabbed panes below the map. There is a great deal of information available from this site to include audio clips, pie charts displaying statistics about religions, ministries and contact, etc. Maps provide only a simple geospatial context defining country boundaries only, and there are no markers nor relation between map and associated text.

“The World Missions Atlas Project contains various forms of information including maps, tabular data sets, and written descriptions. The information is helpful in assessing the current status of Missions progress throughout the world. It is a constantly expanding site that seeks to produce a strategically significant World Missions Atlas.”

WorldMAP ArcGIS Viewer

The WorldMap viewer is an ESRI ArcGIS Adobe Flex-based viewer. A great deal of interactivity is afforded via semi-transparent layers, search, drawing tools, etc. Language mapping is provided by a GMI data layer. This map has a very polished design and accommodates both multivariate data exploration and search.

Sites offering related data, but without custom mapping

The World Christian Database (WCD) includes “detailed information on 9,000 Christian denominations and on religions in every country of the world. Extensive data are available on 232 countries and 13,000 ethnolinguistic peoples, as well as on 5,000 cities and 3,000 provinces.”

The International Mission Board contains downloadable data (spreadsheets) consolidating information across several sources regarding peoples, languages, and status of evangelism.

Design considerations

Language maps are largely produced for either reference (answering very specific questions) or data exploration. In both cases, frequently used contextual features include population, related languages, geospatial extent, geophysical features, ethnicity, and religion. Cartographers often make explicit choices with regard to which features are presented. However, in more dynamic sites, maps tend to be more exploratory and users are presented with an array of data sources to choose from. In either case, most rely on the same set of original data providers such as Ethnologue, GMI, and official government statistics (surveys and census).

Language mapping sites vary from the production and delivery of static map images produced from traditional desktop GIS, pushpins layers on web maps, and multivariate data representations afforded by OpenLayers and Javascript, or Flash/Flex-based rich interactive visualizations. Generally, it appears map content is designed separate from textual narrative: maps are explored independent of other text on the site.

In future posts, I will examine how this paradigm might be shifted toward content-driven maps which are coordinated with, and embedded in, page content. Though not discussed here, Wikipedia contains language maps at varying levels of details. In contrast to the sites listed below, a language map in Wikipedia provides specific geospatial context for the article in which it is embedded. These maps vary greatly in style, detail, and quality. This is the sort of content that might benefit from a more automated map generation. Ethnologue, which currently makes a clear separation between textual content and maps might also benefit.

0

Content-Based Generation of Language Maps

In a previous post, I presented examples of site actively engaged in language mapping. In this post, we will look more deeply at two sites: Wikipedia and Ethnologue.

Wikipedia articles about particular languages typically have embedded maps to provide geospatial context. The design, style, and detail of such maps vary widely. This is not terribly surprising given that Wikipedia is a crowd-sourced content publishing platform. Anyone may create an article, edit published articles, and contribute graphics and maps.

Ethnologue, on the other hand, is a carefully curated reference. Text and maps have equal prominence but are not combined into integrated views. Users may flow freely between two main content views, language detail and country detail, and view maps by following embedded text links. Ethnologue is, by nature, a reference book with maps. Where maps are concerned, it targets the question of where a language is spoken and little else.

Language Maps in Wikipedia

Infobox for the Macedonian Language


Perhaps, the most standard location of a language map in Wikipedia is the infobox. Because anyone may arbitrarily create a page, there are variable schemes for the organization of collections of language pages. They fall under various container categories as such: “Languages of Country”, “Endangered languages”, “Lists of endangered languages”, “Germanic Languages”, etc. Language pages themselves generally have rich infoboxes containing both a simple map and the relevant portion of a language tree.

The infobox for the Macedonian language gives a quick peek into where the language is spoken, how many speakers, writing system, and place within its family tree. The infobox map is necessarily quite small and simple. It must only answer the question of where that language is predominantly spoken: either as an official language, by population density, or some other metric. Climbing up the tree, relatively less information is contained in infoboxes, but often, geographical and familial tree information is still available.

Within any particular article, there may be an arbitrary number of embedded maps. However, such maps are probably rarely designed to the content.

Dialect Map of Macedonian (Wikipedia)

For example, this dialect map of Macedonian contains categories either different that those listed in the adjacent Wikipedia text, or with variant names. Given the contentious nature of dialect and language classification, the map was created under some particular theory that does not quite match the theory or theories represented in the Wikipedia text.

Language Maps in Ethnologue

Maps on Ethnologue were probably originally designed to match relevant textual content on the site. As noted here, problems may have later emerged where maps and content diverge. In general, there are two major types of pages on Ethnologue: country pages and language pages. Click the thumbnails below to view in more detail.

Ethnologue Country Page


Ethnologue Language Page


Both of these page types have a high degree of internal structure. For every country page, there is a link to a map page: maps are clearly designed to augment country pages. Accessible from language descriptions on both country and language pages, is language classification in terms of family tree.

Clicking on this tree (see image below) provides slightly more structure and content.

Family Tree on Ethnologue

Content-Based Mapping
Both Wikipedia and Ethnologue are very large sites with thousands of pages dedicated to describing individual languages as well as relations between languages. This is not only a significant challenge for asset management and maintenance, but also usability. The challenge for Wikipedia is much reduced given that human curators are constantly updating pages. However, there are many language pages with little content and no maps. The goal of content-based mapping is to automatically generate maps taking into account accompanying textual content. Three scenarios are described below.

Scenario 1: Wikipedia 1 – Generating static maps for infoboxes

An innovation of Wikipedia is the presentation of a relatively structured summary of page content as an infobox. Embedded maps must necessarily be small with little detail. It is possible for users to expand these images to see more detail (such as a choropleth population distribution), but it’s important that maps viewed inside the infobox be clear and relevant.

One approach to content-based mapping in this context is to leverage an ontology based on infobox content of language pages. By access ontological types and properties (e.g., via DBpedia) of infoboxes and using in combination with geodata such as that available from GADM (Database of Global Administrative Areas), simple static maps can be generated, cached, and stored on Wikimedia Commons. A reason to do so would be to standardize map design across language pages and provide for maps where none may exist.

Scenario 2: Ethnologue – Embedded maps on country and language pages

Ethnologue is a use case where embedded maps could potentially make a large impact on site usability. It is currently very easy to look up information on Ethnologue. But it is also very difficult to find answers to some kinds of questions such as “In what countries is French an official language?” Or, “how related are French and English?”

The eminently useful feature of Ethnologue is it’s highly structure nature. It is reasonable to expect good information extraction using a domain ontology. Also, the sorts of questions users might have are enumerable and focused to a very specific domain. Unlike the Wikipedia example in Scenario one above, this use case may benefit from a more exploratory interface where multiple questions might be explored.

Scenario 3: Wikipedia 2 – Maps accompanying article content

This scenario is far richer and more problematic. Wikipedia pages are often segmented into individual sections. Often custom maps are embedded into sections (for example, the Macedonian dialect map above). But there are generally relatively few embedded maps in Wikipedia articles. One could imagine embedding a geographic visualization designed for language exploration. A domain ontology could be devised that addresses specific sorts of information in language pages (e.g., place names, language names, dialects, etc.) but ignore other sorts of content. In this case, maps would not so much be designed to specific page content, but relate only in a topical way. Some pages explore concepts in interesting sub-domains (e.g., phonological isoglosses) that would be interesting to map. But such rich domain content would require detailed ontological knowledge, tailored information extraction, and access to geospatially rich data that likely does not exist.

Summary Implications

The first two scenarios described above seem ripe for investigation. The Wikipedia infobox use case is certainly worth exploring and the simpler of the two. The second seems potentially a valuable contribution to Ethnologue, which is a website run by a non-governmental organization on a limited budget. The Ethnologue investigation combines ontology-based extraction with rich geographic visualization and innovative interaction design. Both of these first two scenarios could be enriched by other data visualization techniques such as language family graphs.

0

R as a “Sketchy” Tool in Graphics Development

This is an interesting video from Amanda Cox in the graphics department of the New York Times. She talks about how the NY Times Graphics department uses R. Basically, she notes that R is particularly good for doing three things:

  • reading data
  • manipulating it
  • and, producing graphics.

The NY Times doesn’t use generated R graphics for direct publication, but for producing “sketchy” drafts. As Amanda says, she mostly does one-off work and it’s very quick and easy to generate visuals (in minutes) to see if they may have value and are worth the time and effort to develop further. Code is generated quickly and not intended for re-use.

Below is an interesting flash graphic that the NY York Times team created. It was originally conceived as a series of R graphics generated from zipcode rental data available from Netflix. (Click on this link to visit the interactive NY Times graphic.)

Interactive graphic that was conceived using R

Amanda sums up the value of R in graphics design very well: “When it’s trivial to go out and get data in the real world… and sketch your data… then I think you do better things.”