What can big data tell us about the social meaning of language variation? A case study on socially meaningful spelling variation in English

Laura Rosseel, Dong Nguyen

Research output: Unpublished contribution to conferenceUnpublished abstract


Spelling variation is abundant in written language use on social media platforms. Crucially, many of the non-conventional spellings that can be found there are not misspellings. Various studies have analyzed the patterns and functions of online spelling variation (e.g. Tatman, 2015 and Ilbury, 2020 for Twitter), and have suggested a strong connection between phonological and orthographic variation (Eisenstein, 2015). This suggests that spelling variation can be used like other forms of linguistic variation to express aspects of the language user’s social identity (Sebba, 2007). Yet, little quantitative research has been carried out on the social meanings of spelling variants. This study aims to contribute to tackling this descriptive lacuna in sociolinguistic research. We set out to do so by comparing the social meanings of spelling variants, elicited through human experiments, to data-driven meaning representations, automatically learnt from large corpora. As such, this study supplements its descriptive research aim with a methodological one: to what extent can traditional sociolinguistic ‘small data’ and recent NLP based ‘big data’ approaches complement each other? In this paper, we focus on spelling variation on the popular online platform Twitter. We look at two types of spelling variation phenomena in British English: (1) spelling variation representing phonetic variation (e.g. alveolar vs. velar pronunciation of ING as in workin vs. working), and (2) spelling variation restricted to the orthographic level (e.g. flooding of characters as in fun vs. funnnn). First, the social meaning of the linguistic variants is measured experimentally in a written version of the speaker evaluation paradigm (cf. Leigh 2018). Stimuli for this experiment are selected from a Twitter corpus controlled for region (i.e. the London metropolitan area). A sample of participants (N = 120) geographically matched to the producers of the corpus data is presented with a series of tweets containing the linguistic variants under study. Participants are asked to rate the personality of the writer on a series of semantic differential scales representing various social traits that have been shown to be potentially associated with the social meanings of the targeted linguistic variation. Second, we compare our experimental measurements of social meanings with word embeddings, i.e. automatically learnt mappings from words to high-dimensional vectors based on co-occurrences in the Twitter corpus (Mikolov et al. 2013; Smith 2020). We do so using a computational analysis of the linguistic variants in the embedding space, by measuring the distances between the variants and clustering them based on their embeddings. For example, are linguistic variants that received similar ratings in the human experiments clustered together in the embedding space? Our paper brings novel insights into the social meaning of spelling variation. It furthermore draws attention to opportunities and limitations of data-driven meaning representations for sociolinguistic research on language variation. References – Eisenstein, Jacob. 2015. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics 19(2): 161-188. Ilbury, Christian. 2020. “Sassy Queens”: Stylistic orthographic variation in Twitter and the enregisterment of AAVE. Journal of Sociolinguistics 24(2): 245-264 Leigh, Daisy. 2018. Expecting a performance: Listener expectations of social meaning in social media. Paper presented at NWAV43 [New Ways of Analyzing Variation], New York, 20 October 2018. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado & Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems: 3111–3119. Tatman, Rachel. 2015. #go awn: Sociophonetic Variation in Variant Spellings on Twitter. Working Papers of the Linguistics Circle of the University of Victoria 25(2). Sebba, Mark. 2007. Spelling and society: The culture and politics of orthography around the world. Cambridge: CUP. Smith, Noah A. 2020. Contextual word representations: Putting words into computers. Communications of the ACM 63(6):66-74.
Original languageEnglish
Publication statusPublished - 2022
EventInternational Conference on Language Variation in Europe: ICLaVE11 - Universität Wien, Vienna, Austria
Duration: 11 Apr 202214 Apr 2022


ConferenceInternational Conference on Language Variation in Europe
Abbreviated titleICLaVE11
Internet address


  • sociolinguistics
  • social meaning
  • NLP
  • word embeddings
  • spelling variation
  • big data


Dive into the research topics of 'What can big data tell us about the social meaning of language variation? A case study on socially meaningful spelling variation in English'. Together they form a unique fingerprint.

Cite this