A guest post by Angela Xiao Wu. Angela is an Assistant Professor in Media, Culture, and Communication at New York University. She uses mixed methods to study Chinese digital cultures and politics, as well as how macro-patterns of online activities in China, the US, and globally take shape in relation to various (infra)structural factors. Twitter: @angelaxiaowu


How can we be confident that what we know about China accurately reflects reality on the ground there? And how can we better recognize what we don’t know, or can’t know? At a moment when journalists and researchers outside of the country find it nearly impossible to travel to and report from China, Chinese Internet data has taken on a new centrality for scholars and journalists outside of the country, as a source of information purportedly able to reveal shifting political attitudes and social dynamics there. Yet as Professor Angela Xiao Wu argues, it is a mistake to take this data at face value. In her feature for AI Now’s China in Global Tech Discourse series, Professor Wu provides a behind-the-scenes guide to how platform companies, third party intermediaries, and government actors affect which data are removed, curated, and posted. As big data analyses of Chinese social datasets become more common in the social sciences, what are the necessary caveats that researchers — and the institutions and individuals dependent on their findings for critical understandings of China — should bear in mind? In developing an awareness of the skewed and incomplete picture of life in China that platform data taken at face value can generate, how can we become sharper consumers of stories told using Chinese Internet data?


China has become a land of massive datasets. Publicly accessible content on Weibo (Twitter-like but larger), WeChat (all-in-one mega-platform growing out of a messaging app), and Douban (film and music review website) are amassed and curated by the online opinion monitoring industry to serve its “data enthusiast” niche clientele. Transaction records of Taobao (forcing eBay out of China in 2006) and Didi (forcing Uber out in 2016) reach outside analysts through formal collaboration. All these datasets are also traded behind closed doors and on the black market. China’s rise in this global data-driven enterprise can be credited to a number of reasons: the predominance of huge platforms, lax market regulations, the relative absence of public organizing about data privacy, and an enormous digital world thanks to the government’s long-term infrastructural investments.

Along with those created by China’s burgeoning data-labelling industry, data gathered through Chinese platforms are moving across borders to feed the neural networks of startups, fuel Silicon Valley innovations, and lay the groundwork for an increasing number of academic publications. Many have celebrated the explosion of Chinese platform data as an invaluable opportunity to study China, where political sensitivity restricts survey methods, historical archives, and access to public records. Yet if existing structures of power shape what data are available and in what forms, a dynamic crystalized by none other than the notorious scarcity of Chinese data in the past, we might ask: What undergirds the creation of platform data today? And further: What happens when this data is repurposed for academic social science?

Platform Trace Data for the Social Sciences

To be clear, platform trace data are increasingly used to examine human behavior regardless of geographies. Such data are impressive due to their granularity, volume, and their ability to be “passively measured.” They are “unobtrusive” recordings of our activities and behaviors when no one is watching. The prevalence of these data will only increase as more people are brought online and researchers increase their demand for these datasets. Indeed, in recent proposals to develop computational social science, much effort is concentrated on expanding data collaborations and data infrastructures with platform companies.

Despite this trend, I and my coauthor Harsh Taneja have documented major pushback that addresses data representativeness, data privacy, precarious access at the mercy of platforms, and the data’s commercial origin (see our brief review). On top of these, we focus on the “measurement conditions” of platform trace data, which foreground a different set of epistemic, methodological, and ethical issues. By “measurement conditions” I mean how platform trace data is created. All data result from measurement processes designed and executed to serve a given institutional context. For example, unified standards to observe the weather are essential to meteorological research and human intervention on a planetary scale. The rubric of “best college” metrics induces investment in certain areas but not others.

Platform data are generated as platforms’ records of their own behavioral experimentation, calibrated to answer questions germane to these companies. Platforms constantly modify their digital architectures to observe user activities, meaningfully changing the “measurement methodology” as they do so, and thus creating data that is not, necessarily, fungible from one moment to the next. Yet what standards exist for the creation and management of this data, or for its responsible use?

In the Shadows of Platform Episteme

Our recent paper illustrates the distinct nature of platform datafication in comparison with third-party audience measurement. Whereas the longer tradition of third-party measurement firms such as Nielsen and comScore measure people’s browsing patterns to provide “currency” for a multi-sided advertising market, platforms measure internally for their own administrative purposes.

Third-party measurement firms live off of selling measurement data. They must convince their customers — content producers and advertisers, whose interests conflict — that their methods are “unbiased” (and the best to inform their customers’ profitable decisions). This convincing involves varied mechanisms such as periodic auditing. Beyond this, they have no investment in how web users behave or in any particular measurement rationale. Platform companies, in sharp contrast, are invested in both. Platforms often define themselves by managing user behavior, and their profit hinges on such measurement. A platform’s datafication infrastructure is set up and deployed to “improve” design features, fuel advertising analytics, and please investors. It is constitutive of what we call “platform episteme” — platforms’ way of knowing.

Meanwhile, the platform datasets that external researchers lay their hands on seldom contain information about platforms’ own methods for creating and managing data on user behavior, nor information on how they deploy such data to influence user behavior. (This is unsurprising, as such secrets enable their business models.) What external researchers identify and interpret as correlations within platform trace data thus amount to partial and contorted accounts of human conduct, which essentially conceal platform interventions — rankings, recommendations, display aesthetics, and so forth, including commercial content moderation under proliferating rules. As researchers draw conclusions from this data, they are also drawing on a historical set of conditions and decisions made by the tech platforms.

When treating platforms as transparent vehicles for users’ inherent intentions, scholarship effectively obscures platform companies’ prevailing power. At its core, this methodological pitfall stems from a lack of “contextual knowledge” about platform datafication. We cannot extrapolate human behavior from platform traces without accounting for platform governance.

Seeing China through Platform Trace Data

In any society, platform datafication also operates under broader structural and infrastructural forces. Indeed, Chinese platform governance involves the commercial platforms’ ever-shifting architectures and modulations. But additionally, it also involves the platforms’ often hidden attempts to fulfill political directives from government agencies through content curation and removal. For example, Weibo’s Trending list, which researchers often cite to indicate the center of public attention, is constantly doctored. When researchers of popular emotional expression rely on sentiment analysis of WeChat content supplied by online opinion monitoring companies, they seldom account for the fact that the entire analytic software takes shape as part of the larger political imperative to contain digital activism.

Repurposing Chinese platform data thus tends to replicate a platform episteme that embeds political control. Such analyses are likely to obscure algorithmic and institutional governance over user activity, including the censorship regime at work.

Moreover, when platform data serve as the only data source, their intrinsic features also limit knowledge production. As I have elaborated in a recent essay, platform trace data and data scraped from online venues are typically about the behavior of individual user accounts or the semantic content of sprawling textual fields. Lacking direct measures for gender, income, ethnicity, and so forth, these data — despite their abundance — are “unfriendly” to the investigation of structural inequality.

Further, in American universities, government funds and private philanthropy used to be a lifeline for developing so-called “indigenous knowledge” of foreign territories. But since the end of the Cold War, these supports have been dwindling. As academic jobs have become fast-paced and competitive, graduate mentors in the social sciences discourage students from spending time on foreign language study and fieldwork. Adding to this shift away from “older” methods is an embrace of STEM talent that comes with the expansion of “computational social science,” which itself reflects changes in institutional resource allocation.

Meanwhile, big data social science research within China tends to stay away from “politically sensitive topics” and concentrates on narrow, so-called “application” problems. It’s not difficult to see that research in- and outside China has a similar tendency — both employ Chinese platform data to build up claims about abstracted human behavior.

In its extreme form, research may actively occlude distinctly Chinese histories and configurations in attempts to justify these data as proxies for generic social constructs. Such constructs, like “social capital,” “public sphere,” and “agenda-setting,” are usually developed from examining phenomena in Western contexts and presume particular institutional conditions that China lacks. As such, we see a simulacrum of China, devoid of organizational and cultural complexity, presented more and more in conferences and journals. How can we pursue and present accounts approximating “truth” if we’re relying on data that is being obliquely manipulated and managed?

In sum, as social science data infrastructures continue to expand in collaboration with platforms and transnational data brokers, various types of contextual knowledge are also being diluted, warped, and sidelined. It is time we confront how China is being known in this new academic ecology.

Staying with the Trouble

How to do better? I offer a few suggestions for starting a collective discussion. First and foremost, social scientists repurposing Chinese platform data must restore and confront their measurement conditions. What is the life trajectory of the dataset? How was it created, processed, and acquired? And is it possible to make responsible use of such data if you cannot answer these questions?

Second, from class, gender, and citizenship structures (e.g., hukou, China’s internal passport system), to the arrangements of Chinese media system and state governance, to corporate capital and platform technologies — these institutions remain particularly powerful by keeping their operations out of sight. Analyses therefore should aspire to reveal the circumstances and voices that these institutions intend to obscure.

But to do this, we as social scientists must step outside of platform data. We must painstakingly weave together contextual knowledge, so that our research design speaks to the lived cultural, social, and political realities of Chinese people whose data are taken for our analysis. Quantitative abstraction inevitably flattens lived experience. But by striving in this direction, and working to rehydrate data with its rich contingent context, we may wield it to challenge power.

At the same time, on the receiving end of the burgeoning studies using Chinese platform data, we must be mindful of their methodological pitfalls. Importantly, these pitfalls, while highlighted in the China case, are broadly shared by repurposing platform trace data to study human behavior overall. Measurement choices, obscured platform experimentation, and political pressure dictating acceptable content are factors that shape non-Chinese platform data, as well. In short, we need to pause, and consider their findings by asking ourselves whether they

  • Replicate a platform episteme that glosses over platform interference;
  • Conceal forces shaping platform governance including political control;
  • Sideline investigations of structural inequality in a society; and
  • Leave out formative institutional and cultural contexts when interpreting numerical results.

Finally, both the reliance on platform trace data to know China, and the inability to detect pitfalls in knowing China through these data, bespeak recent changes in scholarly and personnel exchanges between China and the US (and other parts of the world). Both are exacerbated by the shrinking access to actual lives and social worlds in China amidst intensifying geopolitical conflicts.

Both, therefore, can be mitigated by building and enriching ties under institutional support. The prior administration’s termination of the Fulbright and the Peace Corps programs in China, as well as immigration hurdles imposed on Chinese international students and scholars, were ironically celebrated by Chinese hard-liners. But it’s perhaps not that ironic after all, if we reckon that making Chinese realities legible holds extraordinary power beyond national borders.