Crowdsourcing: citizen science and research projects

The process of innovation has been profoundly altered by the emergence of two way communications mediated through the Internet. Nowadays large groups of people have the ability to connect and collaborate in the innovation process. Von Hippel  and Jasanoff acknowledged that a large number of users of a given technology will come up with innovative ideas. Discussions on “citizen science”, “open innovation” and “co-production” show that there is a sizable impact in the area of science.

In this context, crowdsourcing a term introduced by Howe in a 2006 Wired article, was primarily used from a business perspective. Howe compared crowdsourcing to outsourcing in the manufacturing and service industry, which essentially meant moving costly operations to low labor cost countries. Consequently crowdsourcing can be best described as large-scale problem solving model where participants replace an automated process because of potentially better results. However, crowdsourcing in a science context appears to have a different quality and notion. Alan Irwin has described this phenomenon as “citizen science”.

The notion of the scientist as the lonely researcher in his lab has already been replaced by intra-academic collaboration. Crowdsourcing is taking this development even further: instead of the expert community it enables the wider public to contribute to science projects.

The most prominent examples of citizen science projects include Galaxy Zoo and Folding@home (see also TED talk on protein folding).

The basic concept of these types of initiatives, as described by  Yochai Benkler in his book: the wealth of networks , is to split up a bigger task into more manageable smaller tasks so that an individual or group can contribute with their results.

Nevertheless, critics raised concerns over the method of crowdsourcing in science. For instance David Weinberger argues that ‘people are not doing the work of scientist’ but simply are ‘scientific instruments’ who gather information without much comprehension.

Neverthelss, both projects are quite successful as results are published in acclaimed academic journals. (For examples see (1)here and (2) here). Folding@home comes in the form of a computer game where participants fold proteins in different ways. The participants have significantly contributed by solving the configuration of a retrovirus within three weeks where scientist struggled to find an answer for a number of years.

Galaxy Zoo on the other hand  enables participation and discussion through forums. Additionally, Quench an initiative from Galaxy Zoo, enables the participants to analyse the results and even contribute with a paper.

Therefore assertions such as Weinberger’s simply ignore the fact the participants made credible and knowledgeable contributions.

It has to be pointed out that the professionalisation of science is a more recent development that manifested itself through universities who competed for research grants in order to engage in the solution of research problems. This was in stark contrast to the naturalist researchers who where primarily self-funded.

Science nowadays has become partly a guarded profession with elite institutions and private research labs restricting interaction and access to their research. However, crowdsourcing may constitute a crucial attempt to lower these barriers to enter science, thereby fostering public engagement and participation with science.
But is crowdsourcing/citizen science an acceptable form of expertise or are trained scientists the only credible source for producing knowledge?
Apparently there is no definitive answer to this question because crowdsourcing/citizen science in its current form is still evolving. In an article Collins and Evans came up with an interesting idea of categorizing expertise into different types. In view of the discussion on crowdsourcing the notion of expertise as  ‘Non possession’, ‘Interactional’ and ‘Contributory’ expertise are distinctions that are easily transferable to the different levels of crowdsourcing projects.  

According to von Hayek any member of the public has some sort of so called ‘local information’ or unique information. Therefore bringing this local information into crowdsourcing project can indeed lead to more than just low-level contributions (e.g. catalogue galaxies as in Galaxy Zoo).

In a study Page observed that groups with a diverse background always performed better than a group consisting only of experts. Yet, as outlined by Wiggins and Crowston there are no agreed standards of how to manage this relationship between formal accredited scientists and the informed ‘crowd’ that wishes deeper participation in science projects.

Often crowdsourced projects are more structured controlled such as the examples mentioned earlier in this blog, but there are also examples where the crowd controls the whole project (for further reading see Wicks et al.)

In conclusion, crowdsourcing should be embraced by the academic community because it enables communication between experts and the public and is also a first step towards creating standards to manage this relationship. As Sheila Jasanoff puts it: “people should be engaged as active parts, as well as sources of knowledge and insight.“

beginners_mind_quote

We cannot deny the fact that the informed and experienced expert is needed in order to put all these results in context. However, the public can bring a ‘beginners’ view on a set of problems and may overcome bias of an expert.   In a similar vein Barbara Prainsack acknowledged that there need to be a better understanding of citizen science in order to assess the effects it could possibly have on scientific knowledge creation.

Understanding information growth and the emergence of ‘big data’ sets

We know more than we can tell  (Michael Polanyi, 1967)

We are truly living in an age of an overflowing volume of information. Astronomical amounts of data are captured with he help of computational technologies and finally stored in electronic databases. Researchers, individuals and corporations are constantly accumulating new amounts of information every second.

Screen shot 2013-10-18 at 13.39.32

According to Moore’s Law the capability of microprocessors (memory, speed) is continuously improving at an exponential rate. At the same time production costs and physical dimensions are dramatically reduced.

However, with the levels of information produced, a point has been reached where the information created exceeds the available storage space. The current electronic storage systems simply cannot cope with the vast amounts of data produced (see Economist chart below).

Untitled2

Over the past decades and with the digital revolution, the ephemeral character of information has been altered.  The identification of patterns from so called “Big Data”, or simply described as “how to make sense of data”, has become an integral part of this new discipline of statistical data analysis.

The problem of too much information was also acknowledged by the Royal Society in 1948 that “science itself was under threat” of the vast stream of information and that it needs mechanisms to screen for the relevant bits of information (see Bawden and Robinson)

Nowadays, examples of analysing big data are already omnipresent: In 2012, a blogger called Nate Silver correctly predicted the outcome of the US presidential elections in all 50 states.

See Nate Silver talk about his methodology:

What this example reveals is that information is interconnected. The combined analysis of large data sets highlights aspects that otherwise would have remained undiscovered. In this context, scientific research is one area where the production and availability of large data sets will become significant.

Just to give some examples of the magnitude of research data:
In genomics for instance there will be around 1 Exabyte of data in 2014.
However, in 2024 a new generation of radio telescopes will produce this amount of data in a single day!

In order to get a reference point for the sheer volume of the data I have attached the following figures below:
datadeluge-infographic

Screen Shot 2013-09-06 at 10.03.29 AM

Experiments at CERN (Particle Physics Lab, Geneva) produce around 40TB of data every second. That is an amount that at the current state of technology is impossible to store permanently on a database. Therefore CERN scientists only store selected parts of their data.

The challenge is to process these amounts of raw data into structured information and in a final step transforming it into knowledge. The scientific method so far has been the holy grail of scientific inquiry. In an article published by Wired magazine Chris Anderson claims that

…the scientific method becomes obsolete ….. correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

Chris Anderson, Wired Magazine

Anderson readily assumes that large enough data sets can reveal the answers to scientific problems without the need for model building or testing of hypothesis. Anderson misses the point here. By interpreting big data sets in most cases sophisticated algorithms and filters are applied. These algorithms and filters represent a model or lens through which computers scan for correlation. Therefore ‘big data’ marks not the end of the scientific method but must be seen as a useful addition to make sense of large data sets in a scientific environment. Essentially we are confronted with a ‘black-box’ (Bruno Latour) where we need to deconstruct this box into its parts in order to separate the noise from the signal.

Nevertheless, attention has to be paid to potential risks associated with the use of ‘big data’ in science. Scientist might need to pose questions about data reliability, randomness and representativeness. It is equally important to ask questions about access to these data sets. Those without access can neither reproduce nor evaluate the methodological claims of a study.  Often access to large data sets is associated with heavy investment in technology and computer equipment in order to be able to record and store large amounts of data. This is likely to create a new form of digital divide  between those who have access to ‘big data’ sets and those who cannot afford to create large data sets: the data ‘rich’ and the data ‘poor’. Likewise, the emergence of ‘big data’  research will create new skill requirements for scientists who will ultimately become ‘data’ scientist.

In conclusion the phenomenon of information growth and the resulting ‘big data’ will have wide-ranging implications on future research agendas. We have to find a way to convert tacit knowledge into explicit knowledge.  Nevertheless, David Bollier expresses concerns on how big data can be governed. According to Bollier, big data has become omnipresent. Bollier poses the question of ‘how to define what is socially and legally acceptable’ when big data is such a new phenomenon.

Does IT matter?

What is it about?

In 2003 Nicholas Carr published a controversial article with the provocative title: IT doesn’t matter .  This article sparked a lively debate about the role of IT within organizations and was heavily criticised by the IT industry and academics alike. Steve Balmer (CEO, Microsoft) labelled Carr’s argument as “hogwash”.Craig Barrett (CEO, Intel) argued strongly about Carr’s assumptions and pointed out that

“IT is the vehicle by which you take information and data and turn it into intellectual content.”

Steve Alter, Professor of Information Systems at USF criticized Carr’s core argument elegantly with the following quote:

“ Kidneys don’t matter. Kidneys are basically a commodity. Just about everyone has kidneys. There is no evidence that CEOs with superior kidneys are more successful than CEOs with average kidneys. In fact, CEOs who spend more on their kidneys often don’t do as well.” 

Technology vendors tended to present IT as miracle cure to all problems. The claim was: “buy this technology and all your problems will be solved

Carr was saying that technological advancement might in essence be of little value because it has lost its ability to generate competitive advantage for businesses. In other words, CEOs of big companies are overspending on IT products and services. Therefore the Carr article gives an interesting outline of emerging technologies and their influence on the economic arena.

So why does Carr’s argument matter?

In fact the global IT market is a $3.6 trillion industry, compare that the global pharmaceutical market ($900 billion),which is heavily research based, the number seems strikingly high. It is evident that ICTs for the last 20 years played a vital role in the advancement of several sectors not only in corporate world. Nowadays ICTs and computational power have become essential tools for scientists in a large variety of fields. The way scientist approach researching a problem has fundamentally changed due to IT.

Screen shot 2013-10-10 at 21.37.55

(click to see more examples)

Carr argues that IT is becoming a commodity and thus no longer can provide any substantial advantage for advancement and breakthrough innovations. He assumed that IT serves as a transport mechanism, building the infrastructure for the transportation of knowledge and information. Therefore, he compares IT to water/electricity grid, railroads, and highways. Carr identified a pattern that all these have in common:
to increase capacity in a short period of time.

Screen shot 2013-10-10 at 20.40.18 Screen shot 2013-10-10 at 20.40.12 Screen shot 2013-10-10 at 20.40.04

Source: Carr, 2003 

In other words, IT has become a commodity that is readily available at low cost for everyone who wishes to use it.

If we take Carr’s argument as correct diagnosis of the issue, theoretically there shouldn’t be any organization out there with a substantial competitive advantage or heavy investment in IT equipment. Yet, in reality there are many examples that prove quite the opposite. Take for instance Amazon, which revolutionized the book market through the use of IT. Another example of how quickly IT can drive innovation of the economic and social landscape is Google and Facebook. Google spends approximately $2.9 billion in beginning 2013 on infrastructure alone.


(Interesting part: first 12min of video, skip the intro)

Google and Amazon are also good examples of Collingridge’s argument that we can’t predict the outcome of technology.

This points us to a critical distinction that Carr might have missed at the time of writing his article. Namely, that the notion of IT as productivity/automation tool, as Carr is interpreting it, has changed.  The question is how to use the available physical IT infrastructure to create new services and products.

In 2012 Meeker and Wu published an interesting report on Internet trends. In this report Meeker and Wu phrased the term “re-imagination of everything” which superbly describes how technology has transformed the notion of different areas of economic activity. I have attached some interesting comparisons below.

For a detailed account and the PowerPoint presentation click here!

In my opinion Carr’s flaw lies in his assumption about an inevitable future. Whereas Collingridge rightly suggests that the outcome of technologies is unpredictable. However, economist Tyler Cowen, later picked up Carr’s argument about the dwindling importance of IT for innovation. In 2011 Cowen published a book with the title “The Great Stagnation”.

see also his TED talk:

Cowen focused his analysis on the United States and concluded that the US economy has hit its technological peak. Cowen supported Schumpeter’s view that innovation  is the only viable option for long term corporate growth. Consequently some big IT companies have tried to replicate the big science model by hiring specialized scientist to kick start innovative projects. Therefore investment in IT is an important factor in pursuing innovation.

It can be definitely not inferred that IT doesn’t matter, but that it was the development and investment in IT that has helped in the advancement in science (e.g. the human genome project) and the re-imagination of established business models. From this perspective IT really does matter.