Thursday, February 7, 2013

Big Data

Big Data - another sign that we're rapidly approaching the technological singularity:
Reference table for big numbers (source):

1 bit = a 1 or 0 (b)                               1/8 byte
4 bits = 1 nybble (?)
8 bits = 1 byte (B)                                1 byte
1024 bytes = 1 Kilobyte (KB)                 10^3 bytes
1024 Kilobytes = 1 Megabyte (MB)       10^6 bytes     million
1024 Megabytes = 1 Gigabyte (GB)      10^9 bytes      billion
1024 Gigabytes = 1 Terabyte (TB)         10^12 bytes    trillion
1024 Terabytes = 1 Petabyte (PB)         10^15 bytes
1024 Petabytes = 1 Exabyte (EB)           10^18 bytes    quintillion
1024 Exabytes = 1 Zettabyte (ZB)          10^21 bytes
1024 Zettabytes = 1 Yottabyte (YB)       10^24 bytes
                        1 Googolplex1010100   (not to be confused with "Googleplex" which is Google's HQ)

Big data are the large data sets that have started to appear in just the last 7 years. Data sets which are so big and that are growing so fast that they cannot be managed and processed by traditional means.

Data sets such as:
  • the Facebook dataset which contains the information shared by the more than 1 billion people who have Facebook accounts - lists of friends, photos, daily entries, likes, etc
  • the Google Maps dataset which shows the world's roadways with terrain photo underlays, millions of GPS coded photos and videos linked to the maps;
  • The Youtube dataset onto which an additional 103,000 hours of new videos are uploaded each day;
  • The human genome dataset - each individual (excluding identical siblings) has a unique genome that consists of 3.2 billion base pairs (source) (= .4 GB of information).  Multiply this by the 7 billion people alive today to arrive a total size of 2.8 exabytes (2.8 x 10^18 bytes).  Today only a small number of individuals have had their genomes read - but its starting to happen and in a few years time the procedure will be inexpensive enough such that billions will be having themselves decoded each year.
  • The pre-2003 everything dataset - all the books, photos, videos, maps, and other data generated by the human race from the dawn of recorded history to the year 2003, which amounts to 5 exabytes (5 x 10^18 bytes) of data;
  • The 5 exabytes of new data that is generated every 2 days as-of Aug 2010;
  • The 1 exabyte of video data recorded by a 1.8 gigapixel camera on a Predator type drone in an eight hour flight  (original cool video source, text source re the video source, and by the way, 2-21-2013 Yahoo article "Drones Large and Small Coming to US")
The 3 petabytes of total information in the library of congress, which includes 32 million books and 61 million manuscripts plus photos and videos (source), is tiny campared to any one of the above items.


"Big data" has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, and HP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10 percent a year: about twice as fast as the software business as a whole.[5]

Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet.[5] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[15] and it is predicted that the amount of traffic flowing over the Internet will reach 667 exabytes annually by 2013.[5]

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report[39]suggests suitable technologies include A/B testingassociation rule learningclassificationcluster analysiscrowdsourcingdata fusion and integrationensemble learninggenetic algorithmsmachine learningnatural language processingneural networkspattern recognitionanomaly detectionpredictive modellingregressionsentiment analysissignal processingsupervised and unsupervised learningsimulationtime series analysis and visualisation. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation,[40] such as multilinear subspace learning.[41]

This from the commercial website of Big Data Storage :

Big data consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger data sets allowing analysts to "spot business trends, prevent diseases, combat crime."Though a moving target, current limits are on the order of petabytes, exabytes and zettabytes of data.  Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, Radio-frequency identification readers, and wireless sensor networks.The world’s technological per capita capacity to store information has roughly doubled every 40 months since the 1980s (about every 3 years) and every day 2.5 quintillion bytes of data are created.


The motives behind collecting, processing and analyzing these huge data sets include:
  • commercial interest - knowing the searches a person makes on his computer, what his emails say and who his contacts are gives marketers the ability to send that person targeted advertising; 
  • academic interest - seeing what information appears in Twitter and other social media and how fast it spreads enables studying our society, how it works and where it may be going; 
  • security interest - seeing where and when a person travels, what he does with his time, and who he associates with allows zeroing in on those who might be pursuing malicious or nefarious activities;
  • nefarious interest - seeing where and when a person travels, what he does, and who he associates with allows zeroing in on people who, for some reason, you, the government or some other agent may want to target, and this information can be invaluable in plans to attack or "neutralize" that person. Or, imagine what a a health insurance actuary could do with access to her client's genomes;
  • medical interest - comparing individuals' genomes will likely allow identifying such things as how a person will respond to various powerful drugs that benefit some but that harm others;
   The list goes on...

So, what is all of this telling us? First, the amount of information in big data is huge - unlike anything we've ever experienced. Second, the rates at which big data sets are expanding and are being disseminated are growing exponentially - what we experience today will be tiny compared with what we will see just a few years from now. Third, many entities are very interested in the information (and are already using it) and for good reason - it is hugely powerful. In the near future it will likely change our societies in ways as yet unimagined.  

And fourth and finally, this is a predictable step in our progression to a Technological Singularity, a major event that is predicted to arise from our continued acceleration in technological advancement and information growth. (The singularity is expected to occur on or about the year 2022.)  For some excellent speculative fiction on the singularity subject I suggest reading the novel Accelerando.

Additional random stuff on this subject:

The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data

Much of the big data are stored in server farms in, of all places, Oregon.   It seems the area along the Columbia river about 100 miles inland from Portland is quie well suited to server farms because of favorable taxes, dependable and cheap electric supply and desert climate: Article on this from "The Economist".

For more related reading, my niece Elena recommends Rainbow's End and A fire Upon the Deep. "Both deal with superintelligence and the implications of such a point in human history."

Elena also suggested making some mention of new methods of data storage, specifically DNA. This topic is worth at least a post of its own (maybe I can entice someone else to write it), but for what its worth here's a bit of recent news on the DNA storage. I just heard on NPR 2 days ago that some researchers had recently done a cool demonstration in the lab. They encoded onto a DNA strand the MP3 recording of part of Martin Luther King’s “I have a dream” speech. And then read it back using a DNA-sequencing machine.  Article related to this from "The Economist"

A next to final note: to send data into the indefinate distant future we could write the data onto a DNA strand and then insert the strand as "junk DNA" into the genes of the German cockroach. To insure the data stay with the species, we could add some code that gives a carrier a slight advantage over his siblings - for example some code that would grow extra eye located on the roache's rear end.


By the way, it might be worth looking through the junk DNA of all of Earth's lifeforms to see if some star traveling intelligence of the distant past left us some messages ..  

2-20-13 added link:   10 Exabyte storage systems

No comments:

Post a Comment

be sure to scroll down and hit the publish button when done writing