计算机专业英语最新章节_李晓峰著

1.4 The Future of Computer Science

1.4.1 Introduction

Modern computer science is undergoing a fundamental change. In the early years of the field, computer scientists were primarily concerned with the size, efficiency and reliability of computers. They attempted to increase the computational speed as well as reduce the physical size of computers, to make them more practical and useful. The research mainly dealt with hardware, programming languages, compilers , operating systems and data bases. Meanwhile, theoretical computer science developed an underlying mathematical foundation to support this research which in turn, led to the creation of automata theory, formal languages, computability and algorithm analysis. Through the efforts of these researchers, computers have shrunk from the size of a room to that of a dime, nearly every modern household has access to the internet and communications across the globe are virtually instantaneous. Computers can be found everywhere, from satellites hundreds of miles above us to pacemakers inside beating human hearts. The prevalence of computers, together with communication devices and data storage devices, has made vast quantities of data accessible. This data incorporates important information that reveals a closer approximation of the real world and is fundamentally different from what can be extracted from individual entities. Rather than analyzing and interpreting individual messages, we are more interested in understanding the complete set of information from a collective perspective. However, these large-scale data sets are usually far greater than can be processed by traditional means. Thus, future computer science research and applications will be less concerned with how to make computers work and more focused on the processing and analysis of such large amounts of data. Consider the following example of internet search. At the beginning of the internet era, users were required to know the IP address of the site to which they wished to connect. No form of search was available. As websites proliferated, online search services became necessary in order to make the internet navigable. The first internet search tool was developed in 1993, and dozens more were created over the next several years. Ask Jeeves, founded in 1996, relied partially on human editors who manually selected the best websites to return for various search queries. Given the huge number of websites available today, such a strategy is clearly no longer feasible. Google, founded in 1998, is a leader among today’s search engines. It relies on a search algorithm that uses the structure of the internet to determine the most popular and thus, perhaps, the most reputable websites.

However, while Google’s search engine was a major advance in search technology, there will be more significant advances in the future. Consider a user who asks the question, “When was Einstein born?” Instead of returning hundreds of webpages to such a search, one might expect the answer: “Einstein was born at Ulm, in Wurttemberg Germany, on March 14, 1879”, along with pointers to the source from which the answer was extracted. Other similar searches might be:

(1) Construct an annotated bibliography on graph theory.

(2) Which are the key papers in theoretical computer science?

(3) Which car should I buy?

(4) Where should I go to college?

(5) How did the field of computer science develop?

Search engine companies have saved billions of search records along with a whole archive of information. When we search for the answer to the question “Which car should I buy?”, they can examine pages that other individuals who did similar searches have looked at, extract a list of factors that might be important (e.g. fuel economy, price, crash safety), and prompt the user to rank them. Given these priorities, the search engine will provide a list of automobiles ranked according to the preferences, as well as key specifications and hyperlinks to related articles about the recommended models. Another interesting question is, “Which are the key papers in theoretical computer science?” One would expect a list such as:

(1) Juris Hartmanis and Richard Stearns, “On the computational complexity of algorithms”.

(2) Manuel Blum, “A machine-independent theory of the complexity of recursive Functions”.

(3) Stephen Cook,“The complexity of theorem proving procedures”.

(4) Richard Karp, “Reducibility among combinatorial problems”.

With thousands of scientific papers published every year, information on which research areas are growing or declining would be of great help to rank the popularity and predict the evolutionary trend of various research topics. For example, Shaparenko et al . used sophisticated artificial intelligence techniques to cluster papers from the Neural Information Processing Systems (NIPS) conference held between 1987 and 2000 into several groups, as shown in Fig.1-3 since all papers presented at the NIPS conference are in digital format, one can use this information to plot the sizes of the clusters over time. Clusters 10 and 11 clearly show the two growing research areas in NIPS, namely “Bayesian methods” and “Kernel methods”. The graph correctly indicates that the “Bayesian methods” cluster emerged before the “Kernel methods” cluster, with both topics starting to dominate the NIPS conference by 2000. In addition, Cluster 1 on neural networks, Cluster 4 on supervised neural network training, and Cluster 8 on biologically-inspired neural memories were popular in the early years of NIPS, but almost disappeared from the conference by 2000. With the help of advanced techniques, we should be able to accurately predict how important a paper will be when it is first published, as well as how a research area will evolve and who will be the key players.

Fig.1-3　The distribution of k = 13 clusters of NIPS papers

In the beginning years of computer science, researchers established a mathematical foundation consisting of areas such as automata theory and algorithm analysis to support the applications of that time. As applications develop over time, so must the underlying theories develop accordingly. Surprisingly, the intuition and mathematics behind the theory of large or high-dimensional data are completely different from that of small or low dimensional data. Heuristics and methods that were effective merely a decade ago may already be outdated.

1.4.2 Innovative Research Projects

Traditional research in theoretical computer science has focused primarily on problems with small input sets. For instance, in the past, stores tracked the items purchased by each individual customer and gave that customer discounts for future purchases of those items. However, with the help of modern algorithms, service providers such as Netflix are now able to, not only make predictions based on a customer’s past preferences, but amalgamate preferences from millions of customers to make accurate and intelligent suggestions and effectively increase sales revenue. The following subsections describe four ongoing research projects involving the analysis and interpretation of large data sets. Each represents a promising direction for rediscovering fundamental properties of large-scale networks that will reshape our understanding of the world.

1. Tracking Communities in Social Networks

A social network is usually modeled as a graph in which vertices represent entities and edges represent interactions between pairs of entities. In previous studies, a community was often defined to be a subset of vertices that are densely connected internally but sparsely connected to the rest of the network. Accordingly, the best community of the graph was typically a peripheral set of vertices barely connected to the rest of the network by a small number of edges. However, it is our view that for large-scale real-world societies, communities, though better connected internally than expected solely by chance, may also be well connected to the rest of the network. It is hard to imagine a small close-knit community with only a few edges connecting it to the outside world. Rather, members of a community, such as a computer science department, are likely to have many connections outside the community, such as family, religious groups, other academic departments and so on. Empirically, a community displays a higher than average edge to vertex squared ratio which reflects the probability of an edge between two randomly-picked vertices, but can also be connected to the rest of the network by a significant number of edges, which may even be larger than the number of its internal edges.

With this intuitive notion of community, two types of structures are defined: the “whiskers” and the “core”. Whiskers are peripheral subsets of vertices that are barely connected to the rest of the network, while the core is the central piece that exclusively contains the type of community we are interested in. Then, the algorithm for finding a community can be reduced to two steps: ① identifying the core in which no whiskers exist, and ② identifying communities in the core. Further, extracting the exact core from both weighted and unweighted graphs has been proved to be NP-complete. Alternative heuristic algorithms have been developed, all of which are capable of finding an approximate core, and their performance can be justified by the experimental results based on various large-scale social graphs. In this way, one can obtain communities that are not only more densely connected than expected by chance alone, but also well connected to the rest of the network.

2. Tracking Flow of Ideas in Scientific Literature

Remarkable development in data storage has facilitated the creation of gigantic digital document collections available for searching and downloading. When navigating and seeking information in a digital document collection, the ability to identify topics with their time of appearance and predict their evolution over time, would be of significant help. Before starting research in a specific area, a researcher might quickly survey the area, determine how topics in the area have evolved, locate important ideas, and the papers that introduced those ideas. Knowing a specific topic, a researcher might find out whether it has been discussed in previous papers, or is a fairly new concept. As another example, a funding agency that administers a digital document collection might be interested in visualizing the landscape of topics in the collection to show the emergence and evolution of topics, the bursts of topics, and the interactions between different topics that change over time. Such information-seeking activities often require the ability to identify topics with their time of appearance and to follow their evolution. Recently, in their unpublished work, Jo et al . have developed a unique approach to achieving this goal in a time-stamped document collection with an underlying document network which represents a wide range of digital texts available over the internet. Examples are scientific paper collections, text collections associated with social networks such as blogs and Twitter, and more generally, web documents with hyperlinks . A document collection without an explicit network can be converted into this format by connecting textually similar documents to generate a document network.

3. Reconstructing Networks

The study of large networks has brought about many interesting questions, such as how to determine which members of a population to vaccinate in order to slow the spread of an infection, or where to place a limited number of sensors to detect the flow of a toxin through a water network. Most algorithms for solving such questions make the assumption that the structure of the underlying network is known. For example, detectives may want to use such algorithms to identify the leaders of a criminal network, and to decide which members to turn into informants. Unfortunately, the exact structure of the criminal network cannot be easily determined. However, it is possible that the police department has some information about the spread of a certain property through the network; for instance, some new drug may have first appeared in one neighborhood, and then in two other neighborhoods, and so on. The work by Soundarajan et al . attempts to create algorithms to recover the structure of a network given information about how some property, such as disease or crime, has spread through the network. This work begins by defining a model of contagion describing how some property has spread through a network. The model of contagion for information spread may be: “a vertex learns a piece of information in the time interval after one of its neighbors learns that information.” A more complex model of contagion corresponding to the spread of belief may be: “a vertex adopts a new belief in the time interval after a proportion p of its neighbors adopts that belief.” For example, a person will probably not join a political party as soon as one of his friends joins that party, but he may join it after two-thirds of his friends have joined it.

Next, the network recovery algorithm assumes that vertices are partitioned into discrete time intervals, corresponding to the time when they adopt the property. For a given model of contagion, the algorithm attempts to find a network over the set of vertices such that when the property in question (e.g. information, belief) is introduced to some vertices in the first time interval, and then spreads to other vertices in accordance with the model of contagion, every vertex adopts the property at an appropriate time. Initial work has created such algorithms for two models of contagion: the model corresponding to the spread of information, where a vertex adopts a property in the time interval after one of its neighbors has adopted that property, and the model corresponding to the spread of belief, where a vertex adopts a property in the time interval after at least half of its neighbors have adopted that property.

Future work will focus on finding algorithms for other models of contagion, especially the models in which a vertex adopts the property after a proportion p of its neighbors has adopted that property, for arbitrary values of p. Other directions include finding algorithms for networks in which there are two or more properties spreading through the network. This work also opens up questions about the types of graphs produced by these algorithms. For instance, do all possible graphs have some edges in common? Are there any edges that do not appear in any of the solution graphs? Which edges are the most or least likely?

4. Tracking Bird Migration in North America

Hidden Markov models (HMMs) assume a generative process for sequential data whereby a sequence of states (i.e. a sample path) is drawn from a Markov chain in a hidden experiment. Each state generates an output symbol from a given alphabet, and these output symbols constitute the sequential data (i.e. observations). The classic single path problem, solved by the Viterbi algorithm, is to find the most probable sample path given certain observations for a given Markov model.

Two generalizations of the single path problem for performing collective inference on Markov models are introduced, motivated by an effort to model bird migration patterns using a large database of static observations. The eBird database maintained by the Cornell Lab of Ornithology contains millions of bird observations from throughout North America reported by the general public using the eBird Web application. Recorded observations include location, date, species and number of birds observed. The eBird data set is very rich, and the human eye can easily discern migration patterns from animations showing the observations as they unfold overtime on a map of North America. However, the eBird data entries are static and movement is not explicitly recorded, only the distributions at different points in time. Conclusions about migration patterns are made by the human observer, and the goal is to build a mathematical framework to infer dynamic migration models from the static eBird data. Quantitative migration models are of great scientific and practical importance. For example, this problem comes from an interdisciplinary project at Cornell University to model the possible spread of avian influenza in North America through wild bird migration.

The migratory behavior of a species of birds can be modeled by a single generative process that independently governs how individual birds fly between locations. This gives rise to the following inference problem: a hidden experiment draws many independent sample paths simultaneously from a Markov chain, and the observations reveal collective information about the set of sample paths at each time step, from which the observer attempts to reconstruct the paths.

1.4.3 Theoretical Foundation

As demonstrated in the previous section, (1-5) the focus of modern computer science research is shifting to problems concerning large data sets. Thus, a theoretical foundation and science base is required for rigorously conducting studies in many related areas. The theory of large data sets is quite different from that of smaller data sets; when dealing with smaller data sets, discrete mathematics is widely used, but for large data sets, asymptotic analysis and probabilistic methods must be applied. Additionally, this change in the theoretical foundation requires a completely different kind of mathematical intuition.

Large graphs have become an increasingly important tool for representing real world data in modern computer science research. Many empirical experiments have been performed on large-scale graphs to reveal interesting findings. A computer network may have consisted of only a few hundred nodes in previous years, but now we must be able to deal with large-scale networks containing millions or even billions of nodes. Many important features of such large graphs remain constant when small changes are made to the network. Since the exact structure of large graphs is often unknown, one way to study these networks is to consider generative graph models instead, where a graph is constructed by adding vertices and edges in each time interval. Although such graphs typically differ from real-world networks in many important ways, researchers can use the similarities between the two types of networks to gain insight into real-world data sets.

A simple but commonly used model for creating random graphs is the Erdös-Renyi model, in which an edge exists between each pair of vertices with equal probability, independent of the other edges. A more realistic model is known as the “preferential attachment” model, in which the probability that an edge is adjacent to a particular vertex is proportional to the number of edges already adjacent to that vertex. In other words, a high degree vertex is likely to gain more edges than a low degree vertex. The preferential attachment model gives rise to the power-law degree distribution observed in many real-world graphs.

Consider the Erdös-Renyi random graph model in which each edge is added independently with equal probability. Suppose that we start with 1,000 vertices and zero edges. Then, there are clearly 1,000 components of size one. If we add one edge, we will have 998 components of size one and one component of size two. However, a giant component begins to emerge as more edges are added. This occurs because a component is more likely to attract additional vertices as its size increases.

Since random graph models mimic some vital features of real-world networks, it is often helpful to study the processes that generate these features in random graph models. An understanding of these processes can provide valuable insights for analyzing real-world data sets.

1.4.4 An Interview

This is An Interview with Ken Calvert and Jim Griffioen on “The Future of Computer Science”.

Ken Calvert , Ph.D. and chair of the Department of Computer Science, University of Kentucky. Jim Griffioen , Ph.D., professor of computer science and director of the Laboratory for Advanced Networking.

Q: What are shaping up to be the greatest areas of opportunity in the computer science field over the next few years?

K.C. I think this is an exciting time in computer science. Hardware has become so cheap that both compute cycles and storage bytes have essentially become commoditized. We’re seeing this right now with the cloud computing model. A company can now pay someone a relatively low monthly fee to run their Web server instead of shelling out thousands of dollars for hardware, software and maintenance. It’s basically the same transition that happened with electric power 100 years ago. Nicholas Carr’s book, The Big Switch , describes how, back then, factories had to be located next to big streams because that’s where they got the power to run their machines. When electric power grids came along, generation of power became centralized. The same exact centralization is happening with the advent of cloud computing. It makes a lot more sense to have one big centralized data center run by people who know what they’re doing than for every little company to run its own.

J.G. Historically, computer scientists have created technology without fully knowing how it’s going to play out. The Internet was built so machines could communicate back and forth and share information. Well, then users came along and said, “I need this to be easy to use. I need a Web interface. I need a browser.” None of those uses were part of the original design. Now we have virtualization through cloud computing as well as ubiquitous networking—you can be on the network at all times. In addition, we also have a very mobile society. Devices which can maximize the benefits of the cloud will need to be developed. I think we’re on the edge of some of these things just exploding and once it explodes, we’ll have a whole new set of issues to address—how to secure such a world, etc.

K.C. What virtualization also means is that software is going to be king. Everything is going to be about software because hardware is so cheap. I think the opportunities in software are tremendous. However, as Jim mentioned, we now have to consider questions such as: how do I keep control of my information? How do I know what information people are collecting about me? Businesses already know a lot about us and they are going to try to monetize that any way they can. Why do Facebook and Twitter have such astronomical valuations? I believe it’s because they know who is talking to whom and what they’re saying. Privacy is a huge issue going forward and it’s not just “old people” who are concerned about it. We need to understand how to maximize the benefits of virtualization without the Big Brother risks.

Q: What does the future look like on the security front?

J.G. When everyday users weigh the prospective gain of a new application against the possible security risks, they almost always accept the tradeoff. It is difficult to keep up with potential threats and understand the risks because the landscape changes so quickly. On the positive side, though, industry has finally recognized that security is not an afterthought. In the past, companies created products and tacked security onto the back end of the development process. Often, that made it hard to add the security because it wasn’t present from the start. Now, computer scientists are asking, “How do I design the architecture so that if it doesn’t have security now, it is amenable to it later?” There are discussions going on right now about the next generation of the Internet. Naturally, security is a central topic.

K.C. As long as we have the Internet architecture, we’re not going to solve many of the current problems. The architecture doesn’t have the things we need to solve them, and there’s just too much inertia to counteract. So it’s hard to say what the future is going to look like there. But again, almost as important as security is privacy. When it comes to the leaders in software and social media, people aren’t given a choice to use the product and still maintain their privacy. Those companies say, “Here are our policies, take them or leave them.” And people agree, even though the policies are not in their favor, because they want to use the product. I printed out the iTunes license agreement once. It was 29 pages of 9 point font. No one is going to read that! That’s why I think we really need more collaboration between experts in computer science and experts in psychology. As systems get more and more complex and everyday people have to make decisions about privacy settings on their computer or home router, we need to design systems and educate users so the consequences of each decision they have to make is much clearer. That is certainly not the case right now. Unfortunately, until software providers accept accountability for their products—until they have incentive to change—the situation will remain challenging.

Q. What areas in the field besides security and privacy need attention?

K.C. We need to focus on parallelism. You often hear that Moore’s Law is running out of gas. On the contrary, Moore’s Law is still going strong; but the dividends of Moore’s Law are now being paid in parallelism, not in faster sequential computation. Rather than doing each step of the computation faster, you can do multiple steps at once in the same amount of time.

J.G. As far as teaching parallelism in the classroom, we have to change our approach. We’ve been teaching the students a step-by-step process; basically, that’s how computer scientists have always conceived writing programs. Well, now we have multiple processors running on chips and we have to start thinking, “How do I write a program that does three things at once? Eight things at once?” What happens when the chips allow us to do hundreds of things at once? We need to start changing the mindset of everyone in our program and challenge them to think, “I’m going to do lots of things at once.”

K.C. If you’re only doing one thing at a time, you cannot take advantage of the additional power that Moore’s Law is giving you. So, like Jim said, we have to be able to figure out how to do multiple things at once, like putting meat on the stove to brown and, while that’s happening, mixing other ingredients. That’s the way we need to think about things all the time. It’s not trivial. We want to turn out graduates who can master doing things in parallel because this is the way it’s going to be from now on. Right now, though, the tools we have for taking advantage of Moore’s Law and parallelism aren’t very good, so it’s definitely an area that needs attention.

Q. How much of a challenge is it to stay on the leading edge of an industry where technology changes so rapidly, let alone translate those changes into your curricula?

K.C. It’s almost impossible. We could spend all of our time just trying to keep up. It’s a catch-22: we have to show our students technology and let them get their hands dirty, but the reality is whatever we show them as freshmen will have changed and might even be obsolete by the time they are seniors. Five years ago, everybody was using Perl and CGI scripts on the Web. Now those tools have been replaced by a new generation of languages and platforms. So, our task is to teach fundamental principles and I think we do a good job of that. Fortunately, students quickly adapt to the rate of change. They’re fearless and not afraid to pick up new technology and play with it. I consider that a good thing and we need to try to leverage it in the classroom.

J.G. At the same time, we faculty have to make the purpose of learning fundamental concepts and principles clear to them. They have to know that chances are whatever programming language we teach them their freshman year will probably be out of date by the time they graduate. The turnaround times really are that short.

K.C. That actually seems to make it easier to motivate our students to learn the fundamentals, though, because incoming students have seen the short life cycles of various technologies several times already. It’s pretty obvious to them now that if they don’t focus on the stuff that doesn’t change, they’re not going to be able to adapt when they’re forced to.

J.G. Even though I’m a longstanding faculty member, I often learn from the students. There is so much software out there, so many programs, so many computing languages, that I can’t play with them all. Students will come to me and tell me about a program and I’ll say, “Explain it to me. How does it work? What does it do?” I learn a lot from interacting with them.

K.C. The only way to stay on the leading edge is to invent everything. We have a weekly “Keeping Current” seminar, where students share what they’ve learned or some new technology they’ve discovered. They’re always coming in and telling us about stuff we’ve never heard of. It’s a volunteer thing, very informal, but a lot of fun. There are so many tools around, it’s just unbelievable.

Q. How does the future of computer science look from the perspective of college students choosing it as a career?

K.C. It couldn’t be better. In the early 2000s, people were afraid all the computer science jobs were going to be outsourced overseas. That hasn’t happened. In fact, the Bureau of Labor projects software engineering jobs will grow by 38% over the next ten years—one of the top professions as far as growth. Our students are in demand and will continue to be in demand for a long time. I am constantly being contacted by people wanting to hire our graduates. It’s clear there are more jobs than people to do them, and I don’t see that changing.

J.G. I was contacted by a mid-sized company the other day that decided they were going to get into the mobile world, but didn’t have a clue as to how to go about it and wanted to know if any of our students or graduates could help them figure it out. Companies need people who know how to take advantage of the technology, not just throw around terms. One aspect that will change in light of the switch to cloud computing, however, will be the kinds of jobs available. There won’t be as much need for systems administrator jobs if everything is run through a centralized data center. So what a graduate might do once they’re in the marketplace might change, but the demand is still very high.

K.C. Our goal is to equip students to be able to adapt to change. We teach them how to think and how to learn because that’s the only way they’re going to survive. If they think they’re going to learn C++, graduate and be a C++ programmer all their lives, it’s just not going to happen.

Q. What are some myths and misconceptions about the computer science industry?

J.G. One myth I often hear is that all the exciting stuff is happening in industry. “Companies are where the exciting things are happening,” someone will say, downplaying the need for education in the field. While it’s now true that bright high school kids can get programming jobs with big companies right away, we still believe in the importance of developing a skill set based on the fundamentals that will last a long time.

K.C. I think another myth is that computer science is all about programming. Computing professionals need to have an understanding of programming, but it’s even more important to have a broad understanding of the business you’re in: social networking, data mining, business concepts, etc. The future is about applications and applying computing to problems in biology, medicine, engineering, the environment, business, entertainment and other industries—it’s a great time to be a software entrepreneur! Another myth is that computer science is something only guys would want to do. The stereotypical image of scruffy-haired guys with beards staring at computer screens needs to be replaced by one which illustrates the openness of the field to anyone who wants to get in on the opportunities available.