July 30, 2013
JSTOR is a not-for-profit digital library that provides access to scholarly journals, books, and primary sources to people around the world. Established in 1995, JSTOR was conceived originally with three primary goals: to help libraries save space (and therefore lower costs) by digitizing what were then little used older print journals; to help scholarly societies and publishers make the transition to digital forms of communication; and to expand access for scholars and students. At the core of the JSTOR library is an archive of more than 2,000 journals—upwards of 50 million pages of content that JSTOR has digitized, actively preserves, and makes accessible. The archive contains the entire run for each journal, from its earliest issue through the most recent 3 to 5 years. Most of the content in the archive is under copyright. This content is made available with the permission and cooperation of more than 1,000 publishers who trust us to act as responsible stewards of their content.
Today, the JSTOR archive is available primarily through institutions. Currently 8,400 educational, cultural, and other institutions in 168 countries around the world participate in supporting and sustaining the archive. Most institutions pay fees to cover the costs of JSTOR’s work. These fees are scaled to enable access for as many institutions as possible. Institutions in Africa and other developing nations have had free access to JSTOR since 2006; today, nearly 900 institutions from 44 countries have JSTOR access for free. In addition, for the general public, JSTOR provides free access to nearly 500,000 articles no longer under copyright and offers free reading access to more than 1,300 journal titles. In 2012, JSTOR was accessed approximately 540 million times around the world. To learn more, please see 10 Things about JSTOR.
Chronology of Events
On Saturday, September 25, 2010 we noted a tremendous increase in the number of articles downloaded from JSTOR, many of which appeared to be generated by a robot using the MIT network. The volume of activity, hundreds of downloads per minute, was having a negative impact on our servers and therefore was prohibited by JSTOR’s terms of service. We took the preventative measure of blocking the single IP address from which the requests were originating to stop it from continuing to access content on the JSTOR site1. We posted a message to this IP address on September 25 that access from the address was suspended, and asking the user(s) to review our Terms & Conditions of Use and to contact us with any questions. We did not hear back. Instead, the downloading continued the following morning, September 26, from a different IP address associated with MIT.
We then took the further step of blocking a Class C range of MIT IP addresses, which brought the downloading to an end. We contacted MIT on Sunday, September 26 alerting them to the step we had taken and asking them to investigate. On Wednesday, September 29, MIT contacted us and said that they believed the origin of the activity was a guest visiting MIT, and that they believed the activity would not recur. We immediately restored access for the Class C addresses.
While the rate of downloading exceeded anything we had previously experienced, attempts to download large numbers of articles from JSTOR are not unusual. There are many legitimate reasons why people want to download large data sets for their research. We frequently support such research by providing access to datasets, free of cost, in a way that does not affect access for other users.2 In this case, no user contacted us, and it appeared that the user switched IPs to continue the downloading despite our attempts to block the activity.
On Saturday, October 9, we again detected rapid downloading, this time at an even faster rate. The downloading overloaded several servers, disrupting access to JSTOR for users beyond MIT. It appeared that the overloading would spread to other servers if the activity was allowed to continue. The evening of October 9, we blocked access to all MIT IP addresses (a Class A range). We alerted MIT on Saturday night, October 9. We did not take this step lightly, but the nature and scale of the activity was unprecedented. Although we were able to stop the downloading more quickly this time and the negative impact on access for non-MIT users of the JSTOR website was relatively brief, it came at the cost of having to suspend access for all MIT staff and students, an undesirable outcome for everyone.
At this point, there had been multiple events of large-scale and rapid downloading from MIT in a matter of a few weeks. An unknown user had downloaded approximately 450,000 articles, including many under copyright, and had done so in ways that seemed intent on evading our efforts to stop the activity. We could see that the downloader was sequentially accessing the articles and therefore appeared to be trying to acquire a significant portion or perhaps the entire database, but we could not identify the individual or ascertain a motivation. Where were the hundreds of thousands of articles that had already been downloaded, and what was being done with them? It was still possible that the articles were being downloaded for a research project, but why hadn’t the user contacted us?
It was now well into the fall semester and for several days all MIT students and faculty had been unable to access JSTOR, a resource used by hundreds of researchers and students per day at MIT during that time of year. Even though we did not know who was responsible for the downloading, what they intended to do with the copies of articles they had acquired, and whether they would recommence downloading once the block on MIT access was lifted, we were extremely reluctant to continue to deny access to all MIT users. We therefore restored access to the Class A range at MIT on Tuesday, October 12. We did not detect additional accelerated downloading for the rest of October.
Meanwhile, on October 14, we asked MIT if they could identify the person responsible because we wanted to understand the downloader’s motivation, to ensure the articles already downloaded would not be distributed, and to prevent further downloading. MIT responded on October 16 that they had not been able to identify the person using their network. This caused us great concern. We thought it was possible that the downloading was being conducted to acquire and re-distribute the entire database and discussed internally how best to deal with the situation, including the possibility of involving law enforcement. In the end we neither asked MIT to contact law enforcement nor did so ourselves. Subsequently, we began working with MIT to develop alternative, more secure approaches to authenticating MIT users to JSTOR, noted in a message from MIT to JSTOR on October 20.
By October 26, we and MIT agreed on an alternative approach for authenticating MIT users (“the redirect”) that would increase security by requiring users to login to the MIT library with their MIT credentials before using JSTOR. This step would prevent access for guest users, thereby closing off the entry point that had been used for the downloading incidents. Putting such a restriction in place was not ideal. It meant that MIT users would have to log in to MIT to access JSTOR when other academic and research resources were available to them without such a step. It also prevented “walk in” users from gaining access to JSTOR at MIT, which we allow in almost all libraries that provide access to JSTOR. We ultimately agreed that the redirect seemed necessary, despite our reservations.
Our monitoring systems did not alert us to accelerated downloading at MIT in November and most of December. By mid-December we had completed work on the redirect and, pending testing by JSTOR and by MIT, planned to implement the change in early January 2011. Later, we discovered that significant downloading had, in fact, continued during this time using a method that we did not detect. We only became aware of the downloading during a retrospective analysis of the incident conducted in January 2011. As has been widely reported, approximately 4.8 million articles (80% of our entire database at the time) were downloaded between late September 2010 and early January 2011, many of them during December.
It was not until Sunday, December 26, that our monitoring once again identified unusual downloading activity, the nature of which seemed consistent with the previous events. At this point, we concluded that the downloader was purposefully and systematically attempting to acquire the entire JSTOR database. We suspended a Class C range of IP addresses at MIT. We sent an email to MIT late in the day on December 26 that informed them of the group of IP addresses we had suspended and provided the building name from which the download requests were originating. We also reiterated our desire that MIT identify the individual(s) responsible and our desire to ensure that the content downloaded was secured and deleted.
As follow up to our message, we called MIT and were informed by an automated voice message that MIT staff was on a budget-mandated furlough and would return on Monday, January 3. We subsequently learned that our block of the Class C range of addresses had not stopped the activity, as the downloading had shifted to a different IP address outside that range. MIT responded by email on January 3, suggesting we implement the redirect that would block all guest access immediately. We agreed and began implementing the redirect.
The next day, Tuesday, January 4, we received an email from MIT suggesting that we block a larger Class B range of IP addresses since they believed additional downloading activity was taking place, and urging us to move as swiftly as possible to implement the redirect. The message also informed us that MIT did not expect to be able to identify the individual responsible in these incidents. We replied later on January 4, inquiring again whether the individual could be identified based on the location that was associated with the most recent downloading activity.
The following morning, Wednesday, January 5, we received an email from MIT indicating that the machine they believed was the source of the excessive downloading had been located, and that they wanted to leave it there for several days while the investigation continued. Later on January 5, we received another email noting that the investigation had moved beyond MIT to law enforcement, including federal authorities. We do not know how law enforcement became involved. We did not hear further from MIT for several days. On January 10, we learned in an email from law enforcement, forwarded by MIT, that a suspect had been detained and that the suspect was Aaron Swartz.
Shortly thereafter, the United States Attorney’s Office reached out to us with initial questions they had about the incident. At that point, our primary concern remained the same—where were the copies of the articles and what had been done with them? We responded to the United States Attorney’s Office and also reached out directly to Mr. Swartz’s attorney. Ultimately, we learned that Mr. Swartz had retained possession of the downloaded articles and was prepared to stipulate that he had the only copies and that they had not been uploaded or distributed. In June 2011, we reached a settlement with Mr. Swartz in which we agreed not to pursue civil litigation and in which he agreed to turn over all copies of the articles that he had downloaded and certified that he had not distributed the articles. These assurances enabled us to confirm that the harm to JSTOR was limited, and that the content of the archive—for which we act as steward on behalf of the publishers—was secure. We then communicated to the United States Attorney’s Office that although we recognized that any charging decision was entirely up to the government, from our perspective, we preferred that no charges be brought.
A federal grand jury indicted Mr. Swartz in July 2011, and we made a public statement at the time indicating that we had no interest in this becoming an ongoing legal matter. Following the indictment, we continued to respond to subpoenas as required by law.
We did not know at the time, nor do we know now, what Mr. Swartz was going to do with the nearly 5 million articles he downloaded; however, distribution of these articles would have undermined our relationships with participating JSTOR publishers and the sustainability of our service, including our ability to provide access and to preserve the content for future generations.
Creating and sustaining the JSTOR archive is a substantial enterprise involving, among other things, the physical effort of locating and scanning millions of hard-copy journal pages; the legal complexities of licensing content from hundreds of copyright holders for whom JSTOR acts as a custodian and steward; and the operational challenges of keeping this global archive up and running. These activities cost money, and we charge fees to cover these costs.
We are a not-for-profit organization, and when possible, we offer deeply discounted or free access in furtherance of our mission. Taking this approach, we have made enormous progress in expanding access to knowledge. Millions of students and researchers at institutions from all over the world now regularly access a massive amount of material formerly available only in print copies and that for all practical purposes had been inaccessible to them. We continually recalibrate our approach and evolve to meet the changing needs of our publishers, libraries, and users, including finding new ways to offer affordable and free access to more people.
The downloading incidents were a serious event for us. We attempted to respond to what transpired in a measured way, taking the interests of all affected parties into account. With these documents available, readers can reach their own conclusions as to how we responded. We remain saddened by the tragic loss of a gifted young man who contributed a great deal in his short life. Our hope is that providing these documents might contribute to a more complete understanding of these events so that we as a community might learn from them in ways that benefit us all.
1 It is important to note that nearly all licensed online academic resources rely on IP addresses as the means for identifying which users are authorized to get access. IP addresses are unique numbers that identify a particular computer accessing the internet. A university will have a unique range of these IP addresses assigned to it. If a user’s IP address is recognized as part of the authorized list under the control of a licensed institution, that user gets access without having to log in or enter an ID or password. It is a very convenient form of authenticating and permitting access to users but limited in its ability to provide security.