Crowdsourcing public records requests

Leave a comment

The discussions about using body cams in the law enforcement community have continued this week.  NPR’s Morning Edition aired a piece entitled “Transparency vs. Privacy.”  The transparency comes from laying bear the interactions of police with the public.  But Martin Kaste interviewed the Chief of the Los Angeles police department, who said while videos would be made available for legal cases, privacy concerns would prevent their widespread distribution.  The confidentiality provisions that apply to law enforcement records are numerous.

The man in Seattle who made a public records request for all police videos has now been identified.  Timothy Clemans explained his rationale to Kaste: “If we make all these videos public and people really start watching them, that any inappropriate use of force and bias policing will eventually go away because there’ll just be so many people complaining all day long.”  But instead of fighting the request, the chief operating officer of the Seattle police department has taken the novel approach of enlisting Clemans and other techies to help devise a way to redact information from the police videos that should not be made public.  Clemans has suggested a method for blurring people’s identities in videos.  The Seattle PD hosted a hackathon in the hopes of generating more ideas of how to balance transparency with privacy.  While it will still take some time to parse the results, the Seattle Times reported the COO considered the event a success.  Obviously Seattle has the advantage of being located in a tech hub — it will be interesting to see whether other localities are able to coordinate similar sorts of events to harness the power of digital activists and whether the solutions proposed in Seattle get wider usage.

By the way, Clemans has withdrawn his public records request.

Electronic Records Day

Leave a comment

The Council of State Archivists (CoSA) declares October 10 to be Electronic Records Day.  This is a day to raise awareness among government agencies, related professional organizations, the general public, and other stakeholders about the crucial role electronic records play in our world.  Here is the list of ten reasons they suggest people should be focusing on electronic records:

  1. Managing electronic records is like caring for a perpetual toddler: they need regular attention and care in order to remain accessible.
  2. Electronic records can become unreadable very quickly.  While records on paper can sometimes be read after thousands of years, digital files can be virtually inaccessible after just a few.
  3. Scanning paper records is not the end of the preservation process: it is the beginning.  Careful planning for ongoing management expenses must be involved as well.
  4. There are no permanent storage media.  Hard drives, CDs, magnetic tape or any other storage formats will need to be tested and replaced on a regular schedule.  Proactive management is required to avoid catastrophic loss of records.
  5. The lack of a “physical” presence can make it very easy to lose track of electronic records.  Special care must be taken to ensure they remain in controlled custody and do not get lost in masses of other data.
  6. It can be easy to create copies of electronic records and share them with others, but this can raise concerns about the authenticity of those records.  Extra security precautions are needed to ensure e-records are not altered inappropriately.
  7. The best time to plan for electronic records preservation is when they are created.  Don’t wait until software is being replaced or a project is ending to think about how records are going to be preserved.
  8. No one system you buy will solve all your e-records problems.  Despite what vendors say, there’s no magic bullet that will manage and preserve your e-records for you.
  9. Electronic records can help ensure the rights of the public through greater accessibility than ever before, but only if creators, managers and users all recognize their importance and contribute resources to their preservation.
  10. While they may seem commonplace now, electronic records will form the backbone of the historical record for researchers of the future.

CoSA has also generated a document called Survival Strategies for Personal Digital Records that provides suggestions for dealing with backups, migration, and other issues for personal files and digital images.

Earlier this week, an article was published entitled “The New Digital Workplace,” and some of its points about the future of work are interesting to consider through the lens of archives and records management.  Some of people’s expectations that I believe could (or should) apply to archives are these:

  • search that works — standards and interoperability and catalogs have been discussed for years, but there’s still much to be improved about how patrons can find and utilize archival collections
  • rich media tools to communicate — many repositories have embraced social media, but I think there are still more ways that the reference experience in particular could be improved (e.g., reference interviews could take place via Skype before a researcher makes a trip to the repository)

While there’s no questioning the allure of mobile apps, I think the general lack of budget and IT support is going to make it hard for most repositories to begin designing their own apps (though perhaps a hack-a-thon could offer its services).  What remains to be seen is how archives will handle things like whether to provide access to digital collections only in the search room, as has been the norm for most paper records, or whether to devise a way to provide more robust service online.  Once this is determined, it will also be interesting to see whether the new emphasis on collaboration that is sweeping the worlds of business and education will impact the realm of archival research.

Embracing the power of big data

Leave a comment

I attended the North Carolina Digital Government Summit last week.  The keynote speaker was Cynthia Storer, a former CIA analyst.  She made several comments that I found especially relevant:

  • information is power
  • strategic analysis = pointing something out that no one knew they needed to know
  • transparency is key to success

A session on Digital Analytics incorporated a quote from George Dyson: “Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.”

A session on open data and crowdsourcing explained the eight principles that are considered key principles of open government data.  Here are some interesting examples of what state and local governments are doing with big data:

  • Seattle is posting 911 call data in real time
  • Montgomery County, Maryland, posts both the recommended and approved operating budgets for its municipal programs
  • New York’s Open Data Portal includes a list of application programming interfaces (APIs) that have been developed using its open data
  • House Facts Standard was deployed by San Francisco to report government data on the health and safety of residential buildings
  • Open311 was developed as a means of reporting public requests and has been adopted in cities such as Chicago
  • NC OneMap is a public service providing comprehensive discovery and access to North Carolina’s geospatial data resources
  • Raleigh has GIS data comparing current zoning with proposed UDO zoning

This range of uses of big data is fascinating in and of itself, but more striking is that it seems these governments are embracing the ideas put forth by Cynthia Storer:

  • information is power
  • strategic analysis = pointing something out that no one knew they needed to know
  • transparency is key to success

I don’t know the backstory to any of these open data projects, but it is fascinating to watch governments being proactive about sharing information.  It will be interesting to continue watching what develops at this intersection of transparent government, big data, and public records.

Harnessing the good of the Internet

Leave a comment

Less than two weeks ago, the Smithsonian entered the ranks of institutions that have crowdsourced transcriptions of documents in their collections.  As explained in their press release, they have digitized images of millions of documents, but those that are handwritten or otherwise cannot easily be deciphered by a computer have limited discoverability because their text cannot be searched.  So volunteers can register to participate and then choose among a diverse group of projects to transcribe, which are listed according to the individual museum (e.g., the National Museum of American History) or by theme — American Experience, Biodiverse Planet, Civil War Era, Field Book Registry, Mysteries of the Universe, and World Cultures.  In this short period of time, dozens of projects both short and long have already been completed.  Once a transcription is finished, it is reviewed by other volunteers before it is marked complete.

This work is done anonymously, which fascinates me considering that so much of American society today seems intent on calling attention to individuals.  In their call for volunteers, the Smithsonian targets “researchers, educators, citizen scientists and history buffs” — but I think it would be a great project to find out more about who joins these ranks of volunteers.  Are they retired persons? subject experts? students? technological experts?  While I support doing work for the greater good rather than for individual fame, there will remain a part of me that is very curious about the membership of these new crowdsourcing communities that are being created.

Digital hoarding

Leave a comment

Concerns about hoarding seem to have become a fascination to Americans in recent years.  Consider that A&E squeezed 6 seasons (41 episodes) out of its show Hoarders.  In 2012, Melinda Beck wrote an article in the Wall Street Journal about digital hoarding.  She cites experts who suggest that the accumulation of digital files verges into hoarding when it is disorganized and interferes with other relationships and responsibilities.  She estimates that people only use about 20% of what they save.

While physical hoarding has signs that may be recognized, digital hoarding is harder to recognize.  I come across a lot of people who are determined to pare down their stacks and drawers of paper — in favor of scanning these same items and keeping them in digital form, FOREVER.  I’ll confess, I have taken to scanning magazine articles and saving them as tagged PDF files rather than filing the print version.  Being able to have them as files that can be searched using the index function of Windows Explorer and that are fully keyword searchable makes them more useful to me, and I have developed a system for filing them that makes them findable.  But I also try to weed out my electronic files of items that have exceeded their usefulness just as I do my paper files.

It’s not only people — governments are getting into the digitizing craze.  The National Archives and Records Administration has as one of its strategic goals “Make Access Happen,” and according to a blog by David Ferriero, one of the methods of accomplishing this is to digitize records.  The Pittsburgh Post-Gazette reports that a new law takes effect in Pennsylvania today that will allow counties to store court records electronically rather than requiring paper or microfilm record copies.  They have not yet finalized the requisite standards and procedures, but soon enough, PA courts will no longer be required to maintain human-readable court records (i.e., records that can be read without the use of a machine).  The article touts the cost savings this will bring to the counties because of decreased physical storage requirements.

I like the increased access that comes with electronic records.  But my fear is that the rush to digitize ignores the costs of digital preservation.  The Nationaal Archief of the Netherlands has a report on the Costs of Digital Preservation that breaks the costs down in this way:

  • creation of a digital repository — physical space, hardware, and software
  • personnel
  • preservation — software to guarantee the authenticity of records plus efforts required to migrate and/or emulate records
  • public services — training, etc.

In his famous 1995 article for Scientific American, Jeff Rothenberg warns, “digital information lasts forever – or five years, whichever comes first.”  So I guess my main concern is that we not put everything into our digital “file cabinets” and then think we can walk away.  There’s still a lot of work to be done to maintain these files — and there will be costs.  And just as there can be disasters that compromise paper records, electronic records are also vulnerable.  Take as an extreme example this Dropbox disaster that was reported last week.  (Spoiler alert: this story will make you want to start keeping photo albums on your coffee table rather than in the cloud!)  As with all things in life, decisions regarding how to maintain records should be made after thoughtful review and with careful analysis of the costs and benefits.

Updated stories

Leave a comment

Several of the topics about which I’ve previously written — specifically, “the right to be forgotten” and the IRS email scandal — continue making news.  So here are some updates.

  • Forbes reported this week that Google has received about 100,000 requests over two months from Europeans wishing certain search results to be removed.  It points out two big problems with Google’s application of the court decision: (1) it only removes the link from the “local” version of Google, so the incriminating link can always be found on the US or some other country’s version of the Google search page; and (2) Google has been notifying web site owners of the removal of the links, which then tends to spawn an investigation that brings the story back into the news cycle, which of course defeats the whole purpose of making it harder for people to locate this information.  It’ll be interesting to see how the interpretation and application of this decision morph over time.
  • A records management take that was posted last week on the IRS email scandal acknowledges the complications of capturing federal records produced as email.  It cites Meg Phillips of the National Archives and Records Administration (NARA) as saying, “the scale of electronic records being created requires more individual decisions than users can be reasonably expected to process in a manual way.”  One possible solution is being tested by the Department of the Interior, whose Office of Records Management is working with NARA on automating the process of identifying which emails fit into which of 500 retention categories in the general records schedule.  They’re piloting an auto-classification system that can index records using algorithms and automatically file them.  While this wouldn’t necessarily have prevented the IRS debacle, it does address a growing problem.  The article goes on to suggest that in order to guarantee honesty, the government needs to be tested more often — in terms of being required to produce public records within a given period of time.  The founder of a company that provides archiving platforms for organizations is quoted as saying that the Securities and Exchange Commission regularly requires private companies to produce records within 48 hours, but the same rigor is not applied to government requests.  Perhaps if the IRS had been in the habit of promptly supplying records, they would have recognized their IT problem long before it was irrecoverable.
  • Of course, it’s increasingly looking like those emails are recoverable.  According to the Washington Post, the House Oversight and Government Reform Committee released testimony Tuesday showing that IRS Deputy Associate Chief Counsel Thomas Kane told congressional investigators that the agency is no longer certain whether it recycled all of the backup tapes containing Lerner’s e-mails.  And perhaps that hard drive was just scratched.  The chairman of the committee said it best: “It is unbelievable that we cannot get a simple, straight answer from the IRS about this hard drive.”  Maybe one of these days we’ll actually get to the bottom of this story.  But don’t count on it happening any time soon.

Reusing data

Leave a comment

I have spent a good portion of the last week rehearsing and performing Handel’s Messiah with the Duke University Chapel Choir, so it’s on my mind for obvious reasons.  During one rehearsal, our director was encouraging us to emphasize the word “glory” and made the assertion that glory is the most frequently used word in Messiah.  Given that my library science studies — and more specifically, a digital humanities class — exposed me to the existence of software programs that facilitate word frequency counts, I decided to verify this assertion.  So I scanned the libretto into Adobe Acrobat Pro, cleaned up the text that came through its OCR process, and plugged it into the Character and Word Counter with Frequency Statistics Calculator.  Below is a link to the results in a PDF file, along with a word cloud generated at Word Cloud Generator.  (NOTE: a libretto is printed to help an audience follow the words that are being sung, but it does not generally include every occurrence of a sung word, as the text might differ slightly from one voice part to another)

Messiah words  Unique words: 483  Total words: 1485

Messiah word cloud

Messiah word cloud

In addition to being an interesting exercise for a performance week, doing a word frequency count on the text of Messiah made me stop to think about the various ways that data can be reused.  Obviously, those creating the program for the concert did not anticipate my doing this exercise, but with a clear copy of a program that could be scanned, I was able to complete this task with relatively little effort.  But what about the items that are housed in archives of various sorts — do archivists have a responsibility to make these items available for reuse?  And is it sufficient for archives to react to requests for use, such as those from the digital humanities community, or should archives be proactive in anticipating and/or suggesting other uses for the records which they house?  (For an example of how digital humanists have presented online items that were originally analog records in an archives, see the William Blake Archive.)  Given that most archives have excessive backlogs that they can’t afford to process, I don’t imagine that archivists will be devoting a lot of time to brainstorming different ways to utilize their records.  But I think it warrants a look by special collections archivists to what scientists and social scientists are doing to preserve and share datasets in data repositories like the Inter-university Consortium for Political and Social Research (ICPSR), the Odum Institute Dataverse Network, and the Dryad Digital Repository, to name a few.  Encouraging scholarship has always been at the heart of the work of special collections archives, so it’s time to embrace the possibility of new ways to facilitate that scholarship.

Older Entries