Organizing Data
Organizing your data depends on your project requirements and on your needs for accessing and sharing the data. Using consistent methods to name, store, and descibe your files based on those parameters will enable you to find and share them more easily. Before you start generating your data, carefully consider:
- file naming conventions
- directory structures
- metadata
What is metadata?
Metadata is a way of organizing data which explains the who, what, where, and when of data creation and methods of use. It is the basic means of discovery, and facilitates the reuse of data. It is also the foundation for citing datasets. Metadata standards are created to facilitate searching similar items by using similar terms and constructs to describe them. Most subject data repositories have mandated metadata standards. If you don't know what standard to use, try a generic standard like Dublin Core. Metadata records are usually created in XML or other machine-readable formats for easy integration within systems.
Examples of Discipline-based Metadata Standards
Fantastically Helpful MIT Tutorial
Storing Data
Things to consider when you store your data:
- Will the data be used internally (lab group only) or shared with others at other institutions?
- Are there any security restrictions that need to be in place? Who can have access to your data?
- Are you storing data as working documents, or is it “archived?” Is versioning a priority for your project?
- How long do you need to store the data? Are there any requirements to destroy data after a certain period of time?
- How will you store your data: portable HD? cloud? department or project server? external data repository?
- What will it cost to store your data, and who will pay for it?
Data Working Group
Science Data Repositories
ChemSpider:
ChemSpider is a free chemical structure database providing fast text and structure search access to over 26 million structures from hundreds of data sources.
Dryad:
Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences.
EarthChem:
EarthChem is a community-driven effort to facilitate the preservation, discovery, access and visualization of the widest and richest geochemical datasets.
GenBank:
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
ICPSR:
ICPSR maintains a data archive of more than 500,000 files of research in the social sciences.
NSIDC:
National Snow and Ice Data Center distributes more than 500 cryospheric data sets for researchers, from both satellite and ground observations.




Loading...
