Over 80% of the data available in the world today is currently unreadable by computers. These “dark data” are unstructured and include a wide range of invaluable information sources, from the text of scientific articles to the notes written by your doctor. Transforming these data into a form readable by machines is called knowledge base construction and is a vital process for unlocking the potential found in these resources.
Current approaches for automatically building knowledge bases require large, labeled datasets for training. These gold standard datasets are difficult to come by, particularly in biomedicine, limiting our ability to create new knowledge bases that can be analyzed.
Snorkel was created in response to this challenge. It constructs knowledge bases from “dark data.” And unlike other approaches, which require precisely labeled data to train and build the models, Snorkel can work with just a set of user-input rules. We invite you to participate in a two-day hands-on workshop to learn more about the Snorkel platform and to get assistance in applying Snorkel to your own research.
This workshop targets individuals who are interested in applying state-of-the-art machine reading approaches to extracting information from the text and tables of scientific documents.
Attendees should have an idea for a specific biomedical problem of their own to which they would like to apply Snorkel. Individuals who submit project ideas which utilize a dataset to which he/she already has access or which utilize PubMed or other open access document collections are more likely to be accepted. There are several such collections available at http://deepdive.stanford.edu/opendata/.
Attendees do not need to have machine learning backgrounds, but they do need to have some basic Python programming skills.
Snorkel was designed to address problems such as extracting from the scientific literature all drugs and the diseases they treat or automatically building a complex network capturing how proteins interact with other proteins. With Snorkel, you don’t need thousands of labeled data to produce such models, and in some cases, high-accuracy systems can be built in as a little as a single day of development.
Snorkel achieves this by using the new data programming paradigm, in which the user writes a set labeling functions, descriptions of how to label things. A simple example of a labeling function is: appearance of the word “caused by” between a disease name and an environmental factor indicates that the environmental factor is a risk factor for the disease. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model.
Snorkel is currently used for several high-impact applications:
- The U.S. Food and Drug Administration’s project to extract relationships related to the microbiome from the scientific literature.
- Mining millions of EHR clinical records to quantify pain and to understand osteoarthritis progression and treatment outcomes
- Extracting tables and other semi-structured data from scientific and technical publications
The workshop runs July 19-20, 2017 from 9:30am to 5:00pm. On the first day, participants will learn about the Snorkel workflow through brief lectures and hands-on activities. This will include:
- Writing labeling functions using pattern-matching and comparisons against existing dictionaries (e.g., Unified Medical Language System)
- Fitting and assessing a model to the labeling functions to generate the training data
- Hearing about examples of problems that can and cannot be addressed with Snorkel
Utilizing their new knowledge, participants will then design a plan to apply Snorkel to their own biomedical research question. On the second day, participants will receive feedback on their plans and begin implementing their plan, pending data availability.
How to Apply
To be considered for the workshop, submit your application by midnight PDT on Wednesday, May 31, 2017. The workshop is free to attend, but space is limited.
We will select up to 4 individuals to be reimbursed for allowable travel and lodging expenses up to $1000 each. Individuals will be selected based on the potential impact of their proposed application using Snorkel and the project’s likelihood of success (e.g., data availability, suitability of the applicant).
We highly encourage those collaborating on a single project to participate in the workshop together. The project description only needs to be filled out by one member of the group. However, each individual within the group should fill out and submit the personal information part of the application.
Meeting Location and Logistics
Jerry Yang and Akiko Yamazaki Environmental and Energy Building (Y2E2), Room 299
473 Via Ortega
Stanford, California 94305
Participants will be required to bring their own laptop to the workshop. More logistical details are available here.
- Deadline to Submit Application – 11:59PM PDT, Wednesday, May 31, 2017
- Acceptance Notifications – Mid-June, 2017
- Workshop Dates – July 19-20, 2017
- [blog post] Data Programming: ML with Weak Supervision
- [tutorials] Explore several demo Snorkel applications
Email us at firstname.lastname@example.org.