Natural language processing with spaCy library in CoCalc

Home page: spaCy

Jupyter kernel

Run this notebook with the Python 3 (Ubuntu Linux) jupyter kernel.

Setting up

Installing python packages in a CoCalc project should be done as user or in a virtual environment, e.g. with anaconda or virtualenv. This example follows the user-install approach.

Install spacy and dependencies.

Takes about a minute. Open a .term file in CoCalc for the following steps:

~$ time pip3 install --user spacy
...
Installing collected packages: cymem, preshed, plac, pathlib, murmurhash, msgpack-numpy, cytoolz, thinc, regex, spacy
Successfully installed cymem-1.31.2 cytoolz-0.8.2 msgpack-numpy-0.4.1 murmurhash-0.28.0 pathlib-1.0.1 plac-0.9.6 preshed-1.0.0 regex-2017.4.5 spacy-2.
0.11 thinc-6.10.2
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
 
real    0m51.041s
user    0m37.709s
sys     0m7.039s

Install one or more models.

Instructions for installing models recommend the spacy download command, which selects the model version for the current installation. This command will fail, because CoCalc user does not have permissions to the directory. But running this command gives the path to the selected model.

~$ python3 -m spacy download en
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
...
    error: could not create '/usr/lib/python3.5/site-packages/en_core_web_sm': Read-only file system

Now use pip3 to install the model.

$ pip3 install --user \
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

Ready to go.

After the above steps, this notebook, which follows the example at the spacy website, can be run.

In [1]:

# check kernel - verify we're running Python 3.5 or later
import sys
print(sys.version)

3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]

In [2]:

import spacy

In [3]:

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

In [4]:

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

In [5]:

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE

In [6]:

# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

my fries were super gross such disgusting fries 0.7139700916321534

In [0]: