Sharedwww / talks / 2010-10-lmfdb / scratch.txtOpen in CoCalc
Author: William A. Stein
1
TITLE: The Architecture I am Using for my Next Modular Forms Database
2
SPEAKER: William Stein
3
DATE: October 2010
4
5
------------------------------------
6
7
ABSTRACT: I have oodles of data, available on various web pages,
8
computed in files only I know how to use, and I have the potential to
9
generate much new data. This year, I am putting all of this data into
10
a massive database server, and making everything available in a
11
queryable form on the web. Thanks to the National Science Foundation,
12
I have no real worries about hardware; I can easily allocate several
13
terabytes of very fast disk space to this database, and have several
14
thousand dollars budgeted that can be used to buy extra computers and
15
disks for backup purposes. Last time I seriously pursued putting
16
together a big database like this was in 2003, and technology has come
17
a long way since then. Computers are 64-bit, which helps enormously
18
with scaling, and there are finally useful documented-oriented
19
databases. This talk is about the technological architecture I intend
20
to use to put together this cutting-edge database, and I hope it will
21
be of interest to other people in our Focused Research Group (FRG),
22
who have similar goals. I am very willing to help them get going with
23
this technology. (This is joint work with Mike Hansen.)
24
25
--------------------------------------------------------
26
Part 1. Overall Architecture
27
--------------------------------------------------------
28
29
1. Database: * MongoDB master -- disk.math.washington.edu (in Seattle)
30
* MongoDB slave 1 -- in Seattle on William Stein's OS X desktop (?)
31
* MongoDB slave 2 -- in Waterloo on Mike Rubinstein OS X computer
32
33
I will attempt to limit the database footprint for this project to
34
4 terabytes, so that a single $350 USB disk plugged into any
35
computer (Linux, OS X, etc.) can server as a redundant MongoDB
36
slave. No sharding will be used for my project.
37
38
The master MongoDB server will run directly on our 24TB fileserver
39
listening only on localhost; this machine has very few user
40
accounts. A user who needs direct access to the database will have
41
their ssh key added to a limited account on this machine, and via
42
ssh port forwarding, they will be able to access the database,
43
using a login and password that gives them access to a subset of
44
the databases or collections served by MongoDB. Note that a
45
single MongoDB server can simultaneously server numerous
46
completely independent databases.
47
48
2. Web Interface:
49
* Flask microframework: http://flask.pocoo.org/
50
* Apache: via mod_wsgi
51
52
Mike Hansen and I are using the Flask Python library (Flask is
53
from the same group that brought us Jinja, Sphinx, etc.) to
54
develop a web front-end for to the database that will enable
55
anybody to easily make fast queries that on certain collections of
56
data, and easily scan through the results. We will create indexes
57
in the MongoDB database that specifically support the queries that
58
are available through the web interface. We will deploy our Flask
59
application using Apache's mod_wsgi module, which is quick and
60
scalable.
61
62
3. Programmatic Interface:
63
64
We will setup a read-only MongoDB slave server on a separate
65
machine, which will be available for sophisticated users that wish
66
to make arbitrary queries against the database, or use Sage to
67
grab objects out of the database. Some interesting queries (e.g.,
68
map-reduce which can run javascript on millions of documents) can
69
take a long time and put a heavy load on the database server, but
70
since this will be a separate server, such queries will have no
71
impact at all on our master MongoDB server.
72
73
MongoDB has official support for fully accessing a MongoDB server
74
using any of C, C++, Java, Javascript, Perl, PHP, Python (hence
75
Sage!), and Ruby. There are numerous other languages that are not
76
officially supported, but are listed here:
77
http://www.mongodb.org/display/DOCS/Drivers
78
Unfortunately, no other math software, e.g., Magma, Mathematica, Maple,
79
or Matlab, is in that list.
80
81
82
WEB UPLOAD?
83
84
Data upload is also done through 3, but with a connection to the
85
master server instead of a slave. There is definitely no web page
86
upload for data in this model, and I have no interest or plans in
87
creating such a thing as part of this architecture, due to the
88
security issues. However, if somebody else makes one, they could
89
act as an "editor" and submit the results of uploads to the MongoDB
90
database.
91
92
93
--------------------------------------------------------
94
Part 2. The Database -- MongoDB
95
--------------------------------------------------------
96
97
David Farmer has put forth an idea that "the basic building blocks in
98
the project are the individual homepages of each object of interest."
99
MongoDB (http://www.mongodb.org/) is a relatively new *documented
100
oriented* database management system, hence much different than a SQL
101
database such as SQLite, MySQL, or PostgreSQL. MongoDB documents
102
correspond to Farmer's idea of homepages. With MongoDB, not only can
103
you store and retrieve *documents*, you can also build indexes and do
104
elaborate optimized queries. Also, all data can be optimatically
105
replicated on any number of backup servers.
106
107
I've tested using MongoDB to deal with tons of data I generated this
108
summer, related to modular forms, for a project with Barry Mazur. I
109
also tested putting all of Cremona's tables of elliptic curves and the
110
Stein-Watkins tables of elliptic curves in a single big MongoDB
111
database. It made a vast amount of data (hundreds of gigabytes) feel
112
"small". This is "feel" is critical to a database solution for this
113
project, and I've never had this feeling before with any other
114
database I've seriously used, which includes: PostgreSQL, MySQL,
115
sqlite, ZODB, and custom filesystem based stores.
116
117
How to learn about MongoDB: Go to http://www.mongodb.org/ and start
118
browsing. There are tons of quickstarts, tutorials, articles, and
119
videos of talks, slides, etc. Though MongoDB is free and open source
120
(and written in C++), there is a company behind MongoDB, which does a
121
lot of proselitizing. There are also some not-quite-finished books
122
about MongoDB; I read them by temporarily signing up for an
123
O'Reilly Safari books membership (my.safaribooksonline.com), reading
124
them, then unsubscribing. Perhaps they will be published by now.
125
126
(NOTE: MongoDB has essentially only one competitor, which is the
127
"apache CouchDB" project: http://couchdb.apache.org/.)
128
129
Setting up your own simple MongoDB server is easy:
130
131
1. Download binaries from http://www.mongodb.org/downloads and put
132
them somewhere in your PATH. They are available for Linux,
133
OS X, Windows, and Solaris.
134
135
2. Start a MongoDB server running by typing "mongod".
136
137
TECHNICAL NOTES: I usually type something more involved:
138
139
mongod --dbpath /lvm/array/lmfdb/mongodb --bind_ip localhost --port 29000
140
141
The dbpath option specifies where the files for the database
142
are stored and the bind_ip and port options makes it so mongod
143
accepts connections on localhost port 29000; otherwise,
144
anybody in the world could just connect to your mongodb and
145
delete all your data!! If you want to run mongod on a remote
146
server somewhere, but easily connect to it from your laptop
147
(say), setup an ssh tunnel by simply typing:
148
149
ssh -L 29000:localhost:29000 remote.computer.edu
150
151
Then you can pretend that port 29000 on your laptop *is* port
152
29000 on the remote server, and things will just work. It's
153
also possible to create accounts with various permissions from
154
the mongo console -- see the mongo documentation for details.
155
However, if you setup accounts make sure to use an ssh tunnel
156
whenever using them, since mongod itself doesn't use a secure
157
socket, so your password would get sent in the clear.
158
159
3. You can connect to your new MongoDB server with the mongo console:
160
161
[email protected]$ mongo localhost:29000
162
MongoDB shell version: 1.6.1
163
connecting to: localhost:29000/test
164
> show dbs
165
admin
166
local
167
research
168
> help
169
db.help() help on db methods
170
...
171
172
4. More importantly, you can also connect from Sage (or any Python):
173
174
(a) If you have not already done so, install pymongo, which
175
takes about 10 seconds:
176
177
sage: !easy_install pymongo
178
Searching for pymongo
179
Reading http://pypi.python.org/simple/pymongo/
180
Reading http://github.com/mongodb/mongo-python-driver
181
Best match: pymongo 1.9
182
Downloading http://pypi.python.org/packages/source/p/pymongo/pymongo-1.9.tar.gz#md5=12e12163e6cc22993808900fb9629252
183
Processing pymongo-1.9.tar.gz
184
Running pymongo-1.9/setup.py -q bdist_egg --dist-dir /tmp/easy_install-nIodTu/pymongo-1.9/egg-dist-tmp-gCPRfG
185
warning: no files found matching '*.h' under directory 'pymongo'
186
bson/time64.c:279: warning: ‘check_tm’ defined but not used
187
zip_safe flag not set; analyzing archive contents...
188
Adding pymongo 1.9 to easy-install.pth file
189
190
Installed /usr/local/sage/sage-4.6.alpha1/local/lib/python2.6/site-packages/pymongo-1.9-py2.6-linux-x86_64.egg
191
Processing dependencies for pymongo
192
Finished processing dependencies for pymongo
193
194
sage: quit # important!
195
196
(b) Now use it:
197
sage: import pymongo
198
sage: C = pymongo.Connection('localhost:29000')
199
sage: C.database_names()
200
[u'research', u'admin', u'local']
201
sage: R = C.research; R
202
Database(Connection('localhost', 29000), u'research')
203
sage: R.[tab key]
204
R.add_son_manipulator R.create_collection R.logout R.remove_user
205
R.add_user R.dereference R.name R.reset_error_history
206
R.authenticate R.drop_collection R.next R.set_profiling_level
207
R.collection_names R.error R.previous_error R.system_js
208
R.command R.eval R.profiling_info R.validate_collection
209
R.connection R.last_status R.profiling_level
210
sage: R.collection_names()
211
[u'mazur_irreg.done', u'system.indexes', u'mazur_irreg', u'mazur_irreg2.done',
212
u'mazur_irreg2', u'mazur_irreg3.done', u'mazur_irreg3', u'mazur_irreg.f2_multiplicities',
213
u'ellcurves', u'heegner_point_heights', u'shimura_curves',
214
u'fs.chunks', u'fs.files']
215
216
217
218
DATABASES, COLLECTIONS, and DOCUMENTS:
219
220
A MongoDB server serves a collection of completely *independent*
221
databases. A database is a set of *collections*, and a collection
222
is a set of documents. A document is just a basically like a
223
Python dictionary, but only a limited number of datatypes are
224
allowed. Technically, a document is a "BSON" document, where BSON
225
is a slight binary generalization of the Javascript notion of
226
JSON.
227
228
DOCUMENTS are limited:
229
230
A MongoDB document must be at most 4MB in size. Let's push
231
the limits, to see what this means in practice:
232
233
sage: foo = R.foo
234
sage: foo.insert({'test':'a'*(4*10^6)})
235
ObjectId('4cae369075688b3eab000006')
236
sage: foo.insert({'test':'a'*(5*10^6)})
237
Traceback (most recent call last):
238
...
239
InvalidDocument: document too large - BSON documents are limited to 4 MB
240
241
So you could store a string with 4 million characters, but not
242
5 million; for reference, this is about 1,000 typed pages of text.
243
244
HOW TO STORE BIG STUFF:
245
246
The "fs.chunks" and "fs.files" collections above are created
247
automatically in MongoDB to implement something called "GridFS" =
248
"Grid Filesystem". As mentioned above, each MongoDB document must
249
be at most 4MB in size, but GridFS allows you to get around this
250
and store gigantic data in MongoDB, but with no indexing and
251
searching capabilities. It's basically just a key:value store,
252
built on top of MongoDB.
253
254
Here is how to use it (continuing our example):
255
256
sage: import gridfs
257
sage: G = gridfs.GridFS(R)
258
sage: import gridfs
259
sage: G = gridfs.GridFS(R)
260
sage: G.put('a'*(5*10^6), filename='test1')
261
ObjectId('4cae3ba075688b3eab000008')
262
sage: x = G.get_last_version('test1').read()
263
sage: len(x)
264
5000000
265
sage: x[:30]
266
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
267
268
You can store arbitrary Sage objects using dumps and loads:
269
270
sage: M = ModularSymbols(389,2)
271
sage: G.put(dumps(M), filename='modsym389')
272
ObjectId('4cae3c1a75688b3eab00001d')
273
sage: loads(G.get_last_version('modsym389').read())
274
Modular Symbols space of dimension 65 for Gamma_0(389) of weight 2 with sign 0 over Rational Field
275
276
You can have exactly one GridFS per database, so if you have
277
documents in all sorts of collections that somehow point to
278
GridFS "files", you'll need to choose some systematic way
279
of naming the files.
280
281
282
283
--------------------------------------------------------
284
Part 3. The Web Interface -- FLASK, mod_wsgi, apache
285
--------------------------------------------------------
286
287
FLASK: "Flask is a micro webdevelopment framework for Python."
288
289
http://flask.pocoo.org/
290
291
Here is "hello world" written using Flask:
292
293
from flask import Flask
294
app = Flask(__name__)
295
296
@app.route("/")
297
def hello():
298
return "Hello World!"
299
300
if __name__ == "__main__":
301
app.run()
302
303
To install Flask into any Python instance, just do "easy_install Flask". For example,
304
to make Flask work in Sage, do
305
306
sage -sh
307
easy_install Flask
308
309
and you got it.
310
311
I don't have time in this talk to go into detail about how to use Flask in
312
general. The documentation at their website is excellent. In short,
313
you use decorators to construct the URL mapping, deal with GET and
314
POST requests, etc. You can also put static/ and templates/
315
subdirectories in your Python project, and relevant files will get
316
pulled from there, e.g., for static HTML and Jinja2 templates (Flask
317
is heavily tied to Jinja2).
318
319
Mike Hansen and I are building a demo Flask site that illustrates the
320
architecture sketched above and provides access to a large table of
321
over a hundred million elliptic curves, consisting of the union of the
322
Cremona tables and the Stein-Watkins tables. This section describes
323
this demo in detail. This will in fact form the core for the new
324
modular forms database, though I'm sure much of it will get
325
rewritten and polished. In the rest of this talk, I'll give a quick
326
demo of this site and walk through of the code.
327
328
The *DEMO* is here:
329
330
http://db.modform.org/
331
332
Try it out! The rest of this talk is about the architecture of this
333
website, which is running on boxen.math.washington.edu, in Seattle.
334
335
APACHE/WSGI setup:
336
337
(1) We created a file
338
339
/etc/apache2/sites-available/lmfdb
340
341
with the contents:
342
343
NameVirtualHost db.modform.org:80
344
<VirtualHost db.modform.org:80>
345
ServerName db.modform.org
346
WSGIDaemonProcess lmfdb threads=5
347
WSGIScriptAlias / /home/mhansen/lmfdb/lmfdb.wsgi
348
<Directory /home/mhansen/lmfdb>
349
WSGIProcessGroup lmfdb
350
WSGIApplicationGroup %{GLOBAL}
351
Order deny,allow
352
Allow from all
353
</Directory>
354
</VirtualHost>
355
356
and made a symbolic link:
357
358
/etc/apache2/sites-available/lmfdb --> /etc/apache2/sites-enabled/lmfdb
359
360
(2) The WSGI appliction is defined by this file:
361
362
http://sage.math.washington.edu/home/mhansen/lmfdb/lmfdb.wsgi
363
364
The main thing that this file has to do is define some object called "application"
365
which will obey the WSGI protocol. There are a few other things in there to let
366
it know about the virtual environment where Mike has Flask, etc. installed. Here
367
is the contents of the file:
368
369
import os, sys
370
sys.path.append('/home/mhansen/lmfdb')
371
os.environ['PYTHON_EGG_CACHE'] = '/home/mhansen/lmfdb/.python-eggs'
372
373
activate_this = '/home/mhansen/lmfdb/env/bin/activate_this.py'
374
execfile(activate_this, dict(__file__=activate_this))
375
376
from lmfdb import app as application
377
378
Note that this file doesn't really make sense out of context, and it is important
379
to look at the files mentioned above.
380
381
(3) To understand the application, you need to look at the files in the following tarball:
382
383
http://wstein.org/talks/2010-10-lmfdb/lmfdb-flack.tar.bz2
384
385
In there, in addition to the templates you'll find lmfdb.py, which looks like this:
386
387
##################################################################
388
389
from flask import Flask, url_for, render_template, request
390
app = Flask(__name__)
391
392
from pymongo import Connection
393
db = Connection(port=int(29000)).research
394
395
def ellcurves_list(f):
396
from functools import wraps
397
from utils import LazyMongoDBPagination
398
@wraps(f)
399
def wrapper(**kwargs):
400
kwds = f(**kwargs)
401
pagination = LazyMongoDBPagination(query=kwds.pop('curves'),
402
per_page=50,
403
page=request.args.get('page', 1),
404
endpoint=f.__name__,
405
endpoint_params=kwargs)
406
407
return render_template(f.__name__ + '.html',
408
pagination=pagination, **kwds)
409
return wrapper
410
411
@app.route('/ellcurves/conductor/<query>')
412
def ellcurves(query):
413
import re
414
values = map(int, re.findall(r'(\d+)', query))
415
if len(values) == 2:
416
return ellcurves_conductor_range(*values)
417
elif len(values) == 1:
418
return ellcurves_of_conductor(*values)
419
else:
420
return render_template('invalid_query.html', query=query)
421
422
# omitted ...
423
424
@app.route('/ellcurve')
425
def ellcurve():
426
a = request.args
427
level = int(a.get('level','11'))
428
iso_class = a.get('iso_class', 'a')
429
number = int(a.get('number', 1))
430
cursor = db.ellcurves.find({'level':level, 'iso_class':iso_class, 'number':number})
431
return render_template('ellcurve.html', count=cursor.count(True),
432
curve=cursor.next())
433
434
@app.route('/')
435
def index():
436
return render_template('index.html')
437
438
if __name__ == '__main__':
439
app.run(debug=True,host='0.0.0.0', port=8765)
440
441
##################################################################
442
443
SUMMARY:
444
445
This talk has laid out the architecture that I will be using for my
446
new web-based databases, which I recently developed jointly with Mike
447
Hansen. It uses the following free open source tools together in a
448
natural way:
449
450
* Python: a high quality programming language
451
* MongoDB: a scalable documented-oriented database
452
* Flask: a "micro-framework" for Python-based web apps
453
* Jinja2: a general purpose templating language
454
* Apache + WSGI: high performance scalable web server for Python
455
456
You can try out a demo that combines the above right now at:
457
458
http://db.modform.org
459
460
There are other related technologies that I'm currently not planning
461
on using, but might use, depending on further investigation, e.g.,
462
* MongoKit -- a python module that brings structured schema and
463
validation layer on top of the great pymongo driver.
464
465
466
467
468
469
470
471
472
473
474
475