Sharedwww / talks / 2010-10-lmfdb / lmfdb.texOpen in CoCalc
Author: William A. Stein
1
\documentclass{beamer}
2
\definecolor{dblackcolor}{rgb}{0.0,0.0,0.0}
3
\definecolor{dbluecolor}{rgb}{.01,.02,0.7}
4
\definecolor{dredcolor}{rgb}{0.6,0,0}
5
\definecolor{dgraycolor}{rgb}{0.30,0.3,0.30}
6
\newcommand{\dblue}{\color{dbluecolor}}
7
\newcommand{\dred}{\color{dredcolor}}
8
\newcommand{\dblack}{\color{dblackcolor}}
9
10
\usepackage{listings}
11
\lstdefinelanguage{Sage}[]{Python}
12
{morekeywords={True,False,sage,singular},
13
sensitive=true}
14
\lstset{frame=none,
15
showtabs=False,
16
showspaces=False,
17
showstringspaces=False,
18
commentstyle={\ttfamily\color{dredcolor}},
19
keywordstyle={\ttfamily\color{dbluecolor}\bfseries},
20
stringstyle ={\ttfamily\color{dgraycolor}\bfseries},
21
language = Sage,
22
basicstyle={\scriptsize \ttfamily},
23
aboveskip=.3em,
24
belowskip=.1em
25
}
26
\usepackage{url}
27
\usepackage{hyperref}
28
\hypersetup{colorlinks=true, urlcolor=blue}
29
\usepackage{comment}
30
\usepackage{colortbl}
31
\usepackage{fancybox}
32
\usepackage{beamerarticle}
33
\usepackage[utf8x]{inputenc}
34
\mode<presentation>
35
{
36
% \usetheme{Rochester}
37
% \usetheme{Berkeley}
38
\usetheme{PaloAlto}
39
%\usecolortheme{crane}
40
% \usecolortheme{orchid}
41
\usecolortheme{whale}
42
% \usecolortheme{lily}
43
\setbeamercovered{transparent}
44
% or whatever (possibly just delete it)
45
}
46
\usepackage{ngerman}
47
48
\title[Database Architecture]{My Next Modular Forms Database}
49
\date{October 2010}
50
\author[W. Stein]{William Stein (joint work with Mike Hansen)}
51
52
%\newcommand{\todo}[1]{[[#1]]}
53
\newcommand{\todo}[1]{}
54
55
\begin{document}
56
57
\begin{frame}
58
\titlepage
59
\end{frame}
60
61
\begin{frame}{Abstract}
62
63
\begin{block}{I Have Data}
64
I have oodles of data on web pages and in files only I know how to
65
use, and with Sage I can generate much more. I am putting all of
66
this data into a web-accessible database server. Thanks to the NSF
67
I can allocate terabytes of disk space to this database, and have
68
money to buy extra computers for redundancy.
69
\end{block}
70
71
\begin{block}{Today's Technology is Better}
72
Last time
73
I tried putting together a database like this was in 2003;
74
technology has dramatically improved since then. Most computers are
75
64-bit, which removes numerous annoying barriers, and there are good
76
documented-oriented databases. This talk is about how I intend to
77
to put together this database, and will be of interest to
78
others with similar goals.
79
\end{block}
80
81
\end{frame}
82
83
\section{The Overall Architecture}
84
85
\begin{frame}{Database}
86
\begin{block}{Servers}
87
\begin{itemize}
88
\item MongoDB master -- disk.math.washington.edu (in Seattle)
89
\item MongoDB slave 1 -- in Seattle on William Stein's OS X desktop (?)
90
\item MongoDB slave 2 -- in Waterloo on Mike Rubinstein OS X computer
91
\end{itemize}
92
\end{block}
93
94
\begin{block}{Disk Space?}
95
\begin{itemize}
96
\item Try to limit the database footprint for this project to 4
97
terabytes, so that a single $\sim$ \$350 USB disk plugged into any
98
computer (Linux, OS X, etc.) can server as a redundant MongoDB
99
slave.
100
\item But, if things get too big, I'll use ``sharding''.
101
\end{itemize}
102
\end{block}
103
\end{frame}
104
105
\begin{frame}{Security and Users}
106
\begin{itemize}
107
\item
108
The {\em master MongoDB server} will run directly on our big Ubuntu Linux
109
fileserver, listening only on localhost.
110
111
\item A user who needs direct {\em write} access to the database will have their ssh
112
key added to a limited account on this machine, and via ssh port
113
forwarding, they will be able to access the database, using a login
114
and password that gives them access to a subset of the databases or
115
collections served by MongoDB.
116
117
\item A single MongoDB server can {\em simultaneously} serve numerous
118
completely independent databases, and independent requests from
119
different users.
120
\end{itemize}
121
\end{frame}
122
123
\begin{frame}{Web Interface}
124
\begin{block}{Software}
125
\begin{itemize}
126
\item Flask microframework: \url{http://flask.pocoo.org/}
127
\item Apache: via {\tt mod\_wsgi}
128
\end{itemize}
129
\end{block}
130
131
\begin{itemize}
132
\item Use the Flask Python library (Flask is
133
from the same group that brought us Jinja, Sphinx, etc.) to
134
develop a web front end for to the database.
135
\item Webpage will enable anybody to easily make fast queries.
136
\item Will create indexes in the MongoDB database that optimize
137
queries available through the web interface.
138
\item Will deploy our Flask application using Apache's {\tt
139
mod\_wsgi} module, which is scales well.
140
\end{itemize}
141
\end{frame}
142
143
144
\begin{frame}{Use MongoDB from C, C++, Javascript, Python (Sage), Perl, etc.}
145
146
\begin{itemize}
147
\item Will run several {\em read-only MongoDB slave servers}. Good
148
for arbitrary queries against the database. Some queries involve
149
server side javascript run on millions of documents, and can take a
150
long time and put a heavy load on the database server.
151
152
\item MongoDB officially supports accessing a MongoDB server from any
153
of C, C++, Java, Javascript, Perl, PHP, Python (hence Sage!), and
154
Ruby. There are numerous other languages that are not officially
155
supported, but are here:
156
\url{http://www.mongodb.org/display/DOCS/Drivers}\\No math software
157
besides Sage, e.g., none of Magma, Mathematica, Maple, or Matlab, is
158
in that list.
159
\end{itemize}
160
\end{frame}
161
162
\begin{frame}{Web Upload?}
163
164
\begin{itemize}
165
\item Data upload is done by connecting to the {\em master} database
166
via a programming language.
167
\item There is {\em no web page upload for data} as part of my planned
168
architecture, due to security issues and time constraints.
169
However, if somebody else makes a web upload system, they could act
170
as an ``editor'' and submit the results of uploads to my MongoDB
171
database.
172
\end{itemize}
173
174
\end{frame}
175
176
\section{The Database -- MongoDB}
177
178
\begin{frame}{MongoDB: a Documented Oriented Database}
179
At the Paris meeting, David Farmer has put forth an idea that ``the
180
basic building blocks in the project are the individual homepages of
181
each object of interest.''
182
183
MongoDB (\url{http://www.mongodb.org/})
184
\begin{enumerate}
185
\item is a new free open source {\em documented oriented} database
186
management system, written in C++.
187
\item is much different than a SQL database such as SQLite, MySQL,
188
or PostgreSQL.
189
\item data model corresponds to Farmer's idea of
190
homepages.
191
\item easily builds indexes and does elaborate
192
optimized queries.
193
\item automatically {\em replicates}
194
to any number of backup servers.
195
\end{enumerate}
196
\end{frame}
197
198
\begin{frame}{MongoDB makes your data ``feel smaller''}
199
200
\begin{itemize}
201
\item This summer I tested using MongoDB to deal with masses of data I
202
generated related to modular forms for a research project with
203
Barry Mazur.
204
\item I also tested putting all of the Cremona and Stein-Watkins
205
tables of elliptic curves in a single big MongoDB database.
206
\item It made a vast amount of data (hundreds of gigabytes) feel
207
``small''.
208
\end{itemize}
209
210
I have {\em never} had this feeling before with huge number theory
211
tables using any other database, including
212
PostgreSQL, MySQL, sqlite, ZODB, and custom filesystem based stores.
213
214
\end{frame}
215
216
\begin{frame}{How to Learn MongoDB}
217
\begin{block}{}
218
Go to \url{http://www.mongodb.org/} and start browsing.
219
\end{block}
220
\begin{itemize}
221
\item Tons of quickstarts, tutorials, articles, and videos of
222
talks, slides, etc.
223
\item A Company is behind MongoDB; but don't worry, MongoDB is free and open source
224
\item Some not-quite-finished {\em books} about MongoDB; I read them
225
by temporarily signing up for an O'Reilly Safari books membership
226
(\url{http://my.safaribooksonline.com}), reading them, then
227
unsubscribing.
228
\end{itemize}
229
230
\end{frame}
231
232
\begin{frame}{Setting up a MongoDB Server}
233
\begin{block}{Step by step (more details below)}
234
\begin{enumerate}
235
\item Binaries (Linux, OS X, Windows, Solaris) from \url{http://www.mongodb.org/downloads}.
236
\item Start a MongoDB server running by typing {\tt mongod}.
237
\item Connect by typing {\tt mongo} in another window.
238
\item Connect from Python (or Sage) using {\tt pymongo}.
239
\end{enumerate}
240
\end{block}
241
\end{frame}
242
243
\begin{frame}[fragile]{Starting a MongoDB server}
244
I run my mongod server by typing:
245
246
\begin{block}{}\begin{lstlisting}
247
mongod --dbpath /lvm/array/lmfdb/mongodb \
248
--bind_ip localhost --port 29000
249
\end{lstlisting}\end{block}
250
251
The dbpath option specifies where the files for the database are
252
stored and the bind\_ip and port options makes it so mongod accepts
253
connections on localhost port 29000; otherwise, anybody in the world
254
could just connect to your mongodb and delete all your data!! If you
255
want to run mongod on a remote server somewhere, but easily connect to
256
it from your laptop (say), setup an ssh tunnel by simply typing:
257
\begin{block}{}
258
\begin{lstlisting}
259
ssh -L 29000:localhost:29000 remote.computer.edu
260
\end{lstlisting}
261
\end{block}
262
%Then you can pretend that port 29000 on your laptop {\em is} port
263
%29000 on the remote server, and things will just work.
264
It's also
265
possible to create accounts with various permissions from the mongo
266
console.
267
% (see the mongo documentation for details). However, if you
268
%setup accounts make sure to use an ssh tunnel whenever using them,
269
%since mongod itself doesn't use a secure socket, so your password
270
%would get sent in the clear.
271
\end{frame}
272
273
\begin{frame}[fragile]{Connecting to MongoDB via the console}
274
275
\begin{block}{Connect to your new MongoDB server with the Mongo console}\begin{lstlisting}
276
wstein@disk$ mongo localhost:29000
277
MongoDB shell version: 1.6.1
278
connecting to: localhost:29000/test
279
> show dbs
280
admin
281
local
282
research
283
> help
284
db.help() help on db methods
285
...
286
\end{lstlisting}\end{block}
287
288
\end{frame}
289
290
\begin{frame}[fragile]{Connecting to MongoDB from Sage (Python)}
291
\begin{block}{Install Pymongo (about 10 seconds)}\begin{lstlisting}
292
sage: !easy_install pymongo
293
...
294
sage: quit # important!
295
\end{lstlisting}\end{block}
296
297
\begin{block}{Use it}\begin{lstlisting}
298
sage: import pymongo
299
sage: C = pymongo.Connection('localhost:29000')
300
sage: C.database_names()
301
[u'research', u'admin', u'local']
302
sage: R = C.research; R
303
Database(Connection('localhost', 29000), u'research')
304
sage: R.[tab key] ...
305
sage: R.collection_names()
306
[u'mazur_irreg.done', ..., u'fs.chunks', u'fs.files']
307
\end{lstlisting}\end{block}
308
\end{frame}
309
310
\begin{frame}{MongoDB's structure: Databases, Collections and Documents}
311
312
\begin{itemize}
313
\item A MongoDB server serves a collection of {\em independent}
314
databases.
315
\item A {\em database} is a set of collections, and a {\em collection}
316
is a set of documents.
317
\item A {\em document} is like a
318
Python dictionary, but only a limited number of datatypes are
319
allowed.
320
\item Technically, a document is a ``BSON'' document, where BSON
321
is a format very similar to JSON.
322
\end{itemize}
323
324
\end{frame}
325
326
\begin{frame}[fragile]{MongoDB documents are limited}
327
A MongoDB document must be at most 4MB in size. Let's push
328
the limits, to see what this means in practice:
329
\begin{block}{Use it}\begin{lstlisting}
330
sage: foo = R.foo
331
sage: foo.insert({'test':'a'*(4*10^6)})
332
ObjectId('4cae369075688b3eab000006')
333
sage: foo.insert({'test':'a'*(5*10^6)})
334
Traceback (most recent call last):
335
...
336
InvalidDocument: document too large - BSON documents are limited to 4 MB
337
\end{lstlisting}\end{block}
338
339
So you could store a string with 4 million characters, but not 5
340
million; for reference, 4 million characters is about {\em 1,000 typed
341
pages of text}.
342
\end{frame}
343
344
\begin{frame}[fragile]{How to Store Huge Stuff: GridFS}
345
\begin{itemize}
346
\item Recall: MongoDB documents can be at most 4MB in size!
347
\item GridFS get arounds this; stores gigantic data in
348
MongoDB
349
\item No indexing and searching capabilities
350
\item GridFS is just a key:value store, built on top of MongoDB.
351
\end{itemize}
352
353
\begin{block}{Using GridFS}\begin{lstlisting}
354
sage: import gridfs
355
sage: G = gridfs.GridFS(R) # we defined the db R above
356
sage: G.put('a'*(5*10^6), filename='test1')
357
ObjectId('4cae3ba075688b3eab000008')
358
sage: x = G.get_last_version('test1').read(); len(x)
359
5000000
360
sage: x[:30]
361
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
362
\end{lstlisting}\end{block}
363
\end{frame}
364
365
\begin{frame}[fragile]{Using GridFS to Store Pickles}
366
367
You can store arbitrary Sage objects using dumps and loads, which are
368
wrappers around Python's pickle module:
369
370
\begin{block}{Store a mathematical object}\begin{lstlisting}
371
sage: M = ModularSymbols(389, 2)
372
sage: G.put(dumps(M), filename='modsym389')
373
ObjectId('4cae3c1a75688b3eab00001d')
374
sage: loads(G.get_last_version('modsym389').read())
375
Modular Symbols space of dimension 65 for Gamma_0(389)
376
of weight 2 with sign 0 over Rational Field
377
\end{lstlisting}\end{block}
378
379
You get one GridFS per database, so if you have documents in all sorts
380
of collections that somehow point to GridFS ``files'', you'll need to
381
choose some systematic way of naming the files.
382
383
\end{frame}
384
385
386
387
\section{The Web Interface -- Flask, mod\_wsgi, Apache}
388
389
\begin{frame}[fragile]{Flask: a web development ``micro-framework''}
390
391
\begin{block}{FLASK}
392
\url{http://flask.pocoo.org/}
393
\end{block}
394
395
\begin{block}{``Hello world'' written using Flask}\begin{lstlisting}
396
from flask import Flask
397
app = Flask(__name__)
398
399
@app.route("/")
400
def hello():
401
return "Hello World!"
402
403
if __name__ == "__main__": app.run()
404
\end{lstlisting}\end{block}
405
406
\begin{block}{Put the above in a file hello.py and...}\begin{lstlisting}
407
$ easy_install Flask # 30 seconds?
408
$ python hello.py
409
\end{lstlisting}\end{block}
410
\end{frame}
411
412
\begin{frame}{Using Flask}
413
\begin{itemize}
414
\item I don't have time in this talk to go into detail about how to use Flask in
415
general.
416
\item The documentation at \url{http://flask.pocoo.org/docs/} is excellent.
417
\item You use decorators to construct the URL mapping, deal with GET and POST requests, etc.
418
\item You can also put static/ and templates/ subdirectories in your Python project, and
419
relevant files will get pulled.
420
\item You need to learn the Jinja2 templating engine: \url{http://jinja.pocoo.org/2/}.
421
\end{itemize}
422
423
\end{frame}
424
425
\section{Demo Site}
426
\begin{frame}[fragile]{Demo Site}
427
\begin{block}{The DEMO is here}
428
\url{http://db.modform.org/}
429
\end{block}
430
\begin{itemize}
431
\item Mike Hansen and I built a demo site.
432
\item Illustrates the architecture sketched above by providing access to a large table of
433
over a hundred million elliptic curves (Cremona plus Stein-Watkins)
434
\item Will form the core for the new modular forms database.
435
\end{itemize}
436
{\em Try it out!} It's running on boxen.math.washington.edu, in Seattle.
437
\end{frame}
438
439
440
\begin{frame}[fragile]{Apache Setup}
441
\begin{block}{/etc/apache2/sites-available/lmfdb}
442
\begin{lstlisting}
443
NameVirtualHost db.modform.org:80
444
<VirtualHost db.modform.org:80>
445
ServerName db.modform.org
446
WSGIDaemonProcess lmfdb threads=5
447
WSGIScriptAlias / /home/mhansen/lmfdb/lmfdb.wsgi
448
<Directory /home/mhansen/lmfdb>
449
WSGIProcessGroup lmfdb
450
WSGIApplicationGroup %{GLOBAL}
451
Order deny,allow
452
Allow from all
453
</Directory>
454
</VirtualHost>
455
\end{lstlisting}
456
\end{block}
457
And a symbolic link:
458
\begin{block}{}
459
\begin{lstlisting}
460
/etc/apache2/sites-available/lmfdb --->
461
/etc/apache2/sites-enabled/lmfdb
462
\end{lstlisting}
463
\end{block}
464
\end{frame}
465
466
\begin{frame}[fragile]{WSGI Setup}
467
\begin{block}{The WSGI application is defined by this file:}
468
\url{http://sage.math.washington.edu/home/mhansen/lmfdb/lmfdb.wsgi}
469
\end{block}
470
471
The main thing that this file has to do is define some object called
472
``application'' which will obey the WSGI protocol. There are a few
473
other things in there to let it know about the environment. Here are
474
the contents:
475
\begin{block}{}\begin{lstlisting}
476
import os, sys
477
sys.path.append('/home/mhansen/lmfdb')
478
os.environ['PYTHON_EGG_CACHE'] = '/home/mhansen/lmfdb/.python-eggs'
479
activate_this = '/home/mhansen/lmfdb/env/bin/activate_this.py'
480
execfile(activate_this, dict(__file__=activate_this))
481
from lmfdb import app as application
482
\end{lstlisting}\end{block}
483
(It is important to look at the files mentioned above.)
484
\end{frame}
485
486
487
\begin{frame}[fragile]{Python/Flask Code}
488
489
Look at the files in
490
\begin{block}{}
491
\small
492
\url{http://wstein.org/talks/2010-10-lmfdb/demo.tar.bz2}
493
\end{block}
494
495
In addition to the templates, there's a file lmfdb.py:
496
\begin{block}{}\begin{lstlisting}
497
from flask import Flask, url_for, render_template, request
498
app = Flask(__name__)
499
from pymongo import Connection
500
db = Connection(port=int(29000)).research
501
...
502
@app.route('/ellcurves/rank/<int:rank>/')
503
@ellcurves_list
504
def ellcurves_of_rank(rank):
505
curves = db.ellcurves.find({'r':rank}).sort('level')
506
return locals()
507
...
508
\end{lstlisting}\end{block}
509
This file defines what happens when a URL
510
is accessed.
511
\end{frame}
512
513
\section{Summary}
514
515
\begin{frame}{Summary}
516
This talk has laid out the architecture that I will be using for my
517
new web-based databases. It uses the following free open source
518
tools together in a natural way:
519
\begin{itemize}
520
\item {\bf Python}: a high quality programming language
521
\item {\bf MongoDB}: a scalable documented-oriented database
522
\item {\bf Flask}: a "micro-framework" for Python-based web apps
523
\item {\bf Jinja2}: a general purpose templating language
524
\item {\bf Apache + WSGI}: scalable web server
525
\end{itemize}
526
527
You can try out a demo that combines the above right now at:
528
\url{http://db.modform.org}
529
530
\end{frame}
531
532
533
\end{document}
534
535
536
537
538
539
540
541
542
543
544
545
%%% Local Variables:
546
%%% mode: latex
547
%%% TeX-master: t
548
%%% End:
549