Connexions Production Configuration Changes

Plone Hotfix Thu Jun 2 13:26:10 CDT 2011

Bugfix rollout Wed Jun 1 12:52:35 CDT 2011

  • Fix devel-code accidently rolled out with previous rollout: large files in modules broken
  • PloneHotfix?20110531-1.0

Cataloging Fixes lands Thu May 26 21:15:03 CDT 2011

munin page load timers Thu Apr 28 12:39:11 CDT 2011

  • point tachi munin page loaders to haproxy, instead of vanish, so they don't blast the cache

So, you really want it fresh? Tough! Fri Apr 22 14:51:03 CDT 2011

  • change varnish config to ignore client no-cache (cache invalidation req)

Shields to repulsar mode! (frontend redirects) Fri Apr 22 00:36:52 CDT 2011

  • change iptables rules from blocks to port redirects
  • implement/deploy microserver redirect that 301 redirects to cnx.org

Caching milestone lands Thu Apr 14 14:32:45 CDT 2011

Social Links release (twitbook) Tue Apr 5 15:23:00 CDT 2011

Sprint bug fix code release Fri Feb 18 11:23:00 CST 2011

Authenticated load times Tue Feb 15 13:30:00 CST 2011

  • seems test user Crouton got his password changed some time ago

Shields up! (iptables) Wed Jan 19 11:14:04 CST 2011

  • block non-Rice IPs from direct access to FEs (using iptables)
  • block non FE from access to postgresql on ballpoint (already blocked ZEO)

Frontend limits Tue Jan 18 14:43:00 CST 2011

  • limit haproxy to 2 connections per FE: this matches FE config
  • limit total simultaneous connections to 30 (2 * 15 FE) (haproxy will queue new requests)

Rhaptos print download host Thu Jan 13 17:06:24 CST 2011

  • change download host to cnx.org, to potentially use the cache, and avoid overloading the local FE w/ too many threads

Varnish - enable client refresh Tue Jan 11 12:55:19 CST 2011

  • need to re-enable client-refresh for anyone, to allow force-refresh for authors dealing with changing images in content

The Cache is Dead! Long Live the Cache! Fri Jan 7 15:40:44 CST 2011

  • switched DNS for cnx.org from butter to bowie - will leave both active over the weekend, and manually do hitcount stats monday

Varnish - client refresh Tue Jan 4 09:13:47 CST 2011

  • enable client refresh for 'magic header' - use for munin performance monitoring

Performance Tweaks 1 Wed Dec 8 09:11:00 CST 2010

  • mycnx performance update, based on perfomance design and tests

See: ConnexionsReleaseDetails#PerformanceTweaks1December082010

Varnish again Fri Nov 19 10:43:56 CST 2010

  • back to varnish

Varnish test: stop Thu Nov 18 17:12:15 CST 2010

  • revert back to squid for overnight

Varnish test: running config change Thu Nov 18 15:33:23 CST 2010

  • edited varnish.vcl - set randomContent to pass (no cache)

Varnish test Thu Nov 18 13:17:00 CST 2010

  • testing varnish instead of squid
  • config w/ 10GB disk cache

haproxy/squid update Wed Nov 10 13:34:00 CST 2010

  • Add a hard limit of 100 connections to squid
    • haproxy w/ leastconn is da bomb -- except it's a bit more heavy load fragile

haproxy test Fri Nov 5 09:23:58 CDT 2010

  • testing w/ leastconn balancing

Collection Composer rollout Wed Nov 3 16:26:00 CDT 2010

  • Collection Composer (a.k.a. ajax collection editor, ace)
  • various bug fixes

See: ConnexionsReleaseDetails#CollectionComposerReleaseNovember032010

Sword Oops rollout Tue Sep 28 13:47:00 CDT 2010

  • missed a version number bump.

See: ConnexionsReleaseDetails#SwordOopsReleaseSeptember282010

Squid config change: basic auth Tue Sep 28 11:16:28 CDT 2010

  • enable passing of basic auth to the actual frontends (squid was eating it)

Squid config change Mon Sep 27 11:30:17 CDT 2010

  • throttled a pair of fast-crawling bots

Sword rollout Fri Sep 24 17:32:10 CDT 2010

  • Sword importer rolled out
  • epub and offline zip hotfixes rolled out

See: ConnexionsReleaseDetails#SwordreleaseSeptember242010

Squid url_rewriter Wed Sep 22 16:59:26 CDT 2010

  • bump # of url_rewriter procs from 20 to 30 (squid reporting running out occasionally)

Squid ICP again Fri Sep 17 14:26:14 CDT 2010

  • put squid ICP based peer selection back in

Tweak the Squid Fri Sep 17 09:47:51 CDT 2010

  • round-robin on squid

Back to Squid Thu Sep 16 15:50:25 CDT 2010

  • back to direct icp based peer selection via squid

HAProxy again Thu Sep 16 12:24:25 CDT 2010

  • try a longer test w/ pure roundrobin selection

reconfigure squid Wed Sep 15 23:37:53 CDT 2010

  • using squid for direct access again - HAProxy out of the loop

HAProxy config change Wed Sep 15 22:03:00 CDT 2010

  • change to roundrobin selection

tachi config Wed Sep 15 17:18:21 CDT 2010

  • nscd had died on tachi. (ldap outage?) Restart it (Name Services Caching Daemon: affects every uid,gid, and hostname lookup)

HAProxy config change Wed Sep 15 17:02:39 CDT 2010

  • remove glad1 backend

HAProxy config change Wed Sep 15 16:52:55 CDT 2010

  • turn off cookie-based backend selection - try URI hash

HAProxy again Wed Sep 15 15:58:02 CDT 2010

  • attempting HAProxy config, behind squid

added passthru.cnx.org Wed Sep 8 14:12:00 CDT 2010

  • added a CNAME for cache1 (butter) of passthru.cnx.org, that is not redirected by default, but is cached and load-balanced for Roché's mobile work.

robots.txt update Wed Sep 8 12:29:00 CDT 2010

  • added additional large downloads (epub, offline, zip) to denied lists

added mobileproxy.cnx.org Fri Sep 3 09:07:52 CDT 2010

  • added a CNAME for frontend1 (claymore) of mobileproxy.cnx.org, that is not redirected by default for Roché's work.

Increase queue timeout Thu Aug 12 13:00:00 CDT 2010

  • bump timeout for Lineup queue_tool to 30 min (1800 sec) should help w/ epub false timeouts

Epub release Wed Aug 11 12:33:00 CDT 2010

See: ConnexionsReleaseDetails#EpubReleaseAugust112010

Customized skin prefs_users_overview Wed Jun 30 12:01:00 CDT 2010

  • customized the page template prefs_users_overview to allow the plone user management screens to work

Data_pdf swap Wed May 19 12:10:00 CDT 2010

  • swap now packed Data_pdf.fs (32GB vs. 183GB) required bouncing FEs

Turn off HAProxy Thu May 13 17:19:23 CDT 2010

  • reset squid to go direct, after having messed various HAproxy params for a bit

Turn on HAProxy Thu May 13 15:36:57 CDT 2010

  • have squid go through haproxy

HAProxy Thu May 13 15:10:30 CDT 2010

  • add an haproxy instance on butter, port 8888 talking to all FEs - squid does not yet talk to it

fix checkinterval config Mon Apr 26 13:44:00 CDT 2010

  • both quill and fountain had only 3 FEs configured (though 6 running), so 3 were check-interval 100, and could not be restarted for memory growth. Fixed.

python checkinterval config Thu Apr 22 10:57:00 CDT 2010

  • set check-interval to 1300 for claymore & gladius FEs, 1800 for quill and fountain FEs

cachefu is off Thu Apr 22 9:40:00 CDT 2010

  • Noticed that cachefu was turned off sometime ago: looks like around Mon Mar 29 16:30:00 CDT 2010

gladius dual FE Mon Apr 12 00:50:18 CDT 2010

  • only two FEs on gladius (drop 3rd)

claymore single FE max mem test Fri Apr 9 14:34:00 CDT 2010

  • move max-mem restart for FE up, so we actually get to see if it plateaus

claymore single FE max mem test Tue Apr 6 12:20:00 CDT 2010

  • same trick, but claymore (seems plateau is > 8GB)
  • move gladius back to 3 FEs

gladius single FE max mem test Thu Apr 1 09:44:00 CDT 2010

  • drop to one frontend on gladius, make all RAM available (see if we can get a mem plateau)

cachefu turned off Mon Mar 29 16:30:00 CDT 2010

  • based on looking at squid logs, cachefu got turned off at this time

gladius wipe/reinstall completed Tue Mar 30 9:24:04 CDT 2010

  • bring Zope FEs up
  • complete nagios/munin config

gladius wipe/reinstall started Mon Mar 29 16:00:00 CDT 2010

  • a complete system wipe/reinstall of gladius to 64-bit

claymore wipe/reinstall Fri Mar 26 2010

  • a complete system wipe/reinstall of claymore, to get 64-bit system

ZEO port blocks Thu Mar 25 11:15:00 CDT 2010

  • block everyone except 4 FE and 1 BE machine from attaching to ZEO port (safety net)

restart_instance fixup Thu Mar 25 09:14:00 CDT 2010

  • restart_instance has been blocking the wrong IP address (in blockport/unblockport) fixed.

Nagios memory monitoring Wed Mar 24 4:30:00 CST 2010

  • decreased individual FE memory restart limitsto 2 GB on claymore and gladius
  • altered restart_instance script to hit a _lot_ of content and site urls

memory-size test Fri Mar 12 11:30:58 CST 2010

  • reduced number of FEs on fountain to 6, bump up individual RAM size to warn at 5+GB, restart at 8GB
  • System level mem restart kept at 5% free

memory-size test Fri Mar 12 09:17:24 CST 2010

  • reduced number of FEs on quill to 6, bump up individual RAM size to warn at 5+GB, restart at 8GB
  • System level mem restart kept at 5% free

Nagios memory monitoring Tues Mar 9 11:45:00 CST 2010

  • created a new 'free system memory' nagios alert: warns at 20%, restarts biggest frontend at 5%
  • increased individual FE memory restart limits (4GB on claymore and gladius, 5GB on quill and fountain)

CacheFu fixup Mon Mar 8 15:40:52 CST 2010

  • Convinced CacheSetup (aka CacheFu) product to actually upgrade itself from 1.2 to 1.2.1 We should have active cache policies again

rollout Mon Mar 8 14:36:00 CST 2010

  • bounce all FE for module PDF gen fix, and complete textarea "wrap" attrib removal

Buildout FE move! Tue Mar 2 17:50:00 CST 2010

  • move frontend2 and 4 (gladius and foutain) to buildout configed FEs

Buildout FE move! Mon Mar 1 15:45:39 CST 2010

  • Move frontend3 (quill) to buildout configed FEs

Buildout FE move! Mon Mar 1 15:41:50 CST 2010

  • Move frontend1 (claymore) to buildout configed FEs

FE restart test Tue Feb 23 11:55:00 CST 2010

  • slam restart all FEs (zopectl restart rather than restart_instance)

Queue workers - zombies Fri Feb 19 15:00:00 CST 2010

  • found hundreds of defunct procs on fountain (where print queue processing happens)

Varnish removed Fri Feb 19 15:00:00 CST 2010

  • stopped varnish on quill (not part of production)

More ZEO Cache Experiments Thu Feb 18 17:20:03 CST 2010

  • change zeoclient cache-size back from 50MB to 20MB

More ZEO Cache Experiments Thu Feb 18 14:21:29 CST 2010

  • change zeoclient cache-size from 20MB to 50MB

The Great ZODB Cache Experiment Wed Feb 17 09:30:00 CST 2010

  • set FEs to 2 zserver-threads
  • set FEs to 500000 cache-size (up from before) (main) and 5000 (down from 20000) (pdf)

munin tracking of ZODB Mon Feb 15 17:30:00 CST 2010

  • put two python scripts into ZMI to report ZODB cache params
  • add munin plugins to all FE servers to report these values

Squid 5:5 test Tue Dec 8 09:09:00 CST 2009

  • set FE's to 5 zserver-threads
  • set squid to max-conn 5

Squid FE tuning: put it back Mon Dec 7 12:15:00 CST 2009

  • restored FEs to 'default' (4 threads)
  • bump max-conn to 10 in squid

Squid FE tuning: reduce max-conn Mon Dec 7 09:25:00 CST 2009

  • dropped max-conn for all peer_cache to 5

LVM tuning redux Wed Nov 11 14:07:15 CST 2009

  • get the two mapper interfaces as well
    • blockdev --setra 16384 /dev/mapper/ballpoint*

LVM tuning on db.cnx.org Wed Nov 11 13:45:42 CST 2009

  • increased Read-Ahead on LVM disks on db.cnx.org
    • for D in a b c d; do sudo blockdev --setra 16384 /dev/sd$D; done

Dealing w/ bad PDFs Tue Oct 27 09:46:00 CDT 2009

  • Customized two templates TTW:
    • collection_view
    • bar_content_actions_view

These are now robust against broken PDFs

Queue processing worker on paring Mon Oct 26 11:38:00 CDT 2009

  • having processing on quill seems to have been an issue, so moved to a new instance on paring

Bad PDFs Mon Oct 26 9:20:00 CDT 2009

  • Find zero length PDFs (1118 of them) nuk'em, as well.

Bad PDFs Mon Oct 26 9:20:00 CDT 2009

  • POSKeyErrors caused by 31 bad PDF ATFiles. Find them and nuk'em.
  • also, PURGE cache for those.

Collxml rollout: Queuetool changes Fri Oct 23 15:29:22 CDT 2009

  • rolled-out w/ 1 worker per physical hardware FE saw lots of "stuck" jobs in pending queue
  • changed config to 1 worker on 1 FE: seems to be working through the queue properly.

reset changes Mon Oct 5 10:47:32 CDT 200

  • change squid.conf:
    • max-conn 10
    • remove (non-functional) sourcehash

  • change zope FE zope.conf:
    • default (4) threads
    • ZODB cache 20MB

increase number of url_rewriters Fri Oct 2 17:37:09 CDT 2009

  • increase number of rewriters from 20 to 40

add source hashing Fri Oct 2 13:38:37 CDT 2009

  • alter squid config to use sourcehash for cache_peer selection: this should lead to some 'session stickyness' and better cache use

This had no effect: icp RTT overrides it. Testing on secondary squid instance instead

back off some changes Thu Oct 1 09:50:08 CDT 2009

  • reduce ZEO client cache to 100MB
  • reduce # of threads to 5
  • reduce # of squid connections to 5 to match

increase postgresql connections Wed Sep 30 17:48:00 CDT 2009

  • upped max con to 250 (needed to bump maxshmem as well in /etc/sysctl.conf)

increase ZEO client cache Wed Sep 30 10:52:00 CDT 2009

  • bump ZEO client cache from 20MB (Default) to 200 MB

increase zserver-threads Fri Sep 25 14:26:24 CDT 2009

  • up the # of threads per frontend to match the number of connections squid is throttled to (10, rather than the default of 4)

limit DB access Thu Sep 24 11:15:57 CDT 2009

  • limit (vi pg_hba.conf) access to postgresql on ballpoint to the actual frontend machines (for rhaptos user. The backup user can still get read access from all our machines)

Block multimediazip rangers Sat Aug 29 13:28:41 CDT 2009

  • add acl to block miss_access to multimediazip files for ranger requests (TEMP: need to fix by not advertising Range capability on zips)

DNS fallback config Fri Aug 28 22:20:00 CDT 2009

  • add backup DNS (and swap order or primary/backup)

OAI squid cache config Thu Aug 27 13:21:35 CDT 2009

  • had forced min time less than max time: reset
  • 'refresh' cronjob was not running: changed to www-data system user

NEW TTW caching rule Wed Aug 26 11:53:54 CDT 2009

  • added 1 hour frontpage caching for anonymous only

New TTW cache rules Tue Aug 25 16:58:14 CDT 2009

  • tried to add 1 hour cache rule for frontpage, removed
  • added cache-forever rule for some skins (opensearchdescription, transMenus0_9_2.js)

new fountain FEs Tue Aug 25 16:40:00 CDT 2009

  • add 3 additional fountain FEs, local total of 9, global 21

fountain squid Tue Aug 25 16:28:00 CDT 2009

  • redirect connections to cnx.org

move squid Fri Aug 21 16:18:29 CDT 2009

  • bring up alternate squid on butter - remap cnx.org in DNS
  • remove all weights (not working in the presence of local FEs)

squid Zope FE config - adding Fri Aug 21 13:54:29 CDT 2009

  • added 3 addition quill FEs

squid Zope FE config - weights Fri Aug 21 01:30:29 CDT 2009

  • kicked fountain FEs back up to 6
  • kicked quill weights up to 100

squid Zope FE config - weights Thu Aug 20 20:15:54 CDT 2009

  • added weights to cache_peer lines:

quill: 20 claymore: 10 gladius: 10 fountain: 5

  • removed monitor_url from cache_peer lines: ICP handles UP/DOWN

squid cache config Thu Aug 20 15:33:00 CDT 2009

  • Added second squid cache_dir (260 GB)

squid Zope FE config Thu Aug 20 09:01:00 CDT 2009

  • reduce number of gladius FEs from 6 to 3 to deal with swapping

OAI "fullfeed" caching Fri Aug 7 16:42:00 CDT 2009

  • added squid config to cache un-restricted OAI metadatapulls for 7 days

squid Zope FE config Thu Aug 6 10:44:02 CDT 2009

  • reduce number of fountain FEs from 6 to 3 to deal with load spikes

siyavula.cnx.org Tue Jun 30 10:59:01 CDT 2009

  • added acl changes and redirect.py changes to allow siyavula.cnx.org to work

force OAI caching Wed Jun 24 13:07:04 CDT 2009

  • add 60 min forced squid caching of all OAI pages

Take NetScaler? out of it 06/17/09 14:14:44

  • change DNS: cnx.org -> 128.42.169.18 (direct) no netscaler (128.42.206.135)

Reduce max connections to fountain frontends 2009-06-17 10:20

  • reduce from 10 to 5 connections / frontend for f1 f2 and f3

Reduce max connections to fountain frontends 2009-06-17 09:40

  • reduce from 10 to 5 connections / frontend for f4 f5 and f6

Reduce maximum_icp_timeout for squid 2009-06-15 09:17

  • drop from 1 sec to 0.2 sec

put gladius frontends back in rotation 2009-06-09 17:30

  • upgraded gladius to lenny. Not yet rebooted, nor 64bit

removed gladius frontends from rotation 2009-06-09 10:06

  • one or more gladius frontends anomalously hung. Perhaps timeout is taking too long.

timestamp resync restart June 03, 14:20 - 14:27

  • run script to correct far-future timestamps in ZODB. Details on the admin wiki

Deal w/ collection PDF storm May 27

  • No config changes, just killed lots of make print jobs for a specific collection. Asked Jonothan to lock it in the future

restart frontends May 20, 12:05-12:20 pm

  • optimistic restart of all frontends for content serving speed - maybe a cache fragmentation issue

module_export_template caching May 12, 5:21 pm

  • modified ruleset for CacheControl? of modules, to purge module_export_template from squid
  • manually PURGE all existing mets from squid (260 were cached)

PostgreSQL DB Backend move - May 6, 1:46pm

  • transfer PostgreSQL database from naginata to ballpoint (new hardware)

ZEO Backend move - May 1, 4:40pm

  • transfer ZEO backend from naginata to ballpoint (new hardware)

PostgreSQL file transfer speed April 30, 4:15 pm

  • turn off ssl on connection to postgresql server: streaming time for 40MB direct file download: ~ 60 sec before, 3.2 sec after

backups - pack and incremental April 29

  • reinstituted nightly zeopack before backup
  • changed postgres dump to be incremental for the files table only

zeo pack April 28, 1:53-2:55 pm

  • manually triggered zeopack, since the weekly hadn't happened

Quill Squid - April 23, 2:50 pm

  • remove squid on quill, using only fountain again

Quill Squid - April 22, 2:30 pm

  • added a second squid on quill, round-robin selection via netscaler

NOT_CONFIG: Gentext improvements - April 15

  • released gentext code improvements with expected significant performance impact (1pm)
  • released cache code and fix 1pm, 3:30 pm

Fountain Squid - April 15

  • reduce max mem for frontends on new suns (quill and fountain) (11:30am)
  • moved squid from tachi to fountain 12pm - 1pm

Netscaler - April 10

  • moved cnx.org dns to point to netscaler

Fountain Frontends - April 9, 2009

  • Put 6 frontends on fountain and pulled tanto and kissaki out of production

Sim Calc off - April 9, 2009

  • Turned similarity calculation off during publish

Backup Procedure - April 8, 2009

  • Changed to pack ZODB weekly rather than nightly
  • backup cnxconsortium.org

Squid Cache Parameters - April 7, 2009 3pm

  • Upped Disk Cache to 10G with up to 500MB cacheable
  • RAM Cache to 1G with up to 200K cacheable

Quill - April 5, 2009

  • 6 frontends removed due to backend write contention

Quill - April 3, 2009

  • 12 frontends added

Wakizashi frontends added -