Subject: [NYCGA Internet] Re: Call for volunteers: Linux Administration and MY SQL DBA resources
From: Chaz Cheadle
Date: Wed, 19 Oct 2011 14:41:01 -0400
To: Kevin <king.feruke@gmail.com>, Sam Boyer <act@samboyer.org>, Ron Suarez <ron.suarez@loudfeed.com>, Drew H <drew@nycga.net>, Todd Grayson <tgraysonco@gmail.com>, Jake <jakedeg@gmail.com>, Internet Working Group <internet_working_group@googlegroups.com>, ows_solutions <ows_solutions@freenetworkfoundation.org>

I can be available tomorrow in the city, not today though.
Why not stay with Rackspace? From reviewing the Panix website, I was not bowled over with confidence at its reliability. I extremely happy with Rackspace technical support. If we can afford to stay with them, I'd do it. Depending on the traffic we hit, their transfer rates are pretty competitive.

On Wed, Oct 19, 2011 at 2:33 PM, Tom Gillis <thomaswgillis@gmail.com> wrote:
*Short summary - site is on Rackspace Cloud, not Panix, for now.
*Let's try to get interested sysadmin folks in a room or skype tongiht.

Chaz / Kevin  - what's your real-time availability in NYC today ? We'd
like to have another work session - Dan R and Jake also have root on
the server and I'd like to get us all in a room or at least a skype to
coordinate (I won't be available onsite until tonight).

BTW - one other thing that we need to come to a consensus on is our
longterm hosting provider - the ppl at the work session decided last
night (around 2 AM) that after the Panix hosting server that i had
access on became unresponsive (looking into the cause of this) and
none of us had the account creds to restart the server or provision
another one we moved the site to a rackspace cloud account (cloned
from a legacy machine that I have set up as a LAMP box for freelance
projects).  Not the best setup but the best we could do at the last
minute, with calling off the site launch (for the 2nd time in a week)
not an option (since the staging site url was already starting to leak
to the general public and it just mean that the data / user migration
was going to get more complicated as time went by).

ANYWAY - folks on the ground can sync up with Jake or Dan to get
access to the box, or we can do it over skype later (I'm trying to
stick to not sending credentials over the internet).

There may be some duplication of effort here - getting a deployment
going that will get us thru the next few weeks, and then coming up
with a better long term solution.



On Wed, Oct 19, 2011 at 1:13 PM, Kevin <king.feruke@gmail.com> wrote:
> lighttpd/Nginx or any async server is exactly where I was going with fcgi
> ... Having them use fam/gamin will help file stats if we find I/o being the
> problem
>
> Sorry for short mails but I'm mobile only for a while
>
> DON'T PANIC
>
> On Oct 19, 2011 12:05 PM, "Sam Boyer" <act@samboyer.org> wrote:
>>
>> i wish i could volunteer to hit this round the clock, but i just can't
>> at the moment :( :(
>>
>> some thoughts on scaling this thing. first - we don't know squat until
>> we get into the box. second, once we do, installing some monitoring
>> tools - e.g. cacti - should be high priority, otherwise, we're just
>> gonna be flailing around in the dark. nagios is fine, but that'll get us
>> monitoring, not usage logs. alternatively/additionally we could look at
>> paying for monitoring from a service like new relic (which might be
>> better if only because it means less that we have to maintain ourselves,
>> at least at first). beyond that:
>>
>> - get xhprof onto a prod clone somewhere so we can actually look at
>> what's taking up the processing time. beyond low-hanging fruit, though,
>> that's gonna take some expertise to actually make a dent with.
>> - big duh, but we've got an opcode cache running...right? the site seems
>> too responsive right now for this NOT to be the case.
>> - getting mysql onto baremetal, or rackspace cloud (though that would
>> mean moving everything to rackspace, and i've already heard security
>> concerns about that), should probably be a priority. heavy db io through
>> virt layer...meh.
>> - due to the 1s-minimum granularity issue the mysql slow query log is
>> almost a too-late-to-be-useful thing (unless the percona people FINALLY
>> got that patch in to add ms granularity in mainline...but i doubt it),
>> but we do need to run it, as it'll give us a hit list of queries for
>> optimization and/or caching.
>> - as kevin mentioned on the other thread, fcgi; and if we do that,
>> really, no reason not to switch to nginx. i don't know what our request
>> volume looks like so i don't know how much we'd be getting back there,
>> but really, there's no reason to be serving static assets with bloaty
>> apache workers.
>> - ordinarily, for a drupal site of this type, i'd advocate ESI. i have
>> no idea how well WP supports content chunking like that (and truth is,
>> good ESI strategies take a *while* to craft), but at the very least some
>> internal data caching could help with query volume (e.g., cache the
>> output of the query that generates the global activity feed for 30s or
>> so). again, though, i don't know how easy that is to layer in with WP,
>> and the more custom we get, the more difficult it's gonna be to maintain.
>>
>> like i said, though, until we actually *know* where the problem(s) are,
>> we can't address them. also, somewhere in this thread i remember seeing
>> someone set up for the expectation that we might just need ~500MB/proc.
>> dear god, i hope not. if that's the case, we could blow the entire war
>> chest that's been accumulated thus far for liberty plaza (~$230,000 i
>> read somewhere) and still only be able to support a several hundred
>> concurrent users. that needs to be brought *down*.
>>
>> cheers
>> s
>>
>> On 10/19/11 8:47 AM, Kevin wrote:
>> > Agree nagios for the win, we should get logwatch going as well
>> >
>> > Rackspace cloud machines guarantee proc and allow busting if available.
>> >
>> > Once we have more than one machine we need to think about config
>> > mgmt...i would suggest puppet. We could use blueprint to analyze the
>> > current machine and generate puppet files.
>> >
>> > https://github.com/devstructure/blueprint
>> >
>> > DON'T PANIC
>> >
>> > On Oct 19, 2011 11:32 AM, "Chaz Cheadle" <ccheadle@gmail.com
>> > <mailto:ccheadle@gmail.com>> wrote:
>> >
>> >        I'd like to suggest zenoss/nagios for monitoring.
>> >     As for hardware configurations, I'd say we definitely should have
>> >     physical/dedicated DB servers with cloud webhosting. Unless we're on
>> >     Rackspace or Linode, it may be hard to ensure we'll get the needed
>> >     processor or I/O from a vps.
>> >       If we have one server now, we can start serving the whole thing
>> >     from there, then purchase cloud webservers to lighten the webload
>> >     then add mysql replication in later if the DB reads start getting
>> >     high. Unless we're doing heavy editing on the site one DB server for
>> >     now should handle all of the read requests.
>> >        With Zen/nagios we will be able to monitor the server and make
>> >     decisions on expansion. Let's figure out the resource issue we have
>> >     now with WP before jumping to cloud web hosts and MySql replication.
>> >
>> >     What is the current panix host package we're on?
>> >
>> >     chaz
>> >
>> >     On Wed, Oct 19, 2011 at 11:14 AM, Todd Grayson <tgraysonco@gmail.com
>> >     <mailto:tgraysonco@gmail.com>> wrote:
>> >
>> >         Adding Chaz and Kevin,
>> >
>> >         guys once consensus can be reached with dev leads, folks can be
>> >         ID'd and get started on the "how to move forward" as a working
>> >         team? Tom needs additional eyballs and hands covering production
>> >         deploy as well as ongoing release engineering. The subject is
>> >         going to become a bigger deal as work continues.  Please review,
>> >         and lets get a working plan together that approaches the list in
>> >         a way that resource on-boarding is clean and effective?
>> >
>> >         Todd
>> >
>> >         On 10/19/2011 8:06 AM, Todd Grayson wrote:
>> >>         OK:
>> >>
>> >>         As a conversation before going to the list, I'm reaching out
>> >>         to you folks to establish consensus on what is going to happen
>> >>         next.  Please identify WHO should be included in this
>> >>         conversation not currently a part of it.  Once concensus is in
>> >>         place we can go to the lists for specific volunteers.  To make
>> >>         this efficient and quick the team in NYC should have on hand
>> >>         the following items for the folks coming forward:
>> >>
>> >>           * Development leads who are overseeing configuration for
>> >>             current wordpress deploy and able to answer questions
>> >>               o available for q&a and facilitating access to repo's
>> >>                 etc. when needed
>> >>           * ID who is the contacts are for the panix  hosting
>> >>             services, a conference call with them to talk through of
>> >>             what is being seen now and what we feel will be needed to
>> >>             reach capacity should be scheduled ASAP
>> >>           * Is there any way to get current perf statistics from where
>> >>             its running now where its at?
>> >>
>> >>
>> >>         The call for specific volunteers will be based on the fact we
>> >>         need a team folks to help out with systems and DB
>> >>         administration tasks as well as performance tuning and
>> >>         capacity planning.  This will give the working technical team
>> >>         depth and allow for a more contineous support model as one
>> >>         worker will only have limited hours in a day to contribute,
>> >>         where as a team model can support sustained activity over a
>> >>         period of time.
>> >>
>> >>         Here is what is needed from the current and previous volunteer
>> >>         lists as well as contacts on the ground once they are
>> >> identified;
>> >>
>> >>           * Technical Project Manager
>> >>           * Linux systems administrators with web hosting backgrounds
>> >>             (and virtual hosting infrastructure)
>> >>           * MySql DBA's supporting web hosted env's, wordpress
>> >>             environments
>> >>
>> >>         Resources like this have already come forward to the IWG list,
>> >>         we can start with these people.  IMHO a target team of 6
>> >>         people should be the goal (3 dba, 3 sysadmin)
>> >>         MySql
>> >>
>> >> http://groups.google.com/group/internet_working_group/browse_thread/thread/4bde061a2adacee6/ede85b8a1812c3cc?lnk=gst&q=MySQL#ede85b8a1812c3cc
>> >>
>> >> <http://groups.google.com/group/internet_working_group/browse_thread/thread/4bde061a2adacee6/ede85b8a1812c3cc?lnk=gst&q=MySQL#ede85b8a1812c3cc>
>> >>         Linux Administration
>> >>
>> >> http://groups.google.com/group/internet_working_group/browse_thread/thread/694bde580564c681/f4afcbc78aa06aa3?lnk=gst&q=Linux+Administration#f4afcbc78aa06aa3
>> >>
>> >> <http://groups.google.com/group/internet_working_group/browse_thread/thread/694bde580564c681/f4afcbc78aa06aa3?lnk=gst&q=Linux+Administration#f4afcbc78aa06aa3>
>> >>
>> >>         This call for volunteers will be the creation of team that
>> >>         will be dedicated to the infrastructure of the wordpress
>> >>         sites, the DB infrastructure supporting them, and the apache /
>> >>         php / wordpress install and configuration over your
>> >>         dev/test/release environments moving forward.  The folks that
>> >>         will be coming forward from online will need to be included in
>> >>         communication, brought into the planning, and then included in
>> >>         communications as a team moving forward.
>> >>
>> >>         If you want to start the ball rolling on this let me know who
>> >>         the contacts are from the "on the ground" requirements and we
>> >>         can get going asap.
>> >>
>> >>         IMHO the actual MySql DB's might have to be on physical
>> >>         hardware if the IO we are seeing on the VM's shared backplane
>> >>         is the bottleneck.... or just reside in a MySQL DB farm.. That
>> >>         will have to be evaluated with iostat output as system access
>> >>         is regained and the cause is isolated.  It might just simply
>> >>         be memory related; disk IO pressure as paging/swap attempted
>> >>         to scale for the  demand of resources.   If we know the right
>> >>         process names ssh pkill statements can be sent to try and free
>> >>         up the system as well?
>> >>
>> >>         ssh username@hostfqdn 'pkill httpd'
>> >>
>> >>         Todd
>> >>
>> >>         On 10/19/2011 6:33 AM, Tom Gillis wrote:
>> >>>         And I feel like "scalable wordpress deployment" is a little
>> >>> bit of an
>> >>>         oxymoron - but:
>> >>>         good news - we have the nycga 2.0 site up, and the
>> >>> functionality is
>> >>>         all working as expected.
>> >>>
>> >>>         bad news - we needed to rush deployment so that working groups
>> >>> could
>> >>>         start using new features, but wordpress is killing the cpu /
>> >>> memory on
>> >>>         the server (a 16gb virual box) and we know that a single
>> >>> server
>> >>>         hosting setup is not going to be viable.
>> >>>
>> >>>         caching doesn't help us much since most of the content is
>> >>> dynamic, and
>> >>>         near-real time - it's wordpress with budypress on top so
>> >>> there's tons
>> >>>         of forums, and social-networky activity feeds.
>> >>>
>> >>>         what we need:
>> >>>         1 - move mysql to its own server, and set up master / slave
>> >>>         replication (2 virtual servers)
>> >>>         2 - set up a shared file hosting server for user-uploaded
>> >>> images - nfs
>> >>>         mounts to a single box (1 box)
>> >>>         3 - setting up load-balanced web frontends with sticky
>> >>> sessions (4
>> >>>         virtual boxes probably)
>> >>>
>> >>>         I'm hoping to find a few people who will volunteer to work
>> >>> with
>> >>>         internet group, either in nyc or remotely, over the next 72
>> >>> hrs to
>> >>>         make a push to get this infrastructure in place.  in parallel,
>> >>> we'll
>> >>>         be making code optimizations to the site.  (lots of
>> >>> low-hanging fruit
>> >>>         here, like minifying js and css).   i'm hoping to find
>> >>> somebody who
>> >>>         can set up one aspect of the infrastructure and I'll hook them
>> >>> up with
>> >>>         a cloned version of the production server, which they can
>> >>> modify to
>> >>>         fulfill one of these other roles - then we can deploy that
>> >>> back into
>> >>>         the main infrastructure.   I'm probably going to be asleep
>> >>> until
>> >>>         around 1pm nyc time, but I'm hoping to have some volunteers by
>> >>> the
>> >>>         time I come back. And right now we really need people who can
>> >>> free up
>> >>>         most of their time for the rest of the week on this (we're
>> >>> literally
>> >>>         working around the clock in nyc so you'll have people to
>> >>> coordinate
>> >>>         with no matter where / when you're available)
>> >>>
>> >>>         Any takers?
>> >>
>> >
>> >
>>
>

< PREV INDEX SEARCH NEXT >