#RealDataStories - CrowdChat

Infrastructure at Scale

Join to discuss lessons learned while running infrastructures at scale. Share tips & strategies.

#realdatastories @jbgeorge @furrier @lleung scality

#realdatastoriesInfrastructure at ScaleWe'll be talking about Infrastructure at Scale and the recent HP Apollo announcement.

Stream Ended

Establishing a secure connection

Let's tee it up with our first question to the crowd http://www.via-cc.at... - are you ready for "always on"

[+] Show Hidden Comments

@gregorygbishop - thoughts?

We're joined by @gregorygbishop , who ran the huge Time Warner infrastructure

the issue of course is always recovery - this is what makes "enterprise ready" very difficult - all the processes and procedures that are in place create terribly cemented infrastructure - but it typically works

. @lleung here at #bigdataSV #strataconf the questions on scale with hardware is coming up on how to scale #bigdata

At TWC, I supported a mail system with 20M mailboxes, and over 10B object in storage

@dvellante yes and no - recovery is relevant, but so is running continuously even with failures

the old way was to harden every piece of infra, the new was is distributed systems. Big hurdle around applications making this change, infrastructure is getting there faster.

Joseph B George (JBG)

Definitely a common theme that I hear from customers these days

hadoop is still useful for petabyte processing. and yes many companies do have that problem. spark seems cool only if you can figure out how to keep it running at scale - what should folks do for this

Recover isn't the issue - the issue is figuring how not to need recovery

@gregorygbishop - please elucidate

Joseph B George (JBG)

and as tech evolves (a la Hadoop 2.0 and features like erasure coding), scaling becomes ever more interesting

When making a system "always on" and "at scale," one must assume that the system always has something in a failed state.

@dvellante - the old way of downtime or system slowdown while you recover is no longer valid

Andrew Reichman

with massive data sets it's just not viable to think that you can have primary running with copies to somewhere else that you would recover to when things break- it just takes too long to move the data and build out a new envr.- you need HA

So in the old sense, the system is always 'in recovery'

@gregorygbishop - agree. there are always disks down and nodes down... cc @dvellante @stu

Andrew Reichman

But building HA requires deep integration with the apps that use the data, technology to keep multiple sites synchronized and double huge infr

Joseph B George (JBG)

I'm seeing more people put more thought into things like fault domains - embracing that downtime will happen and planning with it in mind

Andrew Reichman

@gregorygbishop exactly- instead of recovery being a declared event when things hit the fan, it's more of a constant scenario that you're mitigating in smaller, non-disruptive ways

@reichmanIT - @gregorygbishop - do you agree in the notion of deep integration or is the infrastructure smarter?

@jbgeorge fault domain seems to be a common issue I hear from customers #realdatastories

Joseph B George (JBG)

Fail fast, right :)

There's definitely a law of large numbers effect - 1,000's of disks, 1,000's of nodes, things will fail

Joseph B George (JBG)

I will also say that as the infrastructure is evolving - esp as we are looking at networking beyond 10GbE - the infrastructure design gets more interesting

I'm not sure that deep integration with the infrastructure is required to support the resiliency requires for 'always on'

Andrew Reichman

@gregorygbishop depends on who's talking- if it's infr team they will say deep integration. if it's app team, they will say that they can control dumb infr with their smart software

Joseph B George (JBG)

back in 100Mb times, it was a source that had to be "designed around" - that is changing

@gregorygbishop - certainly, our prescription is a different kind of infrastructure - "distributed" is one piece @stu

polarization with apps at scale (bus applications) and infra at scale (infra software) - lots of innovation at the infra

@gregorygbishop - given "continuous recovery" what do you do differently from before?

. @lleung this bringups the notion of hw as a service - consumption has to be easy to stand up and provision for app scale world - I'm interested in what solutions are out there

The value and the ability to build something that can scale is now a necessity

@furrier Definitely - not so much "as a service", but service oriented yes. Have to work with old and new apps.

Joseph B George (JBG)

I know @zehicle has been talking about this for many years

At TWC, the mail application interfaces with the storage infrastructure using a standard web interface, but has no concept of how the infrastructure keeps data available

. @lleung many think that containers are a big part of the transition from old apps to new and powering #devops

Joseph B George (JBG)

I actually WOULD say it is HWaaS - the tools behind can give it that level of delivery

So, the goal was to make the infrastructure smart, not HA in the traditional sense, as the application sees no 'failover'

Joseph B George (JBG)

totally agree on containers @furrier

@gregorygbishop - cool - my point these days is traditional notions of failover, recovery, availability... need an update

Joseph B George (JBG)

In that vein, we are seeing more and more HP customers start looking at infra closer - purpose built vs general purpose - getting great results

Andrew Reichman

@gregorygbishop decoupled architecture allows each piece to scale indefinitely and not break the others so long as everybody is reliable and speaking a language the others understand

Joseph B George (JBG)

The recently announced HP Big Data Ref Arch is a good example

@reichmanIT - that's what i mean by service oriented vs. "as a service" - probably need a longer piece on that

Joseph B George (JBG)

http://h30507.www3.h...

OK - about to switch to next topic

@gregorygbishop how utopian - that would be a computer industry first!

Andrew Reichman

agree- as a service just means that someone else is doing it. service oriented means that separate domains have rules of engagement whereever they might live and whoever might have built them

you have to think about "disposable infrastructure" but imo if you ignore recovery you are a foolish practitioner - remember - even google has to recover from tape at times

.@lleung Cloud scalability and performance should be at the heart of every successful internet venture.