Tag: Automation

Stuff The Internet Says On Scalability For August 16th, 2019

Stuff The Internet Says On Scalability For August 16th, 2019

Wake up! It’s HighScalability time:

Do you like this sort of Stuff? I’d love your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 53 mostly 5 star reviews (124 on Goodreads). They’ll learn a lot and likely add you to their will.

Number Stuff:

  • $1 million: Apple finally using their wealth to improve security through bigger bug bounties.
  • $4B: Alibaba cloud service yearly run rate, growth of 66%. Says they’ll overtake Amazon in 4 years. 
  • 200 billion: Pinterest pins pinned across more than 4 billion boards by 300 million users.
  • 21: technology startups took in mega-rounds of $100 million or more. 
  • 3%: of users pass their queries through resolvers that actively work to minimize the extent of leakage of superfluous information in DNS queries.
  • < 50%: Google searches result in a click. SEO dies under walled garden shade.
  • 4 million: DDoS attacks in the last 6 months, frequency grew by 39 percent in the first half of 2019. IoT devices are under attack within minutes. Rapid weaponization of vulnerable services continued. 
  • 200: distributed microservices in S3, up from 8 when it started 13 years ago.
  • 50%: cumulative improvement to Instagram.com’s feed page load time.
  • $318 million: Fortnite monthly revenue, likely had more than six consecutive months with at least one million concurrent active users.
  • $18,000: in fines because you just had to have the license plate NULL. 
  • $6.1 billion: Uber created Dutch weapon to avoid paying taxes.
  • 14.5%: drop in 1H19 global semiconductor sales.
  • 13%: fall in ad revenue for newspapers. 

Quotable Stuff:

  • Donald Hoffman: That is what evolution has done. It has endowed us with senses that hide the truth and display the simple icons we need to survive long enough to raise offspring. Space, as you perceive it when you look around, is just your desktop—a 3D desktop. Apples, snakes, and other physical objects are simply icons in your 3D desktop. These icons are useful, in part, because they hide the complex truth about objective reality.
  • rule11: First lesson of security: there is (almost) always a back door.
  • Paul Ormerod: A key discovery in the maths of how things spread across networks is that in any networked system, any shock, no matter how small, has the potential to create a cascade across the system as a whole. Watts coined the phrase “robust yet fragile” to describe this phenomenon. Most of the time, a network is robust when it is given a small shock. But a shock of the same size can, from time to time, percolate through the system. I collaborated with Colbaugh on this seeming paradox. We showed that it is in fact an inherent property of networked systems. Increasing the number of connections causes an improvement in the performance of the system, yet at the same time, it makes it more vulnerable to catastrophic failures on a system-wide scale.
  • @jeremiahg: InfoSec is ~$127B industry, yet there’s no price tags on any vendor website. For some reason it’s easier to find out what a private plane costs than a ‘next-gen’ security product. Oh yah, and let’s not forget the lack of warranties.
  • Hall’s Law:  the maximum complexity of artifacts that can be manufactured at scales limited only by resource availability doubles every 10 years. 
  • YouTube~ Our responsibility was never to the creators or to the users,” one former moderator told the Post. “It was to the advertisers.”
  • reaperducer: It’s for this reason that’s I’ve stopped embedding micro data in the HTML I write. Micro data only serves Google. Not my clients. Not my sites. Just Google. Every month or so I get an e-mail from a Google bot warning me that my site’s micro data is incomplete. Tough. If Google wants to use my content, then Google can pay me. If Google wants to go back to being a search engine instead of a content thief and aggregator, then I’m on board.
  • Maxime Puteaux: The small satellite launch market has grown to account for “69% of the satellites launched last year in number of satellites but only 4% of the total mass launched (i.e 372 tons). … The smallsat market experienced a 23% compound annual growth rate (CAGR) from 2009 to 2018” with even greater growth expected in the future, dominated by the launch needs of constellations.
  • @Electric_Genie: San Diego has a huge, machine-intelligence-powered smart streetlight network that monitors traffic to time traffic signals. Now, they’ve added ability to detect pedestrians and cyclists
  • Simon Wardley: How to create a map? Well, I start off with a systems diagram, I give it an anchor at the top. In this case, I put customer and then I describe position through a value chain. A customer wants online photo storage, which needs website, which needs platform, which needs computer, which needs power, and of course, the stuff at the bottom is less visible to the customer than the stuff at the top.
  • Charity Majors: When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network.  Our tools are still coming to grips with this seismic shift.
  • Livia Gershon: According to McLaren, from 1884 to 1895, the Matrimonial Herald and Fashionable Marriage Gazette promised to provide “HIGH CLASS MATCHES” to U.K. men and women looking for wives and husbands. Prospective spouses could place ads in the paper or work directly with staff of the associated Word’s Great Marriage Association to privately make a connection.
  • @KarlBode: There is absolutely ZERO technical justification for bandwidth caps and overage fees on cable networks. Zero. It’s a glorified price hike on captive US customers who already pay more for bandwidth than most developed nations due to limited competition.
  • Fowler: That’s the other piece of app trackers, is that they do a whole bunch of bad things for our phone. Over the course of a week, I found 5,400 different trackers activated on my iPhone. Yours might be different. I may have more apps than you. But that’s still quite a lot. If you multiplied that out by an entire month, it would have taken up 1.5 gigabytes of data just going to trackers from my phone. To put that in some context, the basic data plan from AT&T is only 3 gigabyte
  • Kate Green: Starshot is straightforward, at least in theory. First, build an enormous array of moderately powerful lasers. Yoke them together—what’s called “phase lock”—to create a single beam with up to 100 gigawatts of power. Direct the beam onto highly reflective light sails attached to spacecraft weighing less than a gram and already in orbit. Turn the beam on for a few minutes, and the photon pressure blasts the spacecraft to relativistic speeds.
  • Markham Heid: Beeman says activities that are too demanding of our brain or attention — checking email, reading the news, watching TV, listening to podcasts, texting a friend, etc. — tend to stifle the kind of background thinking or mind-wandering that leads to creative inspiration. 
  • @ben11kehoe: Aurora never downsizes storage. Continue to pay at the highest roll you’ve ever made.
  • John Allspaw: Resilience is not preventative design, it is not fault-tolerance, it is not redundancy. If you want to say fault-tolerance, just say fault-tolerance. If you want to say redundancy, just say redundancy. You don’t have to say resilience just because, you can, and you absolutely are able to. I wish you wouldn’t, but you absolutely can, and that’ll be fine as well.
  • Matthew Ball: But, again, lucrative “free-to-play” games have been around for more than a decade. In fact, it turns out the most effective way to generate billions of dollars is to not require a player spend a single one (all of the aforementioned billion-dollar titles are also free-to-play). 
  • TrailofBits: Smart contract vulnerabilities are more like vulnerabilities in other systems than the literature would suggest. A large portion (about 78%) of the most important flaws (those with severe consequences that are also easy to exploit) could probably by detected using automated static or dynamic analysis tools.
  • @sfiscience: 1/2″Once you induce [auto safety] regulatory protection, there is a decline in the number of highway deaths. And then in 3-4 years, it goes right up to where it was before the safety regulation is imposed.”  2/2 There’s a kind of “risk homeostasis” with regulation: as people feel safer, they take more risks (eg, seatbelts led to faster driving and more pedestrian deaths). One exception:  @NASCAR deaths went UP with safety innovations. “People are not dumb, but they’re not rational-expectations-efficient either.”  
  • Michael F. Cohen: It may be hard to believe, but only a few years ago we debated when the first computer graphics would appear in a movie such that you could not tell if what you were looking at was real or CG. Of course, now this question seems silly, as almost everything we see in action movies is CG and you have no chance of knowing what is real or not.
  • Dropbox: Much like our data storage goals, the actual cost savings of switching to SMR (Shingled Magnetic Recording) have met our expectations. We’re able to store roughly 10 to 20 percent more data on an SMR drive than on a PMR drive of the same capacity at little to no cost difference. But we also found that moving to the high-capacity SMR drives we’re using now has resulted in more than a 20% percent savings overall compared to the last generation storage design.
  • Riot Games: The patch size was 68 MB for RADS and 83 MB for the new patcher. Despite the larger download size, the average player was able to update the game in less than 40 seconds, compared to over 8 minutes with the old patcher.
  • @grossdm: For a decade, VCs have been subsidizing the below-market provision of services to urban-dwellers: transport, food delivery, office space. Now the baton is being passed to public shareholders, who will likely have less patience. 20 years ago, public investors very quickly walked away from the below-market provision of e-commerce and delivery services  — i.e. Webvan. 
  • Julia Grace: Looking back, I should have done a lot more reorgs [at Slack] and I should’ve broken up a lot more parts of the organization so that they could have more specialization, but instead, it was working so we kept it all together.
  • Thomas Claburn: “No iCloud subscriber bargained for or agreed to have Apple turn his or her data – whether encrypted or not – to others for storage,” the complaint says. “…The subscribers bargained for, agreed, and paid to have Apple – an entity they trusted – store their data. Instead, without their knowledge or consent, these iCloud subscribers had their data turned over by Apple to third-parties for these third-parties to store the data in a manner completely unknown to the subscribers.”
  • @glitchx86: Some merit to TM: it solves the problem of the correctness of lock-based concurrent programs. TM hides all the complexity of verifying deadlock-free software .. and it isn’t an easy task 
  • @narayanarjun: We were experiencing 40ms latency spikes on queries at @MaterializeInc and @nikhilbenesch tracked it down to TCP NODELAY, and his PR just cracks me up. The canonical cite is a hacker news comment ((link: https://news.ycombinator.com/item?id=10608356) news.ycombinator.com/item?id=106083…) signed by John Nagle himself, and I can’t even.
  • Donald Hoffman: Perhaps the universe itself is a massive social network of conscious agents that experience, decide, and act. If so, consciousness does not arise from matter; this is a big claim that we will explore in detail. Instead, matter and spacetime arise from consciousness—as a perceptual interface.
  • MacCárthaigh: From the very beginning at AWS, we were building for internet scale. AWS came out of amazon.com and had to support amazon.com as an early customer, which is audacious and ambitious. They’re a pretty tough customer, as you can imagine, one of the busiest websites on Earth. At internet scale, it’s almost all uncoordinated. If you think about CDNs, they’re just distributed caches, and everything’s eventually consistent, and that’s handling the vast majority of things.
  • Jack Clark: Being able to measure all the ways in which AI systems fail is a superpower, because such measurements can highlight the ways existing systems break and point researchers towards problems that can be worked on.
  • Google: We investigated the remote attack surface of the iPhone, and reviewed SMS, MMS, VVM, Email and iMessage. Several tools which can be used to further test these attack surfaces were released. We reported a total of 10 vulnerabilities, all of which have since been fixed. The majority of vulnerabilities occurred in iMessage due to its broad and difficult to enumerate attack surface. Most of this attack surface is not part of normal use, and does not have any benefit to users. Visual Voicemail also had a large and unintuitive attack surface that likely led to a single serious vulnerability being reported in it.  Overall, the number and severity of the remote vulnerabilities we found was substantial. Reducing the remote attack surface of the iPhone would likely improve its security.
  • sleepydog: I work in GCP support. I think you would be surprised. Of course Linux is more common, but we still support a lot of customers who use Windows Server, SQL Server, and .NET for production.
  • Laurence Tratt: performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future…
  • Geoff Tate: A year ago, if you talked to 10 automotive customers, they all had the same plan. Everyone was going straight to fully autonomous, 7nm, and they needed boatloads of inference throughput. They wanted to license IP that they would integrate into a full ADAS chip they would design themselves. They didn’t want to buy chips. That story has backpedaled big time. Now they’re probably going to buy off-the-shelf silicon, stitch it together to do what they want, and they’re going to take baby steps rather than go to Level 5 right away.
  • Ann Steffora Mutschler: In discussions with one of the Tier 0.5 suppliers about whether sensor fusion is the way to go or if it makes better sense to do more of the computation at the sensor itself, one CTO remarked that certain types of sensor data are better handled centrally, while other types of sensor data are better handled at the edge of the car, namely the sensor, Fritz said.
  • Dai Zovi: A software engineering team would write security features, then actively go to the security team to talk about it and for advice. We want to develop generative cultures, where risk is shared. It’s everyone’s concern. If you build security responsibility into every team, you can scale much more powerfully than if security is only the security staff’s responsibility.
  • Nitasha Tiku: But that didn’t mean things would go back to normal at Google. Over the past three years, the structures that once allowed executives and internal activists to hash out tensions had badly eroded. In their place was a new machinery that the company’s activists on the left had built up, one that skillfully leveraged media attention and drew on traditional organizing tactics. Dissent was no longer a family affair. And on the right, meanwhile, the pipeline of leaks running through Google’s walls was still going as strong as ever.
  • Graham Allan: There’s another bottleneck that SoC designers are starting to struggle with, and it’s not just about bandwidth. It’s bandwidth per millimeter of die etch. So if you have a bandwidth budget that you need for your SoC, a very easy exercise is to look at all the major technologies you can find. If you have HBM2E, you can get on the order of 60+ gigabytes per second per millimeter of die edge. You can only get about a sixth of that for GDDR6. And I can only get about a tenth of that with LPDDR5.
  • Brian Bailey: If the industry is willing to give von Neumann the boot, it should perhaps go the whole way and stop considering memory to be something shared between instructions and data and start thinking about it as an accelerator. Viewed that way, it no longer has to be compared against logic or memory, but should be judged on its own merits. If it accelerates the task and uses less power, then it is a purely economic decision if the area used is worth it, which is the same as every other accelerator.
  • Barbara Tversky: This brings us to our First Law of Cognition: There are no benefits without costs. Searching through many possibilities to find the best can be time consuming and exhausting. Typically, we simply don’t have enough time or energy to search and consider all the possibilities. The evidence on action is sufficient to declare the Second Law of Cognition: Action molds perception. There are those who go farther and declare that perception is for action. Yes, perception serves action, but perception serves so much more. 
  • Jez Humble: testing is for known knowns, monitoring is for known unknowns, observability is for unknown unknowns
  • @briankrebs: Being in infosec for so long takes its toll. I’ve come to the conclusion that if you give a data point to a company, they will eventually sell it, leak it, lose it or get hacked and relieved of it. There really don’t seem to be any exceptions, and it gets depressing
  • Brendon Foye: The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.” The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.
  • Newley Purnell: Startup Engineer.ai says it uses artificial-intelligence technology to largely automate the development of mobile apps, but several current and former employees say the company exaggerates its AI capabilities to attract customers and investors.
  • George Dyson: If you look at the most interesting computation being done on the Internet, most of it now is analog computing, analog in the sense of computing with continuous functions rather than discrete strings of code. The meaning is not in the sequence of bits; the meaning is just relative. Von Neumann very clearly said that relative frequency was how the brain does its computing. It’s pulse frequency coded, not digitally coded. There is no digital code.
  • Brendon Dixon: Because they’ve chosen to not deeply learn their deep learning systems—continuing to believe in the “magic”—the limitations of the systems elude them. Failures “are seen as merely the result of too little training data rather than existential limitations of their correlative approach” (Leetaru). This widespread lack of understanding leads to misuse and abuse of what can be, in the right venue, a useful technology.
  • Ewan Valentine: I could be completely wrong on this, but over the years, I’ve found that OO is great for mapping concepts, domain models together, and holding state. Therefor I tend to use classes to give a name to a concept and map data to it. For example, entities, repositories, and services, things which deal with data and state, I tend to create classes for. Whereas deliveries and use cases, I tend to treat functionally. The way this ends up looking, I have functions, which have instances of classes, injected through a higher-order function. The functional code then interacts with the various objects and classes passes into it, in a functional manor. I may fetch a list of items from a repository class, map through them, filter them, and pass the results into another class which will store them somewhere, or put them in a bucket.
  • Timothy Morgan: But what we do know is that the [Cray] machine will weigh in at around 30 megawatts of power consumption, which means it will have more than 10X the sustained performance of the current Sierra system on DOE applications and around 4X the performance per watt. This is a lot better energy efficiency than many might have been expecting – a few years back there was talk of exascale systems requiring as much as 80 megawatts of juice, which would have been very rough to pay for at a $1 per kilowatt per year. With those power consumption numbers, it would have cost $500 million to build El Capitan but it would have cost around $400 million to power it for five years; at 30 megawatts, you are in the range of $150 million, which is a hell of a lot more feasible even if it is an absolutely huge electric bill by any measure.
  • Timothy Prickett Morgan: All of us armchair architecture quarterbacks have been thinking the CPU of the future looks like a GPU card, with some sort of high bandwidth memory that’s really close. 
  • Garrett Heinlen (Netflix): I believe GraphQL also goes a step further beyond REST and it helps an entire organization of teams communicate in a much more efficient way. It really does change the paradigm of how we build systems and interact with other teams, and that’s where the power truly lies. Instead of the back end dictating, “Here are the APIs you receive and here’s the shape in the format you’re going to get,” they express what’s possible to access. The clients have all the power between pulling in the data just what they need. The schema is the API contract between all teams and it’s a living evolving source of truth for your organization. Gone are the days of people throwing code over the wall thing like, “Good luck, it’s done.” Instead, GraphQL promotes more of a uniform working experience amongst front end and back end, and I would go further to say even product and designer could have been involved in this process as well to understand the business domain that you’re all working within.

Useful Stuff:

  • Fun thread. @jessfraz: Tell me about the weirdest bug you had that caused a datacenter outage, can be anywhere in the stack including human error. @dormando: one day all the sun servers fired temp alarms and shut off. thought AC had died or there was a fire. Turns out cleaners had wedged the DC door open, causing a rapid humidity shift, tricking the sensors. @ewindisch: connection pool leak in a distributed message queue I wrote caused the cascade failure of a datacenter’s network switches. This brought offlin a large independent cloud provider around 2013. @davidbrunelle: Unexpected network latency caused TCP sockets to stay open indefinitely on a fleet of servers running an application. This eventually led to PAT exhaustion causing around ~50% of outbound calls from the datacenter to fail causing a DC-wide brownout.
  • What happens when you go from LAMP to serverless: case study of externals.io. 90% of the requests are below 100ms. $17.37/month. Generally low effort migration.
  • By continuously monitoring increases in spend, we end up building scalable, secure and resilient Lambda based solutions while maintaining maximum cost-effectiveness. How We Reduced Lambda Functions Costs by Thousands of Dollars: In the last 7 months, we started using Lambda based functions heavily in production. It allowed us to scale quickly and brought agility to our development activities…We were serving +80M Lambda invocations per day across multiple AWS regions with an unpleasant surprise in the form of a significant bill…once we start running heavy workloads in production, the cost become significant and we spent thousands of dollars daily…to reduce AWS Lambda costs, we monitored Lambda memory usage and execution time based on logs stored in CloudWatch…we created dynamic visualizations on Grafana based on metrics available in the timeseries database and we were able to monitor in near real-time Lambda runtime usage…we gain insights into the right sizing of each Lambda function deployed in our AWS account and we avoided excessive over-allocation of memory. Hence, significantly reduced the Lambda’s cost…To gather more insights and uncover hidden costs, we had to identify the most expensive functions. Thats where Lambda Tags comes into the play. We leveraged those metadata to breakdown the cost per Stack…By reducing the invocation frequency (control concurrency with SQS), we reduced the cost up to 99%…we’re evaluating alternative services like Spot Instances & Batch Jobs to run heavy non-critical workloads considering the hidden costs of Serverless…we were using SNS and we had issues with handling errors and Lambda timeout, so we changed our architecture to use instead SQS and we configured a dead letter queue to reduce the number of times the same message can be handled by the Lambda function (avoir recursion). Hence, reducing the number of invocations.
  • Six Shades of Coupling: Content Coupling, Common Coupling, External Coupling, Control Coupling, Stamp Coupling and Data Coupling. 
  • When does redundancy actually help availability?: The complexity added by introducing redundancy mustn’t cost more availability than it adds. The system must be able to run in degraded mode. The system must reliably detect which of the redundant components are healthy and which are unhealthy. The system must be able to return to fully redundant mode.
  • AI Algorithms Need FDA-Style Drug Trials. The problem with this idea is molecules do not change whereas software continuously changes and learning software by definition changes reactively. No static proces like a one and done drug trial will yield meaningful results. We need a different approach that considers the unique nature software plays in systems. Certainly vendors can’t be trusted. Any AI will tell you that. Perhaps create a set of test courses that platforms can be continuously tested and fuzzed against?
  • AWS Lambda is not ready to replace convenctional EC2Why we didn’t brew our Chai on AWS Lambda: Chai Point, India’s largest organized Chai retailer, with over 150+ stores and over 1000+ boxC(IoT Enabled Chai and Coffee vending machines) are designed for corporate which serves approximately 250k cups of chai per day from all the channels…Most of the Chai Point’s stores and boxC machines typically run between 7 AM to 9 PM…[Lambda cold start is] one of the most critical and deciding factors for us to move back the Shark infrastructure to EC2…AWS Lambda has a limit of 50 MB as the maximum deployment package…it takes a delay of 1–2 minutes for logs to appear in the CloudWatch which makes it difficult for immediate debugging in a test environment…when it comes to deploying it in enterprise solutions where there are inter-services dependencies I think there is still time especially for languages like Java. 
  • Facebook Performance @Scale 2019 recap videos are now available. 
  • Sharing is caring until it becomes overbearing. Dropbox no longer shares code between platforms. Their policy now is to use the native language on each platform. It is simply easier and quicker to write code twice. And you don’t have to train people on using a custom stack. The tools are native. So when people move on you have not lost critical expertise. The one codebase to rule them all dream dies hard. No doubt it will be back in short order, filtered through some other promising stack.
  • Everyone these days wants your functions. Oracle Functions Now Generally Available. It’s built on the Apache 2.0 licensed Fn Project. Didn’t see much in the way of reviews or on costs.
  • On LeanXcale database. Interview with Patrick Valduriez and Ricardo Jimenez-Peris: There is a class of new NewSQL databases in the market, called Hybrid Transaction and Analytics Processing (HTAP). NewSQL is a recent class of DBMS that seeks to combine the scalability of NoSQL systems with the strong consistency and usability of RDBMSs. LeanXcale’s architecture is based on three layers that scale out independently, 1) KiVi, the storage layer that is a relational key-value data store, 2) the distributed transactional manager that provides ultra-scalable transactions, and 3) the distributed query engine that enables to scale out both OLTP and OLAP workloads. he storage layer, it is a proprietary relational key-value data store, called KiVi, which we have developed. Unlike traditional key-value data stores, KiVi is not schemaless, but relational. Thus, KiVi tables have a relational schema, but can also have a part that is schemaless. The relational part enabled us to enrich KiVi with predicate filtering, aggregation, grouping, and sorting. As a result, we can push down all algebraic operators below a join to KiVi and execute them in parallel, thus saving the movement of a very large fraction of rows between the storage layer and they query engine layer.
  • Apollo Day New York City 2019 Recap
    • During his keynote, DeBergalis announced one of Apollo’s most anticipated innovations, Federation, which utilizes the idea of a new layer in the data stack to directly meet developers’ needs for a more scalable, reliable, and structured solution to a centralized data graph.
    • Federation paired with existing features of Apollo’s platform like schema change validation listing creates a flow where teams can independently push updates to product microservices. This triggers re-computation of the whole graph, which is validated and then pushed into the gateway. Once completed, all applications contain changes in the part of the graph that is available to them. These events happen independently, so there is a way to operate, which allows each team to be responsible solely for its piece.
    • Another key concept that DeBergalis detailed was the idea that a “three-legged” stack is emerging in front-end development. The “legs” of this new “stool” that form the basis of this stack are React, Apollo, and Typescript. React provides developers with a system for managing user components, Apollo provides developers a system for managing data, and Typescript provides a foundation underneath that provides static typing end-to-end through the stack.
  • Lesson: sticker shock—in Google Cloud everything costs more you think it will, but it’s still worth it. Etsy’s Big Data Cloud Migration. Etsy generates a terabyte of data a day, they run hundreds of Hadoop workflows and thousands of jobs daily. Started out on prem. They migrated to the cloud over a year and half ago, driven by needing both the machine and people resources required to keep up with machine leaning and data processing tasks. Moving into the cloud decoupled systems so groups can operate independently. With their on prem system they didn’t worry about optimization, but on the cloud you must because the cloud will do whatever you tell it do—at a price. In the cloud there’s a business case for making things more efficient. They rearchitected as they moved over. Managed services were a huge win. As they grew bigger they simply didn’t have the resources and the expertise to run all the needed infrastructure. That’s now Google’s job. This allowed having more generalized teams. It would be impossible for their team of 4 to manage all the things they use in GCP. Specialization is not required run things. If you need it you just turn it on. That includes services like BigTable, k8s, Cloud Pub/Sub, Cloud Dataflow, and AI. It allows Etsy to punch above their weight class. They have a high level of support, with Google employees  embedded on their team. Etsy didn’t lift and shift,they remade the platform as they moved over. If they had to do it over again they might have tried for a middle road, changing things before the migration.
  • Facebook Systems @Scale 2019 recap videos are now available.
  • The human skills we need an an unpredictable world. Efficiency and robustness trade off against each other. The more efficient something is the less slack there is to handle to the unexpected. When you target efficiency you may be making yourself more vulnerable to shocks.
  • The lesson is, you can’t wait around for Netflix or anyone else to promote your show. It’s up to you to create the buzz. How a Norwegian Viking Comedy Producer Hacked Netflix’s Algorithm: The key to landing on Netflix’s radar, he knew, would be to hack its recommendation engine: get enough people interested in the show early…Three weeks before launch, he set up a campaign on Facebook, paying for targeted posts and Facebook promotions. The posts were fairly simple — most included one of six short (20- to 25-second) clips of the show and a link, either to the show’s webpage or to media coverage. They used so-called A/B testing — showing two versions of a campaign to different audiences and selecting the most successful — to fine-tune. The U.S. campaign didn’t cost much — $18,500, which Tangen and his production partners put up themselves — and it was extremely precise. In just 28 days, the Norsemen campaign reached 5.5 million Facebook users, generating 2 million video views and some 6,000 followers for the show. Netflix noticed. “Three weeks after we launched, Netflix called me: ‘You need to come to L.A., your show is exploding,'” Tangen recalls. Tangen invested a further $15,000 to promote the show on Facebook worldwide, using what he had learned during the initial U.S. campaign.
  • How did NASA Steer the Saturn V? Triply redundant in logic. Doubly redundant in memory.  Two compared to make sure getting the same answer. If the same numbers aren’t returned then a subroutine is called to determine at this point in the flight which number makes the most sense. During all Saturn flights had less than 10 miscompares. More component means less reliability. Never had catastrophic failure. Biggest problem is vibration.
  • Interesting idea, instead of interviews use how well a candidate performs on training software to determine how well a candidate knows a set of skills. The Cloudcast – Cloud Computing. Role of generalists is gone. Pick a problem people are struggling with and become an expert at solving that problem and market yourself as person who has the skill of solving the problem. 
  • The end state for any application is to write its own scheduler.  Making Instagram.com faster: Part 1. Use preload tags to start fetching resources as soon as possible. You can even preload GraphQL requests to get a head start on those long queries. Preloads have a higher network priority. Preload tag for all script resources and to place them in the order that they would be needed. Load in new batches before the user hits the end of their current feed. A prioritized task abstraction that handles queueing of asynchronous work (in this case, a prefetch for the next batch of feed posts).  If the user scrolls close enough to the end of the current feed, we increase the priority of this prefetch task to ‘high’ by cancelling the pending idle callback and thus firing off the prefetch immediately. Once the JSON data for the next batch of posts arrives, we queue a sequential background prefetch of all the images in that preloaded batch of posts. We prefetch these sequentially in the order the posts are displayed in the feed rather than in parallel, so that we can prioritize the download and display of images in posts closest to the user’s viewport. Also Preemption in Nomad — a greedy algorithm that scales
  • Native lazy loading has arrived! Adding the loading attribute to the images decreased the load time on a fast network connection by ~50% — it went from ~1 second to < 0.5 seconds, as well as saving up to 40 requests to the server 🎊. All of those performance enhancements just from adding one attribute to a bunch of images!
  • Maybe it should just be simpler to create APIs?  Simple Two-way Messaging using the Amazon SQS Temporary Queue Client. Seems a lot of people use queues for front-end back-end communication because it’s simpler to setup and easier to secure than createing an HTTP endpoint. So AWS came up with a virtual queue that let’s you multiplex many virtual queues over a physical queue. No extra cost. It’s all done on the client. A clever tag based heartbeat mechanism is used to garbage collect queues.
  • Monolith to Microservices to Serverless — Our journey: A large part of our technology stack at that time comprised of a Spring based application and a MySQL database running on VMs in a data centre…The application was working for our thousands of customers, day in, day out, with little to no downtime. But it couldn’t be denied that new features were becoming difficult to build and the underlying infrastructure was beginning to struggle to scale as we continued to grow as a business…We needed a drastic rethink of our infrastructure and that came in the shape of Docker containers and Kubernetes…We took a long hard look at our codebase and with the ‘independent loosely coupled services’ mantra at the forefront of our minds we were quickly able to break off large parts of the monolith into smaller much more manageable services. New functionality was designed and built in the same way and we were quickly up to a 2 node K8s cluster with over 35 running pods….Fast forward to Today and we have now been using AWS for well over 2 years, we have migrated the core parts of our reporting suite into the cloud and where appropriate all new functionality is built using serverless AWS services. Our ‘serverless first’ ethos allows us to build highly performant and highly scaleable systems that are quick to provision and easy to manage. 
  • This is Crypto 101. Security Now 727 BlackHat & DefCon. Steve Gibson details how electronic hotel locks can protect themselves against replay attacks: All that’s needed to prevent this is for the door, when challenged to unlock, to provide a nonce for the phone to sign and return. The door contains a software ratchet. This is a counter which feeds a secretly-keyed AES symmetric cipher. Each door lock is configured with its own secret key which is never exposed. The AES cipher which encrypts a counter, produces a public elliptic key which is used to verify signatures. So the door lock first checks the key that it is currently valid for and has been using. If that fails, it checks ahead to the next public key to see whether that one can verify the returned signature. If not, it ignores the request. But if the next key does successfully verify the request’s signature it makes the next key permanent, ratcheting forward and forgetting the previous no-longer-valid key. This means that the door locks do not need to communicate with the hotel. Each door lock is able to operate autonomously with its own secret key which determines the sequence of its public keys. The hotel system knows each room’s secret key so it’s able to issue the proper private signing key to each guest for the proper room. If that system is designed correctly, no one with a copy of the Mobile Key software, and the ability to eavesdrop on the conversation, is able to gain any advantage from doing so.
  • Trip report: Summer ISO C++ standards meeting (Cologne). Reddit trip report. C++20 is now feature complete. Added: modules, coroutines, concepts including in the standard library via ranges, <=> spaceship including in the standard library, broad use of normal C++ for direct compile-time programming, ranges, calendars and time zones, text formatting, span, and lots more. Contracts were moved to C++21. 
  • Ingesting data at “Bharat” Scale: Initially, we considered Redis for our failover store, but with serving an average ingestion rate of 250K events per second, we would end up needing large Redis clusters just to support minutes worth of panic of our message bus. Finally, we decided to use a failover log producer that writes logs locally to disk. This periodically rotates & uploads to S3…We’ve seen outages, where our origin crashes & as it tries to recover, it is inundated with client retries & pending requests in the surge queue. That’s a recipe for cascading failure…We want to continue to serve the requests we can sustain, for anything over that, sorry, no entry. So we added a rate-limit to each of our API servers. We arrived at this configuration after a series of simulations & load-tests, to truly understand at what RPS our boxes will not sustain the load. We use nginx to control the number of requests per second using a leaky bucket algorithm. The target tracking scaling trigger is 3/4th of the rate-limit, to allow for the room to scale; but there are still occasions where large surges are too quick for target-tracking scaling to react.

Soft Stuff:

  • jedisct1/libsodium: Sodium is a new, easy-to-use software library for encryption, decryption, signatures, password hashing and more. It is a portable, cross-compilable, installable, packageable fork of NaCl, with a compatible API, and an extended API to improve usability even further. Its goal is to provide all of the core operations needed to build higher-level cryptographic tools.
  • amejiarosario/dsa.js-data-structures-algorithms-javascript: In this repository, you can find the implementation of algorithms and data structures in JavaScript. This material can be used as a reference manual for developers, or you can refresh specific topics before an interview. Also, you can find ideas to solve problems more efficiently.
  • linkedin/brooklin: Brooklin is a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale. Designed for multitenancy, Brooklin can simultaneously power hundreds of data pipelines across different systems and can easily be extended to support new sources and destinations.
  • gojekfarm/hospital: Hospital is an autonomous healing system for any System. Any failure or faults occurred in the system will be resolved automatically according to given run-book by the Hospital without manual intervention.
  • BlazingDB/pyBlazing: BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
  • serverless/components: Forget infrastructure — Serverless Components enables you to deploy entire serverless use-cases, like a blog, a user registration system, a payment system or an entire application — without managing complex cloud infrastructure configurations.

Pub Stuff:

  • Zooming in on Wide-area Latencies to a Global Cloud Provider: The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. 
  • What is Applied Category Theory? Two themes that appear over and over (and over and over and over) in applied category theory are functorial semantics and compositionality. 
  • ML can never be fair. On Fairness and Calibration: In this paper, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets.

from High Scalability

Digital Transformation Boosts Manufacturing Agility, Competitiveness

Digital Transformation Boosts Manufacturing Agility, Competitiveness

Digital Transformation Boosts Manufacturing AgilityFew industries face the level of global competition that manufacturing does. To compete and realize the promise of Industry 4.0, manufacturers are increasingly embracing digital transformation. Evolving the business — from the manufacturing floor to the sales office — requires a holistic effort that requires a smart IT roadmap strategy and effective execution. In today’s blog, we’re taking a look at the digital transformation journey of several manufacturers and how it has benefited their productivity, efficiency and ultimately their strategic market position.

Drive Scientific Innovation with DevOps Automation
While the current shortage of digital talent in manufacturing is “very high”, according to research by The Manufacturing Institute, DevOps automation increases employee efficiency by creating a platform that enables researchers, engineers, and scientists to focus on their core work. And so it was that the Infrastructure Engineering team at Toyota Research Institute decided to support its researchers and engineers by making it easier for them to utilize the power of the cloud with automation.

Working with the Flux7 DevOps consulting team, they implemented DevOps methods, processes, and automation that they have used to reduce tactical, manual IT operations activities. Researchers and engineers are able to use a self-serve portal to easily and quickly provision the AWS assets they need to test new ideas, thus helping them become more productive as they won’t wait for the infrastructure team to spin up resources.

Having a secure cloud sandbox environment enables them to try new ideas, fail fast, destroy the sandbox if needed, and start over, enabling researchers to innovate at velocity and at scale. According to Mike Garrison, the technical lead for Infrastructure Engineering at TRI, as quoted in DevOps.com, “Modern cloud infrastructure and DevOps automation are empowering us to quickly remove any barriers that get in the way, allowing the team to do their best work, advance research quickly, push boundaries and transform the industry.”

Similarly, Flux7 worked with a large US manufacturer to adopt elastic high-performance computing (HPC) that facilitates the company’s scientific simulations for various aspects of designing new machinery. These HPC simulations were hosted in the company’s traditional data center, yet required scalability to meet dynamic demand which required planning and a great deal of capital expense. Moving its HPC simulations to the cloud meant that it could innovate for the future faster, with scalable, dynamic demand while greatly reducing internal resource overhead and costs.

IoT for Industry 4.0

Linking IoT devices with the cloud and analytics infrastructure can unlock critical real-time data that enables preventive maintenance and extends system productivity. This kind of data can help staff proactively address issues before they occur thus creating greater system uptime, overall equipment effectiveness, and a greater ROI for capital equipment. For a large equipment manufacturer looking to gather data from its geographically dispersed machines, Flux7 helped set up an AWS IoT infrastructure.

The two teams modernized and migrated several applications to the cloud, connecting them with a new AWS Data Lake. (AWS recently announced AWS Lake Formation to help with his process, check out our blog on it here.) The new system collects important data from the field, processing it to make predictions helpful to its customers’ operations. And, the data helps with machine maintenance schedules, ensuring that machines are serviced appropriately, thus increasing uptime. Moreover, processes that previously took days were reduced to 15 minutes, freeing developer time for strategic work, while creating a new revenue stream for the manufacturer.

Set the foundation for the Agile Enterprise

While becoming an Agile Enterprise will help manufacturers realize the promise of Industry 4.0, digital transformation is a journey that requires a smart roadmap and solid execution. Flux7 partnered with a Fortune 500 manufacturer in its Agile Enterprise evolution. The company reached out AWS premier consulting partners, Flux7, to help it embark on a digital transformation that would eventually work its way through the company’s various departments — from enterprise architecture to application development and security — and business units, such as embedded systems and credit services.

The transformation started with a limited Amazon cloud migration and moved on to include:

  • IoT and an AWS Data Lake,
  • EU data privacy regulatory compliance,
  • Serverless monitoring and notification with a goal to use advanced automation to alert operations and information security teams of any known issues surfacing in the account, or violations of the corporate security standard.
  • Advanced automation to simplify maintenance and improve security and compliance.
  • Amazon VPC automation for faster onboarding.

The outcome has been a complete agile adoption of Flux7’s Enterprise DevOps Framework for greater security, cost efficiencies, and reliability. Enabled by solutions that connect its equipment and customer communities, the digital transformation effectively supports the company’s ultimate goal to create an unrivaled experience for its customers and partners.

From smart production to smart logistics and even smart product design and smarter sales and marketing efforts, a technology-driven transformation will help manufacturers achieve greater fault-tolerance, productivity, and ultimately revenue.

For additional manufacturing use case stories:

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

Build A Best Practice AWS Data Lake Faster with AWS Lake Formation

Build A Best Practice AWS Data Lake Faster with AWS Lake Formation

AWS Best Practice AWS Data Lake Formation

The world’s first gigabyte hard drive was the size of a refrigerator — and that wasn’t all that long ago. Clearly, technology has evolved, and so have our data storage and analysis needs. With data serving a key role in helping companies unearth intelligence that can provide a competitive advantage, solutions that allow organizations to end data silos and help create actionable business outcomes from intelligent data analysis are gaining traction. 

According to the 2018 Big Data Trends and Challenges report by Dimensional Research, the number of firms with an average data lake size over 100 Terabytes has grown from 36% in 2017 to 44% in 2018. A trend that’s sure to continue, especially as cloud providers like AWS provide services such as the newly announced AWS Lake Formation that help streamline the process of creating and managing a data lake solution. As such, in today’s blog, we’re going to take a look at the new AWS Lake Formation service, and share our take on its features, benefits, and things we’d like to see in the next version of the service.  

What is AWS Lake Formation

AWS Lake Formation is the newest service from AWS. It is designed to streamline the process of building a data lake in AWS, creating a full solution in just days. At a high level, AWS Lake Formation provides best practice templates and workflows for creating data lakes that are secure, compliant and operate effectively. The overall goal is to provide a solution that is well architected to identify, ingest, clean and transform data while enforcing appropriate security policies to enable firms to focus on gaining new insights, rather than building data lake infrastructure.

Before the release of AWS Lake Formation, organizations would need to take several steps to build their data lake. Not only was the process time-consuming, but there were several points in the process that proved difficult for the average operator. For example, users needed to set up their own Amazon S3 storage; deploy AWS Glue to prepare the data for analysis through the automated extract, transform and load (ETL) process; configure and enforce security policies; ensure compliance and more.  Each part of the process offered room for missteps, making the overall data lake set up challenging and a month+ long process for many.

AWS Data Lake Benefits

AWS has solved many of these challenges with AWS Lake Formation that offers three key areas of benefit and one area that we think is a neat, supporting feature.

  1. Templates – The new AWS Lake Formation provides templates for a number of things. We are most excited about the templates for AWS Glue which is important as this is an area where many organizations find they need to loop in AWS engineering for best practice help. Glue templates show that AWS really is listening to its customers and providing guidance where they need it most. In addition, our AWS consulting team was really happy to see templates that simplify the import of data and templates for the management of long-running cron jobs. These reusable templates will streamline each part of the data lake process.
  2. Cloud Security Solutions – Data is the lifeblood of an organization and for many companies, it is the foundation of their IP. As a result, sound security (and compliance) must be a key consideration for any data lake solution. AWS is definitely singing from that hymn book with AWS Lake Formation as they have created opportunities for security at the most granular of levels — not just securing the S3 bucket, but the data catalog as well. For example, at the data catalog level, you could specify which columns of data a Lambda function can read, or revoke a user’s permissions to a specific database. (AWS notes that row-level tagging will be in a future version of the solution.)
  3. Machine Learning Transformations – AWS provides algorithms for its customers to create their own machine learning solutions. AWS cites record de-duplication as a use case here, illustrating how ML can help clean and update data. However, we see this feature as being particularly interesting to firms in industries like pharmaceuticals where a company could, for example, use it to mine and predictively match chemical patterns to patients or in the oil and gas industry where ML can be applied to learn from field-based data points to maximize oil production.

Also neat, but not marquee-stealing, is the AWS Lake Formation feature that allows users to add metadata and tag data catalog objects. For developers, in particular, this is a nice-to-have feature as it will allow them to more easily search all this data. Separately, we also like that AWS Lake Formation users will only pay for the underlying services used and that there are no additional charges.  

Ready to Swim?

One feature we’d like to see in an upcoming release of Lake Formation is integration with directory services like AD. This will help further streamline the process of controlling data access to ensure permissions are revoked when, for example, an employee leaves the organization or changes workgroups. 

Moreover, while AWS Lake Formation greatly streamlines the process of building a data lake, being able to create your own templates moving forward may still remain a challenge for some organizations. At Flux7, we teach organizations how to build, manage and maintain templates for this — and many other AWS solutions — and can help your team ensure your templates incorporate Well Architected best practice standards on an ongoing basis.

Ready to dive into your own AWS data lake solution? Check out our AWS Data Lake solution case study on how a healthcare provider addressed its rapid data expansion and data complexity with AWS and Flux7 DevOps consulting, enabling it to quickly analyze information and make important data connections. Impact your time to market, customer experience and market position today with our AWS database services

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

DevOps Blog IT Modernization DevOps News

Container security was top of mind this week as Kubernetes announced the results of its first security audit. The review looked at Kubernetes 1.13.4 and found 37 vulnerability issues, including five high-severity issues and 17 medium-severity issues. We are happy to report that fixes for these issues have already been deployed.

Container security was also top of mind for McAfee who said this week it has acquired NanoSec, a California container security startup. This as the Cloud Security Alliance introduced its Egregious Eleven, the top salient threats, risks and vulnerabilities in cloud environments identified in its Fourth Annual Top Threats survey. Two key themes that emerged this year are a maturation in the understanding of the cloud and respondent’s desire to address security issues higher up the technology stack that are the result of senior management decisions. While you can check out the report yourself, the top concerns are: Data Breaches, Misconfiguration and Inadequate Change Control, Lack of Cloud Security Architecture and Strategy and Insufficient Identity, Credential, Access and Key Management. 

To stay up-to-date on DevOps security, CI/CD and IT Modernization, subscribe to our blog here:

Subscribe to the Flux7 Blog

DevOps News

  • This past week HashiCorp released an official Helm Chart for Vault. Operators can reduce the complexity of running Vault on Kubernetes with the new Helm Chart as it provides a repeatable deployment process in less time. For example, HashiCorp reports that using the Helm Chart, allows operators to start a Vault cluster running on Kubernetes in just minutes. The Helm chart allows you to run Vault directly on Kubernetes, so in addition to the native integrations provided by Vault itself, any other tool built for Kubernetes can choose to leverage Vault. Note that a Helm Chart for Vault Enterprise will be available in the future.
  • In response to feedback, GitHub is bringing CI/CD support to GitHub Actions. Available November 13, the new support will allow users to easily automate how they build, test, and deploy projects across platforms — Linux, macOS, and Windows — in containers or virtual machines, and across languages and frameworks such as Node.js, Python, Java, PHP, Ruby, C/C++, .NET, Android, and iOS. GitHub Actions is an API that orchestrates workflows, based on any event, while GitHub manages the execution, provides rich feedback and secures every step along the way. 
  • Jenkins monitoring got a boost this week as Instana announced the addition of Jenkins monitoring to its automatic Application Performance Management (APM) solution as part of its focus on adding performance management for systems in other steps of the application delivery process. According to Peter Abrams, the company COO, and co-founder, “A common theme amongst Instana customers is the need to deliver and deploy quality applications faster, and Jenkins is a critical component of that delivery process.” The new capabilities include providing performance visibility of individual builds and deployments, and health monitoring of the Jenkins tool stack.

AWS News 

  • The long-awaited AWS Lake Formation is now generally available. Introduced at re:Invent last fall, Lake Formation makes it easy to ingest, clean, catalog, transform, and secure data, making it available for analytics and machine learning. Operators work from a central console to manage their data lake and are able to configure the right access permissions and secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. AWS Lake Formation notably works with data already in S3, allowing operators to easily register their existing data with Lake Formation.
  • In related news, it was announced that Amazon Redshift Spectrum now supports column-level access control for data stored in Amazon S3 and managed by AWS Lake Formation. This column-level access control helps limit access to only specific columns of a table rather than allowing access to all columns of a table, a key part of data governance and security needs of many enterprises.
  • Our AWS Consulting team enjoyed these two AWS blogs. The first, Auto-populate instance details by integrating AWS Config with your ServiceNow CMDB, shares how to ensure CMDB accuracy by integrating AWS Config and ServiceNow so that a notification creates a server record in the CMDB and tests the setup. 
  • Focused on security by design, we are always interested in how to securely share keys. Therefore, this blog, How to deploy CloudHSM to securely share your keys with your SaaS provider caught our attention. In it, Vinod Madabushi shares two options for deploying and managing a CloudHSM cluster to secure keys, while still allowing trusted third-party SaaS providers to securely access the HSM cluster to perform cryptographic operations.  
  • Amazon announced that operators can now use AWS PrivateLink in the AWS GovCloud (US-East) Region. Already available in several other regions, AWS PrivateLink allows operators to privately access services hosted on AWS without using public IPs and without requiring the traffic to traverse the internet.

Flux7 News

  • Read our latest AWS Case Study, the story of how Flux7 DevOps consultants teamed with a global retailer to create a platform for scalable innovation. To accelerate its cloud migration and standardize its development efforts, the joint client-Flux7 team identified a solution: a DevOps Dashboard that would automatically apply the company’s various standards as cloud infrastructure is deployed. 
  • For CIOs and technology leaders looking to lead the transition to an Agile Enterprise, Flux7 has published a new paper on How CIOs Can Prepare an IT Platform for the Agile Enterprise. Download it today to learn how a technology platform that supports agility with IT automation and DevOps best practices can be a key lever to helping IT engage with and improve the business. 

Download the Paper Today

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog

Global Retailer Standardizes Hybrid Cloud with DevOps Dashboard

Global Retailer Standardizes Hybrid Cloud with DevOps Dashboard

Global Retailer Hybrid Cloud DevopsFrom luxury to grocery, the retail war continues. While some would say we’re witnessing a retail apocalypse, others contend it’s really the death of the boring middle. (HT Steve Dennis) With a vision to innovate and extend its leadership in this competitive environment, the DevOps consulting team at Flux7 was approached by our newest customer, a top 50 global retailer. Today’s blog is the story of how Flux7 DevOps consultants teamed with the retailer to create a platform for scalable innovation. 

Read More: Download the full case study 

Growing geographically and looking to support its thousands of locations with innovative new solutions, this retailer has embraced digital transformation, starting with an AWS migration. However, doing so required the move of hundreds of applications from different on-premises platforms, a task that required the retailer’s IT teams to consistently ensure that operational, security and regulatory standards were maintained. 

To standardize and accelerate its development efforts on AWS, the joint client-Flux7 team identified a solution: a DevOps Dashboard that would automatically apply the company’s various standards as cloud infrastructure is deployed

The DevOps Dashboard

The DevOps Dashboard standardizes infrastructure creation and streamlines the process of developing applications on AWS. Developers can quickly start and/or continue development of their applications on AWS using the dashboard. Developers simply enter parameters into the UI and behind the scenes, the dashboard triggers pipelines to deploy infrastructure, connects to a repository, deploys code and sets up the environment. 

The DevOps Dashboard also features:

  • Infrastructure provisioning defined and implemented as code  
  • The ability to create ECS, EKS, and Serverless infrastructure in AWS
  • Jenkins automation to provision infrastructure and deploy sample apps to new and/or existing repositories
  • The ability to create a repository or use an existing one and implement a Webhook for continuous deployment 
  • A standard repository structure
  • The ability to automatically update/push the code of new sample applications to the appropriate environment (Dev/QA/Production) once placed in the repository.

DevOps Dashboard Benefits

Using the DevOps Dashboard allows developers to work on the code repository while their code or application is automatically deployed to the selected environment. This allows the engineer to focus only on editing applications rather than worrying about infrastructure standard compliance. The result of this advanced DevOps automation is that developers are able to create higher quality code faster, which means that they can quickly experiment and get winning ideas to market faster.

In addition, the DevOps Dashboard increases the retailer’s development agility while increasing its consistency and standardization of cloud builds across its hybrid cloud environment. Greater standardization has resulted in less risk, greater security, and compliance as code. 

For further reading on how Flux7 helps retailers establish an agile IT platform that harnesses the power of automation to grow IT productivity: 

For ongoing case studies, DevOps news and analysis, subscribe to our blog:

Subscribe to the Flux7 Blog

from Flux7 DevOps Blog

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

IT Modernization DevOps News At IBM’s Investor Briefing 2019, CEO Ginni Rometty, addressed questions about the future of Red Hat now that the acquisition has closed. Framing what she calls Chapter Two of the cloud, she noted that Red Hat brings the vehicle. “Eighty percent is still to be moved into a hybrid cloud environment,” she said. Noting further that, “Hybrid cloud is the destination because you can modularize apps.” The strategy moving forward is to scale Red Hat, selling more IBM services tied to Red Hat while optimizing the IBM portfolio for Red Hat OpenShift in a move that Rometty called, “middleware everywhere.”

To stay up-to-date on DevOps security, CI/CD and IT Modernization, subscribe to our blog here:

Subscribe to the Flux7 Blog

DevOps News

  • HashiCorp announced the public availability of HashiCorp Vault 1.2. According to the company, new features are focused on supporting new architectures for automated credential and cryptographic key management at a global, highly-distributed scale. Specifically, it includes KMIP Server Secret Engine (Vault Enterprise only) which allows Vault to serve as a KMIP Server for automating secrets management and encryption as a service workflows with enterprise systems; integrated storage; identity tokens; and database static credential rotation.
  • CodeStream is now available for deployment through the Slack app store. With CodeStream, developers can more easily use Slack to discuss code; instead of cutting and pasting, developers can now share code blocks in context right from their IDE. Replies can be made in Slack or CodeStream, and in either case, they become part of the thread that is permanently linked to the code.
  • Armory announced it has raised $28M in its pursuit of additional development of Spinnaker, the firm’s open-source, multi-cloud continuous delivery platform used by developers to release quality software with greater speed and efficiency.
  • Our DevOps consulting team enjoyed this article by Mike Cohn on, Overcoming Four Common Objections to the Daily Scrum. In it, he discusses best practices for well-run daily Scrums.

AWS News

  • Operators can now use AWS CloudFormation templates to specify AWS IoT Events resources. According to the firm, this improvement enables you to use CloudFormation to deploy AWS IoT Events resources—along with the rest of your AWS infrastructure—in a secure, efficient, and repeatable way. The new capability is available now where IoT Events are available.
  • Amazon has added a new Predictions category to its Amplify Framework, allowing operators to now easily add and configure AI/ML use cases to their web and/or mobile applications.
  • In response to greater transparency, Amazon has launched the AWS CloudFormation Coverage Roadmap. In it AWS shares its priorities for CloudFormation in four areas: features that have shipped and are production-ready; features that are on the near-horizon and you should expect to see within the next few months; longer-term features that are actively being worked on; and features being researched.
  • AWS introduced the availability of the Middle East Region, the first AWS Region in the Middle East; it is comprised of three Availability Zones.
  • Our AWS Consulting team enjoyed this AWS blog, Analyzing AWS WAF logs with Amazon ES, Amazon Athena, and Amazon QuickSight, by Aaron Franco in which he discusses how to aggregate AWS WAF logs into a central data lake repository. Check out our resource page for additional reading on AWS WAF.

Flux7 News

  • We continued our blog series about becoming an Agile Enterprise, with the Flux7 case study of our OKR (Objectives and Key Results) journey, sharing lessons we learned along the way and greater role of OKRs in an Agile Enterprise. In case you missed the first article in the series on choosing a flatarchy organizational structure, you can read it here.
  • For CIOs and technology leaders looking to lead the transition to an Agile Enterprise, Flux7 has published a new paper on How CIOs Can Prepare an IT Platform for the Agile Enterprise. Download it today to learn how a technology platform that supports agility with IT automation and DevOps best practices can be a key lever to helping IT engage with and improve the business.

Download the Paper Today

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog

Stuff The Internet Says On Scalability For August 2nd, 2019

Stuff The Internet Says On Scalability For August 2nd, 2019

Wake up! It’s HighScalability time—once again:

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 52 mostly 5 star reviews (121 on Goodreads). They’ll learn a lot and hold you in even greater awe.

Number Stuff:

  • $9.6B: games investment in last 18 months, equal to the previous five years combined.
  • $3 million: won by a teenager in the Fortnite World Cup.  
  • 100,000: issues in Facebook’s codebase fixed from bugs found by static analysis. 
  • 106 million: Capital 1 IDs stolen by a former Amazon employee. (complaint)
  • 2 billion: IoT devices at risk because of 11 VXWorks zero day vulnerabilities.
  • 2.1 billion: parking spots in the US, taking 30% of city real estate, totaling 34 billion square meters, the size of West Virginia, valued at 60 trillion dollars.
  • 2.1 billion: people use Facebook, Instagram, WhatsApp, or Messenger every day on average. 
  • 100: words per minute from Facebook’s machine-learning algorithms capable of turning brain activity into speech. 
  • 51%: Facebook and Google’s ownership of the global digital ad market space on the internet.
  • 56.9%: Raleigh, NC was the top U.S. city for tech job growth.
  • 20-30: daily CPAN (Perl) uploads. 700-800 for Python.
  • 476 miles: LoRaWAN (Low Power, Wide Area (LPWA)) distance world record broken using 25mW transmission power.
  • 74%: Skyscanner savings using spot instances and containers on the Kubernetes cluster.
  • 49%: say convenience is more important than price when selecting a provider.
  • 30%: Airbnb app users prefer a non-default font size.
  • 150,000: number of databases migrated to AWS using the AWS Database Migration Service.
  • 1 billion: Google photos users, @MikeElgan: same size as Instagram but far larger than Twitter, Snapchat or Pinterest
  • 300M: Pinterest monthly active users with evenue of $261 million, up 64% year-over-year, on losses of $26 million for the second-quarter of 2019.
  • 7%: of all dating app messages were rated as false.
  • $100 million: Goldman Sachs spend to improve stock trades from hundreds of milliseconds down to 100 microseconds while handling more simultaneous trades. The article mentions using microservices and event sourcing, but it’s not clear how that’s related.

Quotable Stuff:

  • Josh Frydenberg, Australian Treasurer: Make no mistake, these companies are among the most powerful and valuable in the world. They need to be held to account and their activities need to be more transparent.
  • Neil Gershenfeld: Fabrication merges with communication and computation. Most fundamentally, it leads to things like morphogenesis and self-reproducing an assembler. Most practically, it leads to almost anybody can make almost anything, which is one of the most disruptive things I know happening right now. Think about this range I talked about as for computing the thousand, million, billion, trillion now happening for the physical world, it’s all here today but coming out on many different link scales.
  • Alan Kay: Marvin and Seymour could see that most interesting systems were crossconnected in ways that allowed parts to be interdependent on each other—not hierarchical—and that the parts of the systems needed to be processes rather than just “things”
  • Lawrence Abrams: Now that ransomware developers know that they can earn monstrous payouts from local cities and insurance policies, we see a new government agency, school district, or large company getting hit with a ransomware attack every day.
  • @tmclaughbos: A lot of serverless adoption will fail because organizations will push developers to assume more responsibility down the stack instead of forcing them to move up the stack closer to the business.
  • Lightstep: Google Cloud Functions’ reusable connection insertion makes the requests more than 4 times faster [than S3] both in region and cross region.
  • HENRY A. KISSING, ERERIC SCHMIDT, DANIEL HUTTENLOCHER: The evolution of the arms-control regime taught us that grand strategy requires an understanding of the capabilities and military deployments of potential adversaries. But if more and more intelligence becomes opaque, how will policy makers understand the views and abilities of their adversaries and perhaps even allies? Will many different internets emerge or, in the end, only one? What will be the implications for cooperation? For confrontation? As AI becomes ubiquitous, new concepts for its security need to emerge. The three of us differ in the extent to which we are optimists about AI. But we agree that it is changing human knowledge, perception, and reality—and, in so doing, changing the course of human history. We seek to understand it and its consequences, and encourage others across disciplines to do the same.
  • minesafetydisclosures: Visa’s business is all about scale. That’s because the company’s fixed costs are high, but the cost of processing a transaction is essentially zero. Said more simply, it takes a big upfront investment in computers, servers, personnel, marketing, and legal fees to run Visa. But those costs don’t increase as volume increases; i.e., they’re “fixed”. So as Visa processes more transactions through their network, profit swells. As a result, the company’s operating margin has increased from 40% to 65%. And the total expense per transaction has dropped from a dime to a nickel; of which only half of a penny goes to the processing cost. Both trends are likely to continue.
  • noobiemcfoob: Summarizing my views: MQTT seems as opaque as WebSockets without the benefits of being built on a very common protocol (HTTP) and being used in industries beyond just IoT. The main benefits proponents of MQTT argue for (low bandwidth, small libraries) don’t seem particularly true in comparison to HTTP and WebSockets.
  • erincandescent: It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I’d likely implement the architecture to benefit from the existing tooling.
  • Director Jon Favreau~  the plan was to create a virtual Serengeti in the Unity game engine, then apply live action filmmaking techniques to create the film — the “Lion King” team described this as a “virtual production process.”
  • Alex Heath: In confidential research Mr. Cunningham prepared for Facebook CEO Mark Zuckerberg, parts of which were obtained by The Information, he warned that if enough users started posting on Instagram or WhatsApp instead of Facebook, the blue app could enter a self-sustaining decline in usage that would be difficult to undo. Although such “tipping points” are difficult to predict, he wrote, they should be Facebook’s biggest concern. 
  • jitbit: Well, to be embarrassingly honest… We suck at pricing. We were offering “unlimited” plans to everyone until recently. And the “impressive names” like you mention, well, they mostly pay us around $250 a month – which used to be our “Enterprise” pricing plan with unlimited everything (users, storage, agents etc.) So I guess the real answer is – we suck at positioning and we suck at marketing. As the result – profits were REALLY low (Lesson learned – don’t compete on pricing). P.S. Couple of years ago I met Thomas from “FE International” at some conference, really experienced guy, who told me “dude, this is crazy, dump the unlimited plan like right now” so we did. So I guess technically we can afford a PaaS now…
  • 1e-9: The markets are kind of like a massive, distributed, realtime, ensemble, recursive predictor that performs much better than any one of its individual component algorithms could. The reason why shaving a few milliseconds (or even microseconds) can be beneficial is because the price discovery feedback loops get faster, which allows the system to determine a giant pricing vector that is more self-consistent, stable, and beneficial to the economy. It’s similar to how increasing the sample rate of a feedback control system improves performance and stability. Providers of such benefits to the markets get rewarded through profit.
  • @QuinnyPig: There’s something else afoot too. I fix cloud bills. If I offer $10k to save $100k people sign off. If I offer $10 million to save $100 million people laugh me out of the room. Large numbers are psychologically scary.
  • mrjn:  Is it worth paying $20K for any DB or DB support? If it would save you 1/10th of an engineer per year, it becomes immediately worth. That means, can you avoid 5 weeks of one SWE by using a DB designed to better suit your dataset? If the answer is yes (and most cases it is), then absolutely that price is worth. See my blog post about how much money it must be costing big companies building their graph layers. Second part is, is Dgraph worth paying for compared to Neo or others? Note that the price is for our enterprise features and support. Not for using the DB itself. Many companies run a 6-node or a 12-node distributed/replicated Dgraph cluster and we only learn that much later when they’re close to pushing it into production and need support. They don’t need to pay for it, the distributed/replicated/transactional architecture of Dgraph is all open source. How much would it cost if one were to run a distributed/replicated setup of another graph DB? Is it even possible, can it execute and perform well? And, when you add support to that, what’s the cost?
  • @codemouse: It’s halfway to 2020. At this point, if any of your strategy is continued investment into your data centers you’re doing it wrong. Yes migration may take years, but you’re not going to be doing #cloud or #ops better than @awscloud
  • hermitdev: Not Citibank, but previously worked for a financial firm that sold a copy of it’s back office fund administration stack. Large, on site deployment. It would take a month or two to make a simple DNS change so they could locate the services running on their internal network. The client was a US depository trust with trillions on deposit. No, I wont name any names. But getting our software installed and deployed was as much fun as extracting a tooth with a dull wood chisel and a mallet.
  • Insikt Group: Approximately 50% of all activity concerning ransomware on underground forums are either requests for any generic ransomware or sales posts for generic ransomware from lower-level vendors. We believe this reflects a growing number of low-level actors developing and sharing generic ransomware on underground forums.
  • Facebook: For classes of bugs intended for all or a wide variety of engineers on a given platform, we have gravitated toward a “diff time” deployment, where analyzers participate as bots in code review, making automatic comments when an engineer submits a code modification. Later, we recount a striking situation where the diff time deployment saw a 70% fix rate, where a more traditional “offline” or “batch” deployment (where bug lists are presented to engineers, outside their workflow) saw a 0% fix rate.
  • Andy Rachleff: Venture capitalists know that the thing that causes their companies to go out of business is lack of a market, not poor execution. So it’s a fool’s errand to back a company that proposes to do a ride-hailing service or renting a room or something as crazy as that. Again–how would you know if it’s going to work? So the venture industry outsourced that market risk to the angel community. The angel community thinks they won it away from the venture community, but nothing could be further from the truth, because it’s a sucker bet. It’s a horrible risk/reward. The venture capitalists said, “Okay, let the angels invest at a $5 million valuation and take all of that market risk. We’ll invest at a $50 million valuation. We have to pay up if it works.” Now they hope the company will be worth $5 billion to make the same return as they would have in the old model. Interestingly, there now are as many companies worth $5 billion today as there were companies worth $500 million 20 years ago, which is why the returns of the premier venture capital firms have stayed the same or even gone up.
  • imagetic: I dealt with a lot of high traffic live streaming video on Facebook for several years. We saw interaction rates decline almost 20x in a 3 year period but views kept increasing. Things just didn’t add up when the dust settled and we’d look at the stats. It wouldn’t be the least bit surprised if every stat FB has fed me was blown extremely out of proportion.
  • prism1234: If you are designing a small embedded system, and not a high performance general computing device, then you already know what operations your software will need and can pick what extensions your core will have. So not including a multiply by default doesn’t matter in this case, and may be preferred if your use case doesn’t involve a multiply. That’s a large use case for risc-v, as this is where the cost of an arm license actually becomes an issue. They don’t need to compete with a cell phone or laptop level cpu to still be a good choice for lots of devices.
  • oppositelock: You don’t have time to implement everything yourself, so you delegate. Some people now have credentials to the production systems, and to ease their own debugging, or deployment, spin up little helper bastion instances, so they don’t have to use 2FA each time to use SSH or don’t have to deal with limited-time SSH cert authorities, or whatever. They roll out your fairly secure design, and forget about the little bastion they’ve left hanging around, open to 0.0.0.0 with the default SSH private key every dev checks into git. So, any former employee can get into the bastion.
  • Lyft: Our tech stack comprises Apache Hive, Presto, an internal machine learning (ML) platform, Airflow, and third-party APIs.
  • Casey Rosenthal: It turns out that redundancy is often orthogonal to robustness, and in many cases it is absolutely a contributing factor to catastrophic failure. The problem is, you can’t really tell which of those it is until after an incident definitively proves it’s the latter.
  • Colm MacCárthaigh: There are two complementary tools in the chest that we all have these days, that really help combat Open Loops. The first is Chaos Engineering. If you actually deliberately go break things a lot, that tends to find a lot of Open Loops and make it obvious that they have to be fixed.
  • @eeyitemi: I’m gonna constantly remind myself of this everyday. “You can outsource the work, but you can’t outsource the risk.” @Viss 2019
  • Ben Grossman~ this could lead to a situation where filmmaking is less about traditional “filmmaking or storytelling,” and more about “world-building”: “You create a world where characters have personalities and they have motivations to do different things and then essentially, you can throw them all out there like a simulation and then you can put real people in there and see what happens.”
  • cheeze: I’m a professional dev and we own a decent amount of perl. That codebase is by far the most difficult to work in out of anything we own. New hires have trouble with it (nobody learns perl these days). Lots of it is next to unreadable.
  • Annie Lowrey: All that capital from institutional investors, sovereign wealth funds, and the like has enabled start-ups to remain private for far longer than they previously did, raising bigger and bigger rounds. (Hence the rise of the “unicorn,” a term coined by the investor Aileen Lee to describe start-ups worth more than $1 billion, of which there are now 376.) Such financial resources “never existed at scale before” in Silicon Valley, says Steve Blank, a founder and investor. “Investors said this: ‘If we could pull back our start-ups from the public market and let them appreciate longer privately, we, the investors, could take that appreciation rather than give it to the public market.’ That’s it.”
  • alexis_fr: I wonder if the human life calculation worked well this time. As far as I see, Boeing lost more than the sum of the human lives; they also lost reputation for everything new they’ve designed in the last 7 years being corrupted, and they also engulfed the reputation of FAA with them, whose agents would fit the definition of “corrupted” by any people’s definition (I know, they are not, they just used agents of Boeing to inspect Boeing because they were understaffed), and the FAA showed the last step of failure by not admitting that the plane had to be stopped until a few days after the European agencies. In other words, even in financial terms, it cost more than damages. It may have cost the entire company. They “DeHavailland”’ed their company. Ever heard of DeHavailland? No? That’s probably to do with their 4 successive deintegrating planes that “CEOs have complete trust in.” It just died, as a name. The risk is high.
  • Neil Gershenfeld: computer science was one of the worst things ever to happen to computers or science, why I believe that, and what that leads me to. I believe that because it’s fundamentally unphysical. It’s based on maintaining a fiction that digital isn’t physical and happens in a disconnected virtual world.
  • @benedictevans: Netflix and Sky both realised that a new technology meant you could pay vastly more for content than anyone expected, and take it to market in a new way. The new tech (satellite, broadband) is a crowbar for breaking into TV. But the questions that matter are all TV questions
  • @iamdevloper: Therapist: And what do we do when we feel like this? Me: buy a domain name for the side project idea we’ve had for 15 seconds. Therapist: No
  • @dvassallo: Step 1: Forget that all these things exist: Microservices, Lambda, API Gateway, Containers, Kubernetes, Docker. Anything whose main value proposition is about “ability to scale” will likely trade off your “ability to be agile & survive”. That’s rarely a good trade off. 4/25 Start with a t3.nano EC2 instance, and do all your testing & staging on it. It only costs $3.80/mo. Then before you launch, use something bigger for prod, maybe an m5.large (2 vCPU & 8 GB mem). It’s $70/mo and can easily serve 1 million page views per day.
  • PeteSearch: I believe we have an opportunity right now to engineer-in privacy at a hardware level, and set the technical expectation that we should design our systems to be resistant to abuse from the very start. The nice thing about bundling the audio and image sensors together with the machine learning logic into a single component is that we have the ability to constrain the interface. If we truly do just have a single pin output that indicates if a person is present, and there’s no other way (like Bluetooth or WiFi) to smuggle information out, then we should be able to offer strong promises that it’s just not possible to leak pictures. The same for speech interfaces, if it’s only able to recognize certain commands then we should be able to guarantee that it can’t be used to record conversations.
  • Murat: As I have mentioned in the previous blog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed in fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, in place, to provide strong consistent and bounded-staleness consistency reads among other read levels. On the other hand, Cosmos DB also maintains a change log by way of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, and snapshots for analytic workload.
  • grauenwolf: That’s where I get so frustrated. Far too often I hear “premature optimization” as a justification for inefficient code when doing it the right way would actually require the same or less effort and be more readable.
  • Murat: Leader – I tell you Paxos joke, if you accept me as leader. Quorum – Ok comrade. Leader – Here is joke! (*Transmits joke*) Quorum – Oookay… Leader – (*Laughs* hahaha). Now you laugh!! Quorum – Hahaha, hahaha.
  • Manmax75: The amount of stories I’ve heard from SysAdmins who jokingly try to access a former employers network with their old credentials only to be shocked they still have admin access is a scary and boggling thought.
  • @dougtoppin: Fargate brings significant opportunity for cost savings and to get the maximum benefit the minimal possible number of tasks must be running to handle your capacity needs. This means quickly detecting request traffic, responding just as quickly and then scaling back down.
  • @evolvable: At a startup bank we got management pushback when revealing we planned to start testing in production – concerns around regulation and employees accessing prod. We changed the name to “Production Verification”. The discussion changed to why we hadn’t been doing it until now. 
  • @QuinnyPig: I’m saying it a bit louder every time: @awscloud’s data transfer pricing is predatory garbage. I have made hundreds of thousands of consulting dollars straightening these messes out. It’s unconscionable. I don’t want to have to do this for a living. To be very clear, it’s not that the data transfer pricing is too expensive, it’s that it’s freaking inscrutable to understand. If I can cut someone’s bill significantly with a trivial routing change, that’s not the customer’s fault.
  • @PPathole: Alternative Big O notations: O(1) = O(yeah) O(log n) = O(nice) O(nlogn) = O(k-ish) O(n) = O(ok) O(n²) = O(my) O(2ⁿ) = O(no) O(n^n) = O(f*ck) O(n!) = O(mg!)
  • Brewster Kahle: There’s only a few hackers I’ve known like Richard Stallman, he’d write flawless code at typing speed. He worked himself to the bone trying to keep up with really smart former colleagues who had been poached from MIT. Carpal tunnel, sleeping under the desk, really trying hard for a few years and it was killing him. So he basically says I give up, we’re going to lose the Lisp machine. It was going into this company that was flying high, it was going to own the world, and he said it was going to die, and with it the Lisp machine. He said all that work is going to be lost, we need a way to deal with the violence of forking. And he came up with the GNU public license. The GPL is a really elegant hack in the classic sense of a hack. His idea of the GPL was to allow people to use code but to let people put it back into things. Share and share alike.

Useful Stuff:

  • It’s probably not a good idea to start a Facebook poll on the advisability of your pending nuptials a day before the wedding. But it is very funny and disturbingly plausible. Made Public. Another funny/sad one is using a ML bot to “deal with” phone scams. The sad part will be when both sides are just AIs trying to socially engineer each other and half the world’s resources become dedicated to yet another form of digital masturbation. Perhaps we should just stop the MADness?
  • Urgent/1111 Zero Day Vulnerabilities Impacting VxWorks, the Most Widely Used Real-Time Operating System (RTOS). I read this with special interest because I’ve used VxWorks on several projects. Not once do I ever remember anyone saying “I wonder if the TCP/IP stack has security vulnerabilities?” We looked at licensing costs, board support packages, device driver support, tool chain support, ISR service latencies, priority inversion handling, task switch determinacy, etc. Why did we never think of these kind of potential vulnerabilities? One reason is social proof. Surely all these other companies use VxWorks, it must be good, right? Another reason is VxWorks is often used within a secure perimeter. None of the network interfaces are supposed to be exposed to the internet, so remote code execution is not part of your threat model. But in reality you have no idea if a customer will expose a device to the internet. And you have no idea if later product enhancements will place the device on the internet. Since it seems all network devices expand until they become a router, this seems a likely path to Armageddon. At that point nobody is going to requalify their entire toolchain. That just wouldn’t be done in practice. VxWorks is dangerous because everything is compiled into a single image the boots and runs, much like a unikernel. At least when I used it that was the case. VxWorks is basically just a library you link into your application that provides OS functionality. Your write the boot code, device drivers, and other code to make your application work. So if there’s a remote code execution bug it has access to everything. And a lot of these images are built into ROM, so they aren’t upgradeable. And if even if the images are upgradeable in EEPROM or flash, how many people will actually do that? Unless you pay a lot of money you do not get the source to VxWorks. You just get libraries and header files. So you have no idea what’s going on in the network stack. I’m surprised VxWorks never tested their stack against a fuzzing kind of attack. That’s a great way to find bugs in protocols. Though nobody can define simplicity, many of the bugs were in the handling of the little used TCP Urgent Pointer feature. Anyone surprised that code around this is broke? Who uses it? It shouldn’t be in the stack at all. Simple to say, harder to do.
  • JuliaCon 2019 videos are now available. You might like Keynote: Professor Steven G. Johnson and The Unreasonable Effectiveness of Multiple Dispatch
  • CERN is Migrating to open-source technologies. Microsoft wants too much for their licenses so CERN is giving MS the finger.
  • Memory and Compute with Onur Mutlu:
    • The main problem is that DRAM latency is hardly improving at all. From 1999 to 2017, DRAM capacity has increased by 128x, bandwidth by 20x, but latency only by 1.3x! This means that more and more effort has to be spent tolerating memory latency.  But what could be done to actually improve memory latency?
    • You could “easily” get a 30% latency improvement by having DRAM chips provide a bit more precise information to the memory controller about actual latencies and current temperatures.
    • Another concept to truly break the memory barrier is to move the compute to the memory. Basically, why not put the compute operations in memory?  One way is to use something like High-Bandwidth Memory (HBM) and shorten the distance to memory by stacking logic and memory.
    • Another rather cool (but also somewhat limited) approach is to actually use the DRAM cells themselves as a compute engine. It turns out that you can do copy, clear, and even logic ops on memory rows by using the existing way that DRAMs are built and adding a tiny amount of extra logic.
  • Want to make something in hardware? Like Pebble, Dropcam, or Ring. Who you gonna call? Dragon Innovation. Listen how on the AMP Hour podcast episode #451 – An Interview with Scott Miller
    • Typical customers build between 5k and 1 million units, but will talk with you at 100 units. Customers usually start small. They’ve built a big toolbox for IoT, so they don’t need to create the wheel every time, they have designs for sensing, processing, electronics on the edge, radios, and all the different security layers. They can deploy quickly with little customizations.
    • Dragon is moving into doing the design, manufacturing, packaging, issue all POs, and installation support. They call this Product as a Service (PaaS)—full end-to-end provider. Say you have a sensor to determine when avocados are ripe you would pay per sensor per month, or maybe per avocado, instead of a one time sale. Seeing more non-traditional getting into the IoT space, with different revenue models, Dragon has an opportunity to innovate on their business model. 
    • Consumer is dying and industrial is growing. A trend they are seeing in the US is a constriction of business to consumer startups in the hardware space, but an an expansion of industrial IoT. There have been a bunch of high profile bankruptcies in the consumer space (Anki, Jibo).
    • Europe is growing. Overall huge growth in industrial startups across Europe. Huge number of capable factories in the EU. They get feet on the ground to find and qualify factories. They have over 2000 factories in their database. 75% in China, increasingly more in the EU and the US. 
    • Factories are going global. Seeing a lot of companies driven out of China by the 25% tariffs, moving into Asian pacific countries like Taiwan, Singapore, Vietnam, Indonesia, Malaysia. Coming up quickly, but not up to China’s level yet. Dragon will include RFQs on a global basis, including factories from the US, China, EU, Indonesia, Vietnam, to see what the landed cost is as a function of geography. 
    • Factories are different in different countries. In China factories are vertically integrated. Mold making, injection molding, final assembly and test and packaging, all under one roof. Which is very convenient. In the US and Europe factories are more horizontal. It takes a lot more effort to put together your supply chain.  As an example of the degree they were vertically integrated this factory in China would make their own paint and cardboard. 
    • Automation is huge in China. Chinese labor rates are on average 5 to 6 dollars an hour, depending on region, factory, training. Focus is on automation. One factory they worked with had 100,000 workers now they have 30,000 because of automation.
    • Automation is different in China. Automation in China is bottom’s up. They’ll build a simple robot that attaches to a soldering iron and will solder the leads. In the US is top down. Build a huge full functioning worker that can do anything instead of a task specific robot. China is really good at building stuff so they build task specific robots to make their processes more efficient. Since products are always changing this allows them to stay nimble. 
    • Also Strange PartsDesign for Manufacturing Course, How I Made My Own iPhone – in China.
  • BigQuery best practices: Controlling costs: Query only the columns that you need; Don’t run queries to explore or preview table data; Before running queries, preview them to estimate costs; Using the query validator; Use the maximum bytes billed setting to limit query costs; Do not use a LIMIT clause as a method of cost control; Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns; Partition your tables by date; f possible, materialize your query results in stages;  If you are writing large query results to a destination table, use the default table expiration time to remove the data when it’s no longer needed; Use streaming inserts only if your data must be immediately available.
  • Boeing has changed a lot over the years. Once upon a time I worked on a project with Boeing and the people were excellent. This is something I heard: “The changes can be attributed to the influence of the McDonnel family who maintain extremely high influence through their stock shares resulting from the merger. It has been gradually getting better recently but still a problem for those inside who understand the real potential impact.”
  • Maybe we are all just random matrices? What Is Universality? It turns out there are deep patterns in complex correlated systems that lie somewhere between randomness and order. They arise from components that interact and repel one another. Do such patterns exist in software systems? Also, Bubble Experiment Finds Universal Laws
  • PID Loops and the Art of Keeping Systems Stable
    • I see a lot of places where control theory is directly applicable but rarely applied. Auto-scaling and placement are really obvious examples, we’re going to walk through some, but another is fairness algorithms. A really common fairness algorithm is how TCP achieves fairness. You’ve got all these network users and you want to give them all a fair slice. Turns out that a PID loop it’s what’s happening. In system stability, how do we absorb errors, recover from those errors? 
    • Something we do in CloudFront is we run a control system. We’re constantly measuring the utilization of each site and depending on that utilization, we figure out what’s our error, how far are we from optimized? We change the mass or radius of effect of each site, so that at our really busy time of day, really close to peak, it’s servicing everybody in that city, everybody directly around it drawing those in, but that at our quieter time of day can extend a little further and go out. It’s a big system of dynamic springs all interconnected, all with PID loops. It’s amazing how optimal a system like that can be, and how applying a system like that has increased our effectiveness as a CDN provider. 
    • A surprising number of control systems are just like this, they’re just Open Loops. I can’t count the number of customers I’ve gone through control systems with and they told me, “We have this system that pushes out some states, some configuration and sometimes it doesn’t do it.” I find that scary, because what it’s saying is nothing’s actually monitoring the system. Nothing’s really checking that everything is as it should be. My classic favorite example of this as an Open Loop process, is certificate rotation. I happened to work on TLS a lot, it’s something I spent a lot of my time on. Not a week goes by without some major website having a certificate outage.
    • We have two observability systems at AWS, CloudWatch, and X-Ray. One of the things I didn’t appreciate until I joined AWS – I was a bit going on like Charlie and the chocolate factory, and seeing the insides. I expected to see all sorts of cool algorithms and all sorts of fancy techniques and things that I just never imagined. It was a little bit of that, there was some of that once I got inside working, but mostly what I found was really mundane, people were just doing a lot of things at scale that I didn’t realize. One of those things was just the sheer volume of monitoring. The number of metrics we keep on, every single host, every single system, I still find staggering.
    • Exponential Back-off is a really strong example. Exponential Back-off is basically an integral, an error happens and we retry, a second later if that fails, then we wait. Rate limiters are like derivatives, they’re just rate estimators and what’s going on and deciding what’s to let in and what to let out. We’ve built both of these into the AWS SDKs. We’ve got other back pressure strategies too, we’ve got systems where servers can tell clients, “Back off, please, I’m a little busy right now,” all those things working together. If I look at system design and it doesn’t have any of this, if it doesn’t have exponential back-off, if it doesn’t have rate-limiters in some place, if it’s not able to fight some power-law that I think might arise due to errors propagating, that tells me I need to be a bit more worried and start digging deeper.
    • I like to watch out for edge triggering in systems, it tends to be an anti-pattern. One reason is because edge triggering seems to imply a modal behavior. You cross the line, you kick into a new mode, that mode is probably rarely tested and it’s now being kicked into at a time of high stress, that’s really dangerous. Your system has to be idempotent, if you’re going to build an idempotent system, you might as well make a level-triggered system in the first place, because generally, the only benefit of building an edge-triggered system is it doesn’t have to be idempotent.
    • There is definitely tension between stability and optimality, and in general, the more finely-tuned you want to make a system to achieve absolute optimality, the more risk you are of being able to drive it into an unstable state. There are people who do entire PIDs on nothing else then finding that balance for one system. Oil refineries are a good example, where the oil industry will pay people a lot of money just to optimize that, even very slightly. Computer Science, in my opinion, and distributed systems, are nowhere near that level of advanced control theory practice yet. We have a long way to go. We’re still down at the baby steps of, “We’ll at least measure it.”
  • Re:Inforce 2019 videos are now available.
  • Top Seven Myths of Robust Systems: The number one myth we hear out in the field is that if a system is unreliable, we can fix that with redundancy; rather than trying to simplify or remove complexity, learn to live with it. Ride complexity like a wave. Navigate the complexity; The adaptive capacity to improvise well in the face of a potential system failure comes from frequent exposure to risk; Both sides — the procedure-makers and the procedure-not-followers — have the best of intentions, and yet neither is likely to believe that about the other; Unfortunately it turns out catastrophic failures in particular tend to be a unique confluence of contributing factors and circumstances, so protecting yourself from prior outages, while it shouldn’t hurt, also doesn’t help very much; Best practices aren’t really a knowable thing; Don’t blame individuals. That’s the easy way out, but it doesn’t fix the system. Change the system instead. 
  • They grow up so slow. What’s new in JavaScript: Google I/O 2019 Summary
  • From a rough calculation we saw about 40% decrease in the amount of CPU resources used. Overall, we saw latency stabilize for both avg and max p99. Max p99 latency also decreased a bit. Safely Rewriting Mixpanel’s Highest Throughput Service in Golang. Mixpanel moved from Python to Go for their data collection API. They has already migrated the Python API to use the Google Load Balancer to route messages to kubernetes pod on Google Cloud where an Envoy container load-balanced between eight Python API containers. The Python API containers then submitted the data to Google Pubsub queue via a pubsub sidecar container that had a kestrel interface. To enable testing against live traffic, we created a dedicated setup. The setup was a separate kubernetes pod running in the same namespace and cluster as the API deployments. The pod ran an open source API correctness tool, Diffy, along with copies of the old and new API services. Diffy is a service that accepts HTTP requests, and forwards them to two copies of an existing HTTP service and one copy of a candidate HTTP service. One huge improvement is we only need to run a single API container per pod. 
  • Satisfactory: Network Optimizations: It would be a big gain to stop replicating the inventory when it’s not viewed, which is essentially what we did, but the method of doing so was a bit complicated and required a lot of rework…Doing this also helps to reduce CPU time, as an inventory is a big state to compare, and look for changes in. If we can reduce that to a maximum of 4x the number of players it is a huge gain, compared to the hundreds, if not thousands, that would otherwise be present in a big base…There is, of course, a trade-off. As I mentioned there is a chance the inventory is not there when you first open to view it, as it has yet to arrive over the network…In this case the old system actually varied in size but landed around 48 bytes per delta, compared to the new system of just 3 bytes…On top of this, we also reduced how often a conveyor tries to send an update to just 3 times a second compared to the previous of over 20…the accuracy of item placements on the conveyors took a small hit, but we have added complicated systems in order to compensate for that…we’ve noticed that the biggest issue for running smooth multiplayer in large factories is not the network traffic anymore, it’s rather the general performance of the PC acting as a server.
  • MariaDB vs MySQL Differences: MariaDB is fully GPL licensed while MySQL takes a dual-license approach. Each handle thread pools in a different way. MariaDB supports a lot of different storage engines. In many scenarios, MariaDB offers improved performance.
  • Our pySpark pipeline churns through tens of billions of rows on a daily basis. Calculating 30 billion speed estimates a week with Apache Spark: Probes generated from the traces are matched against the entire world’s road network. At the end of the matching process we are able to assign each trace an average speed, a 5 minute time bucket and a road segment. Matches on the same road that fall within the same 5 minute time bucket are aggregated to create a speed histogram. Finally, we estimate a speed for each aggregated histogram which represents our prediction of what a driver will experience on a road at a given time of the week…On a weekly basis, we match on average 2.2 billion traces to 2.3 billion roads to produce 5.4 billion matches. From the matches, we build 51 billion speed histograms to finally produce 30 billion speed estimates…The first thing we spent time on was designing the pipeline and schemas of all the different datasets it would produce. In our pipeline, each pySpark application produces a dataset persisted in a hive table readily available for a downstream application to use…Instead of having one pySpark application execute all the steps (map matching, aggregation, speed estimation, etc.) we isolated each step to its own application…We favored normalizing our tables as much as possible and getting to the final traffic profiles dataset through relationships between relevant tables…Partitioning makes querying part of the data faster and easier. We partition all the resulting datasets by both a temporal and spatial dimension. 
  • Do not read this unless you can become comfortable with the feeling that everything you’ve done in your life is trivial and vainglorious. Morphogenesis for the Design of Design
    • One of my students built and runs all the computers Facebook runs on, one of my students used to run all the computers Twitter runs on—this is because I taught them to not believe in computer science. In other words, their job is to take billions of dollars, hundreds of megawatts, and tons of mass, and make information while also not believing that the digital is abstracted from the physical. Some of the other things that have come out from this lineage were the first quantum computations, or microfluidic computing, or part of creating some of the first minimal cells.
    • The Turing machine was never meant to be an architecture. In fact, I’d argue it has a very fundamental mistake, which is that the head is distinct from the tape. And the notion that the head is distinct from the tape—meaning, persistence of tape is different from interaction—has persisted. The computer in front of Rod Brooks here is spending about half of its work just shuttling from the tape to the head and back again.
    • There’s a whole parallel history of computing, from Maxwell to Boltzmann to Szilard to Landauer to Bennett, where you represent computation with physical resources. You don’t pretend digital is separate from physical. Computation has physical resources. It has all sorts of opportunities, and getting that wrong leads to a number of false dichotomies that I want to talk through now. One false dichotomy is that in computer science you’re taught many different models of computation and adherence, and there’s a whole taxonomy of them. In physics there’s only one model of computation: A patch of space occupies space, it takes time to transit, it stores state, and states interact—that’s what the universe does. Anything other than that model of computation is physics and you need epicycles to maintain the fiction, and in many ways that fiction is now breaking.
    • We did a study for DARPA of what would happen if you rewrote from scratch a computer software and hardware so that you represented space and time physically.
    • One of the places that I’ve been involved in pushing that is in exascale high-performance computing architecture, really just a fundamental do-over to make software look like hardware and not to be in an abstracted world.
    • Digital isn’t ones and zeroes. One of the hearts of what Shannon did is threshold theorems. A threshold theorem says I can talk to you as a wave form or as a symbol. If I talk to you as a symbol, if the noise is above a threshold, you’re guaranteed to decode it wrong; if the noise is below a threshold, for a linear increase in the physical resources representing the symbol there’s an exponential reduction in the fidelity to decode it. That exponential scaling means unreliable devices can operate reliably. The real meaning of digital is that scaling property. But the scaling property isn’t one and zero; it’s the states in the system. 
    • if you mix chemicals and make a chemical reaction, a yield of a part per 100 is good. When the ribosome—the molecular assembler that makes your proteins—elongates, it makes an error of one in 104. When DNA replicates, it adds one extra error-correction step, and that makes an error in 10-8, and that’s exactly the scaling of threshold theorem. The exponential complexity that makes you possible is by error detection and correction in your construction. It’s everything Shannon and von Neumann taught us about codes and reconstruction, but it’s now doing it in physical systems.
    • One of the projects I’m working on in my lab that I’m most excited about is making an assembler that can assemble assemblers from the parts that it’s assembling—a self-reproducing machine. What it’s based on is us. 
    • If you look at scaling coding construction by assembly, ribosomes are slow—they run at one hertz, one amino acid a second—but a cell can have a million, and you can have a trillion cells. As you were sitting here listening, you’re placing 1018 parts a second, and it’s because you can ring up this capacity of assembling assemblers. The heart of the project is the exponential scaling of self-reproducing assemblers.
    • As we work on the self-reproducing assembler, and writing software that looks like hardware that respects geometry, they meet in morphogenesis. This is the thing I’m most excited about right now: the design of design. Your genome doesn’t store anywhere that you have five fingers. It stores a developmental program, and when you run it, you get five fingers. It’s one of the oldest parts of the genome. Hox genes are an example. It’s essentially the only part of the genome where the spatial order matters. It gets read off as a program, and the program never represents the physical thing it’s constructing. The morphogenes are a program that specifies morphogens that do things like climb gradients and symmetry break; it never represents the thing it’s constructing, but the morphogens then following the morphogenes give rise to you.
    • What’s going on in morphogenesis, in part, is compression. A billion bases can specify a trillion cells, but the more interesting thing that’s going on is almost anything you perturb in the genome is either inconsequential or fatal. The morphogenes are a curated search space where rearranging them is interesting—you go from gills to wings to flippers. The heart of success in machine learning, however you represent it, is function representation. The real progress in machine learning is learning representation. 
    • We’re at an interesting point now where it makes as much sense to take seriously that scaling as it did to take Moore’s law scaling in 1965 when he made his first graph. We started doing these FAB labs just as outreach for NSF, and then they went viral, and they let ordinary people go from consumers to producers. It’s leading to very fundamental things about what is work, what is money, what is an economy, what is consumption.
    • Looking at exactly this question of how a code and a gene give rise to form. Turing and von Neumann both completely understood that the interesting place in computation is how computation becomes physical, how it becomes embodied and how you represent it. That’s where they both ended their life. That’s neglected in the canon of computing.
    • If I’m doing morphogenesis with a self-reproducing system, I don’t want to then just paste in some lines of code. The computation is part of the construction of the object. I need to represent the computation in the construction, so it forces you to be able to overlay geometry with construction.
    • Why align computer science and physical science? There are at least five reasons for me. Only lightly is it philosophical. It’s the cracks in the matrix. The matrix is cracking. 1) The fact that whoever has their laptop open is spending about half of its resources shuttling information from memory transistors to processor transistors even though the memory transistors have the same computational power as the processor transistors is a bad legacy of the EDVAC. It’s a bit annoying for the computer, but when you get to things like an exascale supercomputer, it breaks. You just can’t maintain the fiction as you push the scaling. The resource in very largescale computing is maintaining the fiction so the programmers can pretend it’s not true is getting just so painful you need to redo it. In fact, if you look down in the trenches, things like emerging ways to do very largescale GPU program are beginning to inch in that direction. So, it’s breaking in performance.
    •  What’s interesting is a lot of the things that are hard—for example, in parallelization and synchronization—come for free. By representing time and space explicitly, you don’t need to do the annoying things like thread synchronization and all the stuff that goes into parallel programming.
    • Communication degraded with distance. Along came Shannon. We now have the Internet. Computation degraded with time. The last great analog computer work was Vannevar Bush’s differential analyzer. One of the students working on it was Shannon. He was so annoyed that he invented our modern digital notions in his Master’s thesis to get over the experience of working on the differential analyzer.
    • When you merge communication with computation with fabrication, it’s not there’s a duopoly of communication and computation and then over here is manufacturing; they all belong together. The heart of how we work is this trinity of communication plus computation and fabrication, and for me the real point is merging them.
    • I almost took over running research at Intel. It ended up being a bad idea on both sides, but when I was talking to them about it, I was warned off. It was like the godfather: “You can do that other stuff, but don’t you dare mess with the mainline architecture.” We weren’t allowed to even think about that. In defense of them, it’s billions and billions of dollars investment. It was a good multi-decade reign. They just weren’t able to do it. 
    • Again, the embodiment of everything we’re talking about, for me, is the morphogenes—the way evolution searches for design by coding for construction. And they’re the oldest part of the genome. They were invented a very long time ago and nobody has messed with them since.
    • Get over digital and physical are separate; they can be united. Get over analog as separate from digital; there’s a really profound place in between. We’re at the beginning of fifty years of Moore’s law but for the physical world. We didn’t talk much about it, but it has the biggest impact of anything I know if anybody can make anything.

Soft Stuff:

  • paypal/hera (article): Hera multiplexes connections for MySQL and Oracle databases. It supports sharding the databases for horizontal scaling. It is a data access gateway that PayPal uses to scale database access for hundreds of billions of SQL queries per day. Additionally, HERA improves database availability through sophisticated protection mechanisms and provides application resiliency through transparent traffic failover. HERA is now available outside of PayPal as an Apache 2-licensed project.
  • zerotier/lf: a fully decentralized fully replicated key/value store. LF is built on a directed acyclic graph (DAG) data model that makes synchronization easy and allows many different security and conflict resolution strategies to be used. One way to think of LF’s DAG is as a gigantic conflict-free replicated data type (CRDT). Proof of work is used to rate limit writes to the shared data store on public networks and as one thing that can be taken into consideration for conflict resolution. 
  • pahud/fargate-fast-autoscaling: This reference architecture demonstrates how to build AWS Fargate workload that can detect the spiky traffic in less than 10 seconds followed by an immediate horizontal autoscaling.
  • ailidani/paxi: Paxi is the framework that implements WPaxos and other Paxos protocol variants. Paxi provides most of the elements that any Paxos implementation or replication protocol needs, including network communication, state machine of a key-value store, client API and multiple types of quorum systems.

Pub Stuff:

from High Scalability

The Agile Enterprise: A Flux7 OKR Case Study

The Agile Enterprise: A Flux7 OKR Case Study

Flux7 Agile Enterprise Case Study OKR


The Agile Enterprise is becoming the way successful companies operate and at Flux7 we like to lead by example. As a result, we have embraced many Agile practices across our business — from OKRs to a flatarchy (for additional background, read our blog, Flatarchies and the Agile Enterprise) — and plan to share in a short blog series how we are implementing these agile best practices, lessons we’ve learned along the way and the impacts they’ve had on our business. In today’s blog, we start by taking a look at our OKR (Objectives and Key Results) story and the greater role of OKRs in an Agile Enterprise.

Created by Intel and made popular by organizations like Amazon, Google, Microsoft, and Slack, OKR is a goal setting management style that is gaining traction. The goal of OKRs is to align individuals, teams and the organization as a whole around measurable results that have everyone rowing in the same direction.

Our OKR Timeline

Excited to begin, we started experimenting with them in early Q4 of 2018. And, our first serious attempt with OKRs was as we looked to build them for Q1 of 2019. After trying it once, we saw the shortcomings of what we had done (keep reading as we discuss lessons learned from that exercise below) and brought in an expert who could help us learn and improve. We found Dan Montgomery, founder of Agile Strategies and author of Start Less Finish More, to be exactly what we were looking for.

Dan helped us understand both the theory behind OKRs and gave us practical how-to steps to implement OKRs across Flux7. As an organization that already uses Agile methodologies in our consulting practice, Dan showed us how we can readily apply these principles to the OKR process, growing our corporate strategic agility. With Dan’s guidance, we began implementing OKRs across the organization.

We started with an initial training session on OKRs at Flux7’s All Hands Meeting, followed by an in-depth training and project orientation session for company leads. This training was bolstered with a session with our co-founders to assess company strategy, goals and performance as well as prepare for the development of company OKRs with the leads.

With this foundation in place, we began drafting our company OKRs. While our leads helped pave the way, Dan was instrumental in reviewing drafts and providing feedback. With company OKRs in place, we next turned to team OKRs. Over the course of two weeks, our leads worked with team members to draft team OKRs based on corporate OKRs. We finalized OKRs with a workshop where we made sure everyone was in alignment for the upcoming quarter and our leads committed to integrating OKRs into weekly action planning and accomplishments moving forward.

OKR Lessons Learned

While we tried our hand at developing OKRs before we engaged with Dan, we learned a few important things through this first exercise which were underscored by his expertise:

  1. Less can be more.
    Regardless of the team or role, we found that people erred on the side of having more OKRs than fewer. We quickly realized that Dan’s “Start Less, Finish More” mantra was spot on and that less is indeed more as fewer OKRs mean we all have a laser focus on achieving key organizational goals, minimizing distractions and forcing a real prioritization that generates greater output.

    We have a rule of thumb that no team shall have more than two objectives and would recommend to others that they have no more than three OKRs per group. In this vein, we would also recommend no more than three to five results per outcome. For example, if People Ops has an OKR to grow employee success, that might be measured through employee engagement, percent of employees that take advantage of professional development, and percent of employees taking part in the mentorship program.

  2. Cross-dependencies must be flagged.
    While our teams quickly grokked the idea of how OKRs roll-up in support of top-level business goals, we could have done a better job initially of identifying OKR cross- dependencies between teams and individuals. With one of the goals of OKRs to improve employee engagement and teamwork, we quickly saw how imperative it is to flag any OKRs that bridge workgroups and/or individual employees. By ensuring that individuals are working in tandem and not duplicating efforts, we are able to maximize productivity.
  3. Transparency remains vital.
    A core value since we opened our doors in 2013, the OKR process has served to highlight the importance of transparency in all we do. We are as transparent about OKRs as we are everything else at Flux7; since moving to an OKR process, we have taken several steps to ensure transparency.
  • Integrated a team-by-team discussion of OKRs at each of our monthly meetings. At each meeting, we rotate team members who present progress on OKRs.
  • Like everything else at Flux7, we encourage people to ask questions and to spur participation by everyone.
  • We have created an OKR Trello board where team members can see progress to date on our quarterly OKRs.
  • Translate quarterly OKRs to weekly actions.
    It is really important to map OKRs to weekly actions as they are stepping stones to reaching the broader goal. While we still have room for improvement here, we recognize that it’s important to assess our progress to goal on a weekly basis as it allows us to more accurately track overall success and institute a course correction (when/if needed) in order to reach our ultimate OKR goal.

    Two things worth noting here: First, mapping weekly actions to goals was an easier task for some groups than others, as the nature of work for some groups is naturally longer-horizon. Second, we highly recommend setting quarterly OKRs; this cadence allows us to be aggressive and in-tune with the fast-changing pace of the market while not so fast that we’re constantly re-working OKRs.

  • Another core value at Flux7 is applying learning for constant improvement. After our first quarterly OKR setting, we took a hard look at what went well and what could be improved and in learning from it went about the process of our second OKR setting. They say that the first pancake is always the flattest and this proved to be true with our OKR process as the second set of OKRs moved much more seamlessly, thanks to insight and guidance from Dan on what we were doing well and where we could improve.

    OKRs and the Agile Enterprise

    The Agile Enterprise is defined by its ability to create business value as it reacts to swift market changes. OKRs support this goal by replacing traditional goal-setting (a yearly top-down exercise) with quarterly bottom-up objectives and key results. We’ve seen the benefits first-hand:

    • As employees play a key role in developing the objectives and results that they are personally responsible for, they take ownership and accountability. They are invested in achieving results.
    • With ownership comes empowerment. Our employees know we trust them to create their own OKRs and take the reins and drive the results. As Henrik Kniberg points out here, what we seek — and achieve — is Aligned Autonomy. The business needs alignment, which is what we get when everyone is bought-in on the ultimate objectives. And teams need autonomy which is what we get when people are empowered. The result: we can all row in the same direction very efficiently and effectively.
    • Last, with an agile-focused culture and a handful of objectives, we are all able to see clear progress toward our goals. As everyone feels like they are a part of the company’s success, employee satisfaction grows which creates a virtuous cycle of greater ownership, empowerment and ultimately business value to customers, partners and shareholders.

    Transition is hard; it is chaotic, and it doesn’t have easy answers. Having a guide that knows how to navigate these issues is important; just as we learned from working with Dan, our customers learn from working with us that having a partner who understands how to navigate a path to those unique solutions that will work best for your enterprise is invaluable.

    The Agile Enterprise extends beyond agile development or lean product management; it is a mindset that must permeate corporate strategy as well. OKRs can play an integral role in bringing agility to corporate strategy, in the process growing employee engagement, removing silos and accelerating responsiveness to quickly changing market forces. Make sure you don’t miss the series on becoming an Agile Enterprise. Subscribe to our DevOps Blog here:

    Subscribe to the Flux7 Blog

    from Flux7 DevOps Blog

    How CIOs Can Prepare an IT Platform for the Agile Enterprise

    How CIOs Can Prepare an IT Platform for the Agile Enterprise

    CIOs Prepare IT Platform Agile Enterprise

    Today’s marketplace is volatile. It is uncertain. It is complex and difficult to navigate. And to stay competitive, enterprises must react to change with unprecedented speed. As many of the external pressures on business today stem from changes happening in the digital world, IT has naturally become one of the first areas to adopt change with an aim of helping the business become an Agile Enterprise.

    To help CIOs embrace this as an opportunity to be a guiding force for the Agile Enterprise we have just published a new paper on how technology leaders can prepare the IT Platform to effectively serve as a foundation for the Agile Enterprise.

    Download the Paper Today

    While achieving an Agile Enterprise must be rooted in the business and focused on reaching corporate goals, a technology platform that supports agility with IT automation and DevOps best practices can be a key lever to helping IT engage with and improve the business. As a result, in this new paper, we discuss:

    • The tale of two digital transformations, examining what went well and lessons we can all learn and apply to our business.
    • The role of an Agile Culture, particularly within IT and how CIOs can set the right tone from the outset.
    • Five key areas of automation that CIOs should be sure to incorporate into the IT Platform to ensure agility, grow IT productivity, and deliver specific business outcomes.
    • How an Enterprise DevOps Framework can help give CIOs an IT platform that enables DevOps at Scale, facilitates enterprise agility and helps technology leaders deliver greater business value.

    For organizations who are looking to leapfrog the competition and create an Agile Enterprise capable of competing effectively today — and well into the future — CIOs can be the change agent that drives responsiveness, starting with an agile IT culture and flexible IT platform. As the pace of the market continues to accelerate led by digitalization, technology leaders have a distinct opportunity to embrace and lead the Agile Enterprise, driving greater business value and business results.

    Download the Paper Today

    For additional reading:

    Written by Flux7 Labs

    Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

    from Flux7 DevOps Blog

    IT Modernization and DevOps News Week in Review

    IT Modernization and DevOps News Week in Review

    IT Modernization DevOps News 10Underscoring why DevOps security continues to be the leading thing that keeps CIOs up at night is the newest Data Breach report from IBM and the Ponemon Institute in which they find that the average cost of a data breach has grown 12% since 2014. However, companies with an incident response team and extensive incident response testing were able to hinder some of the impacts of a data breach, reporting $1.23 million less in losses. Similarly, companies using encryption were able to help reduce the total cost of a breach by $360,000.

    To stay up-to-date on DevOps security, CI/CD and IT Modernization, subscribe to our blog here:

    Subscribe to the Flux7 Blog

    In related news, Palo Alto Networks unveiled its Summer 2019 Unit 42 Cloud Threat Risk Report in which it found that “over the last 18 months, 65% of publicly disclosed cloud security incidents were due to misconfigurations, and 25% were due to account compromises.” In this time, cloud complexity with the growing adoption of Docker and Kubernetes has grown, opening the door to greater exposure.

    Last, our DevOps consulting team enjoyed this blog, Manufacturers’ Digital Transformation Will Fail Without Both IT And OT in which Forrester analyst Paul Miller discusses why manufacturers need to combine the best of both IT and OT to meet the needs of the business — and its customers.

    AWS News

    • Last week AWS announced the new AWS Chatbot, a new service that, according to the company, enables DevOps teams to receive AWS notifications and execute commands in Slack channels and Amazon Chime chat rooms with only minimal effort. While currently in beta, AWS Chatbot already supports Amazon CloudWatch, AWS Health, AWS Budgets, AWS Security Hub, Amazon GuardDuty and AWS CloudFormation.
    • Are PCI compliance and AWS security best practices important to your organization? If so, AWS wants you to know that it has expanded its PCI DSS certification scope by 79%, from 62 services to 111 services including 12 newly added services, Amazon AppStream 2.0, Amazon CloudWatch, Amazon CloudWatch Events, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS Amplify Console, AWS Control Tower, AWS CodeDeploy, AWS CodePipeline, AWS Elemental MediaConvert, AWS Elemental MediaLive, AWS Organizations, and AWS SDK Metrics for Enterprise Support.

    Flux7 News

    • We are starting a new blog series about becoming an Agile Enterprise, starting with this week’s article on organizational structures. Many organizations embrace agile ways of working in an attempt to build faster, more customer-focused and resilient organizations. Read how we at Flux7 went about the process of choosing an organizational structure that would best support our Agile Enterprise.
    • We are honored that Forrester has named Flux7 among the companies it has included in its Now Tech: Application Modernization and Migration Services Q1 2019 report. In the report, Flux7 is named a cloud development and AWS cloud migration services specialist, serving markets in North America including software, finance, and life sciences. Download your complimentary copy today.

    Subscribe to the Flux7 Blog

    Written by Flux7 Labs

    Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

    from Flux7 DevOps Blog