Tag: Automation

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

A long time ago, in a galaxy far far away, ‘threads’ were a programming novelty rarely used and seldom trusted. In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice. It would be a shame if your database crashed, after all.

Since then, a lot of water has flown under that bridge, but the PostgreSQL community has stuck by their original decision. It is difficult to fault their argument – as it’s absolutely true that:

  • Each client having its own process prevents a poorly behaving client from crashing the entire database.
  • On modern Linux systems, the difference in overhead between forking a process and creating a thread is much lesser than it used to be.
  • Moving to a multithreaded architecture will require extensive rewrites.

However, in modern web applications, clients tend to open a lot of connections. Developers are often strongly discouraged from holding a database connection while other operations take place. “Open a connection as late as possible, close a connection as soon as possible”. But that causes a problem with PostgreSQL’s architecture – forking a process becomes expensive when transactions are very short, as the common wisdom dictates they should be. In this post, we cover the pros and cons of PostgreSQL connection pooling.

PostgreSQL Architecture DiagramThe PostgreSQL Architecture | Source

The Connection Pool Architecture

Using a modern language library does reduce the problem somewhat – connection pooling is an essential feature of most popular database-access libraries. It ensures ‘closed’ connections are not really closed, but returned to a pool, and ‘opening’ a new connection returns the same ‘physical connection’ back, reducing the actual forking on the PostgreSQL side.

Visual Representation of a Connection PoolThe architecture of a generic connection-pool

However, modern web applications are rarely monolithic, and often use multiple languages and technologies. Using a connection pool in each module is hardly efficient:

  • Even with a relatively small number of modules, and a small pool size in each, you end up with a lot of server processes. Context-switching between them is costly.
  • The pooling support varies widely between libraries and languages – one badly behaving pool can consume all resources and leave the database inaccessible by other modules.
  • There is no centralized control – you cannot use measures like client-specific access limits.

As a result, popular middlewares have been developed for PostgreSQL. These sit between the database and the clients, sometimes on a seperate server (physical or virtual) and sometimes on the same box, and create a pool that clients can connect to. These middleware are:

  • Optimized for PostgreSQL and its rather unique architecture amongst modern DBMSes.
  • Provide centralized access control for diverse clients.
  • Allow you to reap the same rewards as client-side pools, and then some more (we will discuss these more in more detail in our next posts)!

PostgreSQL Connection Pooler Cons

A connection pooler is an almost indispensable part of a production-ready PostgreSQL setup. While there is plenty of well-documented benefits to using a connection pooler, there are some arguments to be made against using one:

  • Introducing a middleware in the communication inevitably introduces some latency. However, when located on the same host, and factoring in the overhead of forking a connection, this is negligible in practice as we will see in the next section.
  • A middleware becomes a single point of failure. Using a cluster at this level can resolve this issue, but that introduces added complexity to the architecture.

    Redundant pgBouncer instances to prevent single point of failureRedundancy in middleware to avoid Single-Point-of-Failure | Source

  • A middleware implies extra costs. You either need an extra server (or 3), or your database server(s) must have enough resources to support a connection pooler, in addition to PostgreSQL.
  • Sharing connections between different modules can become a security vulnerability. It is very important that we configure pgPool or PgBouncer to clean connections before they are returned to the pool.
  • The authentication shifts from the DBMS to the connection pooler. This may not always be acceptable.

    PgBouncer Authentication ModelPgBouncer Authentication Model | Source

  • It increases the surface area for attack, unless access to the underlying database is locked down to allow access only via the connection pooler.
  • It creates yet another component that must be maintained, fine tuned for your workload, security patched often, and upgraded as required.

Should You Use a PostgreSQL Connection Pooler?

However, all of these problems are well-discussed in the PostgreSQL community, and mitigation strategies ensure the pros of a connection pooler far exceed their cons. Our tests show that even a small number of clients can significantly benefit from using a connection pooler. They are well worth the added configuration and maintenance effort.

In the next post, we will discuss one of the most popular connection poolers in the PostgreSQL world – PgBouncer, followed by Pgpool-II, and lastly a performance test comparison of these two PostgreSQL connection poolers in our final post of the series.

from High Scalability

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Sponsored Post: Fauna, Sisu, Educative, PA File Sight, Etleap, PerfOps, Triplebyte, Stream

Who’s Hiring? 

  • Sisu Data is looking for machine learning engineers who are eager to deliver their features end-to-end, from Jupyter notebook to production, and provide actionable insights to businesses based on their first-party, streaming, and structured relational data. Apply here.
  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Cool Products and Services

  • Stateful JavaScript Apps. Effortlessly add state to your Javascript apps with FaunaDB. Generous free tier. Try now!
  • Grokking the System Design Interview is a popular course on Educative.io (taken by 20,000+ people) that’s widely considered the best System Design interview resource on the Internet. It goes deep into real-world examples, offering detailed explanations and useful pointers on how to improve your approach. There’s also a no questions asked 30-day return policy. Try a free preview today.
  • PA File Sight – Actively protect servers from ransomware, audit file access to see who is deleting files, reading files or moving files, and detect file copy activity from the server. Historical audit reports and real-time alerts are built-in. Try the 30-day free trial!
  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Advertise your product or service here! 

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


PA File Sight monitors file access on a server in real-time.

It can track who is accessing what, and with that information can help detect file copying, detect (and stop) ransomware attacks in real-time, and record the file activity for auditing purposes. The collected audit records include user account, target file, the user’s IP address and more. This solution does NOT require Windows Native Auditing, which means there is no performance impact on the server. Join thousands of other satisfied customers by trying PA File Sight for yourself. No sign up is needed for the 30-day fully functional trial.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Stuff The Internet Says On Scalability For October 11th, 2019

Stuff The Internet Says On Scalability For October 11th, 2019

 Wake up! It’s HighScalability time:

Light is fast—or is it?

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. And I wrote Explain the Cloud Like I’m 10 for all who want to understand the cloud. On Amazon it has 57 mostly 5 star reviews (135 on Goodreads). Please consider recommending it. You’ll be a cloud hero.

Number Stuff:

  • 1,717,077,725: number of web servers in 2019. In 1994? 623.
  • 7,000,000,000,000: LinkedIn messages sent per day with Apache Kafka (sort of).
  • more data ever: collected by the LSST (Large Synoptic Survey Telescope) in its first year than all telescopes have ever collected—combined . It will do so for 10 years. That’s 15TB of data collected every night.
  • 3200 megapixel: LSST camera sensor, 250x better than an iphone, the equivalent of half a basketball court filled with 4k TVs to gather one raw image. 
  • 4 million: new jobs created in Africa because of investment in cell phone networks.
  • 442%: ROI running Windows workloads on AWS; 56% lower five-year cost of operations; 98% less unplanned downtime; 31% higher internal customer satisfaction; 37% lower IT infrastructure costs; 75% more efficient IT infrastructure team; 26% higher developer productivity; 32% higher gross productivity.
  • several petabytes: logs generated per hour by millions of machines at Facebook. Scribe processes logs with an input rate that can exceed 2.5 terabytes per second and an output rate that can exceed 7 terabytes per second. 
  • lowest: spending on tech acquisitions are the lowest quarterly level in nearly two years due to rising global uncertainty coupled with slowing economic growth. 
  • 5%-8%: per year battery energy density improvement. We expect the storage per unit mass and volume of batteries will probably plateau within 10 to 20 years. At the same time, the market penetration of lithium batteries is doubling every 4 to 5 years. 
  • 27: tech companies raised $100 million or more, taking in a total of $7.1 billion during the month of September.
  • 50%: price reduction for Intel’s Cascade Lake X-Series. 
  • 16x: Redis faster at reading JSON blobs compared to PostgreSQL.
  • $225B: worth of Google cloud. @lacker: I was working at Google when AWS launched in 2006. The internal message was, we don’t want to build a competitor. We have the technology to compete in this area, but it is fundamentally a low-margin business, whereas reinvesting in our core business is high-margin, so we should keep our infrastructure to ourselves.
  • $21.9B: app (App Store 65%, Google Play 35%) revenue in Q3 2019, a 23% increase. WhatsApp is #1. TikTok is #2. Messenger is #3. Facebook is #4. Instagram is #5. Mobile gaming is 74% of total revenue. 
  • 77,000: virtual neurons simulated in real-time on a 1 million processor supercomputer. 

 Quotable Stuff:

  • @sh: What are some famous last words in your industry? @Liv_Lanes: “How does it scale?”
  • Erin Griffith: It is now difficult for “a growth-at-all-costs company burning hundreds of millions of dollars with negative unit economics” to get funding, he said. “This is going to be a healthy reset for the tech industry.”
  • @garykmrivers: Milk delivery 25 years ago was essentially a subscription service offering products with recyclable/reusable packaging, delivered by electric vehicles. Part of me thinks that if a techie firm were to have proposed this same idea today people would think it was incredible.
  • @investing_cit: Costco is a fascinating business. You know all those groceries you buy? Yeah, they basically sell those at break even and then make all of their profit from the $60 annual membership fees. This is the key. The company keeps gross margins as low as possible.  In turn, this gives it pricing authority. In other words, you don’t even look at the price because you know it’s going to be the best. Off of merchandise, Costco’s gross margins are only 11%. Compare this to Target. Gross margins are almost 30%. Or against Walmart. About 25%. The company sells its inventory typically before it needs to pay suppliers. In other words, suppliers do what Costco tells them to do. Costco has essentially aggregated demand which it can then leverage against its suppliers in the form of payment terms. See, the DSI and DPO are basically the same.  On top of this, Costco collects cash in about 4 days, so that’s the extent of the cash conversion cycle.
  • peterwwillis: In fact, I’m going to make a very heretical suggestion and say, don’t even start writing app code until you know exactly how your whole SDLC, deployment workflow, architecture, etc will work in production. Figure out all that crap right at the start. You’ll have a lot of extra considerations you didn’t think of before, like container and app security scanning, artifact repository, source of truth for deployment versions, quality gates, different pipelines for dev and prod, orchestration system, deployment strategy, release process, secrets management, backup, access control, network requirements, service accounts, monitoring, etc. 
  • Jessica Quillin: We are likely facing a new vision for work, one in which humans work at higher levels of productivity (think less work, but more output), thanks to co-existing with robots, working side-by-side personal robots, digital assistants, or artificial intelligence tools. Rather than being bogged down by easily automated processes, humans can leverage robots to focus on more abstract, creative tasks, bringing about new innovative solutions.
  • Edsger W. Dijkstra: Abstraction is not about vagueness, it is about being precise at a new semantic level.
  • @TooMuchMe: Tomorrow, the City of Miami will vote on whether to grant a 30-year contract on light poles that will have cameras, license plate readers and flood sensors. For free. The catch: Nothing would stop the contracting company from keeping all your data and selling it to others.
  • K3wp: I used to work in the same building as [Ken Thompson]. He’s a nice guy, just not one for small talk. Gave me a flying lesson (which terrified me!) once. My father compares him to Jamie Hyneman, which is apt. Just a gruff, no-nonsense engineer with no time or patience for shenanigans
  • Richard Lawson: McDonnell recalls, wistfully, the bygone days, when a creator could directly email the guy who ran YouTube’s homepage. These days, nearly every creator I spoke to seemed haunted and awed by the platform’s fabled algorithm. They spoke of it as one would a vague god or as a scapegoat, explaining away the fading of clout or relevance.
  • @Inc: A 60-year-old founder is 3 times as likely to found a successful startup as a 30-year-old founder.
  • Andy Greenberg: Elkins programmed his tiny stowaway chip to carry out an attack as soon as the firewall boots up in a target’s data center. It impersonates a security administrator accessing the configurations of the firewall by connecting their computer directly to that port. Then the chip triggers the firewall’s password recovery feature, creating a new admin account and gaining access to the firewall’s settings. 
  • DSHR: If running a Lightning Network node were to be even a break-even business, the transaction fees would have to more than cover the interest on the funds providing the channel liquidity. But this would make the network un-affordable compared with conventional bank-based electronic systems, which can operate on a net basis because banks trust each other.
  • Marc Benioff: What public markets do is indeed the great reckoning. But it cleanses [a] company of all of the bad stuff that they have. I think in a lot of private companies these days, we’re seeing governance issues all over the place. I can’t believe this is the way they were running internally in all of these cases. They are staying private way too long.
  • Benjamin Franklin: I began now gradually to pay off the debt I was under for the printing house. In order to secure my credit and character as a tradesman, I took care not only to be in reality industrious and frugal, but to avoid all appearances to the contrary. I dressed plainly; I was seen at no places of idle diversion; I never went out a-fishing or shooting; a book, indeed, sometimes debauched me from my work, but that was seldom, snug, and gave no scandal; and, to show that I was not above my business, I sometimes brought home the paper I purchased at the stores through the streets on a wheelbarrow. Thus, being esteemed an industrious, thriving young man, and paying duly for what I bought, the merchants who imported stationery solicited my custom; others proposed supplying me with books, and I went on swimmingly.
  • @brightball: “BMW’s greatest product isn’t a car, it’s the factory.” – Best quote from #SAFeSummit #SAFe @ScaledAgile
  • @robmay: As an example, some research shows that more automation in warehouses increases overall humans working in the industry.  Why?  Because when you lower the human labor costs of a warehouse, you can put more warehouses in smaller towns that weren’t economically feasible before.  Having more automation will, initially, increase the desire for human skills like judgment, empathy, and just good old human to human interaction in some fields. The important point here is that you can’t think linearly about what will happen. It’s not a 1:1 replacement of automation taking human jobs. It is complex, and will change work in many different ways.
  • Quinn: The most consistent mistake that everyone makes when using AWS—this extends to life as well—is once people learn something, they stop keeping current on that thing. There is an entire ecosystem of people who know something about AWS, with a certainty. That is simply no longer true, because capabilities change. Restrictions get relaxed. Constraints stop applying. If you learned a few years ago that there are only 10 tags permitted per resource, you aren’t necessarily keeping current to understand that that limit is now 50.
  • @BrianRoemmele: Consider: The 1843 facsimile machine invented by Alexander Bain a clock inventor. A clock synchronize movement of two pendulums for line-by-line scanning of a message. It wasn’t until the 1980s that network effect, cost of machines made it very popular. Mechanical to digital. 
  • @joshuastager: “If the ISPs had not repeatedly sued to repeal every previous FCC approach, we wouldn’t be here today.” – @sarmorris
  • @maria_fibonacci: – Make each program do one thing well. – Expect the output of every program to become the input to another, as yet unknown, program. I think the UNIX philosophy is very Buddhist 🙂
  • @gigastacey: People love their Thermomixers so much that of the 3 million connected devices they have sold, those who use their app have a 50% conversion to a subscription. That is an insane conversion rate. #sks2019
  • eclipsetheworld: I think this quote sums up my thoughts quite nicely: “When I was a product manager at Facebook and Instagram, building a true content-first social network was the holy grail. We never figured it out. Yet somehow TikTok has cracked the nut and leapfrogged everyone else.” — Eric Bahn, General Partner at Hustle Fund & Ex Instagram Product Manager
  • Doug Messier: The year 2018 was the busiest one for launches in decades. There were a total of 111 completely successful launches out of 114 attempts. It was the highest total since 1990, when 124 launches were conducted. China set a new record for launches in 2018. The nation launched 39 times with 38 successes in a year that saw a private Chinese company fail in the country’s first ever orbital launch attempt. The United States was in second place behind China with 34 launches. Traditional leader Russia launched 20 times with one failure. Europe flew eight times with a partial failure, followed by India and Japan with seven and six successful flights, respectively.
  • John Preskill: The recent achievement by the Google team bolsters our confidence that quantum computing is merely really, really hard. If that’s true, a plethora of quantum technologies are likely to blossom in the decades ahead.
  • Quinn: What people lose sight of is that infrastructure, in almost every case, costs less than payroll.
  • Lauren Smiley: More older people than ever are working: 63% of Americans age 55 to 64 and 20% of those over 65. 
  • Sparkle: These 12 to 18 core CPU have lost most of their audience. The biggest audience for these consumer CPU was video editing and streaming. Video encoding and decoding with Nvidia NVENC is 10 times faster and now has the same or higher quality than CPU encoding. Software like OBS, Twitch studio, Handbrake, Sony Vegas now all support NVENC. The only major software suite that doesn’t support NVENC officially yet is Premiere.
  • Timothy Prickett Morgan: To move data from DRAM memory on the PIM modules to one of the adjacent DPUs on the memory chips takes about 150 picoJoules (pJ) of energy, and this is a factor of 20X lower than what it costs to move data from a DRAM chip on a server into the CPU for processing. It takes on the order of 20 pJ of energy to do an operation on that data in the PIM DPU, which is inexplicably twice as much energy in this table. The server with PIM memory will run at 700 watts because that in-memory processing does not come for free, but we also do not think that a modern server comes in at 300 watts of wall power.
  • William Stein: The supply/demand pendulum has swung away from providers in favor of customers, with various new entrants bringing speculative supply online, while the most voracious consumers remain in digestion mode. Ultimately, we believe it’s a question of when, not if hyperscale procurement cycles enter their next phase of growth, and the pendulum can swing back the other direction quickly.
  • Jen Ayers: Big-game hunters are essentially targeting people within an organization for the sole purpose of identifying critical assets for the purpose of deploying their ransomware. Hitting] one financial transaction server, you can charge a lot more for that than you could for a thousand consumers with ransomware—you’re going to make a lot more money a lot faster.
  • Eric Berger: Without the landing vision system, the rover would most likely still make it to Mars. There is about an 85% chance of success. But this is nowhere near good enough for a $2 billion mission. With the landing camera and software Johnson has led development of, the probability of success increases to 99%.
  • s32167: The headline should be “Intel urges everyone to use new type of memory that lowers performance for every CPU architecture to fix their own architecture security issues.”
  • Robert Haas: So, the “trap” of synchronous replication is really that you might focus on a particular database feature and fail to see the whole picture. It’s a useful tool that can supply a valuable guarantee for applications that are built carefully and need it, but a lot of applications probably don’t report errors reliably enough, or retry transactions carefully enough, to get any benefit.  If you have an application that’s not careful about such things, turning on synchronous replication may make you feel better about the possibility of data loss, but it won’t actually do much to prevent you from losing data.
  • Scott Aaronson: If you were looking forward to watching me dismantle the p-bit claims, I’m afraid you might be disappointed: the task is over almost the moment it begins. “p-bit” devices can’t scalably outperform classical computers, for the simple reason that they are classical computers. A little unusual in their architecture, but still well-covered by the classical Extended Church-Turing Thesis. Just like with the quantum adiabatic algorithm, an energy penalty is applied to coax the p-bits into running a local optimization algorithm: that is, making random local moves that preferentially decrease the number of violated constraints. Except here, because the whole evolution is classical, there doesn’t seem to be even the pretense that anything is happening that a laptop with a random-number generator couldn’t straightforwardly simulate. 
  • Handschuh: Adding security doesn’t happen by chance. In some cases it requires legislation or standardization, because there’s liability involved if things go wrong, so you have to start including a specific type of solution that will address a specific problem. Liability is what’s going to drive it. Nobody will do it just because they are so paranoid that they think that it must be done. It will be somebody telling them
  • Battery: The best marketing today—particularly mobile marketing—is not about providing a point solution but, instead, offering a broader technology ecosystem to understand and engage customers on their terms. The Braze-powered Whopper campaign, for instance, helped transform an app that had been primarily a coupon-delivery service into a mobile-ordering system that also offered a deeper connection to the Burger King brand.
  • Jakob: I think that we need to think of programming just like any other craft, trade, or profession with an intersection on everyday life: it is probably good to be able to do a little bit of it at home for household needs. But don’t equate that to the professional development of industrial-strength software.  Just like being able to use a screwdriver does not mean you are qualified to build a house, being able to put some blocks or lines of code together does not make you a programmer capable of building commercial-grade software.
  • @benedictevans: TikTok is introducing Americans to a question that Europeans have struggled with for 20 years: a lot of your citizens might use an Internet platform created somewhere that doesn’t know or care about your laws or cultural attitudes and won’t turn up to a committee hearing
  • Robert Pollack: So let me say something about our uniqueness, which is embedded in our DNA. Simple probabilities. Every base pair in DNA has four possible base pairs. Three billion letters long. Each position in the text could have one of four choices. So how many DNAs are there? There are four times four two-letter words in DNA, four for the first letter, four for the second—sixteen possible two-letter words. Sixty-four possible three-letter words. That is to say, how many possible human genomes are there? Four to the power 3 billion, which is to say a ridiculous, infinite number. There are only 1080 elementary particles in the universe. Each of us is precisely, absolutely unique while we are alive. And in our uniqueness, we are absolutely different from each other, not by more or less, but absolutely different.

Useful Stuff:

  • After 2000 years of taking things apart into smaller things, we have learned that all matter is made of molecules, and that molecules are made of atoms. Has Reductionism Run its Course? Or in the context of the cloud: Has FaaS Run Its Course? The “everything is a function” meme is a form of reductionism. And like reductionism in science FaaS reductionism has been successful, as the “business value” driven crowd is fond of pointing out. But that’s not enough when you want to understand the secrets of the universe, which in this analogy is figuring how to take the next step in building systems. Lambda is like the Large Hadron Collider in that it confirmed the standard model, but hasn’t moved us forward. At some point we need to stop looking at functions and explore using some theory driven insight. We see tantalizing bits of a greater whole as we layer abstractions on top of functions. There are event busses, service meshes, service discovery services, work flow systems, pipelines, etc.—but these are all still part of the standard model of software development. Software development like physics is stuck looking for a deeper understanding of its nature, yet we’re trapped in a gilded cage of methodological reductionism. Like for physics,  “the next step forward will be a case of theory reduction that does not rely on taking things apart into smaller things.”
  • SmashingConf Freiburg 2019 videos are now available. You might like The Anatomy Of A Click
  • There’s a lot of energy and ideas at serverlessconf:
    • If you’re looking for the big picture: ServerlessConf NYC 2019: everything you missed
    • @jeremy_daly: Great talk by @samkroon. Every month, @acloudguru uses 240M Lambda calls, 180M API Gateway calls, and 90TB of data transfer through CloudFront. Total cost? ~$2,000 USD. #serverless #serverlessftw #Serverlessconf
    • @ryans140: We’re in a similar situation.  3 environments,  60+ microservices,  serverless datakake. $1400 a month.   Down from $12k monthly in a  vm based datacenter.
    • @gitresethard: This is a very real feeling at #Serverlessconf this year. There’s a mismatch between the promise of focusing on your core differentiators and the struggle with tooling that hasn’t quite caught up.
    • @hotgazpacho: “Kubernetes is over-hyped and elevating the least-interesting part of your application. Infrastructure should be boring.” – @lindydonna 
    • @QuinnyPig: Lambda: “Get a file” S3: “Here it is.” There’s a NAT in there.  (If it’s the Managed NAT gateway you pay a 4.5¢ processing fee / @awscloud tax on going about your business.) #serverlessconf
    • @ben11kehoe: Great part the @LEGO_Group serverless story: they started with a single Lambda, to calculate sales tax. Your journey can start with a small step! #Serverlessconf
    • @ryanjonesirl: Thread about @jeremy_daly talk at #Serverlessconf #Serverless is (not so) simple Relational and Lambda don’t mix well.
    • @jssmith: Just presented the Berkeley View on #Serverless at #Serverlessconf
      • Serverless is more than FaaS
      • Cloud programming simplified
      • Next phase in cloud evolution
      • Using servers will seem like using assembly language
      • Serverless computing will server just about every use case
      • Serverless computing bill will converge to the serverful cost
      • Machine learning will play an important role in optimizing execution
      • Serverless computing will embrace heterogeneous hardware (GPU, TPU, etc) 
      • Serverful cloud computing will decline relative to serverless computing
  • Awesome writeup. Lots to learn on how to handle 4,200 Black Friday orders per minute, especially if you’re interested in running an ecommerce site on k8s in AWS using microservices. Building and running application at scale in Zalando
    • In our last Black Friday, we broke all the records of our previous years, and we had around 2 million orders. In the peak hour, we reached more than 4,200 orders per minute.
    • We have come from a long run, we have migrated from monolith to microservices around 2015. Nowadays, in 2019, we have more than 1,000 microservices. Our current tech organization is composed from more than 1,000 developers, and we are more than 200 teams. Every team is organized strategically to cover a customer journey, and also a business thing. Every team can also have different team members with multidisciplinary skills like frontend, backend data scientists, UX, researcher, product, whatever is needed that our team needs to fulfill.
    • Since we have all of these things, we also have end-to-end responsibility for the services that every team has to manage…We also found out that it’s not easy that every team do their way, so we end up having standard processes of how we develop software. This was enabled by the tools that our developer productivity team provides us. Every team can easily start a new project, can set it up, can start coding, build it, test it, deploy it, monitor it, and so on, in all the software development cycle
    • All our microservices are run in AWS and Kubernetes. When we migrated from monolith to microservices, we also migrated to the cloud. We start to use AWS like EC2 instances and cloud formations…All our microservices, not only checkout, but also lambda microservices are running in containers. Every microservice environment is obstructed from our infrastructure.
    • After this, we also have frontend fragments, which are frontend microservices. Frontend microservices are services that provide server-side rendering of what we call fragments. A fragment is a piece of a page, for example, a header, a body, a content, or a footer. You can have one page where you can see one thing, but every piece can be something that different teams owned.
    • Putting it all together, we do retries of operations with exponential back off. We wrap operations with the circuit breaker. We handle failures with fallbacks when possible. Otherwise, we have to make sure to handle the exceptions to avoid unexpected errors.
    • Every microservice that we have has the same infrastructure. We have a load balancer who handles the incoming request. Then this distributes the request through the replication of our microservice in multiple instances, or if we are using Kubernetes in multiple ports. Every instance is running with a Zalando-based image. This Zalando-based image contains a lot of things that are needed to be compliant, to be secure, to make sure that we have the right policies implemented because we are a serious company, and because we take seriously our business
    • What we didn’t know is that when we have more instances, it also means that we have more database connections. Before, even if we were having 26 million active customers using the website in different patterns, it was not a problem. Now, we have 10 times more instances creating connections to our Cassandra database. The poor Cassandra was not able to handle all of these connections.
    • Consider doing rollouts, consider having the same capacity for the current traffic that you have. Otherwise, your service is likely to become unavailable, just because you’ve introduced a new feature, but you have to make sure that this is also handled.
    • For our Black Friday preparation, we have a business forecast for tellers, we want to make this and that amount of orders, then we also have load testing of real customer journey
    • Then all the services involved in all this journey are identified, then we had to load testing in top of this. With this week, we were able to do capacity planning, so we could scale our service accordingly, and we could also identify bottlenecks, or things that we might need to fix for Black Friday.
    • For every microservice that is involved in Black Friday, we also have a checklist where we review, is the architecture and dependencies reviewed? Are the possible points of failures identified and mitigated? Do we have reliability patterns for all our microservices that are involved? Are configurations adjustable without need of deployment?
    • we are one company doing Black Friday. Then we have other 100 companies or more also doing Black Friday. What happened to us already in one Black Friday, I think, or two, was that AWS run out of resources. We don’t want to make a deployment and start new instances because we might get into the situation where we get no more resources in AWS
    • In the final day of Black Friday, we have a situation room. All teams that are involved in the services that are relevant for the Black Friday are gathered in one situation room. We only have one person per team. Then we are all together in this space where we monitor, and we support each other in case there is an incident or something that we need to handle
  • Videos from CppCon 2019 are now available. You might like Herb Sutter “De-fragmenting C++: Making Exceptions and RTTI More Affordable and Usable
  • Introducing SLOG: Cheating the low-latency vs. strict serializability tradeoff: Bottom line: there is a fundamental tradeoff between consistency and latency. And there is another fundamental tradeoff between serializability and latency… it is impossible to achieve both strict serializability and low latency reads and writes…By cheating the latency-tradeoff, SLOG is able to get average latencies on the order of 10 milliseconds for both reads and writes for the same geographically dispersed deployments that require hundreds of milliseconds in existing strictly serializable systems available today. SLOG does this without giving up strict serializability, without giving up throughput scalability, and without giving up availability (aside from the negligible availability difference relative to Paxos-based systems from not being as tolerant to network partitions). In short, by improving latency by an order of magnitude without giving up any other essential feature of the system, an argument can be made that SLOG is strictly better than the other strictly serializable systems in existence today.
  • Data races are very hard to find. Usually the way you find them is a late night call when a system locks up for no discernable reason. So it’s remarkable Google’s Kernel Concurrency Sanitizer (KCSAN) found over 300  data race conditions within the Linux kernel. Here’s the announcement.
  • How much will you save running Windows on AWS? A lot says IDC in The Infrastructure Cost and Staff Productivity Benefits of Running High-Performing Windows Workloads in the AWS Cloud: Based on interviews with these organizations, IDC quantifies the value they will achieve by running Windows workloads on AWS at an average of $157,300 per 100 users per year ($6.59 million per organization)…IT infrastructure cost reductions: Study participants reduce costs associated with running on-premises environments and benefit from more efficient use of infrastructure and application licenses…IT staff productivity benefits: Study participants reduce the day-to-day burden on IT…Risk mitigation — user productivity benefits: Study participants minimize the operational impact of unplanned application outages…Business productivity benefits: Study participants better address business opportunities and provide their employees with higher-performing and more timely applications and features infrastructure, database, application management, help desk, and security teams and enable application development teams to work more effectively.
    • Food and beverage organization: “We definitely go ‘on the cheap’ to start with AWS because it’s easy just to add extra storage per server instance in seconds. We will spin up a workload with what we feel is the minimum, and then add to it as needed. It definitely has putus in a better place to utilize resources regarding  services and infrastructure.
    • Healthcare organization: Licensing cost efficiencies was one of the reasons we went to the cloud with AWS. The way that you collaborate these licensing contracts through AWS for software licenses versus having to buy the licenses on our own has already been more cost effective for us. We’re saving 10%.
  • A fun approach to learning SQL. NUKnightLab/sql-mysteries: There’s been a Murder in SQL City! The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime.
  • Caching improves your serverless application’s scalability and performance. It helps you keep your cost in check even when you have to scale to millions of users. All you need to know about caching for serverless applications: Lambda auto-scales by traffic. But it has limits… if your traffic is very spiky then the 500/min limit will be a problem…Caching improves response time as it cuts out unnecessary roundtrips…My general preference is to cache as close to the end-user as possible…Where should you implement caching? Route53 as the DNS.CloudFront as the CDN. API Gateway to handle authentication, rate limiting and request validation. Lambda to execute business logic. DynamoDB as the database.
  • To quote the Good Place, “This is forked.” But in a good way. A Multithreaded Fork of Redis That’s 5X Faster Than Redis
    • In regards to why fork Redis in the first place, KeyDB has a different philosophy on how the codebase should evolve. We feel that ease of use, high performance, and a “batteries included” approach is the best way to create a good user experience. While we have great respect for the Redis maintainers it is our opinion that the Redis approach focusses too much on simplicity of the code base at the expense of complexity for the user. This results in the need for external components and workarounds to solve common problems.
    • KeyDB works by running the normal Redis event loop on multiple threads. Network IO, and query parsing are done concurrently. Each connection is assigned a thread on accept(). Access to the core hash table is guarded by spinlock. Because the hashtable access is extremely fast this lock has low contention. Transactions hold the lock for the duration of the EXEC command. Modules work in concert with the GIL which is only acquired when all server threads are paused. This maintains the atomicity guarantees modules expect.
    • @kellabyte: I’ve been saying for years the architecture of Redis has been poorly designed in it’s single threaded nature among several other issues. KeyDB is a multi-threaded fork that attempts to fix some of these issues and achieves 5x the perf. Antirez has convinced a lot of people that whatever he says must be true 😛 Imaging running 64 instances of redis on a 64 core box? Oh god haha…I do. Having built Haywire up to 15 million HTTP requests/second using the same architecture myself I believe the numbers. It’s good engineering.
  • Frugal computing: Companies care about cheap computing…How can we trade-off speed with monetary cost of computing?…With frugal computing, we should try to avoid the cost of state synchronization as much as possible. So work should be done on one machine if it is cheaper to do so and the generous time budget is not exceeded…Memory is expensive but storage via local disk is not. And time is not pressing. So we can consider out-of-core execution, juggling between memory and disk…Communication costs money. So batching communication and trading off computation with communication…We may then need schemes for data-naming (which may be more sophisticated then simple key), so that a node can locate the result it needs in S3 instead of computing itself. This can allow nodes to collaborate with other nodes in an asynchronous, offline, or delay-tolerant way…In frugal computing, we cannot afford to allocate extra resources for fault-tolerance, and we need to do in a way commensurate with the risk of fault and the cost of restarting computation from scratch. Snapshots that are saved for offline collaboration may be useful for building frugal fault-tolerance.
  • A good summary from DevSecCon Seattle 2019 Round Up
  • Corruption is a work around, it’s a utility in a place where there are fewer better options to solve a problemInnovation is the antidote to corruption~ Corruption is not the problem hindering our development. In fact, conventional thinking on corruption and its relationship to development is not only wrong it’s holding many poor countries back…many programs fail to reduce corruption because we have the equation backwards. Societies don’t develop because they’ve reduced corruption, they are able to reduce corruption because they’ve developed. And societies develop through investment in innovation…there’s a relationship between scarcity and corruption, in most poor countries way too many basic things are scarce…this creates the perfect breeding ground for corruption to occur…investing in businesses that make things affordable and accessible to more people attacks this scarcity and creates the revenues for governments to reinvest in their economies. When this happens on a country wide level it can revolutionize nations…as South Korea became prosperous it was able to transition from an authoritarian government to a democratic government and has been able to reinvest in building its institutions and this has payed off…what we found when we looked at most prosperous countries today they were able to reduce corruption as they became prosperous, not before.
  • My take on: Percona Live Europe and ProxySQL Technology Day: It comes without saying that MySQL was the predominant and the more interesting tracks were there. This not because I come from MySQL, but because the ecosystem was helping the track to be more interesting. Postgres was having some interesting talk, but let us say clearly, we had just few from the community. Mongo was really low in attendee. The number of attendees during the talks and the absence of the MongoDb community was clearly indicating that the event is not in the are of interest of the MongoDB utilizers.
  • Put that philosophy degree to work. Study some John Stuart Mill and you’re ready for a job in AI. What am I talking about? Peter Norvig in Artificial Intelligence: A Modern Approach talks about how AI started out by defining AI as maximize expected utility; just give us the utility function and we have all these cool techniques on how optimizing them. But now we’re saying maybe the optimization part is the easy part and the hard part is deciding what is my utility function. What do we want as a society? What is utility? Utilitarianism is filled with just these kind of endless debates. And as usual when you dive deep absolutes fade away and what remains are shades of grey. As of yet there’s no utility calculus. So if you’re expecting AI to solve life’s big questions it turns out we’ll need to solve them before AI can.
  • You too can use these techniques. Walmart Labs on Here’s What Makes Apache Flink scale:
    • I have been using Apache Flink in production for the last three years, and every time it has managed to excel at any workload that is thrown at it. I have run Flink jobs handling datastream at more than 10 million RPM with not more than 20 cores.
    • Reduce Garbage Collection – Flink takes care of this by managing memory itself.
    • Minimize data transfer – several mapping and filter transformations are done sequentially in a single slot. This chaining minimizes the sharing of data between slots and multiple JVM processes. As a result, jobs have a low network I/O, data transfer latencies, and minimal synchronization between objects.
    • Squeeze your bytes – To avoid storing such heavy objects, Flink implements its serialization algorithm, which is much more space-efficient.
    • Avoid blocking everyone – Flink revamped its network communications after Flink 1.4. This new policy is called credit-based flow control. Receiver sub-tasks announce how many buffers they have left to sender sub-tasks. When a sender becomes aware that a receiver doesn’t have any buffers left, it merely stops sending to that receiver. This helps in preventing the blocking of TCP channels with bytes for the blocking sub-task.
  • A good experience report from The Full Stack Fest Experience 2019
  • Places to intervene in a system: 12. Constants, parameters, numbers (such as subsidies, taxes, standards); 11. The sizes of buffers and other stabilizing stocks, relative to their flows; 10. The structure of material stocks and flows (such as transport networks, population age structures); 9. The lengths of delays, relative to the rate of system change; 8. The strength of negative feedback loops, relative to the impacts they are trying to correct against; 7. The gain around driving positive feedback loops; 6. The structure of information flows (who does and does not have access to information); 5. The rules of the system (such as incentives, punishments, constraints); 4. The power to add, change, evolve, or self-organize system structure; 3. The goals of the system; 2. The mindset or paradigm out of which the system — its goals, structure, rules, delays, parameters — arises; 1. The power to transcend paradigms.
  • The big rewrite can work, but perhaps the biggest lesson is big design up front is almost always a losing strategy. Why we decided to go for the Big Rewrite: We used to be heavily invested into Apache Spark – but we have been Spark-free for six months now…One of our original mistakes (back in 2014) had been that we had tried to “future-proof” our system by trying to predict our future requirements. One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines4. At the time, we did not have any datasets that were this large. In fact, 5 years later, we still do not…With hindsight, it seems obvious that divining future requirements is a fool’s errand. Prematurely designing systems “for scale” is just another instance of premature optimization…We do not need a distributed file system, Postgres will do…We do not need a distributed compute cluster, a horizontally sharded compute system will do…We do not need a complicated caching system, we can simply cache whole datasets in memory instead…We do not need cluster-wide parallelism, single-machine parallelism will do…We do not need to migrate the storage layer and the compute layer at the same time, we can do one after the other…Avoid feature creep…Test critical assumptions early…Break project up into a dependency tree…Prototype as proof-of-concept…Get new code quickly into production…Opportunistically implement new features…Use black-box testing to ensure identical behavior…Build metrics into the system right from the start…Single-core performance first, parallelism later.
  • Interesting mix of old and new. What is the technology behind Nextdoor in 2019? 
    • Deploying to production 12–15 times. Inserting billions of rows to our Postgres and DynamoDB tables. Handling millions of user sessions concurrently. 
    • Django Framework for web applications; NGINX and uWSGI to serve our Python 3 code, served behind an Amazon Elastic Load Balancer; Conda to manage our Python environments;M yPy to add type safety to the codebase.
    • PostgreSQL is the database. Horizontally scaling uses a combination of application-specific read replicas as well as a connection pooler (PGBouncer); and Load Balancer is used as a custom microservice in front the databases; DynamoDB for documents that need fast retrieval.
    • Memcached and HAProxy help with performance; Redis via ElastiCache is used to use the right data type for the job; CloudFront as the CDN; SQS for job queues.
    • Jobs consumed off SQS using a custom Pythong based distributed job processor called Taskworker. They built a cron type system on top of Taskworker. 
    • Microservices are written in Go and use gorilla/mux as the router. Zookeeper for service configuration. Communicating between services uses a mix of SQS, Apache Thrift and JSON APIs. Storage is mostly DynamoDB. 
    • Most data processing is done via AirFlow, which aggregates PostgreSQL data to S3 that then loads it into Presto.
    • For Machine Learning: Scikit-Learn, Keras, and Tensorflow.
    • Services are deployed as Docker images, using docker-compose for local development, ECS / Kubernetes for prod/staging environments. M
    • Considering moving everything to k8s in the future.
    • Python deployments are done via Nextdoor/conductor, a Go App in charge of continuously releasing our application via Trains -a group of commits to be delivered together. Releases are made using CloudFormation via Nextdoor/Kingpin.
    • React and Redux on the frontend speaking GraphQL and JSON APIs. 
    • PostGIS extension is used for spatial operations using libraries like GDAL and GEOS for spatial algorithms and abstractions, and tools like Mapnik and the Google Maps API to render map data.
    • Currently in the process of developing a brand new data store and custom processing pipeline to manage the high volume of geospatial data expected to store (1B+ rows) as they expand internationally.
  • How LinkedIn customizes Apache Kafka for 7 trillion messages per day
    • At LinkedIn, some larger clusters have more than 140 brokers and host one million replicas in a single cluster. With those large clusters, we experienced issues related to slow controllers and controller failure caused by memory pressure. Such issues have a serious impact on production and may cause cascading controller failure, one after another. We introduced several hotfix patches to mitigate those issues—for example, reducing controller memory footprint by reusing UpdateMetadataRequest objects and avoiding excessive logging.
    • As we increased the number of brokers in a cluster, we also realized that slow startup and shutdown of a broker can cause significant deployment delays for large clusters. This is because we can only take down one broker at a time for deployment to maintain the availability of the Kafka cluster. To address this deployment issue, we added several hotfix patches to reduce startup and shutdown time of a broker (e.g., a patch to improve shutdown time by reducing lock contention). 

Soft Stuff:

  • Hydra (article): a framework for elegantly configuring complex applications. Hydra offers an innovative approach to composing an application’s configuration, allowing changes to a composition through configuration files as well as from the command line.
  • uttpal/clockwork (article): a general purpose distributed job scheduler. It offers you horizontally scalable scheduler with atleast once delivery guarantees. Currently supported task delivery mechanism is kafka, at task execution time the schedule data is pushed to the given kafka topic. 
  • linkedin/kafka: the version of Kafka running at LinkedIn. Kafka was born at LinkedIn. We run thousands of brokers to deliver trillions of messages per day. We run a slightly modified version of Apache Kafka trunk. This branch contains the LinkedIn Kafka release.
  • serverlessunicorn/ServerlessNetworkingClients (article): Serverless Networking adds back the “missing piece” of serverless functions, enabling you to perform distributed computations, high-speed workflows, easy to use async workers, pre-warmed capacity, inter-function file transfers, and much more.

Pub Stuff: 

  • LSST Active Optics System Software Architecture:  In this paper, we describe the design and implementation of the AOS. More particularly, we will focus on the software architecture as well as the AOS interactions with the various subsystems within LSST.
  • Content Moderation for End-to-End Encrypted Messaging: I would like to reemphasize the narrow goal of this paper: demonstrating that forms of content moderation may be technically possible for end-to-end secure messaging apps, and that enabling content moderation is a different problem from enabling law enforcement access to content. I am not yet advocating for or against the protocols that I have described. But I do see enough of a possible path forward to merit further research and discussion.
  • SLOG: Serializable, Low-latency, Geo-replicated Transactions (article): For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. In this paper we discuss SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from a location close to the home region for data they access. Experiments find that SLOG can reduce latency by more than an order of magnitude relative to state-of-the-art strictly serializable geo-replicated database systems such as Spanner and Calvin, while maintaining high throughput under contention.
  • FCC-hh: The Hadron Collider: This report contains the description of a novel research infrastructure based on a highest-energy hadron collider with a centre-of-mass collision energy of 100 TeV and an integrated luminosity of at least a factor of 5 larger than the HL-LHC. It will extend the current energy frontier by almost an order of magnitude. The mass reach for direct dis- covery will reach several tens of TeV, and allow, for example, the production of new par- ticles whose existence could be indirectly exposed by precision measurements during the earlier preceding e+e− collider phase.

from High Scalability

Stuff The Internet Says On Scalability For October 4th, 2019

Stuff The Internet Says On Scalability For October 4th, 2019

Wake up! It’s HighScalability time:

SpaceX ready to penetrate space with their super heavy rocket. (announcement)

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. And I wrote Explain the Cloud Like I’m 10 for all who want to understand the cloud. On Amazon it has 57 mostly 5 star reviews (135 on Goodreads). Please recommendint it. They’ll love you even more.

Number Stuff: 

  • 94%: lost value in Algorand cryptocurrency in first three months.
  • 38%: increase in machine learning and analytics driven predictive maintenance in manufacturing in the next 5 years.
  • 97.5%: Roku channels tracked using Doubleclick. Nearly all TVs tested contact Netflix, even without a configured Netflix account. Half of TVs talk to tracking services. 
  • 78: SpaceX launches completed in 11 years.
  • 70: countries that have had disinformation campaigns. 
  • 99%: of misconfigurations go unreported in the public cloud.
  • 12 million: reports of illegal images of child sex abuse on Facebook Messenger in 2018.
  • 40%: decrease in trading volume on some major crypto exchanges last month. 
  • 2016: year of peak global smartphone shipments. 
  • 400,000: trees a drone can plant per day. It fires seed missiles into the ground. In less than a year the trees are 20 inches tall.
  • 14 million: Uber rides per day.
  • 90%: large, public game companies (Epic, Ubisoft, Nintendo) run on the AWS cloud. 
  • $370B: public cloud spending by 2022.
  • C: language with the most curse words in the comments.
  • 45 million: DynamoDB TPS. Bitcoin? 14 transactions per second.
  • 50,000: Zoho requests per second.
  • 700 nodes: all it takes for Amazon to cover Los Angeles with a 900 MHz network.
  • One Quadrillion: per day real-time metrics at Datadog.

Quotable Stuff:

  • @Hippotas: The LSM tree is the workhorse for many modern data management systems. Last week we lost Pat O’Neil, one of the inventors of the LSM tree. Pat had great impact in the field of databases (LSM, Escrow xcts, LRU-K, Bitmap Indices, Isolation levels, to name few). He will be missed.
  • @ofnumbers: pro-tip: there is no real point in using a blockchain – much less a licensed “distributed ledger” – for internal use within a silo’ed organization.  marketing that as a big innovative deal is disingenuous.
  • @amcafee: More on this: US digital industries have exploded over the past decade (and GDP has grown by ~25%), yet total electricity use has been ~flat. This is not a coincidence; it’s cause and effect. Digitization makes the entire economy more energy efficient.
  • ICO: GDPR and DPA 2018 strengthened the requirement for organisations to

    report PDBs. As a result, we received 13,840 PDB reports during 2018-19,

    an increase from 3,311 in 2017-18. 

  • Jeanne Whalen: Demand for labeling is exploding in China as large tech companies, banks and others attempt to use AI to improve their products and services. Many of these companies are clustered in big cities like Beijing and Shanghai, but the lower-tech labeling business is spreading some of the new-tech money out to smaller towns, providing jobs beyond agriculture and manufacturing.
  • IDC: Overall, the IT infrastructure industry is at crossing point in terms of product sales to cloud vs. traditional IT environments. In 3Q18, vendor revenues from cloud IT environments climbed over the 50% mark for the first time but fell below this important tipping point since then. In 2Q19, cloud IT environments accounted for 48.4% of vendor revenues. For the full year 2019, spending on cloud IT infrastructure will remain just below the 50% mark at 49.0%. Longer-term, however, IDC expects that spending on cloud IT infrastructure will grow steadily and will sustainably exceed the level of spending on traditional IT infrastructure in 2020 and beyond.
  • Brian Roemmele: I can not overstate enough how important this Echo Network will become. This is Amazon owning the entire stack. Bypassing the ancient cellular network concepts and even the much heralded 5G networks.
  • Ebru Cucen~ Why serverless? Everyone from management to engineering wanted serverless. everyone wanted serverless. It was the first in a project everyone was on-board.
  • Jessica Kerr: Every piece of software and infrastructure that the big company called a capital investment, that they value because they put money into it, that they keep using because it still technically works — all of this weight slows them down.
  • @Obdurodon: Saw someone say recently that bad code crowds out good code because good code is easy to change and bad code isn’t. It’s not just code. (1/2)
  • Paul Nordstrom: spend more time talking to your users about how they would use your system, show your design to more people, you know, just shed the ego and shed this need for secrecy if you can, so that you get a wider spectrum of people who can tell you, I’m gonna use it like this. And then, when you run into the inevitable problem, you know, then you just have to, that having done the work that did before, your system will be cleaner design, you’ll have this mathematical model.
  • @shorgio: Hotels are worse for long term rent prices.  Airbnb keeps hotel profits in check.  Without Airbnb, hotel margins grow so there is an incentive to rezone to build more hotels, which can’t be converted back into actual homes
  • @techreview: In January, WhatsApp limited how often messages can be forwarded—to only five groups instead of 256—in an attempt to slow the spread of disinformation. New research suggests that the change is working.
  • @dhh: The quickest way to ruin the productivity of a small company is to have it adopt the practices of a large company. Small companies don’t just need the mini version of whatever maxi protocol or approach that large companies use. They more often than not need to do the complete opposite.
  • @random_walker: When we watch TV, our TVs watch us back and track our habits. This practice has exploded recently since it hasn’t faced much public scrutiny. But in the last few days, not one but *three* papers have dropped that uncover the extent of tracking on TVs. Let me tell you about them.
  • @lrvick: The WhatsApp backdoor is now public and official. I have said this many times: there is no future for privacy or security tools that are centralized or proprietary. If you can’t decentralize it some government will strongarm you for access.
  • Rahbek: The global pattern of biodiversity shows that mountain biodiversity exhibits a visible signature of past evolutionary processes. Mountains, with their uniquely complex environments and geology, have allowed the continued persistence of ancient species deeply rooted in the tree of life, as well as being cradles where new species have arisen at a much higher rate than in lowland areas, even in areas as amazingly biodiverse as the Amazonian rainforest
  • @mipsytipsy: key honeycomb use cases.  another: “You *could* upgrade your db hardware to a m2.4xl.  Or you could sum up the db write lock time held, break down by app, find the user consuming 92% of all lock time, realize they are on your free tier…and throttle that dude.”
  • Dale Rowe: The internet is designed as a massive distributed network with no single party having total control. Fragmenting the internet (breaking it down into detached networks) would be the more likely result of an attempt. To our knowledge this hasn’t been attempted but one would imagine that some state actors have committed significant research to develop internet kill switches.
  • @cloud_opinion: Although we typically highlight issues with GCP, there are indeed some solid products there – have been super impressed with GKE – its solid, priced right and works great. Give this an A+.
  • David Wootton: This statement seems obvious to us, so we are surprised to discover that the word competition was a new one in Hobbes’ time, as was the idea of a society in which competition is pervasive. In the pre-Hobbesian world, ambition, the desire to get ahead and do better than others, was universally condemned as a vice; in the post-Hobbesian world, it became admirable, a spur to improvement and progress.
  • John Currey: What’s really nice with the randomization is that every node is periodically checking every other node. They’re not checking that particular node so often, but collectively, all the nodes are still checking all of the other nodes. This greatly reduces the chance of a particular node failure not being discovered
  • E-Retail Expansion Report: With over $140 billion in ecommerce sales to consumers in other countries, some U.S. retailers are thinking globally. But only half of U.S. retailers in the Internet Retailer Top 1000 accept online orders from consumers in other countries. The most common way Top 1000 e-retailers sell to shoppers in foreign nations is by accepting orders on their primary websites and then shipping parcels abroad. However, only 46.4% of these retailers ship to the United Kingdom, and 43.4% ship to Japan, two of the largest ecommerce markets. Larger retailers are more likely than smaller ones to ship to foreign addresses, with 70.1% of Top 1000 retailers ranked Nos. 1-100 shipping outside of North America, compared to only 48.4% of those ranked 901-1000
  • Ruth Williams: The finding that a bacterium within a bacterium within an animal cell cooperates with the host on a biosynthetic pathway suggests the endosymbiont is, practically speaking, an organelle.

Useful Stuff: 

  • WhatsApp experiences a connection churn of 600k to 1.5 million connections per second. WhatsApp is famous for using very few servers running Erlang in their core infrastructure. With the 2014 Facebook acquisition a lot has changed, but a lot hasn’t changed too. Seems like they’ve kept that same Erlang spirit. Here’s a WhatsApp update on Scaling Erlang Cluster to 10,000 Nodes
    • Grew from 200m users in 2013 to 1.5 billion in 2018 so they needed more processing power as they add more features and users. In the process they were moving from SoftLayer (IBM, FreeBSD, Erlang R16) to Facebook’s infrastructure (Open Compute, Linux, Erlang R21) after the 2014 acquisition. This required moving from large powerful dual socketed servers to tiny blades with a max of 32 gig of RAM. Facebook’s approach is to pack a lot of servers into a tiny space. Had to move to Erlang R21 to get the networking performance and connection density on Linux that they had on FreeBSD. Now they have a combination of old and new machines in a single cluster and they went from just a few servers to 10,000 smaller Facebook servers. 
    • An Erlang cluster is a mesh. Every node connects to every other node in the cluster. That’s a lot of connections. Not a problem because a million users are assigned to a single server so adding 10,000 connections to a server is not a big deal. They put 1500 nodes in a single cluster with no connection problems. The problem is discovery, when a user on one server talks to another user on a different server. They use two process registries. One is centralized for high rate registrations that acts as a session manager for phones connecting to servers. Every time a phone connects it registers itself in a session manager. A second process registry uses pg2 and globally replicated state for rare changes. A phone connects to an erlang server called a chat node. When a phone wants to connect to another phone it asks a session manager a server the phone is connected to. They have a connection churn of 600k to 1.5 million connections per second. pg2 is used for service discovery mapping which servers to services. Phone numbers are hashed to servers. Meta-cluster are clusters of services: chat, offline, session, contacts, notifications, groups—that are mesh connected as needed. Even with all their patches they can’t scale pg2 to 1500 nodes. Clusters are connected with wandist, a custom service. 
    • It wasn’t easy to move from FreeBSD to Linux, kqueue is awesome and epoll is not as awesome. Erlang R21 supports multiple poll sets so it leverages existing Linux network capabilities. With kqeueu you can update a million file descriptors with a single call. With epoll you would need a million individual kernel calls. Given recent security concerns system calls are not as cheap as you would like them to be.
    • As in 2014 most scalability problems are caused by a lack of concurrency, which means locking bottlenecks. Bottlenecks must be identified and fixed. Routing performance was a problem. Moving to multiple datacenters meant they had to deal with long range communications which added more latency. Some bottlenecks were found and overcome  by adding my concurrency and more workers. Another problem is SSL is really slow on erlang.
    • There are also lots of Erlang bugs they had to fix. The built-in tools are great for fixing problems. First line is using the built-in inspection facilities. For distributed problems they use MSACC – microstate accounting with extra accounting turned on. Lock Counting is a tool to find locks. Since Erlang is open source you can change code to help debugging. 
    • Erlang is getting better so many of the patches they made originally are no longer needed. For example, Erlang introduced off heap messages to reduce garbage collection pressure. But as WhatsApp grows they run into new bottlenecks, like the need for SSL/TLS handshake acceleration. WhatsApp adds more monitoring, statistics, wider lock tables, more concurrency. Some of these patches will go upstream, but many never will. The idea is because Erlang is open source you can make your own version. They are now trying to be more open and push more of their changes upstream.
  • eBay created a 5000 node k8s cluster for their cloud platform. Here’s how they made it workish. Scalability Tuning on a Tess.IO Cluster.
    • To achieve the reliability goal of 99.99%, we deploy five master nodes in a Tess.IO cluster to run Kubernetes core services (apiserver, controller manager, scheduler, etcd, and etcd sidecar, etc). Besides core services, there are also Tess add-ons in each node that expose metrics, set up networks, or collect logs. All of them are watching resources they care about from the cluster control plane, which brings additional loads against the Kubernetes control plane. All the IPs used by the pod network are global routable in the eBay data center. The network agent on each node is in charge of configuring the network on host.
    • There were problems: Failed to recover from failures on cluster with: 5k nodes, 150k pods; Pod scheduling is slow in a large cluster; Large list requests will destroy the cluster; Etcd keeps changing leaders.
    • There were solutions, but it took a lot of work. If you aren’t eBay it might be difficult to pull off.
  • The Evolution of Spotify Home Architecture. This is a common story these days. The move from batch to streaming; the move from running your own infrastructure to moving to the cloud; the move from batch recommendations to real-time recommendations; the move from relative simplicity to greater system complexity; the move from more effort put into infrastructure to more effort being put into product.
    • At Spotify, we have 96 million subscribers, 207 million monthly active users, we’ve paid out over €10 billion to rights holders. There are over 40 million songs on our platform, over 3 billion playlists on our service, and we’re available in 79 markets.
    • We were running a lot of Hadoop jobs back in 2016. We had a big Hadoop cluster, one of the largest in Europe at the time, and we were managing our services and our databases in-house, so we were running a lot of things on-premise. Experimentation in the system can be difficult. Let’s say you have a new idea for a shelf, a new way you want to make a recommendation to a user, there’s a lot in the system that you need to know about to be able to get to an A/B test. There’s also a lot of operational overhead needed to maintain Cassandra and Hadoop. At that time we were running our own Hadoop cluster, we had a team whose job it was just to make sure that that thing was running.
    • We started to adopt services in 2017, this is at the time where Spotify was investing and moving to GCP. What are some of the cons of this? You saw that as we added more and more content, as we added more and more recommendations for the users, it would take longer to load home because we are computing these recommendations at the request spot. We also saw that since we don’t store these recommendations anywhere, if for some reason the request failed, the user would just see nothing on the homepage, that’s a very bad experience.
    • In 2018, Spotify is investing heavily in moving the data stack also to Google Cloud. Today, we’re using a combination of streaming pipelines and services to compute recommendations on home that you see today. What’s the streaming pipeline? We are now updating recommendations based on user events. We are listening to the songs you have listened to, the artists you have followed, and the tracks you have hearted, and we make decisions based on that. We’ve separated out computation of recommendations and serving those recommendations in the system. What are some of the cons? Since we added the streaming pipelines into this ecosystem, the stack has just become a little bit more complex. Debugging is more complicated, if there is an incident on your side, you have to know whether it’s the streaming pipeline, or your service, it’s the logic, or it is because Bigtable is having an issue.
  • Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
    • The key challenge I think for any of us who have worked with S3 is it’s great at this like bulk gerbil storage where you just want the blob back. But for any type of high throughput, definitely with our real-time requirements, S3 in and of itself is not going to ever perform well enough. What this tells us is S3’s going to be great for a long-term gerbil store, but to really scale this, we need to do something faster. That turns into the question of what.
    • Just starting with everyone, let’s just say, everyone who wants to do things fast and is just going in-memory databases today, what does the math of that look like? 300 terabytes, that’s 80 x 1e.32xlarge for a month. That takes it to $300,000 for a month. Now you’re getting into a really expensive system for one customer. This is with no indexes or overhead.
    • we use LevelDB for SSD storage where it’s very high-performing. DRAM is for in-memory. We like Cassandra, where we can trust it to horizontal scaling, more mid-performance. We use Rocks DB and SQLite for when we want a lot of flexibility in types of queries that we want to run.
    • The other thing is, particularly when we’re storing stuff in memory, what we found is that there is no real substitute for just picking the right data structures and just storing them in memory. We use a lot of Go code just using the right indexes and the right clever patterns to store anything else. The key takeaway, this is a lot of data, but what we’ve found is to do this well and at scale, it’s a very hybrid approach to traditional systems.
  • Attention. HTTP/3 is a thing. HTTP/3: the past, the present, and the future
    • Chrome, curl, and Cloudflare, and soon, Mozilla, rolling out experimental but functional, support for HTTP/3 
    • instead of using TCP as the transport layer for the session, it uses QUIC, a new Internet transport protocol, which, among other things, introduces streams as first-class citizens at the transport layer. QUIC streams share the same QUIC connection, so no additional handshakes and slow starts are required to create new ones, but QUIC streams are delivered independently such that in most cases packet loss affecting one stream doesn’t affect others. This is possible because QUIC packets are encapsulated on top of UDP datagrams.
    • QUIC also combines the typical 3-way TCP handshake with TLS 1.3’s handshake. Combining these steps means that encryption and authentication are provided by default, and also enables faster connection establishment. 
  • Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100:
    • I have 12 million or so git repositories which I need to download and process.
    • This worked brilliantly. However the problem with the above was firstly the cost, and secondly lambda behind API-Gateway/ALB has a 30 second timeout, so it couldn’t process large repositories fast enough. I knew going in that this was not going to be the most cost effective solution but assuming it came close to $100 I would have been willing to live with it. After processing 1 million repositories I checked and the cost was about $60 and since I didn’t want a $700 AWS bill I decided to rethink my solution. 
    • How does one process 10 million JSON files taking up just over 1 TB of disk space in an S3 bucket?
    • The first thought I had was AWS Athena. But since it’s going to cost something like $2.50 USD per query for that dataset I quickly looked for an alternative.
    • My answer to this was another simple go program to pull the files down from S3 then store them in a tar file. I could then process that file over and over. The process itself is done though very ugly go program to process the tar file so I could re-run my questions without having to trawl S3 over and over.
    • However after time I chose not to use AWS in the end because of cost. 
    • So were someone to do this from scratch using the same method I eventually went with it would cost under $100 USD to redo the same calculations
  • Here’s Facebook’s Networking @Scale 2019 recap: “This year’s conference focused on a theme of reliable networking at scale. Speakers talked about various systems and processes they have developed to improve overall network reliability or to detect network outages quickly. They also shared stories about specific network outages, how they were immediately handled, and some general lessons for improving failure resiliency and availability.” You might like: Failing last and least: Design principles for network availability; BGP++ deployment and outages; What we have learned from bootstrapping 1.1.1.1; Operating Facebook’s SD-WAN network; Safe: How AWS prevents and recovers from operational events.
  • 25 Experts Share Their Tips for building Scalable Web Application: Tip #1: Choosing the correct tool with scalability in mind reduces a lot of overhead; Tip #2: Caching comes with a price. Do it only to decrease costs associated with Performance and Scalability; Tip #3: Use Multiple levels of Caching in order to minimize the risk of Cache Miss.
  • Google’s approach requires 38x more bandwidth than a Websocket + delta solution, and delivers latencies that are 25x higher on averageGoogle – polling like it’s the 90s
    • Google strangely chose HTTP Polling.  Don’t confuse this with HTTP long polling where HTTP requests are held open (stalled) until there is an update from the server. Google is literally dumb polling their servers every 10 seconds on the off-chance there’s an update. This is about as blunt a tool as you can imagine.
    • Google’s HTTP polling is 80x less efficient than a raw Websocket solution.  Over a 5 minute window, the total overhead uncompressed is 68KiB vs 16KiB for long polling and a measly 852 bytes for Websockets
    • The average latency for Google’s long polling solution is roughly 25x slower than any streaming transport that could have been used.
    • Every request, every 10 seconds, sends the entire state object…Over a 5 minute window, once the initial state is set up, Google’s polling solution consumes 282KiB of data from the Google servers, whereas using Xdelta (encoded with base-64) over a Websocket transport, only 426 bytes is needed. That represents 677x less bandwidth needed over a 5 minute window, and 30x less bandwidth when including the initial state set up.
  • The 8base tech stack:  we chose Amazon Web Services (AWS) as our computing infrastructure; serverless computing using AWS Lambda; AWS Aurora MySQL and MongoDB Atlas as databases; AWS S3 (Simple Storage Service) for object storage service; AWS’s API Gateway; 8base built an incredibly powerful GraphQL API engine; React; Auth0.
  • The rise of the Ctaic family of programming languages. Altaic: Rise and Fall of a Linguistic Hypothesis. Interesting parallels with the lineage of programming languages. Spoken languages living side by side for long periods of times come to share vocabulary and even grammar, yet they are not part of the same family tree. Words transfer sidewise between unrelated languages rather than having a parent child relationship. I don’t even know if there is formal programming language lineage chart, but it does seem languages tend to converge over time as user populations agitate for the adoption of language features from other languages into their favorite language. Even after years of principled objection to generics being added to Go, many users relentlessly advocate for Go generics. And though C++ is clearly in parent child relationship with C, over the years C++ has adopted nearly every paradigm under the sun.
  • Orchestrating Robot Swarms with Java
    • How does a busy-loop go? Here’s that Real-timeEventScheduler that I didn’t show you before, this time with a busy-loop implemented in it. Similar to our discrete-event, we have our TimeProvider, this time it’s probably the system TimeProvider. I’ve got the interface here and we have our queue of events. Rather than iterating or looping on our queue while we have tasks, we loop around forever and we check, is this event due to scheduled now, or basically has current time gone beyond the time my event history to be scheduled? If it is, then we run the event. Otherwise, we loop around. What this is basically doing is going, “Do I have any work? If I have some work, execute it. If not, loop back around. Do I have any work?” till it finds some work and it does it. Why did we do this? What sort of benefits does this get us? Some of the advantages that we saw is that our latency for individual events went down from basically around 5 milliseconds to effectively 0, because you’re not waiting for anything to wake up, you’re not waiting for a thread to get created, you’re just there constantly polling, and as soon as you’ve got your event, you can execute it. We saw in our system that throughput events went up by three times, that’s quite good for us.
    • we have some parts of our computation which can be precomputed in advance. Our application startup time, we can take all these common parts of calculations and eagerly calculate them and cache the results. What that means is, when we come to communicating with our robots, we don’t have to do full computation in our algorithms, we only have to do the smallest amount of computation based on what the robot is telling us
    • To reduce garbage collection overhead: Remove Optional from APIs that are heavily used us, use for-loops instead of the Streams API, use an array backed data structure instead of something like HashSet or LinkedList and avoid primitive boxing, especially in places like log lines. The thing that these all have in common is basically excess object creation. 
    • ZGC is new in Java 11, labeled as experimental, but it’s promising some seriously low pause times, in order like 10 milliseconds, on heaps of over 100 gigabytes. By just switching to ZGC, that 50 milliseconds is all the way over here, it’s beyond the 99th percentile. That means less than 1 in 100 pauses are greater than 50 milliseconds and for us, that’s amazing. 
  • It would be hard to find a better example of the risks of golden path test and development. New Cars’ Pedestrian-Safety Features Fail in Deadliest Situations. Somewhere in the test matrix should be regression tests for detecting pedestrians at night. We really need a standardized test for every autonomous car software update. It might even be a virtual course. They key is cars are not phones or DVRs. A push based over-the-air update process means a once safe vehicle is one unstable point release away from causing chaos on the road—or the death of a loved one. What if iOS 13 was the software that ran your car? Not a pretty thought.
  • CockroachDB offer the lowest price for OLTP workloads, it does so while offering the highest level of consistency. Just How “Global” Is Amazon Aurora? CockroachDB would like you to know Aurora is not all that. Why? 1) It’s optimized for read-heavy workloads when write scalability can be limited to a single master node in a single region. 2) Replication between regions is asynchronous, there is the potential for data loss of up to a second (i.e., a non-zero recovery point objective – RPO), and up to a minute to upgrade the read node to primary write node (i.e., 1 minute for the recovery time objective – RTO). 3) Multi-master – there is no option to scale reads to an additional region. 4) Multi-master – doubling of the maximum write throughput is gained at the expense of significantly decreased maximum read throughput. 5) Aurora multi-master does not allow for SERIALIZABLE isolation. 6) Depends on a single write node for all global writes. 7) Performs well in a single region and is durable to survive failures of an availability zone. However, there can be latency issues with writes because of the dependence on a single write instance. The distance between the client and the write node will define the write latency as all writes are performed by this node. 8) Does not have the ability to anchor data execution close to data. 
  • Scaling the Hotstar Platform for 50M
    • One of our key insights from 2018 was that auto-scaling would not work, which meant that we had static “ladders” that we stepped up/down to, basis the amount of “headroom” left.
    • Our team took an audacious bet to run 2019 on Kubernetes (K8s). It became possible to think of building our own auto-scaling engine that took into account multiple variables that mattered to our system. 
    • We supported 2x more concurrency in 2019 with 10x less compute overall. This was a 6–8 month journey that had it’s roots in 2–3 months of ideation before we undertook it. This section might make it sound easy, it isn’t.
    • Your system is unique, and it will require a unique solution.
    • We found a system taking up more than 70% compute for a feature that we weren’t even using.
  • We hear a lot about bad crypto, but what does good crypto even look like? Who talks about that? Steve Gibson, that’s who. In the The Joy of Sync Steve describes how Sync.com works and declares it good: Zero-knowledge, end-to-end encryption ● File and file meta data is encrypted client-side and remains encrypted in transit and at rest. ● Web panel, file sharing and share collaboration features are also zero-knowledge. ● Private encryption keys are only accessible by the user, never by Sync. ● Passwords are never transmitted or stored, and are only ever known by the user….A randomly generated 2048 bit RSA private encryption key serves as the basis for all encryption at Sync. During account creation, a unique private key is generated and encrypted with 256 bit AES GCM, locked with the user’s password. This takes place client-side, within the web browser or app. PBKDF2 key stretching with a high iteration count is used to help make weak passwords more cryptographically secure. Encrypted private keys are stored on Sync’s servers, and downloaded and decrypted locally by the desktop app, web panel or mobile apps after successful authentication. At no time does Sync have access to a user’s private key. And there’s much more about how good crypto works.
  • Lyft on Operating Apache Kafka Clusters 24/7 Without A Global Ops Team. Kind of an old school example of how to run your own service and make it reliable for you. Also an example of why the cloud isn’t magic sauce. Things fail and don’t fix themselves. That’s why people pay extra for managed services. But Lyft build the monitoring and repair software themselves and now Kafka runs without a huge operational burden.
  • This could help things actually become the internet of things. Photovoltaic-powered sensors for the “internet of things”: Perovskite [solar] cells, on the other hand, can be printed using easy roll-to-roll manufacturing techniques for a few cents each; made thin, flexible, and transparent; and tuned to harvest energy from any kind of indoor and outdoor lighting. The idea, then, was combining a low-cost power source with low-cost RFID tags, which are battery-free stickers used to monitor billions of products worldwide. The stickers are equipped with tiny, ultra-high-frequency antennas that each cost around three to five cents to make…enough to power up a circuit — about 1.5 volts — and send data around 5 meters every few seconds.

Soft Stuff:

  • Programming with Bigmachine is as if you have a single process with “cloudroutines” across a large cluster of physical nodes. Bigmachine (article): is an attempt to reclaim programmability in cloud computing. Bigmachine is a Go library that lets the user construct a system by writing ordinary Go code in a single, self-contained, monolithic program. This program then manages the necessary compute resources, and distributes itself across them. No infrastructure is required besides credentials to your cloud provider. Bigmachine achieves this by defining a unified programming model: a Bigmachine binary distributes callable services onto abstract “machines”. The Bigmachine library manages how these machines are provisioned, and transparently provides a mutually authenticated RPC mechanism. 
    • @marius: When we built Reflow, we came upon an interesting way to do cluster computing: self-managing processes. The idea was that, instead of using complicated cluster management infrastructure, we could build a vertically integrated compute stack. Because of Reflow’s simplified needs, cluster management could also be a lot simpler, so we built Reflow directly on top of a much lower level interface that could be implemented by EC2 (or really any VM provider) directly. Bigmachine is this idea reified in a Go package. It defines a programming model around the idea of an abstract “machine” that exposes a set of services. The Bigmachine runtime manages machine creation, bootstrapping, and secure RPC. Bigmachine supports any comms topology. Bigmachine also goes to great lengths to provide transparency. For example, standard I/O is sent back to the user; Go’s profile tooling “just works” and gives you profiles that are merged across the whole cluster; stats are automatically aggregated.
    • @everettpberr: It’s hard to believe a gap like this exists in cloud computing today but it’s absolutely there. We deal with this every week. If I have a local program that now I want to execute across a lot of data – there’s _still_ a lot of hassle involved. BigSlice may solve this.
  • Bigslice: a cluster computing system in the style of Spark. Bigslice is a Go library with which users can express high-level transformations of data. These operate on partitioned input which lets the runtime transparently distribute fine-grained operations, and to perform data shuffling across operation boundaries. We use Bigslice in many of our large-scale data processing and machine learning workloads.
    • @marius: Bigslice is a distributed data processing system built on top of Bigmachine. It’s similar to Spark and FlumeJava, but: (1) it’s build for Go; (2) it fully embodies the idea of self-managing serverless computing. We’re using Bigslice for many of our large scale workloads at GRAIL. Because Bigslice is built on top of Bigmachine, it is also fully “self-managing”: the user writes their code, compiles a binary, and runs it. The binary has the capability of transparently distributing itself across a large ad hoc cluster managed by the same runtime. This model of cluster computing has turned out to be very pleasant in practice. It’s easy to make modifications across the stack, and from an operator’s perspective, all you need to do is bring along some cloud credentials. Simplicity and transparency in cloud computing.
  • cloudflare/quiche: an implementation of the QUIC transport protocol and HTTP/3 as specified by the IETF. It provides a low level API for processing QUIC packets and handling connection state.

Pub Stuff:  

  • Numbers limit how accurately digital computers model chaos: Our work shows that the behaviour of the chaotic dynamical systems is richer than any digital computer can capture. Chaos is more commonplace than many people may realise and even for very simple chaotic systems, numbers used by digital computers can lead to errors that are not obvious but can have a big impact. Ultimately, computers can’t simulate everything.
  • The Effects of Mixing Machine Learning and Human Judgment: Considered in tandem, these findings indicate that collaboration between humans and machines does not necessarily lead to better outcomes, and human supervision does not sufficiently address problems when algorithms err or demonstrate concerning biases. If machines are to improve outcomes in the criminal justice system and beyond, future research must further investigate their practical role: an input to human decision makers.

from High Scalability

Redis Cloud Gets Easier with Fully Managed Hosting on Azure

Redis Cloud Gets Easier with Fully Managed Hosting on Azure

Redis Cloud Gets Easier with Fully Managed Hosting on Azure

ScaleGrid, a rapidly growing leader in the Database-as-a-Service (DBaaS) space, has just launched their new fully managed Redis on Azure service. This Redis management solution allows startups up to enterprise-level organizations automate their Redis operations on Microsoft Azure dedicated cloud servers, alongside their other open source database deployments, including MongoDBMySQL and PostgreSQL.

Redis, the #1 key-value store and top 10 database in the world, has grown by over 300% in popularity over that past 5 years, per the DB-Engines knowledge base. The demand for Redis is skyrocketing across dozens of use cases, particularly for cache, queues, geospatial data, and high speed transactions. This simple database management system makes it very easy to store and retrieve pairs of keys and values, and is commonly paired with other database types to increase the speed and performance of an application. According to the 2019 Open Source Database Report, a majority of Redis deployments are used in conjunction with MySQL, and over half of Redis deployments are used with either PostgreSQL, MongoDB, and Elasticsearch.

ScaleGrid’s Redis hosting service allows these organizations to automate all of their time-consuming management tasks, such as backups, upgrades, scaling, replication, sharding, monitoring, alerts, log rotations, and OS patching, so their DBAs, developers, and DevOps teams can focus on new product development and optimizing performance. Additionally, organizations can customize their Redis persistence and host through their own Azure account which allows them to leverage advanced cloud capabilities like Azure Virtual Networks (VNET), Security Groups, and Reserved Instances to reduce long-term hosting costs up to 60%. 

“Cloud reliability has never been so important,” says Dharshan Rangegowda, Founder and CEO of ScaleGrid. “It’s crucial for organizations to properly configure their Redis deployments for high availability and disaster recovery, as a couple minutes of downtime can be detrimental to a company’s security and reputation.”

ScaleGrid is the only Redis cloud service that allows you to customize your master-slave and cross-datacenter configurations for 100% uptime and availability across 30 different Azure regions. They also allow you to keep full Redis admin access and SSH access to your machines, and you can learn more about their advantages over competitors Compose for Redis, RedisGreen, Redis Labs and Elasticache for Redis on their Compare Redis Providers page.

from High Scalability

Sponsored Post: Sisu, Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: Sisu, Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Sisu Data is looking for machine learning engineers who are eager to deliver their features end-to-end, from Jupyter notebook to production, and provide actionable insights to businesses based on their first-party, streaming, and structured relational data. Apply here.
  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Cool Products and Services

  • Grokking the System Design Interview is a popular course on Educative.io (taken by 20,000+ people) that’s widely considered the best System Design interview resource on the Internet. It goes deep into real-world examples, offering detailed explanations and useful pointers on how to improve your approach. There’s also a no questions asked 30-day return policy. Try a free preview today.
  • PA File Sight – Actively protect servers from ransomware, audit file access to see who is deleting files, reading files or moving files, and detect file copy activity from the server. Historical audit reports and real-time alerts are built-in. Try the 30-day free trial!
  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

Fun and Informative Events

  • Advertise your event here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


PA File Sight monitors file access on a server in real-time.

It can track who is accessing what, and with that information can help detect file copying, detect (and stop) ransomware attacks in real-time, and record the file activity for auditing purposes. The collected audit records include user account, target file, the user’s IP address and more. This solution does NOT require Windows Native Auditing, which means there is no performance impact on the server. Join thousands of other satisfied customers by trying PA File Sight for yourself. No sign up is needed for the 30-day fully functional trial.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Stuff The Internet Says On Scalability For September 27th, 2019

Stuff The Internet Says On Scalability For September 27th, 2019

Wake up! It’s HighScalability time:

Nifty diagram of what testing looks like in an era or progressive delivery. (@alexsotob, @samnewman)

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for all who want to understand the cloud. On Amazon it has 55 mostly 5 star reviews (131 on Goodreads). They’ll thank you for changing their life forever.

Number Stuff:

  • 2: percentage of human DNA coding for genes, so all the extra code in your project is perfectly natural. And 99.9% of your DNA is like any other person’s, so all that duplicate code in your project is also perfectly natural.
  • 40%: do some form of disaster testing annually in production. 
  • 1 billion: Windows 10 devices in 2020. 
  • ~1.5: Bit-Swap better compression than GNU Gzip. 
  • 1 billion: Slack messages sent weekly.
  • 1.4%: decline in electronic equipment salescompared to the same quarter of last year. 
  • 50%: increase year-over-year in enterprise adoption and deployments of multi-cloud. 80%+ of customers on all three clouds use Kubernetes. 1 in 3 enterprises are using serverless in production. AWS Lambda adoption grew to 36% in 2019, up 12% from 2017.
  • $100 million: fund to empower individual creators, galvanize open-standard monetization service providers, and allow users to directly support content they value.
  • #1: most dangerous software error is: Improper Restriction of Operations within the Bounds of a Memory Buffer.
  • $0.030 – $0.035: Backblaze’s target per gigabyte of storage cost.
  • $16.5 billion: record investment in robot sector along with a staggering jump in the number of collaborative robot installations last year.
  • 100,000: free AI generated headshots.
  • 37%: Wi-Fi 6 single-user data rate is faster than 802.11ac.

Quotable Stuff:

  • @cassidoo: My 55-year-old father-in-law has been trying to get a junior coding job for over 2 years and seeing him constantly get rejected for younger candidates at the last interview round is ageism in tech at its finest 😞
  • Robert Wolkow: We’ve built an atomic scale device that’s as disruptive to the transistor and as the transistor was to the vacuum tube. It will change everything. (paper)
  • @gitlab: We learned that @NASA will be flying Kubernetes clusters to the moon  🚀
  • Yunong Shi: The first classical bits used on the giant ENIAC machine are a room of vacuum tubes (around 17000 in total). On average, there was only one tube that fails every two days. On the other hand, for the first generation of qubits we have now, the average lifetime is on the scale of a millisecond to second…after about one hundred to one thousand operations, all qubits are expected to fail
  • @wattersjames: “The latest SQL benchmarks on AWS demonstrated that YSQL is 10x more scalable than the maximum throughput possible with Amazon Aurora. “
  • outworlder: AWS support is stellar. We have workloads on the three major cloud providers, and AWS support is better by an order of magnitude. If anything, this has spoiled us. When things that take minutes to solve (or figure out) on AWS takes days on another cloud provider, no amount of technical wizardry can make up for it. They won’t be dragging your feet just because you have a lower level plan, but if you want to call them to solve stuff right now on the phone or have them showing up at your company with specialists in tow, then you have to fork over the required amount. It’s well spent, IMHO.
  • Andrei Alexandrescu: Speed Is Found In The Minds of People
  • throwsurveill: It took basically until now for the bloom to come off the rose. Think about that. For 20 years Google has been supposedly hiring the smartest guys in the room and all it took was free food, some ball pits and slides, contributing some tech to open source, and working on a handful of “moonshots” that haven’t gone anywhere to keep the sheen of innovation going. And it worked. For 20 years. People have been saying Google is the new Microsoft for a few years but it basically took until now for that to become consensus. Microsoft, who’s been on the back foot until recently, has recast themselves as the new Open Source Champion, basically using the Google playbook from 20 years ago. And it’s working!
  • Jerry Neumann: Moats draw their power to prevent imitation from one of four basic sources: The state, Special know-how, Scale, or System rigidity.
  • QuestionsHurt: I’ve used most hosting setups in my time, shared hosting, dedicated servers, VPS, PaaS like Heroku, EC2 et al., Serverless, and JAMStack like Netlify. Plus other things that sound new but aren’t. I keep coming back to VPS like Digital Ocean, Vultr and the likes. You get more control of the server and more control of your bill. Which is vital to newborn projects.
  • Slack: It’s a race to scale shared channels to fit the needs of our largest customers, some of which have upward of 160,000 active users and more than 5,000 shared channels.
  • @colmmacc: I know that Internet is a great success and all but goddamnit UDP is such a piece of garbage. Even in the 70s, the designers should have had the sense to do fragmentation at layer 4, not layer 3, and put a UDP header in every packet. Don’t get me started on DNS.
  • @elkmovie: N.B.: the highest Geekbench 5 single-core score for *any* Mac is 1262. (2019 iMac 3.6) So the iPhone 11 now offers the fastest single-core performance of any computer Apple has ever made.
  • @anuraggoel: Serverless has its place, but for the love of everything that is holy, please don’t move your whole stack to serverless just because an AWS consultant told you to.
  • @edjgeek: Best practice is not to couple lambda’s in this pattern. For resiliency we recommend SNS/SQS/EventBridge for pub/sub and queueing in serverless. When locally testing, an event from any of these can be mocked for testing via ‘sam local generate-event’ use –help if needed
  • @it4sec: If you plan to fuzz CAN Bus in real vehicle, please make sure that Airbags are disabled.
  • 250bpm: It is said that every year the IQ needed to destroy the world drops by one point. Well, yes, but let me add a different spin on the problem: Every year, the IQ needed to make sense of the world raises by one point. If your IQ is 100 and you want to see yourself in 2039 just ask somebody with IQ 80 and listen carefully.
  • Haowei Yuan: a Dropbox user request goes through before it reaches backend services where application logics are executed. The Global Server Load Balancer (GSLB) distributes a user request to one of our 20+ Point of Presences (PoPs) via DNS. Within each PoP, TCP/IP (layer-4) load balancing determines which layer-7 load balancer (i.e., edge proxies) is used to early-terminate and forward this request to data centers. Inside a data center, Bandaid is a layer-7 load balancing gateway that routes this request to a suitable service…The core concept of our work is to leverage real-time information to make better load balancing decisions. We chose to piggyback the server load information in HTTP responses because it was simple to implement and worked well with our setup. 
  • Marc Greenberg: There needs to be a fundamental shift in what a computer looks like for compute in memory to really take off. There are companies using analog properties of memory cells to do interesting things. Those technologies are still very early in their development, but they’re really interesting. 
  • @QuinnyPig: There’s at least a 60% chance that I could start talking about a fictitious @awscloud MoonBase and I’d be suspected of breaking an NDA somewhere.
  • Jeff Klaus: Globally we are still seeing increasing data center growth. As noted in a recent report, “the seven primary U.S. data center markets saw 171 megawatts (MW) of net absorption in H1 2019, nearly 57 percent of 2018’s full-year record. That absorption nearly eclipsed the 200 MW of capacity added in H1. Northern Virginia, the largest data center market in the world, accounted for 74 percent of net absorption in the primary markets.”
  • Burke Holland: So is the cost of Serverless over-hyped? No. It’s for real. Until you reach a sizeable scale, you’ll pay very little if anything at all. Serverless is one of the most remarkable technologies to come your way in quite some time. Couple that with the automatic infinite scaling and the fact that you don’t even have to deal with a runtime anymore, and this one is a no-brainer.
  • Graham Allan: And 3D stacking for DDR4 and eventually DDR5, as well. And then increased capacity beyond that, you’re taking it on the DIMM and adding all the RC buffers and the data buffers. Registered DIMMS of 3D stacked devices is probably where you’re going to see the sweet spot for DDR5 for very very high capacity requirements. You can get 128 to 256 gigabytes, and maybe 512 gigabytes in the not-to-distant future, in one DIMM card. And that’s just DRAM.
  • Andy Heinig: the ever-increasing expansion of autonomous driving will also place significantly higher demands on integration technology. The data transfer rate between the circuits will be very high because extensive image and radar data is processed, leading to large data quantities per unit of time. Then this data must be processed in the circuits of the chiplets, and therefore regularly exchanged between the circuits. This requires high data rates that can only be realized with fast and massively parallel interfaces, so the corresponding package technology also has to be prepared. Only approaches such as 2.5D integration (interposers) or fan-out technologies can satisfy these requirements.
  • Kurt Shuler: It was also clear from the conference [HotChips] that AI is driving some huge chips, the largest being Cerebras’ 1.2 Trillion transistor 46,225 square mm wafer-sized chip, and that interconnect topologies and more distributed approaches to caching are becoming fundamental to making these designs work with acceptable throughput and power. 
  • JoeAltmaier: Ah, my sister endured all this sort of thing during 20 years as a VP in corporate America. She successfully deployed new data systems to 120 plants in 50 regions in one year. Didn’t cost $25M. Her method? Ruthlessly purge the region of the old data system and install the new (web-based API to a central, new data system). Investigate every regional difference and consolidate into one model. Before deployment day, get all the regional Directors in one room and tell them it was going to happen. Tell them there was no going back, no push-back would be permitted, and have the CEO in the room to confirm this.
  • NoraCodes: The elephant in the room (post?) is that the reason all these open chat protocols are failing is because of deliberate and serious damage done by attack from corporate software companies, especially Facebook and Google. Back in the day, I used XMPP to chat with people from all over the Internet, and so did a lot of my friends, precisely because it was easy to connect with people outside whatever walled garden you used primarily from a single desktop client software. Google and Facebook deliberately killed that model. That’s on them. Same thing with Slack, which had IRC and XMPP gateways for a long time.
  • Maureen Tkacik: Nearly two decades before Boeing’s MCAS system crashed two of the plane-maker’s brand-new 737 MAX jets, Stan Sorscher knew his company’s increasingly toxic mode of operating would create a disaster of some kind. A long and proud “safety culture” was rapidly being replaced, he argued, with “a culture of financial bullshit, a culture of groupthink.”
  • PaulAJ: “Core competencies” is a widely misunderstood term. Lots of people equate it to “business model”, as in “we sell widgets so therefore selling widgets is our core competence”. A thing is a core competence if, and only if: * It makes a difference to your customers. * It is difficult for your competitors to replicate. * It provides access to a wide range of markets.
  • @mweagle: “Guarantees do not necessarily compose into systems.” On Eliminating Error in Distributed Software Systems
  • Erik Brynjolfsson: Artificial intelligence (AI) is advancing rapidly, but productivity growth has been falling for a decade, and real income has stagnated. The most plausible explanation is that it will take considerable time for AI-related technologies to be deployed throughout the economy.
  • Lauren Smiley: The criminal oversights didn’t end there. As Karen’s body was unzipped from the body bag and laid out at the morgue, the coroner took note of a black band still encircling her left wrist: a Fitbit Alta HR—a smartwatch that tracks heartbeat and movement. A judge signed a warrant to extract its data, which seemed to tell the story Karen couldn’t: On Saturday, September 8, five days before she was found, Karen’s heart rate had spiked and then plummeted. By 3:28 in the afternoon, the Fitbit wasn’t registering a heartbeat.
  • DSHR: When HAMR and MAMR finally ship in volume they will initially be around 20% lower $/GB. L2 Drive promises a cost decrease twice as big as the cost decrease the industry has been struggling to deliver for a decade. What is more, their technology is orthogonal to HAMR and MAMR; drives could use both vacuum and HAMR or MAMR in the 2022-3 timeframe, leading to drives with capacities in the 25-28TB range and $/GB perhaps half the current value.
  • Duje Tadin: We often neglect how we get rid of the things that are less important. And oftentimes, I think that’s a more efficient way of dealing with information. If you’re in a noisy room, you can try raising your voice to be heard — or you can try to eliminate the source of the noise.
  • Desire Athow: With an uncompressed capacity of 9TB, it translates into a per TB cost of $6.55, about 12x less than the cheapest SSD on the market and 1/4 the price of the 12TB Seagate Exos X14, currently the most affordable hard disk drive on the market on a per TB basis. In other words, if you want a LOT of capacity, then tape is the obvious answer
  • Michael Graziano: Attention is the main way the brain seizes on information and processes it deeply. To control its roving attention, the brain needs a model, which I call the attention schema. Our attention schema theory explains why people think there is a hard problem of consciousness at all. Efficiency requires the quickest and dirtiest model possible, so the attention schema leaves aside all the little details of signals and neurons and synapses. Instead, the brain describes a simplified version of itself, then reports this as a ghostly, non-physical essence, a magical ability to mentally possess items. Introspection – or cognition accessing internal information – can never return any other answer. It is like a machine stuck in a logic loop. The attention schema is like a self-reflecting mirror: it is the brain’s representation of how the brain represents things, and is a specific example of higher-order thought. In this account, consciousness isn’t so much an illusion as a self-caricature.

Useful Stuff:

  • The equivalent of razor blades in SaaS is paying double for all the services you need to “support” the loss leader service. That’s wrong. 
    • @aripalo: Someone once asked how AWS makes money with #serverless as you don’t pay for idle. I’m glad that someone asked, I can tell you: CloudWatch. One account CW cost 55% because putMetricData. I’ll have to channel my inner @QuinnyPig, start combing through bills & figure out options.
    • @magheru_san: You do pay for idle but in a different way. If your Lambda function responds in single digit milliseconds, you are getting charged for 100ms or >10x than what your function actually consumed. Including if the function is sleeping or waiting for network traffic with an idle CPU
    • @QuinnyPig: …Then CloudWatch gets you again on GetMetric calls when Datadog pulls data in. Then you pay Datadog.
    • @sysproc: Don’t forget the price you pay per month for each unique metric namespace that you then pay more to populate via those PutMetricData calls.
  • Videos from CloudNative London 2019  are now available
  • Fun graph of Moore’s Law vs. actual transistor count along with a lively discussion thread. In 2002 michaelmalak: Not quite. It was the Itanium flop.The x86 instruction set was ugly. Everyone — Intel, Microsoft, software developers — wanted a clean slate. So Intel kind of put x86 development on the back burner and started working on Itanium. Itanium is what you see explode in the graph in 2002, leapfrogging the dreadfully slow Pentium 4 (although it had high transistor count and high clock rate, it was just bad). Despite Microsoft making a version of Windows for Itanium, Itanium was a commercial flop due to lack of x86 backward compatibility (outside of slow emulation).
  • Behold the power of the in-memory cache. Splash the cache: how caching improved our reliability. The problem: a spike in webhook requests caused a backup do to slow DynamoDB look ups. Doubling the provisioned capacity is an insanely expensive temporary workaround and didn’t really solve the problem. Switching to auto provisioning was 7x more expensive. The solution: an in-memory 3 second cache in the publisher. How much difference could 3 seconds make? A lot. They went from 300 reads per second to 1.4 reads per second—200x fewer database reads. And 3 seconds is short enough that when a webhook URL is updated they won’t be inconsistent for long. Why use an in-memory cache than an external cache like Redis? So many reasons: No external dependencies; Minimal failure rate; No runtime errors; In-memory is orders of magnitude faster than any network request.
  • A few videos from AppSec Global DC 2019 are now available.  
  • 10 Things to Consider While Using Spot Instances: Cost savings – customers can realize as much as 90% in cost savings. In fact, IDC has recently estimated that enterprises can save up to $4.85 Million over a period of 5 years by using Spot instances; Business flexibility and growth; Cost vs. Performance; Right-sizing instances for optimal performance; Developer & DevOps productivity; Application architectures; Enterprise grade SLA; Multi-cloud; Integration; Competitive advantage. @QuinnyPig: I have a laundry list of reasons why SaaS companies are a bad fit for meaningful cost optimization. Spotinst avoids nearly all of them. I like what they’re up to. 
  • Serverless: 15% slower and 8x more expensive. CardGames.io runs on AWS using a traditional mix of S3, CloudFront, EC2, Elastic Beanstalk, ELB, EBS, etc. at a cost of $164.21 a month. What happens if you move to a serverless platform? It’s not all cookies and cream. Serverless setup was 15% slower and 8x more expensive: “Our API accepts around 10 million requests a day. That’s ~35$ every day, just for API Gateway. On top of that Lambda was costing around ~10$ a day, although that could maybe be reduced by using less memory. Together that’s around 45$ per day, or 1350$ per month, vs. ~164$ per month for Elastic Beanstalk.” This generated much useful discussion.
    • Hugo Grzeskowiak: If you have constant load on your API, use EC2 or (if containerised) ECS. Choose instances based on the application profile, e.g. CPU, memory or throughput. There’s also instances with ephemeral volumes which are the fastest “hard drives” in case that’s the bottleneck. – If your load is high but fluctuating a lot (e.g. only having traffic from one time zone during rush hours), consider the burstable instances (t3 family). For non customer facing services (backups, batch jobs, orchestration) Lambda is often a good choice. – Lambda can be used for switching Ops costs to AWS costs.
    • Samuel Smith: Though I’ve only moved one off serverless, the general idea is to leave them all there until they find some traction, then when the cost warrants it, convert them
    • Jeremy Cummins: You can use an ALB instead of API Gateway in front of a Lambda. You can also configure an Elastic Beanstalk cluster, ECS cluster, or any other container service to serve as a proxy to serve requests from your lambdas (instead of an ALB or API Gateway). If you are serving your lambda requests through a CDN as you described you can use [email protected] to modify the request responses directly, so no load balancer or proxy needed. 
    • dread_username: Don’t forget to factor in the managed UNIX Administration costs too I think this is the real argument. Fully burdened cost for a senior developer where I live is about US$150000 a year. Given the article number of $1200 a month extra ($16200 a year), if a single developer can leverage serverless for an extra 11% revenue, it’s paid for itself, and the product potentially has more features for the market.
    • marcocom: So, the reason serverless doesn’t work for most is because they don’t truly buy-in to the heavy front-end necessary to run it. They use their old JSP-style approach and that doesn’t fit the philosophy. You have to believe in JavaScript and a server-side comprises of small stupid lambdas that only know their tiny slice of the whole picture and the data they send to the front end to be consumed and persisted by a very smart stateful single-page-application.
    • endproof: Serverless is not meant to run your api or anything with relatively consistent volume. It’s meant to serve for things that have huge spikes in traffic that would be uneconomical to have resources allocated for consistently. Of course it’s slower if they’re dynamically spinning up processes to handle spikes in traffic.
    • TheBigLewinski: Given how the code was running in the first place, directly from a server behind a load balancer, why was the API gateway used? This could have been loaded into a Lambda function and attached to the ALB, as a direct replacement for the Ec2 instances. The author then admits memory simply could have been lowered, but doesn’t provide any more detail. I’m guessing if that level of traffic is currently being handled by a “small” instance, the level of memory per request should be reduced to the bare minimum. But there were no details provided about that. There are billing details on the instances, but for the latter parts, we’ll just have to take their word that it was all properly -and optimally- implemented (And they obviously were not). This is, at best, a lesson on the consequences of haphazard deployments and mindlessly buying into hype. But instead of digging in, to more deeply understand the mechanics of building an app and how to improve, they blamed the technology with a sensationalist headline.
    • abiro: PSA: porting an existing application one-to-one to serverless almost never goes as expected. 1. Don’t use .NET, it has terrible startup time. Lambda is all about zero-cost horizontal scaling, but that doesn’t work if your runtime takes 100 ms+ to initialize. The only valid options for performance sensitive functions are JS, Python and Go. 2. Use managed services whenever possible. You should never handle a login event in Lambda, there is Cognito for that. 3. Think in events instead of REST actions. Think about which events have to hit your API, what can be directly processed by managed services or handled by you at the edge. Eg. never upload an image through a Lamdba function, instead upload it directly to S3 via a signed URL and then have S3 emit a change event to trigger downstream processing. 4. Use GraphQL to pool API requests from the front end. 5. Websockets are cheaper for high throughput APIs. 6. Make extensive use of caching. A request that can be served from cache should never hit Lambda. 7. Always factor in labor savings, especially devops. The web application needs of most startups are fairly trivial and best supported by a serverless stack. Put it another way: If your best choice was Rails or Django 10 years ago, then it’s serverless today.
    • claudiusd: did the same experiment as OP and ran into the same issues, but eventually realized that I was “doing serverless” wrong. “Serverless” is not a replacement for cloud VMs/containers. Migrating your Rails/Express/Flask/.Net/whatever stack over to Lambda/API Gateway is not going to improve performance or costs. You really have to architect your app from the ground-up for serverless by designing single-responsibility microservices that run in separate lambdas, building a heavy javascript front-end in your favorite framework (React/Ember/Amber/etc), and taking advantage of every service you can (Cognito, AppSync, S3, Cloudfront, API Gateway, etc) to eliminate the need for a web framework. I have been experimenting with this approach lately and have been having some success with it, deploying relatively complex, reliable, scalable web services that I can support as a one-man show.
    • danenania: “It’s also great for when you’re first starting out and don’t know when or where you’ll need to scale.” To me this is probably the most significant benefit, and one that many folks in this discussion strangely seem to be ignoring. If you launch a startup and it has some success, it’s likely you’ll run into scaling problems. This is a big, stressful distraction and a serious threat to your customers’ confidence when reliability and uptime suffer. Avoiding all that so you can focus on your product and your business is worth paying a premium for. Infrastructure costs aren’t going to bankrupt you as a startup, but infrastructure that keeps falling over, requires constant fiddling, slows you down, and stresses you out just when you’re starting to claw your way to early traction very well might.
  • This nuanced discussion from Riot Games on the Future of League’s Engine is a situatiion a lot of projects find themselves in. Do they go engine-heavy and move more functionality into C++ or engine-light and move more functionality into scripting? You adopt a scripting language to make your life simpler. Put the high performing core in C++ and bind it to a much easier to use scripting language. Who wants to go through a compile cycle when you can dynamically run code and get stuff done? But eventually you find clean abstractions break down and core functionality is all over the place and it becomes a nightmare to extend and maintain. Riot has consciously choosen to move away from scripting and use more C++: “The reasoning described in this article has been the direction that the Gameplay group on League has been walking for a couple years now. This has lead to some shifts on how we approach projects, for example how the Champions team encapsulated the complexity of Sylas into higher-level constructs and dramatically simplified the script implementation involved. The movement towards engine-heavy and explicitly away from engine-light will provide us with a more secure footing for the increasing complexity of League.” You may still need a scripting layer for desginers and users, but put the effort into making the abstractions in your core easier to use and keep it there.
  • One person’s boring is anothers pit of complexity. The boring technology behind a one-person Internet company. This is a great approach, but it it really that boring? It’s only boring if you already know the technology. Imagine a person just starting having to learn all this “boring” stuff? It would be daunting. It’s only boring because you already know it. The new boring is always being reborn.
  • Root Cause is a Myth: root cause can’t be determined in complex socio-technical systems…Instead of choosing blame and finger-pointing when breaches happen, DevSecOps practitioners should seek shared understanding and following blameless retrospective procedures to look at a wider picture of how the event actually unfolded. We shouldn’t fire the engineer who didn’t apply the patches, nor the CISO who hired the engineer. Instead, we look at what organizational decisions contributed to the breach.
  • It’s always hard to change a fundamental assumption of your architecture. Twitter took a long time to double their character count to 280. Slack is now supports shared channels—A shared channel is one that connects two separate organizations. Yah, it’s a pain to change, but the pain of trying to create an architecture so flexible it has no limits is much greater. Make limits. Optimize around those limits. And stick your tongue out at anyone who bitches about technical debt over a taking pride in a working system. How Slack Built Shared Channels.
    • The backend systems used the boundaries of the workspace as a convenient way to scale the service, by spreading out load among sharded systems. Specifically, when a workspace was created, it was assigned to a specific database shard, messaging server shard, and search service shard. This design allowed Slack to scale horizontally and onboard more customers by adding more server capacity and putting new workspaces onto new servers.
    • We decided to have one copy of the shared channel data and instead route read and write requests to the single shard that hosts a given channel. We used a new database table called shared_channels as a bridge to connect workspaces in a shared channel.
  • This makes sense. Replace limits with smarts. Password Limits on Banks Don’t Matter:  banks aggressively lock out accounts being brute forced. They have to because there’s money at stake and once you have a financial motivator, the value of an account takeover goes up and consequently, so does the incentive to have a red hot go at it. Yes, a 5-digit PIN only gives you 100k attempts, but you’re only allowed two mistakes…Banks typically use customer registration numbers as opposed to user-chosen usernames or email addresses so there goes the value in credential stuffing lists…”Do you really think the only thing the bank does to log people on is to check the username and password?”…implement additional verification processes at key stages of managing your money.
  • HotChips 31 keynote videos are available.
  • From the Critical Watch Report: Encryption-related misconfigurations are the largest group of SMB security issues; In SMB AWS environments, encryption & S3 bucket configuration are a challenge; Weak encryption is a top SMB workload configuration concern; Most unpatched vulnerabilities in the SMB space are more than a year old; The three most popular TCP ports account for 65% of SMB port vulnerabilities; Unsupported Windows versions are rampant in mid-sized businesses; Outdated Linux kernels are present in nearly half of all SMB systems; Active unprotected FTP servers lurk in low-level SMB devices; SMB email servers are old and vulnerable.
  • Update on fsync Performance: In this post, instead of focusing on the performance of various devices, we’ll see what can be done to improve fsync performance using an Intel Optane card…The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.
  • It turns out the decentralized DNS system is actually quite centralized in practice. DNS Resolver Centrality: While 90% of users have a common set of 1.8% of open resolvers and AS resolver sets configured (Figure 4), 90% of users have the entirety of their DNS queries directed to some 2.6% of grouped resolvers. In this case out or some 15M experiments on unique end points, some 592 grouped resolvers out of a total pool of 23,092 such resolver sets completely serve 90% of these 15M end points, and these users direct all their queries to resolvers in these 592 resolver sets. Is this too centralised? Or is it a number of no real concern? Centrality is not a binary condition, and there is no threshold value where a service can be categorised as centralised or distributed. It should be noted that the entire population of Internet endpoints could also be argued to be centralised in some fashion. Out of a total of an estimated 3.6 billion Internet users, 90% of these users appear to be located within 1.2% of networks, or within 780 out of a total number of 65,815 of ASNs advertised in the IPv4 BGP routing system
  • Why is Securing BGP So Hard?: BGP security is a very tough problem. The combination of the loosely coupled decentralized nature of the Internet and a hop-by-hop routing protocol that has limited hooks on which to hang credentials relating to the veracity of the routing information being circulated unite to form a space that resists most conventional forms of security. 
  • So many ways to shoot yourself in the lambda. Serverless Cost Containment: concurrency can bite you by parallelising your failures, enabling you to rack up expenses 1,000 times faster than you thought!; A common error cause I’ve seen in distributed systems is malformed or unexpected messages being passed between systems, causing retry loops; If a Lambda listening to an SQS queue can’t process the message, it returns it to the queue… and then gets given it back again again!; A classic new-to-serverless example is related to loops: an S3 bucket event (or any other Lambda event source) triggers a function that then writes back to the same source, causing an infinite loop; Using messages and queues as a way to decouple your functions is generally a good architectural practice to use; it can also protect you from some cost surprises; Create dashboards to visually monitor for anomalies; Setting a billing alert also serve as a catch-all for other scenarios that you’d want to know about (e.g. being attacked in a way that causes you to consume resources).
  • High autonomy and little hierarchy is a trait tech companies and startups share. Software Architecture is Overrated, Clear and Simple Design is Underrated: Start with the business problem; Brainstorm the approach; Whiteboard your approach; Write it up via simple documentation with simple diagrams; Talk about tradeoffs and alternatives; Circulate the design document within the team/organization and get feedback. 
    • No disagreement with this well written and well thought out article, but the idea that anyone can agree on what is simple and clean in any complex domain is wishful thinking. That’s why there are so many rewrites on projects. New people coming in that were not part of the context that produced the “simple and clean” design so the inevitably don’t understand what’s going on, so they create their new “simple and clean” system. Complexity happens one decision at a time and if you weren’t part of those decisions chances are you won’t understand the resulting code. Prgrammers produce solutions, let’s stop pretending they are simple and clean.
    • gregdoesit: We had a distributed systems expert join the team, who was also a long-time architect. A junior person invited him to review his design proposal on a small subsystem. This experienced engineer kept correcting the junior engineer on how he’s not naming things correctly and mis-using terms. The design was fine and the tradeoffs were good and there was no discussion about anything needing changes there, but the junior engineer came out devastated. He stopped working on this initiative, privately admitting that he feels he’s not experienced enough to design anything this complex and first needs to read books and learn how it’s done “properly”. This person had similar impact on other teams, junior members all becoming dis-engaged from architecture discussions. After we figured out this pattern, we pulled this experienced engineer aside and had a heart to heart on using jargon as a means to prove your smart, opposed to making design accessible to everyone and using it to explain things. I see the pattern of engineers with all background commenting and asking questions on design documents that are simple to read. But ones that are limiting due to jargon that’s not explained in the scope of the document get far less input.
    • And here’s a great explanation of a common at attractor in the chaotic dynamical system that is a company. Why are large companies so difficult to rescue (regarding bad internal technology): There are two big problems that plague rescue efforts at big companies: history and trust…All of which helps explain why technology rescues at bigger, older companies are so difficult. One is constantly fighting against history…To a large extent “be agile” is almost synonymous with “trust each other.” If you’re wondering why large companies have trouble being agile, it is partly because it is impossible for 11,000 people to trust each other the way 5 people can. That is simply reality. Until someone can figure out the magic spell that allows vast groups of people, in different countries, with different cultures, speaking different languages, to all trust each other as if they were good friends, then people in the startup community need to be a lot more careful about how carelessly they recommend that larger companies should be more agile. 
  • Everything You Need To Know About API Rate Limiting. An excellent overview of different methods—request queues, throttling, rate-limiting algorithms—one to add would be use an API Gateway and let them worry about it.
  • The 64 Milliseconds Manifesto: In an interactive software application, any user action SHOULD result in a noticeable change within 16ms, actionable information within 32ms, and at least one full screen of content within 64ms. dahart: Waiting for the response before updating the screen is the wrong answer. It’s not impossible, you’re assuming incorrectly that the manifesto is saying the final result needs to be on screen. It didn’t say that, it said the app needs to respond to action visually, not that the response or interaction sequence must be completed within that time frame. The right answer for web and networked applications is to update the screen with UI that acknowledges the user action and shows the user that the result is pending. Ideally, progress is visible, but that’s tangential to the point of the manifesto. A client can, in fact, almost always respond to actions within these time constraints. The point is to do something, rather than wait for the network response.
  • StackOverflow on why they love love love .NET Core 3.0. The presentation is glitzy, not your normal tech blog post. Can’t wait to see the series on Netflix. Stack Overflow OLD. It’s faster; apps can run on Windows, Macs, Linux, and run in Azure cloud; cloud deploys are easier because there are fewer moving pieces; SO is being broken up into modules that can be run in different areas which allows experimenting with k8s and Docker; most interestingly since they can run in a container they can ship appliances to customers which lowers support costs and makes it easier to onboard customers; they can move code to become middleware; it’s easier to test because they can test end-to-end; since .NET core is on GitHub they can fix errors; they can just build software instead of dealing with the meta of building software. 

Soft Stuff:

  • RSocket (video): Developed by Netifi in collaboration with Netflix, Facebook, Pivotal, Alibaba and others, RSocket combines messaging, stream processing and observability in a single, lightweight solution that provides the connectivity needed for today’s web, mobile and IoT applications. Unlike older technologies such as REST or gRPC, RSocket is equally adept at handling service calls as well as high-throughput streaming data and is at home in the datacenter as well as in the cloud, browsers and mobile/IoT devices.
  • fhkingma/bitswap (article, video): We introduce Bit-Swap, a scalable and effective lossless data compression technique based on deep learning. It extends previous work on practical compression with latent variable models, based on bits-back coding and asymmetric numeral systems. In our experiments Bit-Swap is able to beat benchmark compressors on a highly diverse collection of images. 
  • dgraph-io/ristretto (article):  a fast, concurrent cache library using a TinyLFU admission policy and Sampled LFU eviction policy.

Pub Stuff:

  • Weld: A Common Runtime for High Performance Data Analytics (article): Weld uses a common intermediate representation to capture the structure of diverse dataparallel workloads, including SQL, machine learning and graph analytics. It then performs key data movement optimizations and generates efficient parallel code for the whole workflow.
  • Low-Memory Neural Network Training: A Technical Report: Using appropriate combinations of these techniques, we show that it is possible to the reduce the memory required to train a WideResNet-28-2 on CIFAR-10 by up to 60.7x with a 0.4% loss in accuracy, and reduce the memory required to train a DynamicConv model on IWSLT’14 German to English translation by up to 8.7x with a BLEU score drop of 0.15.
  • Quantum Supremacy Using a Programmable Superconducting Processor: The tantalizing promise of quantum computers is that certain computational tasks might be executed exponentially faster on a quantum processor than on a classical processor. A fundamental challenge is to build a high-fidelity processor capable of running quantum algorithms in an exponentially large computational space. Here, we report using a processor with programmable superconducting qubits to create quantum states on 53 qubits, occupying a state space 253 ∼1016 Measurements from repeated experiments sample the corresponding probability distribution, which we verify using classical simulations. While our processor takes about 200 seconds to sample one instance of the quantum circuit 1 million times, a state-of-the-art supercomputer would require approximately 10,000 years to perform the equivalent task. This dramatic speedup relative to all known classical algorithms provides an experimental realization of quantum supremacy on a computational task and heralds the advent of a much-anticipated computing paradigm.

from High Scalability

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

2019 Week in Review 4

GitLab Commit, held in New York last week, brought us news that GitLab completed a $268 million Series E round of fundraising. The company reports that it is now valued at $2.75 billion and that it plans to invest the cash infusion in its DevOps platform offerings — including monitoring, security, and planning.

In addition, the firm announced GitLab 12.3 which seeks to underscore that point which includes a WAF built into the GitLab SDLC platform for monitoring and reporting of security concerns related to Kubernetes clusters. It also includes new analytics features and enhanced compliance capabilities.

To stay up-to-date on DevOps best practices, cloud security, and IT Modernization, subscribe to our blog here:
Subscribe to the Flux7 Blog

DevOps News

  • GitHub announced that they have integrated the Checks API with GitHub Page, allowing operators to easily understand why a GitHub Page build failed. And, as Pages is now a GitHub App, users are able to see build status via the Checks interface.
  • And in other Git news, Semmle revealed that it is joining GitHub. According to the companies, security researchers use Semmle to find vulnerabilities in code with simple declarative queries which they then share the Semmle community to improve the safety of code in other codebases.
  • At FutureStack in New York last week, New Relic announced the “industry’s first observability platform that is open, connected and programmable, enabling companies to create more perfect software.” The new capabilities include New Relic Logs, New Relic Traces, New Relic Metrics, and New Relic AI. In addition, the company unveiled the ability for customers and partners to build new applications via programming on the New Relic One Platform.
  • Kubernetes has delivered Kubernetes 1.16, which it reports consists of 31 enhancements including custom resources, a metrics registry, and significant changes to the Kubernetes API.

AWS News

  • Amazon has unveiled a new Step Functions feature in all regions where Step Functions is offered, support for dynamic parallelism. According to AWS, this was probably the most requested feature for Step Functions as it unblocks the implementation of new use cases and can help optimize existing ones. Specifically, now Step Functions support a new Map state type for dynamic parallelism.
  • Heavy CloudFormation users will be happy to see that Amazon has expanded its capabilities; now operators can use CloudFormation templates to configure and provision additional features for Amazon EC2, Amazon ECS, Amazon ElastiCache, Amazon ES, and more. You can see the full list of new capabilities here.
  • AWS has brought to market a new Amazon WorkSpaces feature that will now restore a WorkSpace to a last known healthy state, allowing you to easily recover from the impact of inaccessible WorkSpaces caused by incompatible 3rd party updates on Workspaces.
  • AWS continues to evolve its IoT solution set with AWS IoT Greengrass 1.9.3 . Now available, AWS has added support for ARMv6 and new machine learning inference capabilities.
  • AWS introduced in preview the NoSQL Workbench for Amazon DynamoDB. The application to help operators design and visualize data models, run queries on data, and generate code for applications is free, client-side, and available for Windows and macOS.

Flux7 News
Flux7 has several upcoming educational opportunities. Please join us at:

  • Our October 9, 2019 Webinar, DevOps as a Foundation for Digital Transformation. This free 1-hour webinar from GigaOm Research brings together experts in DevOps, featuring GigaOm analyst Jon Collins and a special guest from Flux7, CEO and co-founder Aater Suleman. The discussion will focus on how to scale DevOps efforts beyond the pilot and deliver a real foundation for innovation and digital transformation.
  • The High-Performance Computing Immersion Day on October 11, 2019, in Houston, TX where attendees will gain in-depth, hands-on training with services such as Batch, Parallel Cluster, Elastic Fabric Adapter (EFA), FSX for Lustre, and more in an introductory session. Register Here Today.
  • The AWS Container Services workshop October 17, 2019 in San Antonio, TX. Designed for infrastructure administrators, developers, and architects, this workshop is designed as an introductory session that provides a mix of classroom training and hands-on labs. Register Here.

Subscribe to the Flux7 Blog

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog

DevOps Foundation for Digital Transformation: Live GigaOm Webinar

DevOps Foundation for Digital Transformation: Live GigaOm Webinar

Gigaom Webinar DevOps FoundationsJoin us on October 9th at Noon from the comfort of your desk as we bring you a free 1-hour webinar on how to scale DevOps efforts beyond the pilot and deliver a real foundation for innovation and digital transformation. Hosted by GigaOm Research analyst Jon Collins and Aater Suleman, CEO and co-founder of DevOps consulting firm Flux7, the discussion will share how to effectively create a DevOps foundation and scale for success.

Specifically, attendees to the Webinar will learn:

  • Causes underlying some of the key challenges to scaling DevOps today
  • A starting baseline for achieving the benefits of an enterprise DevOps implementation
  • How to link DevOps improvements with digital transformation goals
  • Trade-offs between technical, process automation and skills improvements
  • Steps to delivering on the potential of DevOps and enterprise agility
  • How to make a real difference to their organizations, drawing from first-hand in the field experience, across multiple transformation projects.

Register now to join GigaOm and Flux7 for this free expert webinar.

We all know the strategy — transform the enterprise to use digital technologies and deliver significantly increased levels of customer engagement and new business value through innovation. Key to this is DevOps effectiveness, that is, how fast an organization can take new ideas, translate them into software and deploy them into a live environment.

But many organizations struggle to get beyond the starting blocks, coming up against a legion of challenges from skills to existing systems and platforms. Innovation speed and efficiency suffer, costs rise and the potential value does not materialize. So, what to do? Join our Webinar and learn new skills for scaling DevOps efforts beyond the pilot to deliver a real foundation for innovation and digital transformation.

Join us and GigaOm as we explore how to scale a strong DevOps foundation across the enterprise to achieve key business benefits. Interested in additional reading before the presentation? Enjoy these resources on AWS DevOps, DevOps automation and Agile DevOps and be sure to subscribe to our DevOps blog below to stay on top of the latest trends and industry news.

from Flux7 DevOps Blog

IT Modernization and DevOps News Week in Review

IT Modernization and DevOps News Week in Review

2019 Week in Review 4

GitLab Commit, held in New York last week, brought us news that GitLab completed a $268 million Series E round of fundraising. The company reports that it is now valued at $2.75 billion and that it plans to invest the cash infusion in its DevOps platform offerings — including monitoring, security, and planning.

In addition, the firm announced GitLab 12.3 which seeks to underscore that point which includes a WAF built into the GitLab SDLC platform for monitoring and reporting of security concerns related to Kubernetes clusters. It also includes new analytics features and enhanced compliance capabilities.

To stay up-to-date on DevOps best practices, cloud security, and IT Modernization, subscribe to our blog here:
Subscribe to the Flux7 Blog

DevOps News

  • GitHub announced that they have integrated the Checks API with GitHub Page, allowing operators to easily understand why a GitHub Page build failed. And, as Pages is now a GitHub App, users are able to see build status via the Checks interface.
  • And in other Git news, Semmle revealed that it is joining GitHub. According to the companies, security researchers use Semmle to find vulnerabilities in code with simple declarative queries which they then share the Semmle community to improve the safety of code in other codebases.
  • At FutureStack in New York last week, New Relic announced the “industry’s first observability platform that is open, connected and programmable, enabling companies to create more perfect software.” The new capabilities include New Relic Logs, New Relic Traces, New Relic Metrics, and New Relic AI. In addition, the company unveiled the ability for customers and partners to build new applications via programming on the New Relic One Platform.
  • Kubernetes has delivered Kubernetes 1.16, which it reports consists of 31 enhancements including custom resources, a metrics registry, and significant changes to the Kubernetes API.

AWS News

  • Amazon has unveiled a new Step Functions feature in all regions where Step Functions is offered, support for dynamic parallelism. According to AWS, this was probably the most requested feature for Step Functions as it unblocks the implementation of new use cases and can help optimize existing ones. Specifically, now Step Functions support a new Map state type for dynamic parallelism.
  • Heavy CloudFormation users will be happy to see that Amazon has expanded its capabilities; now operators can use CloudFormation templates to configure and provision additional features for Amazon EC2, Amazon ECS, Amazon ElastiCache, Amazon ES, and more. You can see the full list of new capabilities here.
  • AWS has brought to market a new Amazon WorkSpaces feature that will now restore a WorkSpace to a last known healthy state, allowing you to easily recover from the impact of inaccessible WorkSpaces caused by incompatible 3rd party updates on Workspaces.
  • AWS continues to evolve its IoT solution set with AWS IoT Greengrass 1.9.3 . Now available, AWS has added support for ARMv6 and new machine learning inference capabilities.
  • AWS introduced in preview the NoSQL Workbench for Amazon DynamoDB. The application to help operators design and visualize data models, run queries on data, and generate code for applications is free, client-side, and available for Windows and macOS.

Flux7 News
Flux7 has several upcoming educational opportunities. Please join us at:

  • Our October 9, 2019 Webinar, DevOps as a Foundation for Digital Transformation. This free 1-hour webinar from GigaOm Research brings together experts in DevOps, featuring GigaOm analyst Jon Collins and a special guest from Flux7, CEO and co-founder Aater Suleman. The discussion will focus on how to scale DevOps efforts beyond the pilot and deliver a real foundation for innovation and digital transformation.
  • The High-Performance Computing Immersion Day on October 11, 2019, in Houston, TX where attendees will gain in-depth, hands-on training with services such as Batch, Parallel Cluster, Elastic Fabric Adapter (EFA), FSX for Lustre, and more in an introductory session. Register Here Today.
  • The AWS Container Services workshop October 17, 2019 in San Antonio, TX. Designed for infrastructure administrators, developers, and architects, this workshop is designed as an introductory session that provides a mix of classroom training and hands-on labs. Register Here.

Subscribe to the Flux7 Blog

Written by Flux7 Labs

Flux7 is the only Sherpa on the DevOps journey that assesses, designs, and teaches while implementing a holistic solution for its enterprise customers, thus giving its clients the skills needed to manage and expand on the technology moving forward. Not a reseller or an MSP, Flux7 recommendations are 100% focused on customer requirements and creating the most efficient infrastructure possible that automates operations, streamlines and enhances development, and supports specific business goals.

from Flux7 DevOps Blog