Linux系统运维:http://www.linuxyw.com Linux系统运维

Linux系统运维 - 专业的linux运维学习与交流社区

当前位置: 主页 > 业内 >

系统负载增加时将会遇到的42个奇怪的问题

时间:2013-05-12 14:28来源:开源中国社区 作者:网络 点击:
这篇文章是一个掠影,关于在负载增加时,你的精雕细琢的程序可能会遇到的所有坏事情:所有会突然导致损失的地狱。当然你可以扩展或扩大,但是你也可以选择更好的编程。使你的

这篇文章是一个掠影,关于在负载增加时,你的精雕细琢的程序可能会遇到的所有坏事情:所有会突然导致损失的地狱。当然你可以扩展或扩大,但是你也可以选择更好的编程。使你的系统能处理更大的负载。这能节约经费,因为只需要更少的容器,而且这会使整个应用更加可靠并具有更好的响应时间。同时这对程序员也是一个极大的满足。

大量的对象

当对象的数量变多以后,我们通常会遇到伸缩性问题。随着对象数量增长,对所有类型资源的清晰的使用被强调突出出来。

连续的失败引起无限的事件流

在大量的网络故障场景,从来就没有足够的时间进行系统恢复。我们处于持续的压力状态。

super0555
super0555
翻译于 9天前

0人顶

 

 翻译的不错哦!

大量的高优先级工作

例如,重选路由是一个高优先级的活动。如果存在大量的不能屏蔽或消除的重选路由工作,那么资源将会被持续的消耗,用来支持这些高优先级的工作。

更大的数据流

随着媒体资产大小的增长,系统的负载也增加了。随着请求来源数量的增加,系统负载也增加了。

特征潜变

随着更多的特性被添加,系统中的漏洞要比系统曾经设想中的要多。

客户端的增加

更多的客户端意味着更多的资源使用。线程是为客户端建立的,以便可以将事件推送给它们。内存被用来做客户端队列。网络带宽被用来通信。为了每个客户端,每个客户数据都需要被维护。

super0555
super0555
翻译于 9天前

0人顶

 

 翻译的不错哦!

不太好的设计决策

有许多设计上的问题能导致伸缩性的问题。

  • 没有应对大量对象的设计
  • 缺乏端到端应用层级的流控制
  • 应用层级的重试引起的再分配与消息发送
  • 记忆库中的开销
  • 并不是真正可靠的发布
  • 被特殊数据结构使用的奢侈内存
  • 没有处理所有失败场景的消息协议
  • 使用一块硬盘做存储
  • 磁盘同步没有使用块复制
  • 可以更好的应用级协议
  • 承担增长的功能负荷的低动力CPU
  • 不支持流程架构的OS
  • 缺乏能甚至对丢弃一个消息都敏感动作响应的硬件
  • 像ARP之类的突发网络问题,随着网络负载的增加而被忽略

假设是无效的

大多数你做出的估计都是无效的,有关于你正在使用内存数量、某些事情需要的时间、什么才是合理的超时、在一个时刻你能够消耗的数量,故障有多大几率发生,系统中不同的点潜伏的问题、你的队列需要有多大,等等。

super0555
super0555
翻译于 8天前

0人顶

 

 翻译的不错哦!

内存溢出

内存使用的基线增长,以及内存使用的峰值增长。这些可能会引起内存溢出(Out of Memory)。

CPU 匮乏

由于需要操作更多的对象,有更多的对象操作需要更长的时间。这使得CPU可用性降低,进而导致其它操作没有资源。一个地方的匮乏可以传播到其它地方。可能没有足够的CPU来执行必须要做的工作。之所以这样是因为工作的基线数值很高,或者某个特定场景导致许多高优先级的工作需要被完成。

原生资源使用量增长

更多的对象占用了更多的内存。如果有人想支持1千万同步对象,你可能办不到,因为很简单你无法获得足够的内存。

隐含资源使用量增长

大多数功能会将许多额外的资源用于每个以“原生”方式使用的资源。如果你在两个不同的列表中存储了一个对象,那么该对象数量的增长将会带来两倍的内存使用量增长。队列的大小也需要向上调整。磁盘的数量需要增长。将数据复制到二级存储所需要的时间有增长。将数据装载进一个应用所需要的时间有增长。为了所有这些,CPU的使用量有增长。引导时间有增长。

super0555
super0555
翻译于 8天前

0人顶

 

 翻译的不错哦!

事件爆炸

许多系统中都新增了无数的工作流。网站服务器和应用程序服务器需要为巨型用户群提供服务,所以需求有多大,系统提供的服务就有多少。这样的工作永无止境。用户的请求每周7天每天24小时不停的来。所以服务器的CPU使用率很容易就打到100%。

一般都把100%的CPU使用率当成一个坏信号。为了避免这样,所以我们构建了非常复杂的架构以达到负载平衡,复制信号和服务器集群。

CPU是不会感觉到累的,所以你就想我们可以尽可能多的提高CPU的使用率。

另一方面,我们也通过最大限度的增加应届资源来提供效率。

在服务器端我们总是认为得保持CPU的低使用率以保证服务器能维持一个文档的响应速度。这样的做法是出于这样的考虑,如果没有可用的CPU资源,那么我们就不能在可用户可接受的响应时间内给出响应,或者去做其他的事情。

CPU 100%的使用率真的是个问题吗?使用空闲的CPU资源,与其使用任务的优先级这样简单的电脑程序的认知方法去构建系统,而不去深究使系统低量的工作负担,也不去搜集信息去做一些特别的决策,这真的不是一个问题?

除了根据负载平衡原则而构建笨拙的系统,猜测使用线程的数量和线程的优先级之外,我们束手无策。

要删除系统的功能的时候要从架构方面小心考虑。现在的应用构架很难决定程序是怎么跑的。

 

theboy
theboy
翻译于 5天前

0人顶

 

 翻译的不错哦!

延迟增加

你遇到的延迟可能完全不同于规模的增长。CPU 资源不足是主要原因。

任务优先级被证明是错误的

可正常工作在轻负载情况下的任务优先级计划在重负载时会产生问题。尤其是,当有一个不好的流程控制机制时,高优先级的任务工作,低优先级的任务引起丢失和内存使用失效,因为低优先级的任务会获得比较少的执行机会。

队列的尺寸不够足够大

很多对象暗示更多并发操作将被产生,这意味着队列的大小需要增加。

启动时间更长

对象越多,需要越多的实践从磁盘加载到应用程序。

同步时间更长

对象越多,需要越长实践来同步各个对象之间的数据。

有不一样的测试在大配置中

因为搭建测试时间如此之多,所以实际测试搭时间少了。你不能在开发阶段访问一个大的系统,所以很可能你的设计将比一开始大。

 

   

zfj
zfj
翻译于 3天前

0人顶

 

 翻译的不错哦!

操作需要更长时间

如果某个操作被应用到每个对象,那么随着更多对象的增加,将会需要比以前更长的时间。表会更大,所以可能对一个数据量的数据很快的查询现在需要长得多的时间。

更多随机的失效

在普通的操作中你可能看不到某种特定的失效。在规模增大以后响应将下降,ARP请求将下降,文件系统将显现出特定的错误,消息可能丢失,回复可能丢失,等等。

更大的窗口导致失败

上规模意味着更有可能会失败,因为一切都需要更长的时间。对很小的数据集,某种交换数据的协议可能进行得飞快,这意味着它少有机会看到重启或超时,但对于更大规模,时间窗口增长意味着新的问题可能会第一次出现。

超时不具有伸缩性

在数据集变大时,很少数据集时的超时将不会发出请求。再加上CPU匮乏的问题,你的代码可能甚至没有机会运行,超时进一步的失效了。

重试不具有伸缩性

在失败被声明前,应用没有办法选择重试的数量,因为它们没有支持做出决策的足够的信息。一秒钟4次重试好不好?为什么不20次呢?

super0555
super0555
翻译于 2天前

0人顶

 

 翻译的不错哦!

优先级继承

大范围内的锁会存留更长时间,这给看到优先级继承问题创造了更好的机会。

消费模式的打破

在一个规模你能从生产者获得所有的数据。但在另一个规模你用完了队列的空间/内存。一个这样的例子就是某种轮询器,它被用来从远程的队列轮询所有的数据源,接下去再继续下一个队列。当队列上只有少量的项目时,这工作的很好。但可能某个属性变化会致使远程队列上的项目数量膨胀,轮询器会使某个节点耗尽内存。

看门狗超时

 100% CPU负荷的情形会导致看门狗(监视器)的超时。在较小规模的系统中,这鲜有发生,但是对较大规模的不良设计可能会使之发生。

缓慢的内存泄漏成为快速泄漏

在较小规模内存泄漏不会被人注意,但在较大规模它可能会变得非常显著。

失效的锁变得引人注目

一个应该在适当位置的锁但未在,在较小规模系统可能不引人注意,因为线程可能永远也不会恰好在能引起问题的指令前放弃处理器。在较大规模的系统中,将有更多的抢占,这意味着会有更多的机会看到不同的线程同时访问数据。

super0555
super0555
翻译于 2天前

0人顶

 

 翻译的不错哦!

死锁增加的可能性

不同的调度模式使代码沿不同的路径运行,因此会有更大的遇到死锁的可能性。例如,当文件系统由于高CPU使用率而不能获得运行的机会,有时不知何故它会占用100%CPU而且不会再次运行。

时间同步的损失

时间同步不具有一个非常高的优先级,因此随着CPU和网络资源变得稀缺,不同节点的时钟将会漂移

记录器丢弃数据

记录器会丢弃数据,是因为它的队列太小了,不能应对增长的负载,或者CPU太忙了,不能给记录器时间去分派日志数据。根据队列的大小和类型,这可能会引起OOM(内存不足Out Of Memory)

定时器没有在正确的时间触发

一个繁忙的系统将无法在期望的时间触发定时器,这会在系统的其它部分级联的引起一个时间延迟。

super0555
super0555
翻译于 5天前

0人顶

 

 翻译的不错哦!

ARP丢弃

在CPU高负荷或者网络高负载期间,机器之间的ARP会丢弃。这意味着包会被发送到错误的条目,直到表格被更新,也有可能永远也不会更新。

文件描述符限制

通常一个容器上文件描述符的数量有一个固定数值的极限。设计必须限制最大的数字在这个极限以下。如果阻塞的描述符从文件描述符池取出,那么具有大量连接数(ftp, com, 启动程序, 客户端,等等)的设计将导致问题。大规模条件下,描述符的数量可能有一个峰值。随着规模的增长,描述符泄漏会用光可用的描述符池。

Socket缓存限制

每个socket 都有一个分配数量的缓冲空间。大量的socket能减少可获得内存的全部总量。随着规模的增长,丢弃增多,因为用来接受消息的缓存数量不足以跟上负载。这也与优先级有关,因为任务可能不具有从socket读出数据的足够优先级。在发送方,高优先级的任务可能用消息将低优先级的消息淹没。

super0555
super0555
翻译于 昨天(23:03)

0人顶

 

 翻译的不错哦!

引导镜像服务限制

在同一时刻,每个节点能提供的启动条目数量限制为X。ftp服务器基础设施必须限制它所服务的对象数量,否则该节点会使CPU停止工作。

混乱的消息

在压力之下,消息系统可能会开始无序分派消息,这会导致需要非指数数量级操作的问题。

协议弱点

除非应用协议是仔细创建的,否则增长的规模将带来非常多的问题。

连接数限制

某种带十个客户端的中央服务器可能有足够的性能,但若带一千个客户端,它可能无法满足响应时间需求。在这个场景,平均响应时间大概随着客户端数量线性变化,我们称其具有O(N)("规则 N")的复杂度,但也有其它复杂度的问题。例如,如果我们希望一个网络中N个节点能互相通信,我们可以将每个节点连接到一个中心交换设备,需要O(N)连线,或者我们也可以将每一对直接连接,需要O(N^2)连线(确切的数字或公式通常不是那么重要,相较于涉及的N的最高次方来说)。

super0555
super0555
翻译于 昨天(23:44)

0人顶

 

 翻译的不错哦!

分层架构

这是 一个很好的总结 ,所以这里只是参考一下吧:基于层叠的架构从来就不是说能够创建低延迟,高吞吐率的应用。分层架构的基础问题是,它是创建来解决昨天的问题的。从客户端-服务器时代过渡到互联网时代,它是解决可伸缩性的完美解决方案。

问题域是怎样伸缩应用支持成百上千的用户。这个问题的解决方案是我们今天知道的n层架构。选择伸缩性规格在于基于过负载均衡的表现层。确实它实际上解决了这个问题。但是,现在问题已经进化了。现在很多行业里,问题不仅仅在于提升用户体验,它也是一个数据容量的问题。

多处理器性能问题

当处理器被请求在大量不相关工作负载之间切换的时候,神奇的硬件缓存加速将趋于崩溃。



英文原贴:


42 Monster Problems That Attack As Loads Increase

For solutions take a look at: 7 Life Saving Scalability Defenses Against Load Monster Attacks.

This is a look at all the bad things that can happen to your carefully crafted program as loads increase: all hell breaks lose. Sure, you can scale out or scale up, but you can also choose to program better. Make your system handle larger loads. This saves money because fewer boxes are needed and it will make the entire application more reliable and have better response times. And it can be quite satisfying as a programmer.

Large Number Of Objects

We usually get into scaling problems when the number of objects gets larger. Clearly resource usage of all types is stressed as the number of objects grow.

Continuous Failures Makes An Infinite Event Stream

During large network failure scenarios there is never time for the system recover. We are in a continual state of stress.

Lots Of High Priority Work

For example, rerouting is a high priority activity. If there is a large amount of rerouting work that can't be shed or squelched then resources will be continuously consumed to support the high priority work.

Larger Data Streams

As the sizes of media assets grow larger then the system load increases. As the number of sources of requests increase the system load increases.

Feature Creep

Holes in the system are uncovered as more features are added than the system was ever designed for.

Increases In Clients

More clients mean more resources uses. Threads are setup for clients so events can be pushed to them. Memory is used for client queues. Network bandwidth is used for communication. Per client data are maintained for each client. 

Less Than Good Design Decisions

There are numerous design problems that can contribute to scaling problems.

  • Design that didn't handle large numbers of objects.
  • Lack of end-to-end application level flow control.
  • Application level retries that cause the reallocation and sending of messages.
  • Overhead in the memory library.
  • Publishes that are not really reliable.
  • Extravagant memory usage by particular data structures.
  • Messaging protocols that don't handle all failure scenarios.
  • Using a hard disk for storage.
  • Not using block replication for disk syncing.
  • Application level protocols that could be better.
  • Under powered CPU for increased feature load.
  • OS that doesn't support process architecture.
  • Lack of hardware support for operations that are sensitive to even a single message drop.
  • Funky network problems like ARPs that get dropped as network loads increase.

Assumptions Are Invalidated

Large numbers invalidate most of the assumptions you have been making about how much memory you are using, how long something takes, what are reasonable timeouts, how much you can consume at a time, how likely failures are to happen, the latency at various points in the system, how big your queues need to be, etc.

Out Of Memory

The base line memory usage increases and the spike memory usage increases. These can cause OOM.

CPU Starvation

With more objects operations take a longer because they have to operate over more objects. This makes less CPU available which may starve other operations. Starvation in one place can propagate to other places. There may not be enough CPU to take care of the required work that needs to be done. This can because the base line amount of work is high or certain scenarios cause a lot of high priority work to be done.

Raw Resource Usage Increases

More objects take more memory. If someone wanted to support 10 million simultaneous objects you may not be able to because you simply would not have enough memory.

Implied Resource Usage Increased

Most features will have a lot of additional resources used for every resource used in the "raw" resource usage. If you are storing an object in two different lists then increasing the number of objects increases memory usage by two times. Queue sizes may also need to adjust upwards. The amount of disk needed increases. The time to replicate the data to the secondary increases. The time to load the data into an application increases. The CPU usage increases to handle all this. The boot time increases.

Events Multiply

Infinite work streams are the new reality of many systems. Web servers and application servers serve very large user populations where it is realistic to expect infinite streams of new work. The work never ends. Requests come in 24 hours a day 7 days a week. Work could easily saturate servers at 100% CPU usage.

Traditionally we have considered 100% CPU usage a bad sign. As compensation we create complicated infrastructures to load balance work, replicate state, and cluster machines.

CPUs don't get tired so you might think we would try to use the CPU as much as possible.

In other fields we try to increase productivity by using a resource to the greatest extent possible.

In the server world we try to guarantee a certain level of responsiveness by forcing an artificially low CPU usage. The idea is if we don't have CPU availability then we can't respond to new work with a reasonable latency or complete existing work.

Is there really a problem with the CPU being used 100% of the time? Isn't the real problem that we use CPU availability and task priority as a simple cognitive shorthand for architecting a system rather than having to understand our system's low level work streams and using that information to make specific scheduling decisions?

We simply don't have the tools to do anything other than make clumsy architecture decisions based on load balancing servers and making guesses at the number of threads to use and the priorities for those threads.

Scaling a system requires careful attention to architecture. In the current frameworks applications have very little say as to how their applications run.

 

Latency Increases

The latency you experience may be totally different as scale increases. CPU starvation is the major cause here.

Task Priorities Are Shown To Be Wrong

Task priority schemes that may have worked under a lite load can cause problems under heavier load. In particular, when there is a poor flow control mechanism, a high priority task feeding work to a lower priority task causes drops and spike memory usage because the lower priority task will get very little chance to run.

Queue Sizes Not Big Enough

A larger number of objects imply more simultaneous operations can be made which means queues sizes will probably need to increase.

Boot Times Are Longer

The more objects the longer it takes to load them into applications from the disk.

Sync Times Are Longer

The more objects the longer it takes applications to sync data between each other.

Not As Much Testing In Large Configurations

Because the test setups are so expensive the actual time spent testing large configurations is small. You will not have access to large systems during development so it's likely your disgn will not  scale at first.

Operations Take Longer

If an operation is applied to every object then it will take longer than it used to as more objects are added. Tables are larger so maybe a search that was fast enough with one amount of data now take a lot longer.

More Random Failures

You may not see certain failures in normal operation. Under scale replies will drop, ARP requests will drop, the file system may show certain errors, messages may drop, replies may drop, etc.

Bigger Windows For Failure

Scale means there is more opportunity for failure to occur because everything takes longer. A protocol for exchanging data may occur quickly with a small data set, which means it has a smaller change for seeing a reboot or timeout, but with larger scales the windows increase which means new problems may be seen for the first time.

Timeouts Don't Scale

Any timeouts that worked for smaller data sets won't apply as data sets get larger. Add the CPU starvation issue, where your code may not even get a chance to run, and timeouts are further invalidated.

Retries Don't Scale

There is no way to for an application to pick the number of retries before failure is declared because they do not have enough information on which to base the decision. Is 4 retries with a second retry any good? Why not 20 retries?

Priority Inheritance

Locks that are at large scope are held for longer which makes for a better chance for seeing priority inheritance problems.

Consumption Patterns Break

At one scale you could take all of the data from producer. But at another scale you run out of queue space/memory. An example of this is a poller that used to poll all data sources from a remote queue before moving on to the next queue. This works well when there is a small number of items on the queue. But then maybe a feature change expands the number of items on the remote queue and the poller causes a node to run out of memory. 

Watch Dog Timeouts

A 100% CPU condition can cause a watchdog timeout. This may rarely happen on smaller scale systems, but poor design at larger scales can make it happen.

Slow Memory Leaks Become Fast Leaks

A memory leak that went unnoticed in a smaller scale may become significant at larger scales.

Missed Locks Become Noticed

A lock that should be in place, but is not, can go unnoticed in smaller scale system because a thread may never give up the processor right before the instruction that will cause the problem. In larger scale systems there will be more preemption which means there is more of a chance of seeing simultaneous access to data by different threads.

Chance Of Dead Lock Increases

Different scheduling patterns exercises code along different paths, so have a greater chance of seeing deadlock. For example, when the file system doesn't get a chance to run because of high CPU usage, it somehow breaks in a way that it takes 100% CPU and never runs again.

Time Synchronization Suffers

Time synchronization is not a very high priority, so as CPU and network resources become less available clocks on different nodes will drift.

Logger Drops Data

The logger can start dropping data because its queues are too small to handle the increased load or the CPU is too busy to give time to the logger to dispatch the log data. Depending on the size and type of the queues, this can cause OOM.

Timers Don't Fire At The Right Time

A busy system will not be able to fire timers at the expected time, which can cause a cascade of delay in the rest of the system.

ARP Drops

During high CPU or network loads ARPs can drop between machines. This means that packets are sent to the wrong cards until the tables are updated, which can be never.

File Descriptor Limits

There is usually a fixed limit to the number of file descriptors available on a box. Design must limit the max number to be under the limit. If socked descriptors are taken out of the file descriptor pool, then a design with a large number of connections (ftp, com, booting, clients, etc) will cause problems. With scale it's possible to have a spike of the number of descriptors. Descriptor leaks will use up the available pool as scale increases.

Socket Buffer Limits

Every socket has an allocated amount of buffer space. Large numbers of sockets can reduce the overall amount of available memory. As scale increases drops increase because the amount of buffer space to receive messages is not enough to keep up with the load. This is also related to priority because a task may not have enough priority to read data out of the socket. On the sender side a high priority task may overwhelm with messages a task with lower priority.

Boot Image Serving Limit

The number of booting cards a node can serve is limited to X at a time. The ftp server infrastructure must limit the number of cards it servers or the node gets CPU starved.

Out Of Order Messaging

Under stress your message system may start delivering messages out-of-order which can cause problems for non-idempotent operations.  

Protocol Weakness

Unless application protocols are created very carefully increased scale brings out a lot of problems. 

Connection Limits

A central server of some kind with ten clients may perform adequately but with a thousand clients it might fail to meet response time requirements. In this case, the average response time probably scales linearly with the number of clients, we say it has a complexity of O(N) ("order N") but there are problems with other complexities. E.g. if we want N nodes in a network to be able to communicate with each other, we could connect each one to a central exchange, requiring O(N) wires or we could provide a direct connection between each pair, requiring O(N^2) wires (the exact number or formula is not usually so important as the highest power of N involved).

Tiered Architectures

This is a good a summary so I'll just refer to it here: Tiered based architecture was never meant to enable building low latency, high throughput applications. The fundamental problem with tiered architecture is that it was created to solve yesterday's problem. In the transition from the client-server era to internet time, it was the perfect solution for scalability.

The problem domain was how to scale applications to support hundred thousands of users. The solution for this problem was the n-tier architecture that we know today. The scalability dimension that was chosen was the presentation layer through load balancers. Indeed it actually solved this problem. However, now days, the problem has evolved. These days in many industries the problem is not about just scaling up the user experience, it is also a question of data volumes.

Multiprocessor Performance Problems

Magic hardware caching speedups tend to break down when processors are asked to switch between a large number of unrelated workloads.

 
 

Related Articles


本文来自linux系统运维http://www.linuxyw.com/linux/yenei/315.html

顶一下
(0)
0%
踩一下
(0)
0%
分享按钮
------分隔线----------------------------
发表评论
请自觉遵守互联网相关的政策法规,严禁发布色情、暴力、反动的言论。
评价:
验证码: 点击我更换图片
栏目列表
推荐内容