Table Of ContentSAFARITechnicalReportNo. 2016-001(January26,2016)
Thisisasummaryoftheoriginalpaper,entitled“Tiered-LatencyDRAM:ALowLatencyandLowCostDRAMArchitecture”
whichappearsinHPCA2013[37].
Tiered-Latency DRAM (TL-DRAM)
DonghyukLee YoonguKim VivekSeshadri JamieLiu LavanyaSubramanian OnurMutlu
CarnegieMellonUniversity
Abstract sense-amplifier is connected to many DRAM cells through a
wirecalledabitline.
This paper summarizes the idea of Tiered-Latency DRAM,
Everybitlinehasanassociatedparasiticcapacitancewhose
which was published in HPCA 2013 [37]. The key goal of
value is proportional to the length of the bitline. Unfortu-
TL-DRAM is to provide low DRAM latency at low cost, a
nately, such parasitic capacitance slows down DRAM oper-
critical problem in modern memory systems [55]. To this
6 ation for two reasons. First, it increases the latency of the
end, TL-DRAM introduces heterogeneity into the design of a
1
sense-amplifiers. When the parasitic capacitance is large, a
0 DRAMsubarraybysegmentingthebitlines,therebycreatinga
cellcannotquicklycreateavoltageperturbationonthebitline
2 low-latency, low-energy, low-capacityportioninthesubarray
that could be easily detected by the sense-amplifier. Second,
(called the near segment), which is close to the sense ampli-
n
it increases the latency of charging and precharging the bit-
fiers, and a high-latency, high-energy, high-capacity portion,
a
lines. Although the cell and the bitline must be restored to
J whichisfartherawayfromthesenseamplifiers. Thus,DRAM
their quiescent voltages during and after an access to a cell,
6 becomesheterogeneouswithasmallportionhavinglowerla-
suchaproceduretakesmuchlongerwhentheparasiticcapac-
2 tencyandalargeportionhavinghigherlatency. Varioustech-
niques can be employed to take advantage of the low-latency itance is large. Due to the above reasons and a detailed la-
] tencybreak-down(refertoourHPCA-19paper[37]),wecon-
nearsegmentandthisnewheterogeneousDRAMsubstrate,in-
R
cludethatlongbitlinesarethedominantsourceofDRAMla-
cluding hardware-based caching and software based caching
A tency[22,70,51,52].
andmemoryallocationoffrequentlyuseddatainthenearseg-
.
s ment. Evaluations with simple such techniques show signifi- Latency vs. Cost Trade-Off. The bitline length is a key
c cantperformanceandenergy-efficiencybenefits[37]. designparameterthatexposestheimportanttrade-offbetween
[
latency and die-size (cost). Short bitlines (few cells per bit-
1 line) constitute a small electrical load (parasitic capacitance),
1 Summary
v whichleadstolowlatency. However,theyrequiremoresense-
3
amplifiersforagivenDRAMcapacity(Figure1a),whichleads
0 1.1 TheProblem: HighDRAMLatency
toalargedie-size. Incontrast,longbitlineshavehighlatency
9
Primarilyduetoitslowcost-per-bit,DRAMhaslongbeen
6 and a small die-size (Figure 1b). As a result, neither of these
thechoicesubstrateforarchitectingmainmemorysubsystems.
0 twoapproachescanoptimizeforbothlatencyandcost-per-bit.
. In fact, DRAM’s cost-per-bit has been decreasing at a rapid
1
160 rDcaeRtsesAiavMseDgcReeAnllesMraitnpitorooncteohsfesDtseaRcmhAenModloihegayasrseecnaa.alebslAetdsoiainntrceergseruaalsttie,ngeevalyecrhlmasrugoecre-- bitline(short) cells e
v: capacitymainmemorysubsystemsatlowcost. sense-amps bitlin
arXi tthheeInlsaatmsetnaecrk1y1oc-ofyneDtarRrasAintMtteorvhtaahlseirnceomwnahtiiinnceuhdedDalRsmcAaoMlsitn’gcsoconofssttac-noptse.tr--pDbeiurt-rbidnietg-, bitline(short) cells bitline(long) cells Ibitlinesolationc eTlRls.
creasedbyafactorof16,DRAMlatency(asmeasuredbythe
sense-amps sense-amps sense-amps
t and t timing constraints) decreased by only 30.5%
RCD RC
and 26.3% [5, 25], as shown in Figure 1 of our paper [37]. (a)LatencyOpt. (b)CostOptimized (c)OurProposal
From the perspective of the processor, an access to DRAM Figure1.DRAM:Latencyvs.CostOptimized,OurProposal
takes hundreds of cycles — time during which the processor
Figure 2 shows the trade-off between DRAM latency and
maybestalled,waitingforDRAM.Suchwastedtimeleadsto
die-size by plotting the latency (t and t ) and the die-
largeperformancedegradations. RCD RC
size for different values of cells-per-bitline. Existing DRAM
1.2 KeyObservationsandOurGoal architectures are either optimized for die-size (commodity
DDR3[64,50])andarethuslowcostbuthighlatency,oropti-
Bitline: DominantSourceofLatency. InDRAM,eachbit
mizedforlatency(RLDRAM[49],FCRAM[65])andarethus
isrepresentedaselectricalchargeinacapacitor-basedcell.The
lowlatencybuthighcost.
smallsizeofthiscapacitornecessitatestheuseofanauxiliary
structure, called a sense-amplifier, to detect the small amount Thegoalofourpaper[37]istodesignanewDRAMarchi-
ofchargeheldbythecellandamplifyittoafulldigitallogic tecturetoapproximatethebestofbothworlds(i.e.,lowlatency
value. But, a sense-amplifier is approximately one hundred andlowcost),basedonthekeyobservationthatthatlongbit-
timeslargerthanacell[61]. Toamortizetheirlargesize,each linesarethedominantsourceofDRAMlatency.
1
SAFARITechnicalReportNo. 2016-001(January26,2016)
(t )forallcells, TL-DRAMofferssignificantlyreducedla-
e tRCD tRC RC
ziS 7 16 16 : cells-per-bitline tency(tRC)forcellsinthenearsegment,whileincreasingthe
eiD- 6 (492mm2): die-size latencyforcellsinthefarsegmentduetotheadditionalresis-
de 5 tanceoftheisolationtransistor. InDRAM,alargefractionof
z
ilamroN 34 3624 RLDRAM 32 (276) RLDRAM tiphnoewTpeLor-.wDOeRrnAistMhceohnoastsuhmearelhdoawbnyedr,thcaeacpcbeaisctsliitinanengsc.tehS,eiintfcaaerlsstohegecmonneeasnurtmsreeegqsmulieerenssst
repa 12 128 F2C5R6A M51 2 128 (6141 4(1) 6 8) FCRAM 512 (73.5) tcoogngsluimngpttihoen.isoMlaatiinonlytrdaunesitsotoarsd,dilteiaodnianlgistoolainticorneatsreadnspisotworesr,
e 256 (87)
hC 0 DDR3 DDR3 TL-DRAM increases die-area by 3%. Our paper includes de-
0 10 20 30 40 50 60 tailedcircuit-levelanalysesofTL-DRAM(Section4of[37]).
Faster Latency (ns)
ShortBitline LongBitline SegmentedBitline
Figure2.BitlineLength:Latencyvs.Die-Size (Fig1a) (Fig1b) (Fig1c)
Unsegmented Unsegmented Near Far
1.3 Tiered-LatencyDRAM
Length(Cells) 32 512 32 480
To achieve the latency advantage of short bitlines and the
costadvantageoflongbitlines,weproposetheTiered-Latency Latency Low High Low Higher
DRAM(TL-DRAM)architecture,whichisshowninFigure1c (tRC) (23.1ns) (52.5ns) (23.1ns) (65.8ns)
and 3a. The key idea of TL-DRAM is to divide the long bit- Normalized Low High Low Higher
lineintotwoshortersegmentsusinganisolationtransistor:the PowerConsump. (0.51) (1.00) (0.51) (1.49)
near segment (connected directly to the sense-amplifier) and Normalized High Lower Low
thefarsegment(connectedthroughtheisolationtransistor). Die-Size(Cost) (3.76) (1.00) (1.03)
Table1.Latency,Power,andDie-AreaComparison
FSeagr ment CCELL bitline CFAR CCELL bitline CFAR 1.4 LeveragingTL-DRAM
Isolation Isolation Isolation
Transistor TR. (off) TR. (on) TL-DRAMenablesthedesignofmanynewmemoryman-
Near
SSeegnmsee-nt CCELL CNEAR CCELL CNEAR ategreismtiecnstopfothliecineesatrhaantdexthpelofiatrtsheegmaseynmtsm.eOtruircHlaPtCenAc-y19chpaarpace-r
Amps (inSection5)describesfourwaysoftakingadvantageofTL-
DRAM.Here,wedescribetwoapproachesinparticular.
(a)Organization (b)NearSeg.Access (c)FarSeg.Access
Figure3.TL-DRAM:Nearvs.FarSegments
In the first approach, the memory controller uses the near
Theprimaryroleoftheisolationtransistoristoelectrically segmentasahardware-managedcacheforthefarsegment. In
decouplethetwosegmentsfromeachother. Thischangesthe ourHPCA-19paper[37],wediscussthreepoliciesformanag-
effective bitline length (and also the effective bitline capaci- ing the near segment cache. (The three policies differ in de-
tance)asseenbythecellandsense-amplifier.Correspondingly, ciding when a row in the far segment is cached into the near
thelatencytoaccessacellisalsochanged—albeitdifferently segmentandwhenitisevicted.) Inaddition,weproposeanew
dependingonwhetherthecellisinthenearorthefarsegment. datatransfermechanism(Inter-SegmentDataTransfer)thatef-
When accessing a cell in the near segment, the isolation ficientlymigratesdatabetweenthesegmentsbytakingadvan-
transistor is turned off, disconnecting the far segment (Fig- tage of the fact that the bitline is a bus connected to the cells
ure 3b). Since the cell and the sense-amplifier see only the in both segments. By using this technique, the data from the
reducedbitlinecapacitanceoftheshortenednearsegment,they source row can be transferred to the destination row over the
candrivethebitlinevoltagemoreeasily.Asaresult,thebitline bitlinesatverylowlatency(additional4nsovert ). Further-
RC
voltageisrestoredmorequickly, sothatthelatency(tRC)for more, this Inter-Segment Data Transfer happens exclusively
the near segment is significantly reduced. On the other hand, within DRAM bank without utilizing the DRAM channel, al-
whenaccessingacellinthefarsegment,theisolationtransis- lowingconcurrentaccessestootherbanks.
toristurnedontoconnecttheentirelengthofthebitlinetothe
sense-amplifier. Inthiscase,theisolationtransistoractslikea In the second approach, the near segment capacity is ex-
resistorinsertedbetweenthetwosegments(Figure3c)andlim- posed to the OS, enabling the OS to use the full DRAM ca-
its how quickly charge flows to the far segment. Because the pacity. We propose two concrete mechanisms, one where the
farsegmentcapacitanceischargedmoreslowly,ittakeslonger memory controller uses an additional layer of indirection to
for the far segment voltage to be restored, so that the latency map frequently accessed pages to the near segment, and an-
(t )isincreasedforcellsinthefarsegment. other where the OS uses static/dynamic profiling to directly
RC
Latency, Power, and Die-Area. Table 1 summarizes the map frequently accessed pages to the near segment. In both
latency, power, and die-area characteristics of TL-DRAM to approaches, theaccessestopagesthataremappedtothenear
other DRAMs, estimated using circuit-level SPICE simula- segmentareservedfasterandwithlowerpowerthaninconven-
tion[56]andpower/areamodelsfromRambus[61].Compared tional DRAM, resulting in improved system performance and
tocommodityDRAM(longbitlines)whichincurshighlatency energyefficiency.
2
SAFARITechnicalReportNo. 2016-001(January26,2016)
1.5 Results: PerformanceandPower 2 Significance
Our HPCA-19 paper [37] provides extensive detail about 2.1 Novelty
both of the above approaches. But, due to space constraints, Toourknowledge,ourHPCA-19paperisthefirsttoenable
wepresenttheevaluationresultsofonlythefirstapproach, in latencyheterogeneityinDRAMwithoutsignificantlyincreas-
which the near segment is used as hardware-managed cache ingcost-per-bitandtoproposehardware/softwaremechanisms
managed under our best policy (Benefit-Based Caching) to thatleveragethislatencyheterogeneitytoimprovesystemper-
showtheadvantageofourTL-DRAMsubstrate. formance. Wemakethefollowingmajorcontributions.
ACost-EfficientLow-LatencyDRAM.Basedonthekey
Performance & Power Analysis. Figure 4 shows the
observation that long internal wires (bitlines) are the domi-
average performance improvement and power-efficiency of
nant source of DRAM latency, we propose a new DRAM ar-
our proposed mechanism over the baseline with conventional
chitecturecalledTiered-LatencyDRAM(TL-DRAM).Toour
DRAM, on 1-, 2- and 4-core systems. As described in Sec-
knowledgethisisthefirstworktoenablelow-latencyDRAM
tion 1.3, access latency and power consumption are signifi-
without significantly increasing the cost-per-bit. By adding a
cantlylowerfornearsegmentaccesses,buthigherforfarseg-
single isolation transistor to each bitline, we carve out a re-
mentaccesses,comparedtoaccessesinaconventionalDRAM.
gion within a DRAM chip, called the near segment, that is
Weobservethatalargefraction(over90%onaverage)ofre-
fastandenergy-efficient. Thiscomesatamodestoverheadof
questshitintherowscachedinthenearsegment, therebyac-
3%increaseinDRAMdie-area. Whiletherearetwopriorap-
cessingthenearsegmentwithlowlatencyandlowpowercon-
proaches to reduce DRAM latency (using short bitlines [49,
sumption. Asaresult,TL-DRAMachievessignificantperfor-
65],addinganSRAMcacheinDRAM[20,18,16,84]),both
mance improvement by 12.8%/12.3%/11.0% and power sav-
of these approaches significantly increase die-area due to ad-
ings by 23.6%/26.4%/28.6% in 1-/2-/4-core systems, respec-
ditional sense-amplifiers or additional area for SRAM cache,
tively.
asweevaluateinourpaper[37]. Comparedtothesepriorap-
tn 15% n 30% proaches, TL-DRAM is a much more cost-effective architec-
em oit25% tureforachievinglowlatency.
10% c20%
e u
v d15% There are many works that reduce overall memory access
orpm 5% eR r105%% latencybymodifyingDRAM,theDRAM-controllerinterface,
e
.frePI 0 % Co1r (e1--ccoh)unt2 ( (#2 -ochf )cha4n (n4-eclhs)) woP 0 % Co1r (e1--ccho)unt2 ((#2 -ochf )cha4n (4n-echls)) aannddbDaRnAdwMidcthon[t2r9o,ll1er0s,.6T6h,e4s0e],wroedrkusceenraebfrleesmhocoreunptasra[4ll2e,li4sm3,
26,79,60],acceleratebulkoperations[66,68,69,11],acceler-
(a)IPCImprovement (b)PowerConsumption
atecomputationinthelogiclayerof3D-stackedDRAM[2,1,
Figure4.IPCImprovement&PowerConsumption 83,17],enablebettercommunicationbetweenCPUandother
Sensitivity to Near Segment Capacity. The number of devices through DRAM [39], leverage process variation and
rowsinthenearsegmentpresentsatrade-off,sinceincreasing temperature dependency in DRAM [38], leverage DRAM ac-
thenearsegment’ssizeincreasesitscapacitybutalsoincreases cess patterns [19], reduce write-related latencies by better de-
its access latency. Figure 5 shows the performance improve- signing DRAM and DRAM control policies [13, 36, 67], and
mentofourproposedmechanismsoverthebaselineaswevary reduce overall queuing latencies in DRAM by better schedul-
the near segment size. Initially, performance improves as the ingmemoryrequests[53,54,27,28,75,73,21,78]. Ourpro-
number of rows in the near segment since more data can be posal is orthogonal to all of these approaches and can be ap-
cached. However, increasing the number of rows in the near plied in conjunction with them to achieve higher latency and
segmentbeyond32reducestheperformancebenefitduetothe energybenefits.
increasedcapacitance. Inter-Segment Data Transfer. By implementing latency
heterogeneity within a DRAM subarray, TL-DRAM enables
efficient data transfer between the fast and slow segments by
utilizingthebitlinesasawidebus. Thismechanismtakesad-
vantage of the fact that both the source and destination cells
share the same bitlines. Furthermore, this inter-segment mi-
gration happens only within a DRAM bank and does not uti-
Figure5.EffectofVaryingNearSegmentCapacity lizetheDRAMchannel,therebyallowingconcurrentaccesses
tootherbanksoverthechannel. Thisinter-segmentdatatrans-
Other Results. In our HPCA-19 paper, we provide a de-
ferenablesfastandefficientmovementofdatawithinDRAM,
tailedanalysisofhowtimingparametersandpowerconsump-
whichinturnenablesefficientwaysoftakingadvantageofla-
tion vary when varying the near segment length, in Section 4
tencyheterogeneity.
and 6.3, respectively. We also provide a comprehensive eval-
Sonetal. proposesalowlatencyDRAMarchitecture[71]
uation of the mechanisms we build on top of the TL-DRAM
that has fast (long bitline) and slow (short bitline) subarrays
substrateforsingle-andmulti-coresystemsinSection8.
in DRAM. This approach provides largest benefit when allo-
Allofourresultsaregatheredusinganin-houseversionof catinglatencycriticaldatatothelowlatencyregions(thelow
Ramulator[31],anopen-sourceDRAMsimulator[30],which latency subarrays. Therefore, overall memory system perfor-
isintegratedintoanin-houseprocessorsimulator. mance is sensitive to the page placement policy. However,
3
SAFARITechnicalReportNo. 2016-001(January26,2016)
our inter-segment data transfer enables efficient relocation of tiers,showingthespreadinlatencyforthreetiers.) Thisen-
pages,leadingtodynamicpageplacementbasedonthelatency ables new mechanisms both in hardware and software that
criticalityofeachpage. can allocate data appropriately to different tiers based on
theiraccesscharacteristicssuchaslocality,criticality,etc.
2.2 PotentialLong-TermImpact
• Inspiring new ways of architecting latency heterogeneity
ToleratingHighDRAMLatencybyEnablingNewLay-
withinDRAM.Toourknowledge, TL-DRAMisthefirstto
ers in the Memory Hierarchy. Today, there is a large la-
enablelatencyheterogeneitywithinDRAMbysignificantly
tency cliff between the on-chip last level cache and off-chip
modifyingtheexistingDRAMarchitecture. Webelievethat
DRAM,leadingtoalargeperformancefall-offwhenapplica-
thiscouldinspireresearchonotherpossiblewaysofarchi-
tions start missing in the last level cache. By introducing an
tectinglatencyheterogeneitywithinDRAMorothermem-
additional fast layer (the near segment) within the DRAM it-
orydevices.
self,TL-DRAMsmoothensthislatencycliff.
NotethatmanyrecentworksaddedaDRAMcacheorcre- References
atedheterogeneousmainmemories[33,35,59,47,81,62,57, [1] J.Ahnetal. AScalableProcessing-in-MemoryAcceleratorforParallel
48,44,12,63,41,14]tosmooththelatencycliffbetweenthe GraphProcessing.InISCA,2015.
[2] J. Ahn et al. PIM-Enabled Instructions: A Low-Overhead, Locality-
last level cache and a longer-latency non-volatile main mem- AwareProcessing-in-MemoryArchitecture.InISCA,2015.
ory,e.g.,PhaseChangeMemory[33,35,59],ortotakeadvan- [3] R.Ausavarungnirunetal. Stagedmemoryscheduling: achievinghigh
performanceandscalabilityinheterogeneoussystems.InISCA,2012.
tageoftheadvantagesofmultipledifferenttypesofmemories
[4] R.Ausavarungnirunetal. ExploitingInter-WarpHeterogeneitytoIm-
to optimize for multiple metrics. Our approach is similar at proveGPGPUPerformance.InPACT,2015.
the high-level (i.e., to reduce the latency cliff at low cost by [5] S.BorkarandA.A.Chien. Thefutureofmicroprocessors. InCACM,
2011.
taking advantage of heterogeneity) yet we introduce the new [6] Y.Caietal.ProgramInterferenceinMLCNANDFlashMemory:Char-
low-latencylayerwithinDRAMitselfinsteadofaddingacom- acterization,Modeling,andMitigation.InICCD,2013.
[7] Y.Caietal. Neighbor-cellAssistedErrorCorrectionforMLCNAND
pletelyseparatedevice. FlashMemories.InSIGMETRICS,2014.
Applicability to Future Memory Devices. We show the [8] Y.Caietal.DataretentioninMLCNANDflashmemory:Characteriza-
tion,optimization,andrecovery.InHPCA,2015.
benefits of TL-DRAM’s asymmetric latencies. Considering [9] Y.Caietal. ReadDisturbErrorsinMLCNANDFlashMemory:Char-
that most memory devices adopt a similar cell organization acterization,Mitigation,andRecovery.InDSN,2015.
[10] K.K.Changetal. ImprovingDRAMperformancebyparallelizingre-
(i.e., a 2-dimensional cell array and row/column bus connec-
fresheswithaccesses.InHPCA,2014.
tions),ourapproachofreducingtheelectricalloadofconnect- [11] K.K.Changetal. Low-CostInter-LinkedSubarrays(LISA):Enabling
FastInter-SubarrayDataMovementinDRAM.InHPCA,2016.
ingtoabus(bitline)toachievelowaccesslatencycanbeap-
[12] N.Chatterjeeetal.LeveragingHeterogeneityinDRAMMainMemories
plicabletoothermemorydevices. toAccelerateCriticalWordAccess.InMICRO,2012.
[13] N. Chatterjee et al. Staged Reads: Mitigating the Impact of DRAM
Furthermore, the idea of performing inter-segment data
WritesonDRAMReads.InHPCA,2012.
transfer can also potentially be applied to other memory de- [14] G.Dhimanetal. PDRAM:AhybridPRAMandDRAMmainmemory
system.InDAC,2009.
vices, regardless of the memory technology. For example,
[15] E.Ebrahimietal.ParallelApplicationMemoryScheduling.InMICRO,
we believe it is promising to examine similar approaches 2011.
[16] EnhancedMemorySystems.EnhancedSDRAMSM2604,2002.
for emerging memory technologies like Phase Change Mem-
[17] Q.Guoetal. 3D-StackedMemory-SideAcceleration: Acceleratorand
ory[33,59,58,46,82,34]orSTT-MRAM[32,80],aswellas SystemDesign.InWoNDP,2013.
theNANDflashmemorytechnology[45,8,9,7,6]. [18] C.A.Hart. CDRAMinaunifiedmemoryarchitecture. InCompcon
Spring’94,DigestofPapers,1994.
New Research Opportunities. The TL-DRAM substrate [19] H.Hassanetal. ChargeCache:ReducingDRAMLatencybyExploiting
creates new opportunities by enabling mechanisms that can RowAccessLocality.InHPCA,2016.
[20] H.Hidakaetal. TheCacheDRAMArchitecture: ADRAMwithan
leveragethelatencyheterogeneityofferedbythesubstrate. We On-ChipCacheMemory.InIEEEMicro,1990.
brieflydescribethreedirections,butwebelievemanynewpos- [21] E.Ipeketal.Selfoptimizingmemorycontrollers:Areinforcementlearn-
ingapproach.InISCA,2008.
sibilitiesabound.
[22] JEDEC. DDR3SDRAMSTANDARD. http://www.jedec.org/
standards-documents/docs/jesd-79-3d,2010.
• New ways of leveraging TL-DRAM. TL-DRAM is a sub-
[23] J.A.Joaoetal.Bottleneckidentificationandschedulinginmultithreaded
strate that can be utilized for many applications. Although applications.InASPLOS,2012.
[24] J.A.Joaoetal. Utility-BasedAccelerationofMultithreadedApplica-
wedescribetwomajorwaysofleveragingTL-DRAMinour
tionsonAsymmetricCMPs.InISCA,2013.
HPCA-19paper,webelievethereareseveralmorewaysto [25] T. S. Jung. Memory technology and solutions roadmap.
leveragetheTL-DRAMsubstratebothinhardwareandsoft- http://www.sec.co.kr/images/corp/ir/irevent/
techforum_01.pdf,2005.
ware.Forinstance,newmechanismscouldbedevisedtode- [26] S.Khanetal. TheEfficacyofErrorMitigationTechniquesforDRAM
tectdatathatislatencycritical(e.g., datathatcausesmany RetentionFailures: AComparativeExperimentalStudy. InSIGMET-
RICS,2014.
threads to becomes serialized [15, 77, 23, 76, 24] or data [27] Y.Kimetal. ATLAS:Ascalableandhigh-performanceschedulingal-
that belongs to threads that are more latency-sensitive [27, gorithmformultiplememorycontrollers.InHPCA,2010.
[28] Y.Kimetal. Threadclustermemoryscheduling:Exploitingdifferences
28,72,78,3,4,73,75,74])orcouldbecomelatencycriti-
inmemoryaccessbehavior.InMICRO,2010.
calinthenearfutureandallocate/prefetchsuchdataintothe [29] Y.Kimetal. Acaseforexploitingsubarray-levelparallelism(SALP)in
DRAM.InISCA,2012.
nearsegment.
[30] Y. Kim et al. Ramulator source code. https://github.com/
• Opening up new design spaces with multiple tiers. TL- CMU-SAFARI/ramulator,2015.
[31] Y. Kim, W. Yang, and O. Mutlu. Ramulator: A Fast and Extensible
DRAMcanbeeasilyextendedtohavemultiplelatencytiers
DRAMSimulator.InIEEECAL,2015.
by adding more isolation transistors to the bitlines, provid- [32] E. Kultursay et al. Evaluating STT-RAM as an energy-efficient main
memoryalternative.InISPASS,2013.
ingmorelatencyasymmetry.(OurHPCA-19paperprovides
[33] B. C. Lee et al. Architecting Phase Change Memory As a Scalable
ananalysisofthelatencyofaTL-DRAMdesignwiththree DRAMAlternative.InISCA,2009.
4
SAFARITechnicalReportNo. 2016-001(January26,2016)
[34] B.C.Leeetal. PhaseChangeMemoryArchitectureandtheQuestfor [73] L.Subramanianetal. TheBlacklistingMemoryScheduler: Achieving
Scalability.InCACM,2010. highperformanceandfairnessatlowcost.InICCD,2014.
[35] B.C.Leeetal.Phase-ChangeTechnologyandtheFutureofMainMem- [74] L.Subramanianetal. TheApplicationSlowdownModel: Quantifying
ory.InIEEEMicro,2010. andControllingtheImpactofInter-ApplicationInterferenceatShared
[36] C.J.Leeetal. DRAM-AwareLast-LevelCacheWriteback: Reducing CachesandMainMemory.InMICRO,2015.
Write-CausedInterferenceinMemorySystems. InUTTechReportTR- [75] L.Subramanianetal. TheBlacklistingMemoryScheduler: Balancing
HPS-2010-002,2010. Performance,FairnessandComplexity.InTPDS,2016.
[37] D. Lee et al. Tiered-Latency DRAM: A Low Latency and Low Cost [76] M.A.Sulemanetal. Acceleratingcriticalsectionexecutionwithasym-
DRAMArchitecture.InHPCA,2013. metricmulti-corearchitectures.InASPLOS,2009.
[38] D.Leeetal. Adaptive-LatencyDRAM:OptimizingDRAMTimingfor [77] M.A.Sulemanetal. DataMarshalingforMulti-coreArchitectures. In
theCommon-Case.InHPCA,2015. ISCA,2010.
[39] D.Leeetal. DecoupledDirectMemoryAccess: IsolatingCPUandIO [78] H. Usui et al. DASH: Deadline-Aware High-Performance Memory
TrafficbyLeveragingaDual-Data-PortDRAM.InPACT,2015. SchedulerforHeterogeneousSystemswithHardwareAccelerators. In
[40] D.Leeetal. SimultaneousMulti-LayerAccess:Improving3D-Stacked ACMTACO,2016.
MemoryBandwidthatLowCost.InACMTACO,2016. [79] R.Venkatesanetal. Retention-awareplacementinDRAM(RAPID):
[41] Y.Lietal.ManagingHybridMainMemorieswithaPage-UtilityDriven softwaremethodsforquasi-non-volatileDRAM.InHPCA,2006.
PerformanceModel.InCoRRabs/1507.03303,2015. [80] J. Wang et al. Enabling High-performance LPDDRx-compatible
[42] J.Liuetal. RAIDR:Retention-AwareIntelligentDRAMRefresh. In MRAM.InISLPED,2014.
ISCA,2012. [81] H.Yoonetal. RowBufferLocalityAwareCachingPoliciesforHybrid
[43] J.Liuetal.AnExperimentalStudyofDataRetentionBehaviorinMod- Memories.InICCD,2012.
ernDRAMDevices: ImplicationsforRetentionTimeProfilingMecha- [82] H. Yoon et al. Efficient Data Mapping and Buffering Techniques for
nisms.InISCA,2013. MultilevelCellPhase-ChangeMemories.InACMTACO,2014.
[44] Y.Luoetal. CharacterizingApplicationMemoryErrorVulnerabilityto [83] D.Zhangetal.TOP-PIM:Throughput-orientedProgrammableProcess-
OptimizeDataCenterCostviaHeterogeneous-ReliabilityMemory. In inginMemory.InHPCA,2014.
DSN,2014. [84] Z.Zhangetal.CachedDRAMforILPprocessormemoryaccesslatency
[45] Y. Luo et al. WARM: Improving NAND flash memory lifetime with reduction.IEEEMicro,July2001.
write-hotnessawareretentionmanagement.InMSST,2015.
[46] J.Mezaetal.Acaseforsmallrowbuffersinnon-volatilemainmemories.
InICCD,2012.
[47] J.Mezaetal. EnablingEfficientandScalableHybridMemoriesUsing
Fine-GranularityDRAMCacheManagement.InIEEECAL,2012.
[48] J.Mezaetal.ACaseforEfficientHardware-SoftwareCooperativeMan-
agementofStorageandMemory.InWEED,2013.
[49] Micron. RLDRAM2and3Specifications. http://www.micron.
com/products/dram/rldram-memory.
[50] Y.Moonetal.1.2V1.6Gb/s56nm6F24GbDDR3SDRAMwithhybrid-
I/Osenseamplifierandsegmentedsub-arrayarchitecture.ISSCC,2009.
[51] O.Mutlu. MemoryScaling: ASystemsArchitecturePerspective. In
IMW,2013.
[52] O.Mutlu. MainMemoryScaling: ChallengesandSolutionDirections.
InMorethanMooreTechnologiesforNextGenerationComputerDesign.
Springer,2015.
[53] O.MutluandT.Moscibroda. Stall-timefairmemoryaccessscheduling
forchipmultiprocessors.InMICRO,2007.
[54] O.MutluandT.Moscibroda. Parallelism-awarebatchscheduling: En-
hancingbothperformanceandfairnessofsharedDRAMsystems. In
ISCA,2008.
[55] O.MutluandL.Subramanian. ResearchProblemsandOpportunitiesin
MemorySystems.InSUPERFRI,2015.
[56] S. Narasimha et al. High performance 45-nm SOI technology with
enhanced strain, porous low-k BEOL, and immersion lithography. In
IEDM,2006.
[57] S.PhadkeandS.Narayanasamy. MLPawareheterogeneousmemory
system.InDATE,2011.
[58] M.K.Qureshietal. EnhancingLifetimeandSecurityofPCM-based
MainMemorywithStart-gapWearLeveling.InMICRO,2009.
[59] M.K.Qureshietal. ScalableHighPerformanceMainMemorySystem
UsingPhase-changeMemoryTechnology.InISCA,2009.
[60] M.K.Qureshietal.AVATAR:AVariable-Retention-Time(VRT)Aware
RefreshforDRAMSystems.InDSN,2015.
[61] Rambus. DRAM Power Model. http://www.rambus.com/
energy,2010.
[62] L.E.Ramosetal. Pageplacementinhybridmemorysystems. InICS,
2011.
[63] J.Renetal. ThyNVM:EnablingSoftware-TransparentCrashConsis-
tencyinPersistentMemorySystems.InMICRO,2015.
[64] Samsung. DRAM Data Sheet. http://www.samsung.com/
global/business/semiconductor/product.
[65] Y.Satoetal. FastCycleRAM(FCRAM);a20-nsrandomrowaccess,
pipe-linedoperatingDRAM.InSymposiumonVLSICircuits,1998.
[66] V.Seshadrietal. RowClone: FastandEnergy-efficientin-DRAMBulk
DataCopyandInitialization.InMICRO,2013.
[67] V.Seshadrietal.TheDirty-BlockIndex.InISCA,2014.
[68] V.Seshadrietal. FastBulkBitwiseANDandORinDRAM. InIEEE
CAL,2015.
[69] V.Seshadrietal.Gather-ScatterDRAM:In-DRAMAddressTranslation
toImprovetheSpatialLocalityofNon-unitStridedAccesses.InMICRO,
2015.
[70] S.M.Sharroushetal. Dynamicrandom-accessmemorieswithoutsense
amplifiers.InElektrotechnik&Informationstechnik,2012.
[71] Y.H.Sonetal. ReducingMemoryAccessLatencywithAsymmetric
DRAMBankOrganizations.InISCA,2013.
[72] L.Subramanianetal. MISE:ProvidingPerformancePredictabilityand
ImprovingFairnessinSharedMainMemorySystems.InHPCA,2013.
5