ForPermissions,pleaseemail:journals.permissions@oxfordjournals.org
AdvanceAccesspublishedonDecember19,2005doi:10.1093/comjnl/bxh157
InstructionLevelParallelismthrough
Microthreading—AScalableApproachtoChipMultiprocessors
KostasBousias1,NabilHasasneh2andChrisJesshope1,∗
1DepartmentofComputerScience,UniversityofAmsterdam,NL2DepartmentofElectronicEngineering,UniversityofHull,UK
∗Correspondingauthor:Jesshope@science.uva.nl
Mostmicroprocessorchipstodayuseanout-of-orderinstructionexecutionmechanism.Thismechanismallowssuperscalarprocessorstoextractreasonablyhighlevelsofinstructionlevelparallelism(ILP).Themostsignificantproblemwiththisapproachisalargeinstructionwindowandthelogictosupportinstructionissuefromit.Thisincludesgeneratingwake-upsignalstowaitinginstructionsandaselectionmechanismforissuingthem.Wide-issuewidthalsorequiresalargemulti-portedregisterfile,sothateachinstructioncanreadandwriteitsoperandssimultaneously.Neitherstructurescaleswellwithissuewidthleadingtopoorperformancerelativetothegatesused.Furthermore,toobtainthisILP,theexecutionofinstructionsmustproceedspeculatively.Analternative,whichavoidsthiscomplexityininstructionissueandeliminatesspeculativeexecution,isthemicrothreadedmodel.Thismodelfragmentssequentialcodeatcompiletimeandexecutesthefragmentsoutoforderwhilemaintainingin-orderexecutionwithinthefragments.Theonlyconstraintsontheexecutionoffragmentsarethedependenciesbetweenthem,whicharemanagedinadistributedandscalablemannerusingsynchronizingregisters.ThefragmentsofcodearecalledmicrothreadsandtheycaptureILPandloopconcurrency.Fragmentscanbeinterleavedonasingleprocessortogivetolerancetolatencyinoperandsordistributedtomanyprocessorstoachievespeedup.Theimplementationofthismodelisfullyscalable.Itsupportsdistributedinstructionissueandafullyscalableregisterfile,whichimplementsadistributed,shared-registermodelofcommunicationandsynchronizationbetweenmultipleprocessorsonasinglechip.Thispaperintroducesthemodel,comparesitwithcurrentapproachesandpresentsananalysisofsomeoftheimplementationissues.Italsopresentsresultsshowingscalableperformancewithissuewidthoverseveralordersofmagnitude,
fromthesamebinarycode.
Keywords:concurrency,CMP,microthreads,codefragments
Received10May2005;revised24August2005
1.INTRODUCTION
Formanyyearsnow,researchershavebeeninterestedintheideaofachievingmajorincreasesinthecomputationalpowerofcomputersbytheuseofchipmultiprocessors(CMPs).ExamplesofCMParetheCompaqPiranha[1],StanfordHydra[2]andHammondetal.[3].Severalarchitectureshavebeenproposedandsomemanufacturershaveproducedcommercialdesigns,suchastheIBMPowerPC[4]andSun’sMAJCproject[5].Ideally,theperformanceofsuchsystemsshouldbedirectlyproportionaltothenumberofprocessorsused,i.e.shouldbescalable.CMPsscalewell,withthelimittoscalabilitydefinedbyMoore’slaw.Wecalculatethatcurrenttechnologycouldsupporthundredsofin-orderprocessorsand
achievesignificantspeedupovercurrentarchitecturesthatuseimplicitconcurrencyandachieveminimalspeedupthroughconcurrentinstructionissue[6].OneofthemajorbarrierstotheuseofCMPsistheproblemofprogrammingthemwithoutusingexplicitconcurrencyintheusercode.Ideallytheyshouldbeprogrammedusinglegacysequentialcode.
Intheory,thereisnolimittothenumberofprocessorsthatcanbeusedinaCMPprovidedthattheconcurrencyderivedfromthesequentialcodescaleswiththeproblemsize.Theproblemishowtosplitthecodeintoanumberofindependentthreads,scheduletheseonmanyprocessorsandtodothiswithalowandscalableoverheadintermsofthecontrollogicandprocessorefficiency.Infact,ongeneral-purposecodeitwill
TheComputerJournalVol.49No.2,2006
212K.Bousias,N.HasasnehandC.Jesshope
processorsduetotagmatching,wake-upsignalstowaitinginstructionsandselectionmechanismsforissuinginstructions.Thesedelaysincreasequadraticallyformostbuildingblockswiththeinstructionwindowsize[12].Finally,evenwiththeimplementationofalargeinstructionwindow,itdifficultforprocessorstofindsufficientfine-grainparallelism,whichhasmademostchipmanufacturerslikeCompaq,IntelandSunlookatsimultaneousmulti-threading(SMT)[13]toexposemoreinstructionlevelparallelism(ILP)throughacombinationofcoarse-andfine-grainparallelism.
Futuremicroprocessortechnologyneedsefficientnewarchitecturestoachievethedemandsofmanygrand-challengeapplications,suchasweatherandenvironmentalmodelling,computationalphysics(onbothsub-atomicandgalacticscales)andbiomolecularsimulation.Inthenearfuture,itshouldbepossibletointegratethousandsofarithmeticunitsonasinglechip[14]butout-of-orderexecutionisnotagoodcandidate,becauseofthelimitedconcurrencyitexposesandthenon-linearincreaseinhardwarecomplexityofinstructionissuewithissuewidth.Instructionissueisnottheonlyscalingproblem,thecomplexityoftheregisterfile[15]andbypasslogic[12]alsoscalebadlygivingfurtherbarrierstoascalablearchitecture.Very-longinstructionword(VLIW)architecturestransferthetaskofinstructionschedulingfromthehardwaretothecompiler,whichavoidsthescalingproblemsininstructionissue.However,inpractice,manymodernVLIWarchitecturesenduprequiringmanyofthesamecomplexmechanismsassuperscalarprocessors[16],suchasbranchprediction,speculativeloads,pipelineinterlocks,andnewmechanismslikecodecompressors.
Multi-threadingcanexposehigherlevelsofconcurrencyandcanalsohidelatencybyswitchingtoanewthreadwhenonethreadstalls.SMTappearstobethemostpopularformofmulti-threading.Inthistechnique,instructionsfrommultiplethreadsareissuedtoasinglewide-issueprocessoroutofprogrammedorderusingthesameproblematicstructureswehavealreadydescribed.Itincreasestheamountofconcurrencyexposedinout-of-orderissuebyworkingonmultiplethreadssimultaneously.ThemaindrawbacktoSMTisthatitcomplicatestheinstructionissuestage,whichiscentralforthemultiplethreads[17].Scalabilityininstructionissueisnoeasiertoachievebecauseofthisandtheotherscalabilityproblemsremainunchanged.ThusSMTsuffersfromthesameimplementationproblems[18]assuperscalarprocessors.
Analternativeapproachtomulti-threadingthateliminatesspeculationanddoesprovidescalableinstructionissueisthemicrothreadedmodel.Thethreadsinthismodelaresmallcodefragmentswithanassociatedprogramcounter.Littleotherstateisrequiredtomanagethem.Themodelisabletoexposeandsupportmuchhigherlevelsofconcurrencyusingexplicitbutdynamiccontrols.LikeVLIW,theconcurrencyexposedcomesfromthecompilerandexpressesloopsaswellasbasicblockconcurrency.UnlikeVLIW,theconcurrencyisparametricandiscreateddynamically.Inpipelinesthatexecutethis
beimpossibletoeliminatealldependenciesbetweenthreadsandhencesynchronizationisalsorequired.ThegoalofthisworkthereforeistodefineafeasiblearchitectureforascalableCMPthatiseasytoprogram,thatmaximizesthroughputforagiventechnologyandthatminimizesthecommunicationandsynchronizationoverheadsbetweendifferentthreads.Itshouldbenotedthatinthisworkthetermthreadisusedtodescribeverysmallcodefragmentswithminimalcontext.
TodayIntel’sItanium-2(Madison)microprocessorfeaturesover410milliontransistorsina0.13µmsemiconductorprocesstechnologyoperatingataspeedof1.5GHz.Thisisadual-processorversionofthepreviousItaniumprocessor(Mckinley),whichhasanissuewidthofsix.Moore’slawwouldindicatethatthebillion-transistorchipwillbecomefeasiblein65nmtechnologywithinthenext3or4years[7].IntelexpectsthattheItaniumprocessorwillreach1.7billiontransistorsattheendof2005.Thequestionswemustaskarewheredowegofromhereandwhatisthebestwaytoutilizethiswealthoftransistors,whilemaximizingperformanceandminimizingpowerdissipation?
Itcanbearguedthatthecurrenttrendinincreasingtheclockspeedandusingalargeareatomarginallyimprovetheinstructionsexecutedpercycle(IPC)ofthechipisapoorstrategyforfuturegenerationsofmicroprocessorsanddoesnotguaranteebetterperformance.Theuseofaggressiveclockspeedasameansofachievingperformanceonlyexacerbatesthememorywallandrequireslargeron-chipcaches.Moreimportantlythisstrategycannotbecontinuedindefinitelyaspowerdensityisafunctionoffrequencyandisbecomingacriticaldesignconstraint.Usingconcurrencyasameansofincreasingperformancewithoutincreasingpowerdensityisamuchbetterstrategysolongasmodelsandimplementationscanbefoundthatarescalable.Conversely,byexploitingconcurrencyforconstantperformance,clockfrequenciescanbereduced,allowingvoltagestobescaleddownandgainingaquadraticdecreaseinpowerdensitywithconcurrency.Ofcourse,thisassumescompletelyscalablemodelsandimplementations.
Anotherproblemareainfuturetechnologyisthescalingofwiredelayscomparedwithgatedelays.Astransistordimensionsscaledown,thenumberofgateswhicharereachablewithinthescaledclockisatbestconstant,whichmeansthatdistributedratherthanmonolithicarchitecturesneedtobeexploited[8].
Superscalarprocessorstodayissueuptoeightinstructionperclockcyclebutinstructionissueisnotscalable[9]andalinearincreaseinparallelismrequiresatleastaquadraticincreaseinarea[10].Thelogicrequiredoccupies∼30%ofthetotalchipareaina6-waysuperscalarprocessor[11].Inaddition,moreandmoreareaisbeingusedforon-chipmemory.Typicallythesecondlevelon-chipcacheoccupies25–30%ofthedieareaonamodernmicroprocessorsandbetween50and75%ontherecentlyannouncedItanium-2.Moreover,asignificantdelayandpowerconsumptionareseeninhigh-issue-width
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
model,instructionsareissuedin-orderfromanyofthethreadsallocatedtoitbutthescheduleofinstructionsexecutedisnon-deterministic,beingdeterminedbydataavailability.Threadscanbedeterministicallydistributedtomultiplepipelinesbasedonasimpleschedulingalgorithm.Theallocationofthesethreadsisdynamic,beingdeterminedbyresourceavailability,astheconcurrencyexposedisparametricandnotlimitedbythehardwareresources.Theinstructionissuescheduleisalsodynamicandrequireslinearhardwarecomplexitytosupportit.Instructionscanbeissuedfromanymicrothreadalreadyallocatedandactive.Ifsuchanapproachcouldalsogivelinearperformanceincreasewithnumberofpipelinesused,thenitcanprovideasolutiontobothCMPandILPscalability[19].Theperformanceofthemicrothreadedmodeliscomparedwithaconventionalpipelinein[20]foravarietyofmemorylatencies.Theresultsshowthatthemicrothreadedmicroprocessoralwaysprovidesasuperiorperformancetothatofanon-threadedpipelineandconsistentlyshowsaheightIPConasingleissuepipeline(0.85)evenforhighlydependentcodeandinthepresenceofveryhighmemorylatencies.Themodelnaturallyrequiresmoreconcurrencytoachievethesameasymptoticperformanceasmemorydelaysareincreased.ItwillbeshowninthispaperthatapplyingthismodeltoaCMPcanalsoprovideordersofmagnitudespeedupandIPCsofwellover1000havebeendemonstrated.Notethatboththeinstructionissuelogicandthesizeoftheregisterfileinthismodelscalelinearlywithissuewidth.2.CURRENTAPPROACHES2.1.
Out-of-orderexecution
213
orfuturefileofcheckpointsandrepairsisalsorequiredtore-orderthecompletionofinstructionsbeforecommittingtheirresultstotheregistersspecifiedintheISAinordertoachieveasequentiallyconsistentstateonexceptions.
Controlspeculationpredictsbranchtargetsbasedonthepriorhistoryforthesameinstruction.Executioncontinuesalongthispathasifthepredictionwascorrect,sothatwhentheactualtargetisresolved,acomparisonwiththepredictedtargetwilleithermatchgivingacorrectlypredictedbranchornot,inwhichcasetherewasamissprediction.Amisspredictioncanrequiremanypipelinecyclestocleanupand,inawide-issuepipeline,thiscanleadtohundredsofinstructionslotsbeingunused,ortobemoreprecise,ifwefocusonpower,beingusedunnecessarily.Bothpredictionandcleaningupamisspredictionrequireadditionalhardware,whichconsumesextraareaandalsopower,oftentoexecuteinstructionswhoseresultsareneverused.Itcanthereforebedescribedaswastefulofchipresourcesand,moreover,hasunpredictableperformancecharacteristics[21].Wewillshowthatitispossibletoobtainhighperformancewithoutspeculationand,moreover,savepowerindoingso.
AsalreadynotedinSection1,astheissuewidthincreasesinanout-of-order,superscalararchitecture,thesizeoftheinstructionwindowandassociatedlogicincreasequadratically,whichresultinalargepercentageofthechipbeingdevotedtoinstructionissue.Theout-of-orderexecutionmechanismthereforepreventsconcurrencyfromscalingwithtechnologyandwillultimatelyrestricttheperformanceovertime.Theonlyreasonforusingthisapproachisthatitprovidesanimplicitmechanismtoachieveconcurrentexecutionfromsequentialbinarycode.2.2.
VLIW
Toachieveahigherperformance,modernmicroprocessorsuseanout-of-orderexecutionmechanismtokeepmultipleexecutionunitsasbusyaspossible.Thisisachievedbyallowinginstructionstobeissuedandcompletedoutoftheoriginalprogramsequenceasameansofexposingconcurrencyinasequentialinstructionstream.Morethanoneinstructioncanbeissuedineachcycle,butonlyindependentinstructionscanbeexecutedinparallel,otherinstructionsmustbekeptwaitingor,undersomecircumstances,canproceedspeculatively.
Speculationreferstoexecutinganinstructionbeforeitisknownwhethertheresultsoftheinstructionwillbeusedornot,thismeansthataguessismadeastotheoutcomeofacontrolordatahazardasameanstocontinueexecutinginstructions,ratherthanstallingthepipeline.Registerrenamingisalsousedtoeliminatetheartificialdata-dependenciesintroducedbyissuinginstructionsoutoforder.ThisalsoenablestheextensionofthearchitecturalregistersetoftheoriginalISA,whichisnecessarytosupportconcurrencyininstructionexecution.Anyconcurrentexecutionofasequentialprogramwillrequiresomesimilarmechanismtoextendthesynchronizationmemoryavailabletoinstructions.Speculativeexecutionandout-of-orderissueareusedinsuperscalarprocessorstoexposeconcurrencyfromsequentialbinarycode.Areorderbuffer
AnalternativeandexplicitapproachtoconcurrencyininstructionissueisVLIW,wheremultiplefunctionalunitsareusedconcurrentlyasspecifiedbyasingleinstructionword.Thisusuallycontainsafixednumberofoperationsthatarefetched,decoded,issuedandexecutedconcurrently.Toavoidcontrolordatahazards,VLIWcompilersmusthoistlaterindependentinstructionsintotheVLIWorifthisisnotpossible,mustexplicitlyaddno-opinstructionsinsteadofrelyingonhardwaretostalltheinstructionissueuntiltheoperandsareready.Thiscancausetwoproblems,firstly,astallinoneinstructionwillstalltheentirewidthoftheinstruction;secondly,addingno-opinstructions,increasestheprogramsize.Intermsofperformance,iftheprogramsizeislargecomparedtotheI-cacheorTLBsize,itmayresultinhighermissrates,whichinturndegradestheperformanceoftheprocessor[22].Itisnotpossibletoidentifyallpossiblesourcesofpipelinestallsandtheirdurationatcompiletime.Forexample,supposeamemoryaccesscausesacachemiss,thisleadstoalongerthanexpectedstall.Therefore,ininstructionswithnon-deterministicdelaylikealoadinstructiontoacachehierarchy,
TheComputerJournalVol.49No.2,2006
214K.Bousias,N.HasasnehandC.Jesshope
architecturalsupportforcontrolanddataspeculationthroughpredicatedinstructionexecutionandbindingpre-fetchesofdataintocache.Inthisarchitectureeachoperationisguardedbyoneofthepredicateregisters,eachofwhichstores1bitthatdetermineswhethertheresultsoftheinstructionarerequiredornot.Predicationisaformofdelayedbranchcontrolandthisbitissetbasedonacomparisonoperation.Ineffect,instructionsareexecutedspeculativelybutstateupdateisdeterminedbythepredicatebitsoanoperationiscompletedonlyifthevalueofitsguardbitistrue,otherwisetheprocessorinvalidatestheoperation.ThisisaformofspeculationthatexecutesbotharmsofsomebranchesconcurrentlybutthisactionrestrictstheeffectiveILP,dependingonthedensityandnestingofbranches.
Pre-fetchingisachievedinanumberofways.Forexample,byaninstructionidenticaltoaloadwordinstructionthatdoesnotperformaloadbuttouchesthecacheandcontinues,settinginmotionanyrequiredtransactionsandcachemissesupthehierarchy.Theseinstructionsarehoistedbythecompileruptheinstructionsstream,notjustwithinthesamebasicblock.Theycanthereforetoleratehighlatenciesinmemory,providingthecorrectloadscanbepredicted.Therearemanymoreexplicitcontrolsoncachingintheinstructionsettoattempttomanagethenon-deterministicnatureoflargecachehierarchies.Problemsagainarisefromthespeculativenatureofthesolution.Ifforsomereasonthepre-fetchfails,eitherbecauseofaconflictorinsufficientdelaybetweenthepre-fetchandgenuineloadwordinstruction,thenasoftwareinterruptistriggeredincurringalargedelayandoverhead.
EPICcompilersfaceamajorprobleminconstructingaplanofexecution,theycannotpredictallconditionalbranchesandknowwhichexecutionpathistaken[29].Tosomeextentthisuncertaintyismitigatedbypredicatedexecutionbutasalreadyindicated,thisiswastefulofresourcesandpowerandlikeallspeculativeapproachescancauseunpredictabilityinperformance.Althoughobjectcodecompatibilityhasbeensolvedtosomeextent,theforwardcompatibilityisonlyasgoodasthecompiler’sabilitytogenerategoodschedulesintheabsenceofdynamicinformation.AlsothecodesizeproblemisstillachallengefacingtheEPICarchitecture[29].2.4.
Multiscalar
thenumberofno-opinstructionsrequiredisnotknownandmostVLIWcompilerswillscheduleloadinstructionsusingthecache-hitlatencyratherthanthemaximumlatency.Thismeansthattheprocessorwillstalloneverycachemiss.Thealternativeofschedulingallloadswiththecachemisslatencyisnotfeasibleformostprogramsbecausethemaximumlatencymaynotbeknownduetobuscontentionormemoryportdelays,anditalsorequiresconsiderableILP.Thisproblemwithnon-determinismincacheaccesslimitsVLIWtocachelessarchitectureunlessspeculativesolutionsareembraced.Thisisasignificantproblemwithmoderntechnology,whereprocessorspeedsaresignificantlyhigherthanmemoryspeeds[19].AlsopureVLIWarchitecturesarenotgoodforgeneralpurposeapplications,duetotheirlackofcompatibilityinbinarycode[23].ThemostsignificantuseofVLIWthereforeisinembeddedsystems,wheretheseconstraintsarebothsolved(i.e.singleapplicationsusingsmallfastmemories).AnumberofprojectsdescribedbelowhaveattemptedtoapplyspeculationtoVLIWinordertosolvetheschedulingproblemsandone,theTransmetaCrusoe,hasapplieddynamicbinarycodetranslationtosolvethebackwardcompatibilityproblem.
TheSunMAJC5200[24]isaCMPbasedonfour-wayissue,VLIWpipelines.Thisarchitectureprovidesasetofpredicatedinstructionstosupportcontrolspeculation.TheMAJCarchitectureattemptstousespeculative,thread-levelparallelism(TLP)tosupportthemultipleprocessors.Thisaggressivelyexecutescodeinparallelthatcannotbefullyparallelizedbythecompiler[25,26].Itrequiresnewhardwaremechanismstoeliminatemostsquashesduetodatadependencies[25].Thismethodofexecutionisagainspeculativeandcandegradetheprocessor’sperformancewhenthespeculationfails.MAJCreplicatesitssharedregistersinallpipelinestoavoidsharingresources.Fromtheimplementationpointofview,replicatingtheregisterscostssignificantpowerandarea[27]andalsorestrictsthescalability.Furthermore,theMAJCcompilermustknowtheinstructionslatenciesbeforeitcancreateaschedule.Asdescribedpreviously,itisnotsimpletodetectallinstructions’latenciesduetothevarietyofthehardwarecommunicationoverheads.2.3.EPIC
Intel’sexplicitlyparallelinstructioncomputing(EPIC)architectureisanotherspeculativeevolutionofVLIW,whichalsosolvestheforward(althoughnotbackward)codecompatibilityproblem.Itdoesthisthroughtherun-timebindingofinstructionwordstoexecutionunits.TheIA-64[28]architecturesupportsbinarycodecompatibilityacrossarangeofprocessorwidthsbyutilizinginstructionspacketsthatarenotdeterminedbyissuewidth.Thismeansaschedulerisrequiredtoselectinstructionsforexecutionontheavailablehardwarefromthecurrentinstructionpacket.Thisgivesmoreflexibilityaswellassupportingbinarycodecompatibilityacrossfuturegenerationsofimplementation.TheIA64alsoprovides
AnotherparadigmtoextractevenmoreILPfromsequentialcodeisthemultiscalararchitecture.Thisarchitectureextendstheconceptofsuperscalarprocessorsbysplittingonewideprocessorintomultiplesuperscalarprocessors.Inasuperscalararchitecture,theprogramcodehasnoexplicitinformationregardingILP,onlythehardwarecanbeemployedtodiscovertheILPfromtheprogram.Inmultiscalar,theprogramcodeisdividedintoasetoftasksorcodefragments,whichcanbeidentifiedstaticallybyacombinationofthehardwareandsoftware.Thesetasksareblocksinthecontrolflowgraphoftheprogramandareidentifiedbythecompiler.Thepurposeof
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
thisapproachistoexposeagreaterconcurrencyexplicitlybythecompiler.
Theglobalcontrolunitusedinthisarchitecturedistributesthetasksamongmultipleparallelexecutionunits.Eachexecutionunitcanfetchandexecuteonlytheinstructionsbelongingtoitsassignedtask.So,whenataskmisspredictionisdetected,allexecutionunitsbetweentheincorrectspeculationpointandthelatertaskaresquashed[30].Likesuperscalar,thiscanresultinmanywastedcycles,however,asthedepthofspeculationismuchgreater,theunpredictabilityinperformanceiscorrespondinglywider.
Thebenefitofthisarchitectureoverasuperscalararchitectureisthatitprovidesmorescalability.Thelargeinstructionwindowisdividedintosmallerinstructionwindows,oneperprocessingunit,andeachprocessingunitsearchesasmallerinstructionwindowforindependentinstructions.Thismitigatestheproblemsofscalinginstructionissuewithissuewidth.Themultipletasksarederivedfromloopsandfunctioncalls,allowingtheeffectivesizeoftheinstructionwindowtobeextremelylarge.Notethatnotallinstructionswithinthiswiderangearesimultaneouslybeingconsideredforexecution[31].Thisoptimizationoftheinstructionwindowisoffsetbyapotentiallylargeamountofcommunication,whichmayeffecttheoverallsystemperformance.
Communicationarisesbecauseofdependenciesbetweentasks,examplesareloop-carrieddependencies,functionargumentsandresults.Resultsstoredtoregister,thatarerequiredbyanothertask,areroutedfromoneprocessortoanotheratrun-timeviaaunidirectionalringnetwork.Recoveryfrommisspeculationisachievedbyadditionalhardwarethatmaintainstwocopiesoftheregistersalongwithasetofregistermasks,ineachprocessingunit[32].Insummarythen,althoughthemultiscalarapproachmitigatesagainstinstructionwindowscalingallowingwiderissuewidth,inpracticeitrequiresmanyofthesamecomplexmechanismsassuperscalarandbeingspeculativeisunlikelytobeabletoperformasconsistentlyasascalableCMP.2.5.
Multi-threading
215
describedinthispaper,althoughtheprocessorwasdesignedasacomponentofalargemulti-computerandnotasgeneralpurposechip.Theinterleavedapproachrequiresalargeconcurrencyinordertomaintainefficientpipelineutilization,asitmustbefilledwithinstructionsfromindependentthreads.Unliketheearlierapproaches,Teraavoidsthisrequirementusingsomethingcalledexplicit-dependencelookahead,whichusesaninstructiontagof3bitsthatspecifieshowmanyinstructionscanbeissuedfromthesamestreambeforeencounteringadependencyonit.Thisminimizesthenumberofthreadsrequiredtokeepthepipelinerunningefficiently,whichis∼70inthecaseofmemoryaccesses.ItwillbeseenthatmicrothreadingusesadifferentapproachthatmaintainsfullbackwardcompatibilityintheISA,aswellasinthepipelinestructure.
UnlikeIMT,whichusuallydrawsconcurrencyfromILPandloops,BMTusuallyexploitsregularsoftwarethreads.TherehavebeenmanyBMTproposals,see[17]andevensomecommercialdesignssuchastheSun’sNiagraprocessor[34].HowevertheconcurrencyexposedinBMTarchitecturesislimited,asresources,suchasregisterfilesmustbeduplicatedtoavoidexcessivecontextswitchingtimes.ThislimitstheapplicabilityofBMTtocertainclassesofapplications,suchasservers.
SMT,isprobablythemostpopularandcommercialformofmulti-threadinginusetoday.Inthisapproach,multipleinstructionsfrommultiplethreadsprovideILPformultipleexecutionunitsinanout-of-orderpipeline.SeveralrecentarchitectureshaveeitherusedorproposedSMT,suchastheHyper-ThreadTechnologyintheIntelXeonprocessor[35]andtheAlpha21464[36].Asalreadydescribed,themainproblemwithanSMTprocessoristhatitsuffersfromthesamescalabilityissuesasasuperscalarprocessor,i.e.layoutblocksandcircuitdelaysgrowfasterthanlinearlywithissuewidth.Inadditiontothis,multiplethreadssharethesamelevel-1I-cache,whichcancausehighcachemissrates,allofwhichprovideslimitstoitsultimateperformance[18].2.6.
RecentCMPs
Inordertoimproveprocessorperformance,modernmicropro-cessorstrytoexploitTLPthroughamulti-threadingapproachevenatthesametimeastheyexploitILP.Multi-threadingisatechniquethattoleratesdelaysassociatedwithsynchroniz-ing,includingsynchronizingwithremotememoryaccesses,byswitchingtoanewthread,whenonethreadstalls.Manyformsofexplicitmulti-threadingtechniqueshavebeendescribed,suchasinterleavedmulti-threading(IMT),blockedmulti-threading(BMT)andSMT.Agoodsurveyofmulti-threadingisgivenin[17].
AnumberofsupercomputersdesignedbyBurtonSmithhavesuccessfullyexploitedIMT,theseincludetheDelencorHEP,theHorizonandculminatedintheTeraarchitecture[33].ThisapproachisperhapstheclosesttothatofmicrothreadingFromtheabovediscussionweseethatmostcurrenttechniquesforexploitingconcurrencysufferfromsoftwareand/orhardwaredifficulties,andthefocusofresearchanddevelopmentactivitynowseemstobeonCMPs.Thesedesignsgiveamoreflexibleandscalableapproachtoinstructionissue,freeingthemtoexploitMoore’slawthoughsystemlevelconcurrency.Someapplicationscanexploitsuchconcurrencythroughtheuseofmulti-threadedapplications.Webandotherserversaregoodexamples;however,thebigproblemishowtoprogramCMPsforgeneralpurposecomputationandwhetherperformancecaneverbeachievedfromlegacysequentialcode,eitherinbinaryorevensourceform.
SeveralrecentprojectshaveinvestigatedCMPdesigns[1,2,3,37].Typically,theefficiencyofaCMPdependson
TheComputerJournalVol.49No.2,2006
216K.Bousias,N.HasasnehandC.Jesshope
intermsofreducingthenumberofregistersorminimizingthenumberofreadorwriteports[14,15,41,42].
Workdonein[41]describesabypassschemetoreducethenumberofregisterfilereadportsbyavoidingunnecessaryregisterfilereadsforthecaseswherevaluesarebypassed.Inthisschemeanextrabypasshintbitisaddedtoeachoperandofinstructionswaitingintheissuewindowandawake-upmechanismisissuedtoreduceregisterfilereadports.Asdescribedin[43],thistechniquehastwomainproblems.First,theschemeisonlyaprediction,whichcanbeincorrect,requiringseveraladditionalrepaircyclesforrecoveryonmissprediction.Second,becausethebypasshintisnotresetoneverycycle,thehintisoptimisticandcanbeincorrectifthesourceinstructionhaswrittenbacktoregisterfilebeforethedependentinstructionisissued.Furthermore,anextrapipelinestageisrequiredtodeterminewhethertoreaddataoperandsfromthebypassnetworkorfromtheregisterfile.
Otherapproachesincludeadelayedwrite-backscheme[42],whereamemorystructureisusedtodelaythewrite-backresultsforafewcyclestoreduceregisterfileports.Thedisadvantageofthisschemeisthatitisnecessarytowritetheresultsbothtotheregisterfileandthewrite-backqueueconcurrentlytoavoidconsistencyproblemsduringregisterrenaming.Theauthorsproposeanextensiontothisschemetoreducethenumberofregisterwriteports.However,thisextensionsuffersfromanIPCpenaltyanditdegradesthepipelineperformance.Furthermore,inthismodel,anybranchmisspredictionscauseapipelinestallandinsufficientuseofthedelaywrite-backqueue.Infactmostpreviousschemesforminimizingthemulti-portedregisterfilehaverequiredchangesinthepipelinedesignanddonotenablefullscalability.Atbesttheyprovideaconstantremissioninthescalabilityoftheregisterfile.
RecentlyRixneretal.[14]suggestedseveralpartitioningschemesfortheregisterfilefromtheperspectiveofstreamingapplications,includingdesignsspanningacentralregisterfilethroughtoadistributedregisterfileorganization.Theirresults,notsurprisingly,showthatacentralizedregisterfileiscostlyandscalesasO(N3),whileinthedistributedscheme,eachALUhasitsownporttoconnecttothelocalregisterfilesandanotherporttoaccessotherregisterfilesviaafastcrossbarswitchnetwork.Thispartitioningprovedtouselessareaandpowerandcausedlessdelaycomparedwiththepurelyglobalscheme,andwasalsoshowntoprovideascalablesolution.Thedistributedconfigurationalsohasasmalleraccesstimecomparedwiththecentralizedorganization.Inthiswork,a12832-bitregisterfilewith16readportsand8writeportsisusedforacentralregisterfileandiscomparedto8localregisterfilesof3232-bitregisters,2readports,1writeport,and1read/writeportforexternalaccess.TheresultfromtheirCACTImodelshoweda47.8%reductioninaccesstimeforthedistributedregisterfileorganizationacrossalltechnologies[44].
Itisnotclearfromthiswork,whethertheprogrammingmodelforthedistributedregisterfilemodelissufficientlygeneralformostcomputations.Withadistributedregisterfile
thedegreeandcharacteristicsoftheparallelism.Executingmultipleprocessesorthreadsinparallelisthemostcommonwaytoextracthighlevelofparallelismbutthisrequiresconcurrencyinthesourcecodeofanapplication.PreviousresearchhasdemonstratedthataCMPwithfour2-issueprocessorswillreachahigherutilizationthanan8-issuesuperscalarprocessor[17].Also,workdescribedin[3]showsthataCMPwitheight2-issuesuperscalarprocessorswouldoccupythesameareaasaconventional12-issuesuperscalar.TheuseofCMPsisaverypowerfultechniquetoobtainmoreperformanceinapowerefficientmanner[38].However,usingsuperscalarprocessorsasabasisforCMPs,withtheircomplexissuewindow,largeon-chipmemory,largemulti-portedregisterfileandspeculativeexecutionisnotsuchagoodstrategybecauseofthescalingproblemsalreadyoutlined.Itwouldbemoreefficienttousesimplerin-orderprocessorsandexploitmoreconcurrencyattheCMPlevel,providedthatthiscanbeutilizedbyasufficientlywiderangeofapplications.Thisisanactiveareaofresearchinthecompilercommunityanduntilthisproblemissolved,CMPsbasedonuser-levelthreadswillonlybeusedinapplicationswhichmatchthisrequirement,suchaslargeserverapplications,wheremultipleservicerequestsaremanagedbythreads.
3.REGISTERFILES
Theregisterfileisamajordesignobstacletoscalablesystems.Allsystemsthatimplementconcurrencyrequiresomeformofsynchronizingmemory.Inadataflowarchitecture,thisisthematchingstore,inanout-of-orderissueprocessoritistheregisterfile,supportedbytheinstructionwindow,reservationstationsandre-orderbuffer.Toimplementmoreconcurrencyandhigherlevelsoflatencytolerance,thissynchronizingmemorymustbeincreasedinsize.Thiswouldnotbeaproblemexceptthatincentralizedarchitectures,asissuewidthincreases,thenumberofportstothissynchronizingmemorymustalsoincrease.Theproblemisthatthecellsizegrowsquadraticallywiththenumberofportsorissuewidth.IfNinstructionscanbeissuedinonecycle,thenacentralregisterfilerequires2NreadportsandNwriteportstohandletheworstcasescenario.ThismeansthattheregistercellsizegrowsquadraticallywithN.Moreover,asthenumberofregistersalsoincreaseswiththeissuewidth,atypicalscalingofregisterfileareaisasthecubeofN.
TheregisterfileintheproposedAlpha8-wayissue21464hadasingle512registerfilewith24portsintotal.ItoccupiedanareaofsomefivetimesthesizeoftheL1D-cacheof64Kbyte[39].Also,intheMotorola’sM.COREarchitecture,theregisterfileenergyconsumptioncanbe16%ofthetotalprocessorpowerand42%ofthedatapathpower[40].Itisclearthereforethatthemulti-portedregisterfilesinmodernmicroprocessorsconsumesignificantpoweranddiearea.Severalprojectshaveinvestigatedtheregisterfileproblem
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
andexplicitlyroutednetwork,operationsmustbescheduledbythecompilerandroutinginformationmustalsobegeneratedwithcodeforeachprocessorinordertorouteresultsfromoneprocessor’sregisterfiletoanother.Althoughitmaybepossibletoprogramstreamingapplicationsusingsuchamodel,ingeneral,concurrencyandschedulingcannotbedefinedstatically.
Otherpreviousworkhasdescribedadistributedregisterfileconfiguration[44]whereafullydistributedregisterfileorganizationisusedinasuperscalarprocessor.Thearchitectureexploitsalocalregistermappingtableandadedicatedregistertransfernetworktoimplementthisconfiguration.Thisarchitecturerequiresanextrahardwarerecopyunittohandletheregisterfiledispatchoperations.Also,thisarchitecturesuffersfromadelaypenaltyastheexecutionunitofaninstructionthatrequiresavaluefromaremoteregisterfilemuststalluntilitisavailable.Theauthorshaveproposedaneagertransfermechanismtoreducethispenalty,butthisstillsuffersfromanIPCpenaltyandrequiresbothcentralissuelogicandglobalrenaming.
Inourresearch,itseemsthatonlythemicrothreadedmodelprovidessufficientinformationtoimplementapenalty-freedistributedregisterfileorganization.Suchaproposalisgivenin[6]whereeachprocessorinaCMPhasitsownregisterfileinasharedregistermodel.Accessestoremotedataisdescribedinthebinarycodeanddoesnotrequirespeculativeexecutionorrouting.Thedecouplingisprovidedbyasynchronizationmechanismonregistersandtheroutingisdecoupledfromtheoperationofthemicrothreadedpipelineoperation,exploitingthesamelatencytolerancemechanismsasusedformainmemoryaccess.Section4.5explainsinmoredetailhowthemicrothreadedCMPdistributesdatatotheregisterfilesandhidesthelatencyduringaremoteregisterfileaccess.
217
4.THEMICROTHREADEDMODEL4.1.
Overview
InthissectionweconsiderthemicrothreadedconcurrencymodelinmoredetailanddescribethefeaturesthatsupporttheimplementationofascalableCMPbasedonit.Thismodelwasfirstdescribedin[45],andwasthenextendedin[6,19,46]tosupportsystemswithmultipleprocessorson-chip.
LiketheTera,thismodelcombinestheadvantagesofBMTandIMTbutdoessobyexplicitlyinterleavingmicrothreadsonacycle-by-cyclebasisinaconventionalpipeline.Thisisachievedusinganexplicitcontextswitchinstruction,whichisacteduponinthefirststageofthepipeline.Contextswitchingisperformedwhenthecompilercannotguaranteethatdatawillbeavailabletothecurrentinstructionandisusedinconjunctionwithasynchronizationmechanismontheregisterfilethatsuspendsthethreaduntilthedatabecomesavailable.Thecontextswitchcontrolisnotstrictlynecessary,asthiscanbesignalledfromthesynchronizationfailureontheregister
read.However,itsignificantlyincreasestheefficiencyofthepipeline,especiallywhenalargenumberofthreadsuspensionsoccurtogether,whenthemodelresemblesthatofanIMTarchitecture.OnlywhenthecompilercandefineastaticscheduleareinstructionsfromthesamethreadscheduledinBMTmode.Exceptionstothisarecachemisses,iterativeoperationsandinter-threadcommunications.Thereisoneothersituationwherethecompilerwillflagacontextswitchandthatisfollowinganybranchinstruction.Thisallowsexecutiontoproceednon-speculatively,eliminatesthebranchpredictionandcleanuplogicandfillsanycontrolhazardbubbleswithinstructionsfromotherthreads,ifanyareactive.
ThemodelisdefinedincrementallyandcanbeappliedtoanyRISCorVLIWinstructionset.SeeTable1fordetailsoftheextensionstoaninstructionsetrequiredtoimplementtheconcurrencycontrolsrequiredbythismodel.Theincrementalnatureofthemodelallowsaminimalbackwardcompatibility,whereexistingbinarycodecanexecuteunchangedontheconventionalpipeline,althoughwithoutanyofthebenefitsofthemodeltoberealized.
MicrothreadingdefinesILPintwoways.SetsofthreadscanbespecifiedwherethosethreadsgenerateMIMDconcurrencywithinabasicblock.EachthreadisdefinedbyapointertoitsfirstinstructionandisterminatedbyoneormoreKillinstructionsdependingonwhetheritbranchesornot.Setsofthreadsprovideconcurrencyononepipelineandshareregisters.Theyprovidelatencytolerancethroughexplicitcontextswitchingfordataandcontrolhazards.Iterators,ontheotherhand,defineSPMDconcurrencybyexploitingavarietyofloopstructures,includingforandwhileloops.Iteratorsgiveparametricconcurrencybyexecutingiterationsinparallelsubjecttodataflowconstraints.Independentloopshavenoloop-carrieddependenciesandcanexecutewithminimaloverheadonmultipleprocessors.Dependentloopscanalsoexecuteonmultipleprocessors,exploitinginstructionlevelconcurrencybutduringtheexecutionofdependencychainsactivitywillmovefromoneprocessortoanotherandspeedupwillnotbelinear.IdeallydependencychainsshouldexecutewithminimallatencyandparametersfortheinstructioninTable1allowdependenciestobebypassedoninteractionsexecutedonasingleprocessorgivingtheminimallatencypossible,i.e.onepipelinecycleperlinkinthechain.
Iteratorssharecodebetweeniterationsanduseasetofthreadstodefinetheloopbody.Thismeansthatsomeformofcontextmustbeprovidedtodifferentiatemultipleiterationsexecutingconcurrently.Thisisachievedbyallocatingregisterstoiterationsdynamically.Afamilyofthreadsthen,isdefinedbyaniteratorcomprisingatripleofstart,stepandlimitoverasetofthreads.Informationisalsorequiredthatdefinesthemicro-contextassociatedwithaniterationand,aseachiterationiscreated,registersforitsmicrocontextareallocateddynamically.Tocreateafamilyofthreadsasingleinstructionisexecutedononeprocessor,whichpointstoathreadcontrolblock(TCB)containingtheaboveparameters.Iterationscan
TheComputerJournalVol.49No.2,2006
218K.Bousias,N.HasasnehandC.Jesshope
TABLE1.Concurrency-controlinstructions.InstructionCreSwchKillBsyncBrk
Instructionbehaviour
CreatesanewfamilyofthreadsCausesacontextswitchtooccurTerminatesthethreadbeingexecutedWaitsforallotherthreadstoterminateTerminatesallotherthreads
thenbescheduledononeormoreprocessorsasrequiredtoachievethedesiredperformance.
Virtualconcurrencyonasinglepipelinedefinesthelatencythatcanbetoleratedandislimitedbythesizeofthelocalregisterfileorcontinuationqueueinthescheduler.Thelatterholdstheminimalstateassociatedwitheachthread.Botharerelatedbythetwocharacteristicsofthecode;thenumberofregisterspermicro-contextandthecardinalityofthesetofthreadsdefiningtheloopbody.Inthismodel,allthreadsaredrawnfromthesamecontextandtheonlystatemanipulatedinthearchitectureisthethread’sexecutionstate,itsPCandsomeinformationaboutthelocationofitsmicro-contextinthelocalregisterfile.Thismechanismremovesanyneedtoswapregistervaluesonacontextswitch.
Theoretically,physicalconcurrencyislimitedonlybythesiliconavailabletoimplementaCMP,asallstructuressupportingthismodelarescalableandarerelatedtotheamountofthevirtualconcurrencyrequiredforlatencytolerance,i.e.registerfile,continuationqueueandregisterallocationlogic.Practically,physicalconcurrencywillbelimitedbytheextentoftheloopsthatthecompilercangenerate,whethertheyareindependentorcontainloop-carrieddependenciesandultimately,theoverheadsindistributionandsynchronizationthatframetheSPMDexecution.Notethatthreadcreationproceedsinatwostages.Aconceptualscheduleisdeterminedalgorithmicallyoneachprocessorfollowingthecreationofafamilyofmicrothreadsbuttheactualthreadcreation,i.e.thecreationofentriesinthecontinuationqueue,occursoveraperiodoftimeattherateofonethreadpercycle,keepingupwiththemaximumcontext-switchrate.Thiscontinueswhileresourcesareavailable.
Figure1showsasimplemicrothreadedpipelinewithfivestagesandtherequiredsharedcomponentsusedinthismodel.Noticethatnoadditionalstagesarerequiredforinstructionissue,retiringinstructions,orinroutingdatabetweenprocessors’registerfiles.Shortpipelinesprovidelowlatencyforglobaloperations.Notethatmorepipelinestagescouldbeusedtoreducetheclockperiod,asisthecurrenttrendinmicroprocessorarchitecture.However,asconcurrencyprovidesthemostpower-efficientsolutiontoperformanceitismootwhetherthisisasoundstrategy.Twoprocessorscangivethesameperformanceasonedoublespeedprocessorbutdosowithlesspowerdissipated.DynamicpowerdissipationisproportionaltofrequencyfandV2,butVcanbereducedwith
fgivingaquadraticreductionofenergyrequiredbyexploitingconcurrencyinacomputationoveratleastoversomerangeofvoltagescaling.
ThesetofsharedcomponentsusedinthismodeltosupportthemicrothreadedCMPisminimalandtheiruseisinfrequent.Thesecomponentsarethebroadcastbus,usedtocreateafamilyofthreadsandaringnetworkforregistersharing.InaCreinstructionapointertotheTCBisdistributedtoeachprocessor,wheretheschedulerwillusethisinformationtodeterminethesubsetoftheiterationsitwillexecute.Thebroadcastbusisalsousedtoreplicateglobalstatetoeachprocessor’slocalregisterfileinsteadofaccessingacentralizedregisterfile.Bothoperationsarelowinfrequencyandcanbeamortizedovertheexecutionofmultipleiterations.
Replicationofglobalvariablesisoneoftwomechanismsthatallowtheregisterfileinmicrothreadedmodeltobefullydistributedbetweenthemultipleprocessorsonachip.Theotheristheshared-registerringnetwork,whichallowscommunicationsbetweenpairsofthreadstoimplementloop-carrieddependencies.Thiscommunicationbetweenthesharedanddependentthreadsisdescribedinmoredetailbelowbutusestheringnetworkonlyiftwoiterationsareallocatedtodifferentprocessors.Notethatschedulescanbedefinedtominimizeinter-processorcommunication,moreimportantly,thiscommunicationistotallydecoupledfromthepipeline’soperationthroughtheuseofexplicitcontextswitching.
Bothtypesofglobalcommunicationi.e.thebroadcastbusandtheringnetworkareabletouseasynchronouscommunications,creatingindependentclockingdomainsforeachprocessor.Indeed,thebroadcastmechanismscouldbeimplementedatsomelevelbyeagerpropagationwithintheringnetwork.4.2.
Concurrencycontrols
Themicrothreadmodelisagenericone,asitcanbeappliedtoanyISA,solongasitsinstructionsareexecutedin-order.Inaddition,themodelcanbedesignedtomaintainfullbackwardcompatibility,allowingexistingbinarycodetorunwithoutspeedup[6]onamicrothreadedpipeline.Binarycompatibilitywithspeedupcanalsobeobtainedusingbinary-to-binarytranslationtoidentifyloopsanddependenciesandaddinginstructionstosupporttheconcurrentexecutionofthoseloopsand/ortheconcurrencywithinthebasicblocks.
Table1showsthefiveinstructionsrequiredtosupportthismodelonanexistingISA.Theinstructionscreateafamilyofthreads,explicitlycontextswitchbetweenthreads,killathreadandtwoinstructionsprovideforglobalsynchronization.Oneisabarriersynchronization,theotheraformofbreakinstruction,whichforcesabreakfromaloopexecutedconcurrently.TheseconcurrencycontrolsprovideanefficientandflexiblemethodtoextracthighlevelsofILPfromexistingcode.Eachinstructionwillnowbedescribedinmoredetail.
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
Broadcast bus219
arbiterringTo nearest neighboursShared memoryregister allocation modelread-onlycacheschedulercontextswitchremoteregisterreadregister writebroadcastdecoupled lwbypasscreateinitthreadcontroland IFregisterfilereschedule threadALUcachewritebackFIGURE1.Microthreadedmicroprocessorpipeline.
TABLE2.TCBcontainingparametersthatdescribeafamilyofmicrothreads.ThreadsDependencyPreamblesPostamblesStartLimitStepLocalsSharedsPre-pointerMain-pointerPost-pointer
Cardinalityofthesetofthreadsrepresentinganiteration
Iterationoffsetforanyloopcarrieddependencies,e.g.a[i]:=...a[i−d]
NumberofiterationsusingpreamblecodeNumberofiterationsusingpostamblecodeStartofloopindexvalueLimitofloopindexvalueStepbetweenloopindices
Numberoflocalregistersdynamicallyallocatedperiteration
Numberofsharedregistersdynamicallyallocatedperiteration
OnepointerperthreadinsetforpreamblecodeOnepointerperthreadinsetformainloop–bodycode
Onepointerperthreadinsetforpostamblecode
4.2.1.Threadcreation
ThemicrothreadedmodeldefinesexplicitandparametricconcurrencyusingtheCreinstruction.ThisinstructionbroadcastsapointertotheTCBandtoallprocessorsassignedtothecurrentcontext;see[47]fordetailsofdynamicprocessorallocation.TheTCBcontainsparameters,whichdefineafamilyofthreads,i.ethesetofthreadsrepresentingtheloopbodyandthetripledefiningtheloop.Italsodefinesthedynamicresourcesrequiredbyeachthread(itsmicrocontext)intermsoflocalandsharedregisters.Forloopswhichcarryadependency,italsodefinesadependencydistance,andoptionally,pointerstoapreambleand/orpostamblethread,whichcanbeusedtosetupand/orterminateadependencychain.Thedependencydistanceisaconstantoffsetintheindexspacewhichdefinesregularloop-carrieddependencies.Afamilyofthreadscanbecreatedwithoutrequiringapipelineslot,asthecreateinstructionisexecutedconcurrentlywitharegularinstructionintheIFstageofthepipeline.TheTCBforourcurrentworkonimplementationoverheadsisdefinedinTable2.
Aglobalschedulingalgorithmdetermineswhichiterationswillexecuteonwhichprocessors.ThisalgorithmisbuiltintothelocalschedulerbuttheparametersintheTCBandthenumberofprocessorsusedtoexecutethefamilymaybedynamic.Theconcurrencydescribedbythisinstructionisthereforeparametricandmayexceedtheresourcesavailableintermsofregistersandthreadslotsinthecontinuationqueues.Theregisterallocationunitineachlocalschedulermaintainstheallocationstateofallregistersineachregisterfileandthiscontrolsthecreationofthreadsatarateofoneperpipelinecycle.Onceallocatedtoaprocessorathreadrunstocompletion,i.e.untilitencountersaKillinstructionandthenterminates.
Aterminatedthreadreleasesitresourcessolongasanydependentthreadhasalsoterminated.Todosobeforethismaydestroydatathathasnotyetbeenreadbythedependentthread.Notethatmicrothreadsareusually(butnotexclusively)veryshortsequencesofinstructionswithoutinternalloops.4.2.2.Context-switching
ThemicrothreadedcontextswitchingmechanismisachievedusingtheSwchinstruction,whichisacteduponinthefirststageofthepipeline,givingacycle-by-cycleinterleavingif
TheComputerJournalVol.49No.2,2006
220K.Bousias,N.HasasnehandC.Jesshope
processorwhenmanagingloop-carrieddependencies.Thisimplementsascalableanddistributedshared-registermodelbetweenprocessorswithoutusingasingle,multi-portedregisterfile,whichisknowntobeunscalable.
Theuseofdataflowsynchronizationbetweenthreadsenablesapolicyofconservativeinstructionexecutiontobeapplied.Whennomicrothreadsareactivebecauseallarewaitingexternalevents,suchasloadwordrequests,thepipelinewillstalland,ifthepipeisflushedcompletely,theschedulerwillstopclocksandpowerdowntheprocessorgoingintoastandbymode,inwhichitconsumesminimalpower.Thisisamajoradvantageofdata-drivenmodels.Conservativeinstructionexecutionpoliciesconservepowerincontrasttotheeagerpoliciesusedinout-of-orderissuepipelines,whichhavenomechanismstorecognizesuchaconjunctionofschedules.Thiswillhaveamajorimpactonpowerconservationandefficiency.
Contextswitchingandsuccessfulsynchronizationhavenooverheadintermsofadditionalpipelinecycles.Thecontextswitchinterleavesthreadsinthefirststageofthepipeline,ifnecessaryonacyclebycyclebasis.Synchronizationoccursatregister-readstageandonlyifitfailswillanyexceptionalactionbetriggered.Onasynchronizationfailure,controlfortheinstructionismutatedtostoreareferencetothemicrothreadintheregisterbeingread.Thismeansthattheonlyoverheadinsupportingtheseexplicitconcurrencycontrolsistheadditionalcyclerequiredtoreissuethefailedinstructionwhenthesuspendedthreadisreactivatedbythearrivalofthedata.Ofcoursethereareoverheadsinhardwarebutthisistrueforanymodel.
Themodelalsoprovidesabarriersynchronization(Bsync)instruction,whichsuspendstheissuingthreaduntilallotherthreadshavecompletedandaBrkinstruction,whichexplicitlykillsallotherthreadsleavingonlythemainthread.Theseinstructionsarerequiredtoprovidebulksynchronizationformemoryconsistency.Thereisnosynchronizationonmainmemory,onlytheregistersaresynchronizing.Thismeansthattwodifferentmicrothreadsinthesamefamilymaynotreadafterwritetothesamelocationinmemorybecausetheorderingofthoseoperationscannotbeguaranteed.Italsomeansthatanyloop-carrieddependenciesmustbecompiledtouseregistervariables.Apartitioningofthemicro-contextsupportsthismechanismefficiently.
Wedonotaddressthedesignofthesharedmemorysysteminthispaper,althoughwearecurrentlyinvestigatingseveralapproaches.Indeedthelatencytoleranceprovidedbythismodelmakesthisdesignofthememorysystemsomewhatflexible.Forexample,alarge,banked,multi-portedmemorywouldgiveasolutionthatwouldprovideallthebufferingrequiredforthelargenumberofconcurrentrequestsgeneratedbythismodel.Itisimportanttonotethatusingin-orderprocessorsandablock-basedmemoryconsistencymodel,memoryorderingdoesnotposethesameproblemasitdoesinanout-of-orderprocessor.
necessary.WhenaSwchinstructionisexecuted,theIFstagereadsthenextinstructionfromanotherreadythread,whosestateispassedtotheIFstageasaresultofthecontextswitch.AsthisactiononlyrequirestheIFstageofthepipeline,itcanbeperformedconcurrentlywithaninstructionfromthebaseISA,solongastheSwchinstructionispre-fetchedwithit.Thecontextswitchingmechanismisusedtomanagebothcontrolanddatadependencies.Itisusedtoeliminatecontroldependenciesbycontextswitchingfollowingeverytransferofcontrol,inordertokeepthepipelinefullwithoutanybranchprediction.Thishastheadvantagethatnoinstructionisexecutedspeculativelyandconsequently,powerisneitherdissipatedinmakingapredictionnorinexecutinginstructionsonthewrongdynamicpath.Contextswitchingalsoeliminatesbubblesinthepipelineondatadependenciesthathavenon-deterministictiming,suchasloadsfrommemoryorthread-to-threadcommunication.Contextswitchingprovidesanarbitrarylargetolerancetolatency,determinedbythesizeofthelocalregisterfile.
4.2.3.Threadsynchronization
Theonlysynchronizingmemoryinthemicrothreadedmodelisprovidedbytheregistersandthisgivesanefficientandscalablemechanismforsynchronizingdatadependencies.Thesynchronizationisperformedusingtwosynchronizationbitsassociatewitheveryregister,whichdifferentiatebetweenthefollowingstates:full,empty,waiting-localandwaiting-remote.Registersareallocatedtomicro-contextsintheemptystateandareadtoanemptyregisterwillfail,resultinginareferencetothemicrothreadthatissuedtheinstructionbeingstoredinthatregister.Thisreferencepassesdownthepipelinewitheachinstructionexecuted.Usingthecontinuationqueueinthescheduler,listsofcontinuationsmaybesuspendedonaregister,whichisrequiredwhenmultiplethreadsaredependentonthevaluetobestoredthere.AllregistersthereforeimplementI-structuresinamicrothreadedmicroprocessor.Inthefullstate,registersoperatenormally,providingdatauponaregisterreadand,ifnosynchronizationisrequired,aregistercanberepeatedlywrittentowithoutchangingitssynchronizationstatetoprovidebackwardcompatibility.Thecompilercaneasilyrecognizethepotentialforasynchronizationfailureifascheduleforthedependencyisnotknownatcompiletime.Ifso,itinsertsacontextswitchonthedependentinstruction.Examplesincludeinstructionsdependentonapriorloadword,producedinanotherthread,orproducediniterativeCPUoperations.
Theregisterissettooneofthewaitingstateswhenitholdsacontinuation.Twokindsofcontinuationaredistinguished:waiting-local,whentheregisterholdstheheadofalistofcontinuationstolocalmicrothreadsand;waiting-remote,whentheregisterholdsaremoterequestfordatafromanotherprocessor.Thelatterenablesthemicro-contextforoneiterationtobestoredforread-onlyaccessonaremote
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
4.2.4.Threadtermination
ThreadterminationinthemicrothreadedmodelisachievedthroughaKillinstruction,whichofcoursecausesacontextswitchaswellasupdatingthemicrothread’sstatetokilled.Theresourcesofthekilledthreadsarereleasedatthisstage,unlessthereisanotherthreaddependentuponit,inwhichcaseitsresourceswillnotbereleaseduntilthedependentthreadhasalsobeenkilled.(Notethatthisisthemostconservativepolicyandmoreefficientpoliciesmaybeimplementedthatdetectwhenallloop-carrieddependencieshasbeensatisfied.)4.3.Scalableinstructionissue
CurrentmicroprocessorsattempttoextracthighlevelsofILPbyissuingindependentinstructionsoutofsequence.Theydothismostsuccessfullybypredictingloopbranchesandunrollingmultipleiterationsofaloopwithintheinstructionwindow.Theproblemwiththisapproachhasalreadybeendescribed;alargeinstructionwindowisrequiredinordertofindsufficientindependentinstructionsandthelogicassociatedwithitgrowsatleastwiththesquareoftheissuewidth.
Ifwecomparethiswithwhatishappeninginthemicrothreadedmodel,weseethatalmostexactlythesamemechanismisbeingusedtoextractILP,withonemajordifference,amicrothreadedmicroprocessorexecutesfragmentsofthesequentialprogramsout-of-order.Thesefragments(themicrothreads)areidentifiedatcompiletimefromloopbodiesandconventionalILPandmayexecuteinanyordersubjectonlytodataflowconstraints.Instructionswithinfragmentshowever,issueandcompletein-order.Wehavealreadyseenthatacontextswitchsuspendsafragmentatinstructionswhoseoperandshavenon-deterministictiming.Thedependentinstructionisissuedandstoresapointertoitsfragmentifaregisteroperandisfoundtobeempty.Anysuspendedfragmentsarerescheduledwhendataiswrittentothewaitingregister.Thusonlyinstructionsuptothefirstdependencyineachfragment(loopbody)areissuedandonlythatinstructionwillbewaitingforthedependencytoberesolved;allsubsequentinstructionsinthatpipelinewillcomefromotherfragments.Inanout-of-orderissuemodeltheinstructionwindowisfilledwithallinstructionsfromeachloopunrolledbybranchpredictionbecauseitknowsnothingaprioriabouttheinstructionschedules.
Consideracomputationthatonlyevercontainsoneindependentinstructionperloopoflinstructions,thentogetn-wayissuenloopsmustbeunrolledandtheinstructionwindowwillcontainn*linstructionsforeachninstructionsissued.Incomparison,themicrothreadedmodelwouldissuethefirstnindependentinstructionsfromnthreads(iterations),thenitwouldissuethefirstdependentinstructionsfromthesamenthreadsbeforecontextswitching.Thenextninstructionswouldthencomefromthenextniterations(threads).Synchronization,insteadoftakingplaceinaglobalstructurewithO(n2)complexity,isdistributedtonregistersandhaslinearcomplexity.Eachthreadwaitsfor
221
thedependencytoberesolvedbeforebeingabletoissueanynewinstructions.Ineffecttheinstructionwindowinamicrothreadedmodelisdistributedtothewholeofthearchitecturalregistersetandonlyonelinkinthedependencygraphforeachfragmentofcodeiseverexposedsimultaneously.Moreover,nospeculationiseverrequiredandconsequently,iftheschedulesaresuchthatallprocessorswouldbecomeinactive,thenthisstatecanberecognizedandusedtopower-downtheprocessorstoconserveenergy.
Comparethistotheexecutioninanout-of-orderprocessor,whereinstructionsareexecutedspeculativelyregardlessofwhethertheyareonthecorrectexecutionpath.Althoughpredictionsaregenerallyaccurateindeterminingtheexecutionpathinloops,ifthecodewithinaloopcontainsunpredictable,data-dependentbranches,thiscanresultinalotofenergybeingconsumedfornousefulwork.Researchersnowtalkabout‘breakingthedependencybarrier’usingdatainadditiontocontrolspeculationbutwhatdoesthismean?Indicescanbepredictedreadilybutthesearenottruedependenciesanddonotconstrainthemicrothreadedmodel;addresses,basedonthoseindicescanalsobepredictedwithareasonableamountofaccuracybutagainthesedonotconstrainthemicrothreadedmodel.Thisleavestruecomputationaldatadependencies,whichcanonlybepredictedunderveryextraordinarycircumstances.Itseemsthereforethatthereisnojustificationforconsumingpowerinattemptingdataspeculation.
Out-of-orderissuehasnoglobalknowledgeofconcurrencyorsynchronization.Microthreading,ontheotherhand,isabletoexecuteconservativelyasitdoeshavethatglobalknowl-edge.Realdependenciesareflaggedbycontextswitching,concurrencyisexposedbydynamicallyexecutingparametricCreinstructionsandthenamespaceforsynchronizationspansanentireloopasregistersareallocateddynamically.Atanyinstantthephysicalnamespaceisdeterminedbytheregistersthathavebeenallocatedtothreads.4.4.
Threadstate
Whenathreadisassignedresourcesbythescheduler,itisinitiallysettothewaitingstate,asitmustwaitforitscodetobeloadedintotheI-cachebeforeitcanbeconsideredactive.Athreadwillgointoasuspendedstatewhenithasbeencontextswitcheduntileithertheregistersynchronizationhasbeencompletedorthebranchtargethasbeendefined,whenitagaingoesintothewaitingstate.TheschedulergeneratesarequesttotheI-cachetopre-fetchtherequiredcodeforanythreadthatentersthewaitingstate.Iftherequiredcodeisavailable,thentheI-cacheacknowledgestheschedulerimmediately,otherwisenotuntiltherequiredcodeisinthecache.Thethread’sstatebecomesreadyatthisstage.Akilledstateisalsorequiredtoindicatethosethreadsthathavebeencompletedbutwhosedatamaystillbeinuse.Atanytimethereisjustonethreadperprocessor,whichisintherunningstate;onstart-upthiswillbethemainthread.
TheComputerJournalVol.49No.2,2006
222K.Bousias,N.HasasnehandC.Jesshope
accesspatternshavethecharacteristicsthattheyarewrittentoinfrequentlybutreadfromfrequently.TheaddressspaceinaconventionalRISCISAispartitionedsothatthelower16registersformthisglobalwindow.Thesearestaticallyallocatedforagivencontextandeverythreadcanreadand/orwritetothem.Notethatthemainthreadhas32staticallyallocatedregisters,16ofwhicharevisibletoallmicrothreadsasglobalsand16ofwhicharevisibleonlytothemainthread.Eachthreadsees32registers.Thelower16ofthesearetheglobalsandthesearesharedbyallthreadsandthoseintheupperhalfarelocaltoagiventhread.
Theupper16registersareusedtoaddressthemicrocontextofeachiterationinafamilyofthreads.Aseachiterationsharescommoncode,theaddressofeachmicro-contextintheregisterfilemustbeuniquetothatiteration.Aswehaveseen,thebaseaddressofathread’smicro-contextformsapartofitsstate.Thisimmediatelygivesameansofimplementingadistributed,shared-registermodel.Weneedtoknowtheprocessoronwhichathreadisrunningandthebaseaddressofitsmicrocontextinordertoshareitsdata.However,wecanfurtherpartitionthemicro-contextintoalocalpartandasharedparttoavoidtoomuchadditionalcomplexityinimplementingthepipeline.Threeregisterwindowsaremappedtotheupperordynamichalfoftheaddressspaceforeachmicro-context.Thesearethelocalwindow($Li),thesharedwindow($Si)andthedependentwindow($Di).Thusthesumofthesizeofthesethreewindowsmustbe16.Thelocalwindowstoresvaluesthatarelocaltoagiventhread.Forexampletheystorevaluesfromindexedarraysusedonlyinasingleiteration.ReadsandwritestothelocalwindowarealllocaltotheprocessorathreadisrunningonandnodistributionoftheLwindowisthereforerequired.TheSandDregisterwindowsprovidethemeansofsharingapartofamicro-contextbetweenthreads.TheSwindowiswrittenbyonethreadandisreadbyanotherthreadusingitsDwindow.
Itshouldbenotedthatmanydifferentmodelscanbesupportedbythisbasicmechanism.Inthispaperasimplemodelisdescribedbutdifferentmodelsofcommunicationwithdifferentconstraintsandsolutionstoresourcedeadlockcanbeimplemented.Themechanismwouldevensupportahierarchyofmicrocontextsbyallowinganiterationinonefamilyofthreadstocreateasubordinatefamily,wherethedynamicpartoftheaddressspaceinthecreatingfamilybecamethestaticpartinthesubordinatefamily.Thiswouldsupportnestedmulti-dimensionalloopsaswellasbreadthfirstrecursion.Therearedifficultieshowever,inresolvingresourcedeadlockproblemsinallbutthesimplestmodelsandtheserequirefurtherresearchtoresolve.
Inthispaperwedescribeasimplemodel,thatsupportsasinglelevelofloopwithcommunicationbetweeniterationsbeingallowedonlybetweeniterationsthatdifferbyacreate-timeconstant.Anexampleofthistypeofcommunicationcanbefoundinloop-carrieddependencies,whereoneiterationproducesavalue,whichisusedbyanotheriteration.For
Onacontextswitchorkill,theinstructionfetchstageisprovidedwiththestateofanewthreadifanyareactive,otherwisethepipelinestallsforafewcyclestoresolvethesynchronizationandifitfails,thepipelinesimplystops.Thisactionissimple,requiresnoadditionalflushorcleanuplogicandmostimportantly,isconservativeinitsuseofpower.Notethatbydefinition,whennolocalthreadsareactive,thesynchronizationeventhastobeasynchronousandhencedoesnotrequireanylocalclocks.
Thestateofathreadalsoincludesitsprogramcounter,thebaseaddressofitsmicro-contextandthebaseaddressandlocationofanymicro-contextsitisdependentupon.Thestatealsoincludesanimplicitslotnumber,whichistheaddressoftheentryinthecontinuationqueueandwhichuniquelyidentifiesthethreadonagivenprocessor.Thelastfieldrequiredisalinkfield,whichholdsaslotnumberforbuildinglinkedlistsofthreadstoidentifyemptyslots,readyqueuesandanarbitrarynumberofcontinuationqueuesthatsupportmultiplecontinuationsondifferentregisters.TheslotreferenceisusedasatagtotheI-cacheandisalsopassedthroughthepipelineandstoredintherelevantoperandregisterifaregisterreadfails,whereitformstheheadofthatcontinuationqueue.4.5.
Registerfilepartitioninganddistribution
WehavealreadyseenthatRixneretal.[14]haveshownthatadistributedregisterfilearchitectureachievedabetterperformancecomparedwithaglobalsolutionanditalsoprovidessuperiorscalingproperties.Theirworkwasbasedonstreamingapplications,whereregistersourcesanddestinationsarecompiledstatically.Wewillshowthatsuchadistributedorganizationcanalsobebasedonextensionstoageneral-purposeISAwithdynamicscheduling.Theconceptofadynamicmicro-contextassociatedwithparallelizingdifferentiterationshasalreadybeenintroducedandisrequiredinordertomanagecommunicationsbetweenmicro-contextsinascalablemanner.Itisnecessaryforthecompilertopartitionthemicro-contextintodifferentwindowsrepresentingdifferenttypesofcommunicationandforthehardwaretorecognizethesewindowstotraparegisterreadtoanappropriatedistributedcommunicationmechanism.
Amicrothreadedcompilermustrecognizeandidentifyfourdifferenttypesofcommunicationpatterns.Thereareanumberofwaysinwhichthispartitioningcanbeencoded,andherewedescribeasimpleandefficientschemethatsupportsafullydistributedregisterfilebasedonaconventionalRISCISA,assuminga5-bitregisterspecifierandhencea32-registeraddressspacepermicrothread(although,notthesame32registersforeachthread).
Thefirstregisterwindowistheglobalwindow(representedby$Gi).Theseregistersareusedtostoreloopinvariantsoranyotherdatathatissharedbyallthreads.Inothermodelsofconcurrencythesewouldrepresentbroadcastdata,whicharewrittenbyoneandreadbymanyprocesses.Their
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
example,A[i]:=...A[i−k]...wherekisinvariantoftheloop.Suchdependenciesnormallyactasadeterrenttoloopvectorizationorparallelizationbutthisisnotsointhismodel,astheindependentinstructionsineachloopcanexecuteconcurrently.ThisisthesameILPasisextractedfromanout-of-ordermodel.
Considernowtheimplementationofthisbasicmodel.Itisstraightforwardtodistributetheglobalregisterwindowanditscharactisticssuggestabroadcastbusasbeinganappropriateimplementation.Thisrequiresthatallprocessorsexecutingafamilyofmicrothreadsbedefinedpriortoanyloopinvariantsbeingwritten(orre-written)totheglobalwindow.Thehardwarethentrapsanywritestotheglobalwindowandreplicatesthevaluesusingthebroadcastbustothecorrespondinglocationinallprocessors’globalwindows.Asmultiplethreadsmayreadthevalueswrittentotheglobalregisterwindow,registersmustsupportarbitrarilylargecontinuationqueues,boundedaboveonlybythenumberofthreadsthatcanbeactiveatanytimeononeprocessor.
Thewritetotheglobalwindowcanbefromanyprocessorandthuscanbeusedtoreturnavaluefromaninteractiontotheglobalstateofacontext.Thewriteisalsoasynchronousandindependentofpipelineoperation,providedthereislocalbufferingforthedataintheeventofaconflictonthebus.Contentionforthisbusshouldnotoccurregularly,aswritestoglobalsaregenerallymuchlessfrequentthanreads(byafactorproportionaltotheconcurrencyofthecode).Thisisanalysedlaterinthepaper.
ThedistributionofSandDwindowsisalittlemorecomplexthantheglobalwindow.Normally,aproducerthreadwritestoitsSwindowandtheconsumerreadsfromitsDwindow,whichmapsinsomesenseontotheSwindowoftheproducer;wewilllaterreturntothis.However,thereisnorestrictiononathreadreadingaregisterfromitsSwindowsolongasdatahasalreadybeenwrittentoit(itwoulddeadlockotherwise).ThereisalsonophysicalrestrictiononmultiplewritestotheSwindow,althoughthismayintroducesnon-determinismifasynchronizationispendingonit.Asfarasthehardwareisconcernedtherefore,theSwindowisidenticaltotheLwindow,asallreadsandwritestoitarelocalandaremappedtothedynamichalfoftheregister-addressspace.Ontheotherhand,athreadmayneverwritetoitsDwindow,whichisstrictlyread-only.ThehardwareneedonlyrecognizereadstotheDwindowinordertoimplementsharingbetweentwodifferentthreads.InordertoperformareadfromaDwindow,aprocessorneedsthelocation(processorid)andbaseaddressoftheSwindowoftheproducerthread.Therearetwocasestoconsiderinsupportingthedistributionofregisterfilesinthebase-levelmodelwehavedescribed.
Thefirstandeasiestcaseiswhentheconsumeriterationisscheduledtothesameprocessorastheproducer.InthiscaseareadtotheDwindowcanbeimplementedasanormalpipelinereadbymappingtheDwindowoftheconsumermicro-contextontotheSwindowoftheproducermicro-context.The
223
thread’sstatemustthereforecontainthebaseaddressofitsownmicro-contextforlocalreadsandalsothebaseaddressofanymicro-contextitisdependentupon.Inthebase-levelmodelwepresent,onlyoneothermicro-contextisaccessed,ataconstantoffsetintheindexspace.
Inthesecondcase,theproducerandconsumeriterationsarescheduledtodifferentprocessors.Now,theconsumer’sreadtotheDwindowwillgeneratearemoterequesttotheprocessoronwhichtheproduceriterationisrunning.Whereasinthefirstcaseamicro-context’sDwindowisnotphysicallyallocated,inthissecondcaseitmustbe.Itisusedtocachealocalcopyoftheremotemicro-context’sSwindow.Itisalsousedtostorethethreadcontinuationlocally.Thecommunicationisagainasynchronousandindependentofthepipelineoperation.TheconsumerthreadissuspendedonitsreadtotheDwindowlocationuntilthedataarrivesfromtheremoteprocessor.Forthisconstantstridedcommunication,iterationschedulesexistthatrequireonlynearestneighbourcommunicationinaringnetworktoimplementthedistributedshared-registerscheme.Notethatarequestfromtheconsumerthreadmayfindanemptyregister,inwhichcasetherequestgetssuspendedintheproducer’sSwindowuntiltherequireddatahasbeenproduced.Thusashared-registertransactionmayinvolvetwocontinuations,athreadsuspendedintheDwindowoftheconsumer(waiting-local)andaremoterequestsuspendedintheSwindowoftheproducer(waiting-remote).Asthesestatesaremutuallyexclusive,thecompilermustensurethattheproducerthreaddoesnotsuspendononeofitsownS-windowlocations.ThiscanhappenifaloadfrommemorytoanSlocationisalsousedinthelocalthread.However,asdependenciesarepassedviaregistervariablesthiscanonlyhappenintheinitializationofadependencychain.ThiscasecanbeavoidedbyloadingtoalocationintheLwindowwhenthevalueisrequiredlocallyandthencopyingittotheSwindowwithadeterministicschedule.Theadditionalcomplexityrequiredinthisdistributedregisterfileimplementationis2bitsineachregistertoencodethefoursynchronizationstates:full,empty,waiting-local,waiting-global;asmallamountofadditionallogictoaddressthedynamicallyallocatedregistersusingbase-displacementaddressing;andasimplestatemachineoneachregisterporttoimplementtherequiredactionbasedonthesynchronizationstate.
Amethodhasnowbeendescribedtodistributeallclassesofcommunicationrequiredinthebase-levelmodel.However,wemustensurethatthisdistributiondoesnotrequireustoimplementregisterfileslocallythatarenotscalable.Thisrequiresthenumberoflocalportsintheregisterfiletobeconstant.AccessestoL,SandlocalDwindowsrequiresatmosttworeadandonewriteportforasingle-issuepipeline.TheGwindowrequiresanadditionalwriteportindependentofthepipelineports.Finally,readstoaremoteDwindowrequireonereadportandonewriteportperprocessor.Contentionforthisportwilldependonthepatternofdependencies,whichforthemodeldescribedisregularandhenceevenlydistributedwith
TheComputerJournalVol.49No.2,2006
224
Global Write BusK.Bousias,N.HasasnehandC.Jesshope
mostsignificantproblemsinmodernsynchronousprocessordesign.
FullasynchronousdesignisdifficultbutonepromisingtechniqueistouseaGlobally-Asynchronous,Locally-Synchronous(GALS)clockingscheme[48].Thisapproachpromisestoeliminatetheglobalclockingproblemandprovidesasignificantpowerreductionovergloballysynchronousdesigns.Itdividesthesystemintomultipleindependentdomains,whichareindependentlyclockedbutwhichcommunicateinanasynchronousmanner.AGALSsystemnotonlymitigatesagainsttheclockdistributionproblem,theproblemofclockskewandtheresultingpowerconsumption,itcanalsosimplifythereuseofmodulesastheyhaveasynchronousinterfacesthatdonotrequireredesignfortimingissueswhencomposed[49].InCMPdesign,globalcommunicationisoneofthemostsignificantproblemsinbothcurrentandfuturesystems[6],yetnoteverysystemcanbedecomposedintoasynchronouslycommunicatingsynchronousblockseasily,theremustbeacleardecouplingoflocalandremoteactivity.Toachievethisthelocalactivityshouldnotbeoverlydependentonaremotecommunication.Themodelwehavedescribedhasjustthisproperty;eachprocessorisindependentandwhenitdoesneedtocommunicatewithotherprocessors,thatcommunicationoccursindependentlywithoutlimitingthelocalactivity.Inshort,thelocalprocessorhastolerancetoanylatencyinvolvedinglobalcommunication,asinmostcircumstancesitwillhavemanyotherindependentinstructionsitcanprocessand,ifthisisnotthecase,itwillsimplyswitchoffitsclocks,reduceitsvoltagelevelsandwaituntilithasworktoaccomplishdissipatingminimalpower.ThesizeofthesynchronousblockinamicrothreadedCMPcanbefromasingleprocessorupwards.Thesizeofthisblockisalow-leveldesigndecision.Theissueisthatastechnologycontinuestoscalethisblocksizewillscaledownwiththeproblemsofsignalpropagation.ThusthemodelprovidessolutionstotheendofscalinginsiliconCMOS.Comparethiswiththecurrentapproach,whichseekstogainperformancebyclockspeedinasinglelargewide-issueprocessorwhereallstrategiesareworkingagainstthetechnology.4.6.
MicrothreadedCMParchitecture
Shared-Dependent PortsInitialization PortData_local_LCQData_G_WriteCache Miss PortWrite_cache_MissData_local_Read1Data_local_WriteData_local_Read2Data_S_WriteData_S_ReadGlobalWrite PortPipeline PortsFIGURE2.Register-fileportsanalysed.
appropriatescheduling.Eachiterationisallocatedaseparatemicrocontextinthedynamichalfoftheregister-addressspaceandthefirstlocalregister($L0)isinitializedbytheschedulertotheloopindexforthatiteration,sothisalsorequiresawriteport.Finally,awriteportisrequiredtosupportdecoupledaccesstothememoryonacachemissinordertoavoidabubbleinthepipeline,whendatabecomesavailable.
Figure2isablockdiagramofthemicrothreadedregisterfileillustratingtheseports.Asshown,ithasamaximumofeightlocalportsperprocessor.Theregisterfilecouldbeimplementedwithjustthreeportsbystallingthepipelinewheneveroneoftheasynchronousreadsorwritesoccursbutthiswoulddegradeitsperformancesignificantly.AnanalysisoftheaccessestotheseportsisgiveninSection5below,whereweattempttoreducetheseeightportstofiveusingcontention,i.e.thethreepipelineportsandoneadditionalreadandwriteportforallothercases.4.5.1.
Globallyasynchronouslocallysynchronouscommunication
ModernsynchronousCMParchitecturesarebasedonsingleclockdomainwithglobalsynchronizationandcontrolsignals.Thecontrolsignaldistributionmustbeverycarefullydesignedinordertomeettheoperationrateoneachcomponentusedandthelargerthechip,themoreisthepowerthatisrequiredtodistributethesesignals.Infact,clockskew,andthelargepowerconsumptionrequiredtoeliminateit,isoneofthe
AblockdiagramofamicrothreadedCMPisshowninFigure3.Asshown,Nmicrothreadedpipelinesareconnectedtothesetwosharedcommunicationssystems.Thefirstisabroadcastbus,forwhichtheremustbecontention,althoughinpracticetheaccesstothisbusisatalowfrequency.Itisusedbyoneprocessortocreateafamilyofthreads,itwillprobablybeusedbythesameprocessortodistributeanyloopinvariantsandfinally,ifthereisascalarresult,oneprocessormaywritevaluesbacktogloballocations.Thissituationoccurswhensearchinganiterationspace—itistheonlysituationwherecontentionmightberequired—asanumberofprocessorsmightfindasolutionsimultaneouslyandattempttowritetothebus.In
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
CreateCreate225
SchedulerI-cacheSchedulerI-cachePipelineCreate/write $GD-cache…PipelineD-cacheBroadcast BusWrite $G Decoupled LwCreate/write $GWrite $G Initialise $L0Initialise $L0LocalRegister file$D readLocalRegister file$D readDecoupled LwRing interconnect for registers and bus arbitrationIndependently synchronous domainsFIGURE3.MicrothreadedCMParchitecture,showingcommunicationstructuresandclockingdomains.
thiscaseabreakinstructionacquiresthebusandterminatesallotherthreadsallowingthewinnertowriteitsresultsbacktotheglobalstateofthemaincontext.
Thesecondcommunicationsystemistheshared-register,ringnetwork,whichisusedbyprocessorstocommunicateresultsalongadependencychain.Forthemodedescribed,thisrequiresonlylocalconnectivitybetweenindependentlyclockedprocessors.
Allglobalcommunicationsystemsaredecoupledfromtheoperationofthemicrothreadedpipelineandthreadschedulingprovideslatencyhidingduringtheremoteaccess.ThistechniquegivesamicrothreadedCMPaseriousadvantageasalong-termsolutiontosiliconscaling.
5.ANALYSISOFREGISTERFILEPORTS
Inthissection,ananalysisofmicrothreadedregisterfileportsismadeintermsoftheaveragenumberofaccessestoeachportoftheregisterfileineverypipelinecycle.Thisanalysisisbasedonthehandcompilationofavarietyofloopkernels.TheloopsconsideredincludedanumberofLivermorekernels—somewhichareindependentandsomewhichcontainloop-carrieddependencies.Italsoincludesbothaffineandnon-affineloops,vectorandmatrixproblems,andarecursivedoublingalgorithm.Wehaveusedloopkernelsatthisstageaswecurrentlyhavenocompilertocompilecompletebenchmarks.However,asthemodelonlygainsspeedupvialoops,wehavechosenabroadsetofrepresentativeloopsfromscientificandotherapplications.Analysisofcompleteprogramsandotherstandardbenchmarkswillbeundertakenwhenacompilerwearedevelopingisabletogeneratemicrothreadedcode.
Theresultsarebasedonastaticanalysisoftheaccessestovariousregisterwindowsandinvestigatetheaveragetrafficonthemicrothreadedregisterfileports.ThefivetypesofregisterfileportsareshowninFigure2andinclude,pipelineports(read-Randwrite-W),theinitializationport(I),theshared-dependentports(Sd),thebroadcastport(Br)andthewriteportthatisrequiredinthecaseofacachemiss(Wm).Thegoalofthisanalysisistoguidetheimplementationparametersofsuchasystem.Weaimtoshowthatallaccessesotherthanthesynchronouspipelineportscanbeimplementedbyapairofreadandwriteports,witharbitrationbetweenthedifferentsources.InthiscasearegisterfilewithfivefixedportswouldbesufficientforeachoftheprocessorsinourCMPdesign.Themicrothreadedpipelineusesthreesynchronousports.Theseportsareusedtoaccessthreeclassesofregisterwindowsi.e.the$L,$Sand$Gregisterwindows.IfweassumethattheaveragenumberofreadstothepipelineportsineachcycleisRandtheaveragenumberofwritestothepipelineportineachcycleisW,thenthesevaluesaredefinedbythefollowingequations,whereNeisthetotalnumberofinstructionsexecuted.
(Read($L)+Read($S)+Read($G))
R=inst.(1)
Ne
(Write($L)+Write($S)+Write($G))
W=inst.(2)
NeTheinitializationportonotherhandisusedinregisterallocationtoinitializethe$L0totheloopindex.Thisportisaccessedoncewheneachiterationisallocatedtoaprocessorandsotheaveragenumberofaccessestothisportisconstantandequaltotheinverseofthenumberofinstructionsexecutedbythethread
TheComputerJournalVol.49No.2,2006
226K.Bousias,N.HasasnehandC.Jesshope
TABLE3.Averagenumberofaccessestoeachclassofregisterfileportoverarangeofloopkernels,m=problemsize.
Loop
A:PartialProductsB:2-DSORL3:InnerProduct
L4:BandedLinearEquationL5:TriDiagonalElimination
Ne3m5m−24m+43m+345m+3
R4m−3Ne8m−15Ne5m+3Ne
2.4m+37.4
Ne7mNe
W2m−1Ne4m−4Ne4m+1Ne3m+22Ne4mNe
I0.3330.20.250.20.250.14290.07140.11110.09090.0385
Br0004nNe0
Sdm−1M∗Nem−2M∗NemM∗Ne1.8m+1.8M∗Nem−1M∗Ne
L6:GeneralLinearRecurrence2.5m+6.5m1/2−5C:PointerChasingL1:HydroFragmentL2:ICCG
L7:EquationofStateFragment
14m+59m+511m+2logm−21
26m+5
5.5m+2.5m1/2−53m+m1/2−2
NeNe9m+36m+2NeNe15m8m+3NeNe
17m−5logm−2710m−5logm−12
NeNe
25m+343m+3
NeNe(m1/2−1)n0.5m−0.5m1/2
NeM∗NenmNeM∗Ne3n
0
Ne
(logm−1)n
0
Ne3n
0
Ne
beforeitiskilled,no.Therefore,ifIistheaveragenumberofaccessestotheinitializationportpercycle,wecansaythat:
I=
1no
(3)
Adependentreadtoaremoteprocessorusesareadportontheremoteprocessorandawriteportonthelocalprocessor,aswellasareadtothesynchronouspipelineportonthelocalprocessor.Theaveragenumberofaccessestotheseportspercycleisdependentonthetypeofschedulingalgorithmused.Ifweusemoduloscheduling,whereMconsecutiveiterationsarescheduledtooneprocessor,theninterprocessorcommunicationisminimized.Anequationfordependentreadsandwritesisgivenbasedonmoduloschedulingalthoughweconsideritonlyattheworstcasescenario.TheaveragenumberofaccessespercycletothedependentwindowisgivenbelowbySdusingthefollowingequation,whereMisthenumberofconsecutivethreadsschedulingtooneprocessorandNeisthetotalnumberofinstructionsexecuted.ItisclearthattheworstcaseiswhereM=1,i.e.iterationsaredistributedoneperprocessorinamodulomanner.
Read($D)
Sd=inst.(4)
M∗NeTheglobalwriteportisusedtostoredatafromthebroadcastbustotheglobalwindowineveryprocessor’slocalregisterfile.IfweassumethattheaveragenumberofaccessespercycletothisportisBr,thenBrcanbeobtainedfromthe
followingequation,whereNeisthetotalnumberofinstructionsexecutedandnisthenumberofprocessorsinthesystem.Theresultisproportionaltothenumberofprocessors,asonewriteinstructionwillcauseawritetoeveryprocessorinthesystem.
Write($G)∗n
(5)Br=inst.
NeFinally,thefrequencyofaccessestotheportthatisrequiredforthedeferredregisterwriteinthecaseofacachemisscanalsobeobtained.Itisparameterizedbycachemissrateinthisstaticanalysisandagainwelookattheworstcase(100%missrate).Theaveragenumberofwritespercycletothecache-missportisgivenbyWm,whichisgivenbytheformulabelowwhereLwisthenumberofloadinstructionsineachthreadbody,noisthenumberofinstructionsexecutedperthreadbody,andCmisthecachemissrate.Againtheaverageaccesstothisportisconstantforagivenmissrate.
(Lw)
∗Cm(6)Wm=inst.
noTable3showstheaveragenumberofaccessestoeachclassofregisterfileportoverarangeofloopkernelsusingtheaboveformulae.Thefirstsevenkernelsaredependentloops,wherethedependenciesarecarriedbetweeniterationsusingregisters.Thelastthreeareindependentloops,wherealliterationsoftheloopareindependentofeachother.
Asdescribedpreviously,eachofthedistributedregisterfileshasfoursourcesforwriteaccessesinadditiontothepipeline
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
0.50.4510.90.80.70.60.50.40.30.20.10
0
5
10
15
20
25
30
35
40
45
50
Normalised problem size (m/n)
227
IBS/DAll writesAverage accesses per cycle (individual ports)0.40.350.30.250.20.150.10.050FIGURE4.Averageaccessespercycleonadditionalports,n=4processors.
0.50.45
10.9
IBS/DAll writes0.80.70.6
Average accesses per cycle (individual ports)0.40.350.3
0.250.20.150.10.0500
5
10
15
20
25
30
35
40
45
50
0.50.40.30.20.10
Normalised problem size (m/n)
FIGURE5.Averageaccessespercycleonadditionalports,n=16processors.
ports.Thesearefor$Gwrite,theinitializationwrite,the$Dreturndataandthewritetotheportthatsupportsdecoupledaccesstomemoryonacachemiss.Ouranalysisshowsthattheaverageaccessesfromthesesourcesismuchlessthanoneaccesspercycleoverallanalysedloopkernels.ThisisshowninFigures4–7whereaccessestoinitialization(I),broadcast(Br)andthenetworkports(Sd,shownasS/D)aregiven.Thefourfiguresillustratethescalabilityoftheresults(fromn=4to
TheComputerJournalVol.49No.2,2006
Average accesses per cycle (all writes)Average accesses per cycle (all writes)228
0.50.45
K.Bousias,N.HasasnehandC.Jesshope
10.9
IBS/DAll writes0.80.70.6
Average accesses per cycle (individual ports)0.350.3
0.250.20.150.10.0500
5
10
15
20
25
30
35
40
45
50
0.50.40.30.20.10
Normalised problem size (m/n)FIGURE6.Averageaccessespercycleonadditionalports,n=64processors.
FIGURE7.Averageaccessespercycleonadditionalports,n=256processors.
n=256processors).Resultsareplottedagainstthenormalizedproblemsize,wheremisthesizeoftheproblemintermsofthenumberofiterations,althoughnotalliterationsareexecutedconcurrentlyinallcodes(forexampletherecursivedoublingalgorithmhasasequenceofconcurrentloopsvaryingbypowersof2from2tom/2).Normalizedproblemsizeisthereforeameasureofthenumberofiterationsexecutedperprocessor.Itcanbeseenthatonlyaccessesfromthebroadcastbusincreasewiththenumberofprocessorsandeventhisisonlysignificantwherefewiterationsaremappedtoeachprocessor.Evenin
TheComputerJournalVol.49No.2,2006
Average accesses per cycle (all writes)0.4
InstructionLevelParallelismthroughMicrothreading
TABLE4.Averagenumberofaccessestoalladditionalwriteportsfordifferentnumberofprocessors,m/n=8.
Averageaccesses(allwriteports)
n=4
0.4050.4430.4800.5180.5560.5940.6320.6690.7070.744
Averageaccesses(allwriteports)
n=16
0.4530.4910.5280.5660.6040.6420.6800.7170.7550.785
Averageaccesses(allwriteports)
n=64
0.5170.5550.5920.6300.6680.7060.7440.7810.8190.857
Averageaccesses(allwriteports)n=256
0.6340.6720.7090.7470.7850.8230.8610.8980.9360.974
229
Missrate(R%)102030405060708090100
AverageaccessestoWmPort
0.0380.0760.1130.1510.1890.2270.2650.3020.3400.378
thecaseof256processors,providingweschedulemorethanafewiterationstoeachprocessor,theoverallnumberofwritesis<50%.Notethataregisterfileof512registerssupportsatleast32micro-contextsperprocessor.Toputthisinperspective,thismeansthatonaverageasingleportsharingallI,BrandSdwriteswouldbebusyonly50%ofthetime.Theremaybepeaksinthedistributionofwritespercycle,howeveralloftheseaccessesareasynchronousandtheycanbequeuedwithoutstallingtheoperationofanyofthepipelines.Thisstillleavescapacitytoincludewritesfromthedecoupled-memoryaccesses.
Theanalysisofthedecoupled-memoryportalsoshowsthattheaveragenumberofaccessespercycleissmall.Ifweassumeacachemissrateof50%,thentheaveragenumberofaccessesis<20%overallloopkernels.Thus,asinglewriteportwouldnotbefullyutilizedatthismissrate.Forcompleteness,Table4showstheaveragenumberofaccessespercycletoallwriteportsincludingtheWmportwithavariablecachemissrate.Thistableiscompiledforanormalizedproblemsizewherethem/n=8,whichcorrespondstofullyutilizingasmallregisterfile.Actualcachemissratesforoneofthekernelsaregiveninthesimulationresultspresentedbelow.
6.CMPSIMULATIONRESULTS
Thissectiongivessomepreliminarysimulationresults,whichshowscalabilityofperformanceinexecutingasingleloopkernel.Reference[47]alsoshowsscalabilityofpowerforthesamesimulations.Theresultsarepreliminaryastheycannotshowtherelationshipbetweenthescalarpartofthecodeandtheconcurrentpart,asamicrothreadedcompilerisrequiredbeforewecansimulatecompleteapplications.Thesignificantissuehowever,isthatalloftheseresultswereobtainedfromasinglebinarycodethatwashandcompiledfromtheLivermorehydrofragmentandexecutedonamicrothreadedCMPusingfrom1to2048processors.Thecodeperformsafixednumber
ofiterations(64K).Clearlythisdemonstratesthescheduleinvarianceofmicrothreadedcode.
Theresultsarepresentedasfunctionsofthenumberofprocessorsonwhichthecodewasrunandinclude:thenumberofcyclestoexecutethecode;thenumberofinstructionsexecuted(notethatfailedsynchronizationwillcauseanincreaseintheoverallnumberofinstructionsexecuted);IPC,whichiscomputedfromtheprevioustwo;thenumberofcompletelyinactivecycles,whenthepipeisemptyandnothreadsareabletoexecute;andfinallythatL1D-cachehitrate.
Allresultsuseidenticalprocessorswith1024registersperprocessoranda512entrycontinuationqueue.TheI-cacheisalsoidenticalinallresults,arelativelysmall2-Kbyte8-waysetassociativecachewith32bytelinesize.Thefirstsetofresultsareforarelativelycomplexlevel-1D-cacheof64Kbyte,with8-wayassociativityand64bytelinesize.Figure8showstheresults.Thelargestnumberofinstructionsexecutedduetoinstructionreissueonfailedsynchronizationis<4%andoccurredat256processors.ThemaximumIPCwas1613andrepresentedaspeedupof1602timesthesingleprocessorexecutionandoccurred,asexpected,with2048processors.Thisisnearly80%efficient.Indeedthespeedupwaswithin4%oftheidealupto256processors.Thebeginningofsaturationinspeedupisnotunexpectedgiventhefixedproblemsize.Whenusing2048processors,forexample,theprocessor’sresources(registersandcontinuationqueueslots)areonlypartiallyutilizedandhencelatencytolerancesuffers.Alsothefixedoverheadofthreadcreation,includingbroadcastoftheTCBpointertoallprocessorsisamortizedoverfewercycles.Notethatallsimulationsarestartedwithcoldcaches.Theoverheadisrepresentedbytherelativelystatic100orsocyclesduringwhichtheprocessorsareinactive.Asthesolutiontimeapproachesthis,thepercentageoverheadwillincreaserapidly.Finallyitcanbeseenthatthecachehitratevariesbetween75and95%.Thebesthitrateisseenononeprocessorbuttheworstoccurson∼64processors,wheretheresourcesstarttobecomelessefficientlyused.Notethatthelocalschedules
TheComputerJournalVol.49No.2,2006
230
1.0E+07
K.Bousias,N.HasasnehandC.Jesshope
100
90
1.0E+06
80
Cycles, IPC or instructions executed1.0E+05
70
60
50
1.0E+03
40
1.0E+02
30
20
1.0E+01
Cycles to solutionIPCInstructions executedInactive cyclesHit Rate (%)1
10
100
1000
10
1.0E+00
010000
Number of processors used to execute kernal
FIGURE8.Characteristicsofmicrothreadedkernelexecution(64KD-cache).
weresettomatchthecachelinesizeandhencemaximizehitrate.
Thesecondsetofresultswasundertakentoillustratewhateffect,ifany,theD-cachehadonperformance.Inthisset,theD-cacheparameterswerealtereddrasticallytogivearesidualcacheofjust1Kbytewithdirectmappedcachelines(n.b.theregisterfilesizeisasubstantiallargerthanthisat8Kbyte).Figure9showstheseresults.Itcanbeseenthatnumberofinstructionsexecuted,timetosolutionandIPCareallvirtuallyunchanged.Theonlysignificantchangeisinthecachehitrate,whichisnowmuchworse,varyingfrom40upto88%onasingleprocessor.IndeedtheminimaltimetosolutionandmaximumIPCbyasmallamountwereobservedwith2048processorsusingtheresidualD-cache.
Wenotethattherearesomecaveatstotheseresults.Theyareobtainedfromtheexecutionofasingleindependentcodekernel.Simulationsofdependentloopsshowthatspeedupsaturates,withthemaximumspeedupbeingdeterminedbytheratioofindependenttodependentinstructionsinthesekernels.Forexample,asimulationofanaivemultiply-accumulateimplementationofmatrixmultiplication(adependentloop)saturatedwithaspeedupofbetween3and4andwasachievedusingonlyfourprocessors.Thisrepresentsthemaximuminstructionparallelismofthecombinediterations.Theresultsalsoignorethescalarcomponentoftheprogram,whichwecannoteffectivelyevaluateuntilwehavefullydevelopeda
microthreadedcompiler.Finally,thememorymodelusedinthesesimulationsisnon-blocking,aswedonothaveanrealisticmodelfullysimulatedyet.Theresultsuseapipelinedmemorywithan8-cyclestart-upforfirstwordand2-cyclesperwordtocompleteacachelinetransfer.Notethatthislimitationisnotamajorissuewiththekernelsimulatedasdatacanbedistributedinmemoryaccordingtothelocalscheduleandblockingwouldbeunlikely.Inthe64Kbytecacheonly3%ofmemoryaccessescausearequesttosecondlevelmemory.Theremainingcachemissesaresame-linemissesduetotheregularityofscheduling.Wearecurrentlyworkingonasecondlevelmemoryarchitectureandwillpresentfullsimulationresultsofthiswhenwehaveaworkingcompiler.Theresultspresentedhere,however,showgreatpromiseforthisapproach.
7.CONCLUSIONS
Thecharacteristicsofadvancedintegratedcircuits(ICs)willinfuturerequirepowerfulandscalableCMParchitectures.However,currenttechniqueslikewide-issue,superscalarprocessorssufferfromcomplexityininstructionissueandinthelargemulti-portedregisterfilerequired.Thecomplexityofthesecomponentsgrowsatleastquadraticallywithincreasingissuewidth;also,executionofinstructionsusingthesetechniquesmustproceedspeculatively,whichdoesnotalways
TheComputerJournalVol.49No.2,2006
Hit rate (%)1.0E+04
InstructionLevelParallelismthroughMicrothreading
1.0E+07100
231
90
1.0E+0680
Cycles, IPC or instructions executed1.0E+0570
60
50
1.0E+0340
1.0E+0230
20
1.0E+01Cycles to solutionIPCInstructions executedInactive cyclesHit Rate (%)1
10
100
1000
10
1.0E+00010000
Number of Processers used to execute kernal
FIGURE9.Characteristicsofmicrothreadedkernelexecution(1KD-cache).
provideresultsforthepowerconsumed.Inaddition,moreon-chipmemoryisrequiredinordertoamelioratetheeffectsofthesocalled‘memorywall’.Theseobstacleslimittheprocessor’sperformance,byconstrainingparallelismorthroughhavinglargeandslowstructures.Inshort,thisapproachdoesnotprovidescalabilityinaprocessor’sperformance,intheon-chipareaandpowerdissipation.
Analternativesolutionwhicheliminatesthiscomplexityininstructionissueandtheglobalregisterfile,andavoidsspeculationhasbeenpresentedinthispaper.Themodelisbasedondecomposingasequentialprogramintosmallfragmentsofcodecalledmicrothreads,whicharescheduleddynamicallyandwhichcancommunicateandsynchronizewitheachotherveryefficiently.ThisprocessallowssequentialcodetobecompiledforexecutiononscalableCMPs.Moreover,asthecodeisscheduleinvariant,thesamecodewillexecuteonanynumberofprocessorslimitedonlybyproblemsize.ThemodelexploitsILPwithinbasicblocksandacrossloopbodies.Inaddition,thisapproachsupportsapre-fetchingmechanismthatavoidsmanyinstruction-cachemissesinthepipeline.ThefullydistributedregisterfileconfigurationusedinthisapproachhastheadditionaladvantageoffullscalabilityinaCMPwiththedecouplingofallformsofcommunicationfromthepipeline’soperation.Thisincludesmemoryaccessesandcommunicationbetweenmicro-contexts.
ThedistributedimplementationofamicrothreadedCMPincludestwoformsofasynchronouscommunication.Thefirst
isthebroadcastbus,usedforcreatingthreadsanddistributinginvariants.Thesecondistheshared-registerringnetworkusedtoperformcommunicationbetweentheregisterfilesintheproducerandconsumerthreads.TheasynchronousimplementationofthebusandswitchprovidesmanyopportunitiesforpowersavinginlargeCMPsystems.Thedecoupledapproachtoregister-filedesignavoidsacentralizedregisterfileorganizationand,aswehaveshown,requiresasmall,fixednumberofportstoeachprocessor’sregisterfile,regardlessofthenumberofprocessorsinthesystem.
Ananalysisoftheregister-fileportsintermsofthefrequencyofaccessestoeachlogicalportisdescribedinthispaper.Thisanalysisinvolveddifferenttypesofdependentandindependentloopkernels.Theanalysisillustratesanumberofinterestingissues,whichcanbesummarizedasfollows:
•Asinglewriteportwitharbitrationbetweendifferentsourcesissufficienttosupportallnon-pipelinewrites.Thisporthasanaverageaccessrateof<100%overnormaloperatingconditions.Thisistrueeveninthecaseofa100%cache-missrate.
•Asecondportisrequiredtohandlereadstothe$Dwindow.Theanalysisshowsthattheaverageaccesstothisportis<10%overallanalysedloopkernels.
•Asaconsequence,thedistributedregisterfilesrequireonlyfiveportsperprocessorandtheseportsarefixedregardlessofthenumberofprocessorsinthesystem.This
TheComputerJournalVol.49No.2,2006
Hit rate (%)1.0E+04232K.Bousias,N.HasasnehandC.Jesshope
ArchitecturesandCompilationTechniques,Paris,France,October12–18,pp.130–135.IEEEComputerSociety,Washington,DC.
Olukotun,K.,Nayfeh,B.A.,Hammond,L.,Wilson,K.andChang,K.(1996)Thecaseforasingle-chipmultiprocessor.InProc.SeventhInt.Symp.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-7),Cambridge,MA,October1–5.Cambridge,MA,September,pp.2–11.ACMPress,NewYork,NY.
Palacharla,S.,Jouppi,N.P.andSmith,J.(1997)Complexity-effectivesuperscalarprocessors.InProc.24thInt.Symp.ComputerArchitecture,Denver,CO,June1–4,pp.206–218.ACMPress,NewYork,NY.
Tullsen,D.M.,Eggersa,S.andLevy,H.M.(1995)Simultaneousmultithreading:maximizingonchipparallelism.InProc.22ndAnnualInt.Symp.ComputerArchitecture,SantaMargheritaLigure,Italy,June22–24,pp.392–403.ACMPress,NewYork,NY.
Rixner,S.,Dally,W.J.,Khailany,B.,Mattson,P.R.,Kapasi,U.J.andOwens,J.D.(2000)Registerorganizationformediaprocessing.InProc.Int.Symp.HighPerformanceComputerArchitecture,Toulouse,France,January8–12,pp.375–386.IEEECSPress,LosAlamitos,CA.
Balasubramonian,R.,Dwarkadas,S.andAlbonesi,D.(2001)ReducingtheComplexityoftheregisterfileindynamicsuperscalarprocessors.InProc.34thInt.Symp.onMicro-architecture,Austin,TX,December1–5,pp.237–248.IEEEComputerSociety,Washington,DC.
Diefendorff,K.andDuquesne,Y.(2002)ComplexSOCsrequirenewarchitectures.EETimes.Availableathttp://www.eetimes.com/issue/se/OEG20020911S0076.
Ungerer,T.,Robec,B.andSilc,J.(2003)Asurveyofprocessorswithexplicitmultithreading.ACMComput.Surveys,35,29–63.Burns,J.andGaudiot,J.-L.(2001)AreaandsystemclockeffectsonSMT/CMPprocessors.InProc.2001Int.Conf.ParallelArchitecturesandCompilationTechniques,Barcelona,Spain,September8–12,pp.211–218.IEEEComputerSociety,Washington,DC.
Jesshope,C.R.(2003)Multi-threadedmicroprocessorsevolutionorrevolution.InProc.8thAsia-PacificConf.ACSAC’2003,Aizu,Japan,September23–26,LNCS2823,pp.21–45.Springer,Berlin,Germany.
Luo,B.andJesshope,C.R.(2002)Performanceofamicro-threadedpipeline.InProc.7thAsia-PacificConf.ComputerSystemsArchitecture,Melbourne,Victoria,Australia,January28–February2,pp.83–90.AustraliaComputerSociety,Inc.Darlinghurst,Australia.
Jesshope,C.R.(2001)Implementinganefficientvectorinstructionsetinachipmulti-processorusingmicro-threadedpipelines.InProc.ACSAC2001,GoldCoast,Queensland,Australia,January29–30,pp.80–88.IEEEComputerSociety,LosAlamitos,CA.
Zhou,H.andConte,T.M.(2002)CodeSizeEfficiencyinGlobalSchedulingforVLIW/EPICStyleEmbeddedProcessors.TechnicalReport,DepartmentofElectricalandComputerEngineering,NorthCarolinaStateUniversity,Raleigh,NC.Hwang,K.(1993)AdvancedComputerArchitecture.MITandMcGraw-Hill,NewYork,StLouis,SanFrancisco.
providesascalableandefficientsolutionforlargenumberofprocessorson-chip.
•Finally,theaverageaccessestoallwriteportsdoesnotexceed100%eveninthecaseofn=256-processor.However,todealwithalargenumberofprocessors,theperformancewoulddegradegracefullyduetotheinherentlatencytoleranceofthemodel.Eventuallyallthreadswouldbesuspendedwaitingfordataandinthiscasethestalledpipeline(s)wouldfreeupcontentiontothenon-pipelinewriteport.
Finallywepresentresultsofthesimulationofanindependentloopkernel,thatclearlydemonstratescheduleinvarianceofthebinarycodeandlinearspeedupcharacteristicsoverawiderangeofprocessorsonwhichthekernelisscheduled.Clearly,amicrothreadedCMPbasedonafullydistributedandscalableregisterfileorganizationandasynchronousglobalcommunicationbusesisagoodcandidatetofutureCMP.REFERENCES
[1]Barroso,L.A.,Gharachorloo,K.,McNamara,R.,Nowatzyk,A.,
Qadeer,S.,Sano,B.,Smith,S.,Stets,S.andVerghese,B.(2000)Piranha:ascalablearchitecturebasedonsingle-chipmultiprocessing.InProc.27thAnnualInt.Symp.ComputerArchitecture,Vancouver,BritishColumbia,Canada,June12–14,pp.282–293.ACMPress,NewYork,NY.
[2]Hammond,L.,Hubbert,B.A.,Siu,M.,Prabhu,M.K.,Chen,M.
andOlukotun,K.(2000)TheStanfordHydraCMP.IEEEMicro,20,71–84.
[3]Hammond,L.,Nayfah,B.A.andOlukotun,K.(1997)Asingle-chipmultiprocessor.IEEEComput.Soc.,30,79–85.
[4]Tendler,J.M.,Dodson,J.S.,Fields,J.S.,Le,H.andSinharoy,B.
(2002)Power4SystemMicro-architecture.IBMJ.Res.Develop.,46,5–25.
[5]Tremblay,M.,Chan,J.,Chaudhry,S.,Conigliaro,A.W.and
Tse,S.S.(2000)TheMAJCarchitecture:asynthesisofparallelismandscalability.IEEEMicro,20,12–25.
[6]Jesshope,C.R.(2004)Scalableinstruction-levelparallelism.
InProc.ComputerSystems:Architectures,ModelingandSimulation,3rdand4thInt.Workshops,SAMOS2004,Samos,Greece,July19–21,LNCS3133,pp.383–392.Springer.
[7]Bhandarkar,D.(2003)Billiontransistorchipsinmainstream
enterpriseplatformsofthefuture.InProc.9thInt.Symp.High-PerformanceComputerArchitecture,Anaheim,CA,February8–12,pp.3.IEEEComputerSociety,Washington,DC.
[8]Agarwal,V.,Hrishikesh,M.S.,Keckler,S.W.andBurger,D.
(2000)ClockrateversusIPC:theendoftheroadforconventionalmicroarchitectures.InProc.27thAnnualInt.Symp.ComputerArchitecture,Vancouver,British,Columbia,Canada,June10–14,pp.248–259.ACMPress,NewYork,NY.
[9]Onder,S.andGupta,R.(2001)Instructionwake-upinwide
issuesuperscalars.InProc.7thInt.Euro-ParConf.ManchesteronParallelProcessing,Manchester,UK,August28–31,pp.418–427.Springer-Verlag,London,UK.
[10]Onder,S.andGupta,R.(1998)Superscalarexecution
withdynamicdataforwarding.InProc.Int.Conf.Parallel
[11]
[12]
[13]
[14]
[15]
[16]
[17][18]
[19]
[20]
[21]
[22]
[23]
TheComputerJournalVol.49No.2,2006
InstructionLevelParallelismthroughMicrothreading
[24]Sudharsanan,S.,Sriram,P.,Frederickson,H.andGulati,A.
(2000)ImageandvideoprocessingusingMAJC5200.InProc.2000IEEEInt.Conf.ImageProcessing,Vancouver,BC,Canada,September10–13,pp.122–125.IEEEComputerSociety,Washington,DC.
[25]Cintra,M.andTorrellas,J.(2002)Eliminatingsquashesthrough
learningcross-threadviolationsinspeculativeparallelisationformultiprocessors.InProc.8thInt.Symp.High-PerformanceComputerArchitecture,Boston,MA,February2–6,pp.43–54.IEEEComputerSociety,Washington,DC.
[26]Cintra,M.Martinez,J.S.andTorrellas,J.(2000)Architecture
supportforscalablespeculativeparallelizationinshared-memorymultiprocessors.InProc.Int.Symp.ComputerArchitecture,Vancouver,Canada,June10–14,pp.13–24.ACMPress,NewYork,NY.
[27]Terechko,A.,Thenaff,E.L.,Garg,M.J.,VanEijndhoven,J.V.
andCorporaal,H.(2003)Inter-clustercommunicationmod-elsforclusteredVLIWprocessors.InProc.9thInt.Symp.High-PerformanceComputerArchitecture,Anaheim,CA,February8–12,pp.354–364.IEEEComputerSociety,Washington,DC.
[28]Halfhill,T.(1998)InsideIA-64.ByteMagaz.,23,81–88.
[29]Schlansker,M.S.andRau,B.R.(2000)EPIC:anarchitecturefor
instruction-levelparallelprocessors.CompilerandArchitectureResearch,HPL-1999-111.HPLaboratories,PaloAlto.
[30]Sundararaman,K.andFranklin,M.(1997)Multiscalarexecution
alongasingleflowofcontrol.InProc.IEEEInt.Conf.ParallelProcessing,Bloomington,IL,August11–15,pp.106–113.IEEEComputerSociety,Washington,DC.
[31]Sohi,G.S.,Breach,S.E.andVijaykumar,T.N.(1995)
Multiscalarprocessors.InProc.22ndAnnualInt.Symp.ComputerArchitecture,S.MargheritaLigure,Italy,June22–24,pp.414–425.ACMPress,NewYork,NY.
[32]Breach,S.E.,Vijaykumar,T.N.andSohi,G.S.(1994)
Theanatomyoftheregisterfileinamultiscalarprocessor.InProc.27thInt.Symp.Microarchitecture,SanJose,CA,November30–December2,pp.181–190.ACMPress,NewYork,NY.
[33]Alverson,R.,Callahan,D.,Cummings,D.,Koblenz,B.,
Porterfield,A.andSmith,B.(1990)TheTeracom-putersystem.InProc.4thInt.Conf.Supercomputing,Amsterdam,TheNetherlands,June11–15,pp.1–6.ACMPress,NewYork,NY.
[34]Kongetira,P.,Aingaran,K.andOlukotun,K.(2005)Niagara:
32-waymultithreadedSparcprocessor.IEEEComput.Soc.,25,21–29.
[35]Marr,D.T.,Binns,F.,Hill,D.L.,Hinton,G.,Koufaty,D.A.and
Upton,M.(2002)Hyper-threadingtechnologyarchitectureandmicroarchitecture.IntelTechnol.J.,6,4–15.
[36]Emer,J.(1999)Simultaneousmultithreading:multipleAlpha’s
performance.InPresentationattheMicroprocessorForum’99,MicroDesignResources,SanJose,CA.
233
[37]Codrescu,L.,Wills,D.S.andMeindl,J.D.(2001)Architecture
oftheAtlasChipMultiprocessor:dynamicallyparallelisingirregularapplications.IEEEComput.Soc.,50,67–82.
[38]Diefendorff,K.(1999)Power4focusesonmemorybandwidth:
IBMconfrontsIA-64,saysISAnotimportant.MicroprocessorRep.,13,11–17.
[39]Preston,R.P.etal.(2002)Designofan8-widesuperscalar
RISCmicroprocessorwithsimultaneousmultithreading.InProc.2002IEEEInt.Solid-StateCircuitsConf.,SanFrancisco,CA,February4–6,pp.334–335.IEEESolid-StateCircuits,USA.[40]Scott,L.,Lee,L.,Arends,J.andMoyer,B.(1998)Designingthe
low-powerM-COREarchitecture.InProc.IEEEPowerDrivenMicroArchitectureWorkshopatISCA98,Barcelona,Spain,June28,pp.145–150.
[41]Park,I.,Powell,M.D.andVijaykumar,T.N.(2002)Reducing
registerportsforhigherspeedandlowerenergy.InProc.35thAnnualACM/IEEEInt.Symp.Microarchitecture,Istanbul,Turkey,November18–22,pp.171–182.IEEEComputerSociety,LosAlamitos,CA.
[42]Kim,N.S.andMudge,T.(2003)Reducingregisterports
usingdelayedwrite-backqueuesandoperandpre-fetch.InProc.17thAnnualInt.Conf.Supercomputing,SanFrancisco,CA,June23–26,pp.172–182.ACMPress,NewYork,NY.
[43]Tseng,J.H.andAsanovic,K.(2003)Bankedmultiported
registerfilesforhigh-frequencysuperscalarmicroprocessors.InProc.30thInt.Symp.ComputerArchitecture,SanDiego,CA,June9–11,pp.62–71.ACMPress,NewYork,NY.
[44]Bunchua,S.,Wills,D.S.andWills,L.M.(2003)Reducing
operandtransportcomplexityofsuperscalarprocessorsusingdistributedregisterfiles.InProc.21stInt.Conf.ComputerDesign,SanJose,CA,October13–15,pp.532–535.IEEEComputerSociety,LosAlamitos,CA.
[45]Bolychevsky,A.,Jesshope,C.R.andMuchnick,V.(1996)
DynamicSchedulinginRISCArchitectures.IEEProc.Comput.Digit.Tech.,143,309–317.
[46]Jesshope,C.R.(2005)Micro-grids—theexploitationofmassive
on-chipconcurrency.InProc.HPCWorkshop2004,GridComputing:ANewFrontierofHighPerformanceComputing,L.Grandinetti(ed.),Cetraro,Italy,May31–June3.Elsevier,Amsterdam.
[47]Bousias,K.andJesshope,C.R.(2005)Thechallengesof
massiveon-chipconcurrency.TenthAsia-PacificComputerSystemsArchitectureConference,Singapore,October24–26.LNCS3740,pp.157–170.Springer-Verlag.
[48]Shapiro,D.(1984)GloballyAsynchronousLocallySynchronous
Circuits.PhDThesis,ReportNo.STAN-CS-84-1026,StanfordUniversity.
[49]Shengxian,Z.,Li,W.,Carlsson,J.,Palmkvist,K.and
Wanhammar,L.(2002)AnasynchronouswrapperwithnovelhandshakecircuitsforGALSsystems.InProc.IEEE2002Int.Conf.Communication,CircuitsandSystems,Cheungdu,China,June29–July1,pp.1521–1525.IEEESociety,CA.
TheComputerJournalVol.49No.2,2006
因篇幅问题不能全部显示,请点此查看更多更全内容