Instruction Level Parallelism through Microthreading-A Scalable Approach to Chip Multiproce

2021-12-14 来源：我们爱旅游

©TheAuthor2005.PublishedbyOxfordUniversityPressonbehalfofTheBritishComputerSociety.Allrightsreserved.

ForPermissions,pleaseemail:journals.permissions@oxfordjournals.org

AdvanceAccesspublishedonDecember19,2005doi:10.1093/comjnl/bxh157

InstructionLevelParallelismthrough

Microthreading—AScalableApproachtoChipMultiprocessors

KostasBousias1,NabilHasasneh2andChrisJesshope1,∗

1DepartmentofComputerScience,UniversityofAmsterdam,NL2DepartmentofElectronicEngineering,UniversityofHull,UK

∗Correspondingauthor:Jesshope@science.uva.nl

Mostmicroprocessorchipstodayuseanout-of-orderinstructionexecutionmechanism.Thismechanismallowssuperscalarprocessorstoextractreasonablyhighlevelsofinstructionlevelparallelism(ILP).Themostsigniﬁcantproblemwiththisapproachisalargeinstructionwindowandthelogictosupportinstructionissuefromit.Thisincludesgeneratingwake-upsignalstowaitinginstructionsandaselectionmechanismforissuingthem.Wide-issuewidthalsorequiresalargemulti-portedregisterﬁle,sothateachinstructioncanreadandwriteitsoperandssimultaneously.Neitherstructurescaleswellwithissuewidthleadingtopoorperformancerelativetothegatesused.Furthermore,toobtainthisILP,theexecutionofinstructionsmustproceedspeculatively.Analternative,whichavoidsthiscomplexityininstructionissueandeliminatesspeculativeexecution,isthemicrothreadedmodel.Thismodelfragmentssequentialcodeatcompiletimeandexecutesthefragmentsoutoforderwhilemaintainingin-orderexecutionwithinthefragments.Theonlyconstraintsontheexecutionoffragmentsarethedependenciesbetweenthem,whicharemanagedinadistributedandscalablemannerusingsynchronizingregisters.ThefragmentsofcodearecalledmicrothreadsandtheycaptureILPandloopconcurrency.Fragmentscanbeinterleavedonasingleprocessortogivetolerancetolatencyinoperandsordistributedtomanyprocessorstoachievespeedup.Theimplementationofthismodelisfullyscalable.Itsupportsdistributedinstructionissueandafullyscalableregisterﬁle,whichimplementsadistributed,shared-registermodelofcommunicationandsynchronizationbetweenmultipleprocessorsonasinglechip.Thispaperintroducesthemodel,comparesitwithcurrentapproachesandpresentsananalysisofsomeoftheimplementationissues.Italsopresentsresultsshowingscalableperformancewithissuewidthoverseveralordersofmagnitude,

fromthesamebinarycode.

Keywords:concurrency,CMP,microthreads,codefragments

Received10May2005;revised24August2005

1.INTRODUCTION

Formanyyearsnow,researchershavebeeninterestedintheideaofachievingmajorincreasesinthecomputationalpowerofcomputersbytheuseofchipmultiprocessors(CMPs).ExamplesofCMParetheCompaqPiranha[1],StanfordHydra[2]andHammondetal.[3].Severalarchitectureshavebeenproposedandsomemanufacturershaveproducedcommercialdesigns,suchastheIBMPowerPC[4]andSun’sMAJCproject[5].Ideally,theperformanceofsuchsystemsshouldbedirectlyproportionaltothenumberofprocessorsused,i.e.shouldbescalable.CMPsscalewell,withthelimittoscalabilitydeﬁnedbyMoore’slaw.Wecalculatethatcurrenttechnologycouldsupporthundredsofin-orderprocessorsand

achievesigniﬁcantspeedupovercurrentarchitecturesthatuseimplicitconcurrencyandachieveminimalspeedupthroughconcurrentinstructionissue[6].OneofthemajorbarrierstotheuseofCMPsistheproblemofprogrammingthemwithoutusingexplicitconcurrencyintheusercode.Ideallytheyshouldbeprogrammedusinglegacysequentialcode.

Intheory,thereisnolimittothenumberofprocessorsthatcanbeusedinaCMPprovidedthattheconcurrencyderivedfromthesequentialcodescaleswiththeproblemsize.Theproblemishowtosplitthecodeintoanumberofindependentthreads,scheduletheseonmanyprocessorsandtodothiswithalowandscalableoverheadintermsofthecontrollogicandprocessorefﬁciency.Infact,ongeneral-purposecodeitwill

TheComputerJournalVol.49No.2,2006

212K.Bousias,N.HasasnehandC.Jesshope

processorsduetotagmatching,wake-upsignalstowaitinginstructionsandselectionmechanismsforissuinginstructions.Thesedelaysincreasequadraticallyformostbuildingblockswiththeinstructionwindowsize[12].Finally,evenwiththeimplementationofalargeinstructionwindow,itdifﬁcultforprocessorstoﬁndsufﬁcientﬁne-grainparallelism,whichhasmademostchipmanufacturerslikeCompaq,IntelandSunlookatsimultaneousmulti-threading(SMT)[13]toexposemoreinstructionlevelparallelism(ILP)throughacombinationofcoarse-andﬁne-grainparallelism.

Futuremicroprocessortechnologyneedsefﬁcientnewarchitecturestoachievethedemandsofmanygrand-challengeapplications,suchasweatherandenvironmentalmodelling,computationalphysics(onbothsub-atomicandgalacticscales)andbiomolecularsimulation.Inthenearfuture,itshouldbepossibletointegratethousandsofarithmeticunitsonasinglechip[14]butout-of-orderexecutionisnotagoodcandidate,becauseofthelimitedconcurrencyitexposesandthenon-linearincreaseinhardwarecomplexityofinstructionissuewithissuewidth.Instructionissueisnottheonlyscalingproblem,thecomplexityoftheregisterﬁle[15]andbypasslogic[12]alsoscalebadlygivingfurtherbarrierstoascalablearchitecture.Very-longinstructionword(VLIW)architecturestransferthetaskofinstructionschedulingfromthehardwaretothecompiler,whichavoidsthescalingproblemsininstructionissue.However,inpractice,manymodernVLIWarchitecturesenduprequiringmanyofthesamecomplexmechanismsassuperscalarprocessors[16],suchasbranchprediction,speculativeloads,pipelineinterlocks,andnewmechanismslikecodecompressors.

Multi-threadingcanexposehigherlevelsofconcurrencyandcanalsohidelatencybyswitchingtoanewthreadwhenonethreadstalls.SMTappearstobethemostpopularformofmulti-threading.Inthistechnique,instructionsfrommultiplethreadsareissuedtoasinglewide-issueprocessoroutofprogrammedorderusingthesameproblematicstructureswehavealreadydescribed.Itincreasestheamountofconcurrencyexposedinout-of-orderissuebyworkingonmultiplethreadssimultaneously.ThemaindrawbacktoSMTisthatitcomplicatestheinstructionissuestage,whichiscentralforthemultiplethreads[17].Scalabilityininstructionissueisnoeasiertoachievebecauseofthisandtheotherscalabilityproblemsremainunchanged.ThusSMTsuffersfromthesameimplementationproblems[18]assuperscalarprocessors.

Analternativeapproachtomulti-threadingthateliminatesspeculationanddoesprovidescalableinstructionissueisthemicrothreadedmodel.Thethreadsinthismodelaresmallcodefragmentswithanassociatedprogramcounter.Littleotherstateisrequiredtomanagethem.Themodelisabletoexposeandsupportmuchhigherlevelsofconcurrencyusingexplicitbutdynamiccontrols.LikeVLIW,theconcurrencyexposedcomesfromthecompilerandexpressesloopsaswellasbasicblockconcurrency.UnlikeVLIW,theconcurrencyisparametricandiscreateddynamically.Inpipelinesthatexecutethis

beimpossibletoeliminatealldependenciesbetweenthreadsandhencesynchronizationisalsorequired.ThegoalofthisworkthereforeistodeﬁneafeasiblearchitectureforascalableCMPthatiseasytoprogram,thatmaximizesthroughputforagiventechnologyandthatminimizesthecommunicationandsynchronizationoverheadsbetweendifferentthreads.Itshouldbenotedthatinthisworkthetermthreadisusedtodescribeverysmallcodefragmentswithminimalcontext.

TodayIntel’sItanium-2(Madison)microprocessorfeaturesover410milliontransistorsina0.13µmsemiconductorprocesstechnologyoperatingataspeedof1.5GHz.Thisisadual-processorversionofthepreviousItaniumprocessor(Mckinley),whichhasanissuewidthofsix.Moore’slawwouldindicatethatthebillion-transistorchipwillbecomefeasiblein65nmtechnologywithinthenext3or4years[7].IntelexpectsthattheItaniumprocessorwillreach1.7billiontransistorsattheendof2005.Thequestionswemustaskarewheredowegofromhereandwhatisthebestwaytoutilizethiswealthoftransistors,whilemaximizingperformanceandminimizingpowerdissipation?

Itcanbearguedthatthecurrenttrendinincreasingtheclockspeedandusingalargeareatomarginallyimprovetheinstructionsexecutedpercycle(IPC)ofthechipisapoorstrategyforfuturegenerationsofmicroprocessorsanddoesnotguaranteebetterperformance.Theuseofaggressiveclockspeedasameansofachievingperformanceonlyexacerbatesthememorywallandrequireslargeron-chipcaches.Moreimportantlythisstrategycannotbecontinuedindeﬁnitelyaspowerdensityisafunctionoffrequencyandisbecomingacriticaldesignconstraint.Usingconcurrencyasameansofincreasingperformancewithoutincreasingpowerdensityisamuchbetterstrategysolongasmodelsandimplementationscanbefoundthatarescalable.Conversely,byexploitingconcurrencyforconstantperformance,clockfrequenciescanbereduced,allowingvoltagestobescaleddownandgainingaquadraticdecreaseinpowerdensitywithconcurrency.Ofcourse,thisassumescompletelyscalablemodelsandimplementations.

Anotherproblemareainfuturetechnologyisthescalingofwiredelayscomparedwithgatedelays.Astransistordimensionsscaledown,thenumberofgateswhicharereachablewithinthescaledclockisatbestconstant,whichmeansthatdistributedratherthanmonolithicarchitecturesneedtobeexploited[8].

Superscalarprocessorstodayissueuptoeightinstructionperclockcyclebutinstructionissueisnotscalable[9]andalinearincreaseinparallelismrequiresatleastaquadraticincreaseinarea[10].Thelogicrequiredoccupies∼30%ofthetotalchipareaina6-waysuperscalarprocessor[11].Inaddition,moreandmoreareaisbeingusedforon-chipmemory.Typicallythesecondlevelon-chipcacheoccupies25–30%ofthedieareaonamodernmicroprocessorsandbetween50and75%ontherecentlyannouncedItanium-2.Moreover,asigniﬁcantdelayandpowerconsumptionareseeninhigh-issue-width

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

model,instructionsareissuedin-orderfromanyofthethreadsallocatedtoitbutthescheduleofinstructionsexecutedisnon-deterministic,beingdeterminedbydataavailability.Threadscanbedeterministicallydistributedtomultiplepipelinesbasedonasimpleschedulingalgorithm.Theallocationofthesethreadsisdynamic,beingdeterminedbyresourceavailability,astheconcurrencyexposedisparametricandnotlimitedbythehardwareresources.Theinstructionissuescheduleisalsodynamicandrequireslinearhardwarecomplexitytosupportit.Instructionscanbeissuedfromanymicrothreadalreadyallocatedandactive.Ifsuchanapproachcouldalsogivelinearperformanceincreasewithnumberofpipelinesused,thenitcanprovideasolutiontobothCMPandILPscalability[19].Theperformanceofthemicrothreadedmodeliscomparedwithaconventionalpipelinein[20]foravarietyofmemorylatencies.Theresultsshowthatthemicrothreadedmicroprocessoralwaysprovidesasuperiorperformancetothatofanon-threadedpipelineandconsistentlyshowsaheightIPConasingleissuepipeline(0.85)evenforhighlydependentcodeandinthepresenceofveryhighmemorylatencies.Themodelnaturallyrequiresmoreconcurrencytoachievethesameasymptoticperformanceasmemorydelaysareincreased.ItwillbeshowninthispaperthatapplyingthismodeltoaCMPcanalsoprovideordersofmagnitudespeedupandIPCsofwellover1000havebeendemonstrated.Notethatboththeinstructionissuelogicandthesizeoftheregisterﬁleinthismodelscalelinearlywithissuewidth.2.CURRENTAPPROACHES2.1.

Out-of-orderexecution

213

orfutureﬁleofcheckpointsandrepairsisalsorequiredtore-orderthecompletionofinstructionsbeforecommittingtheirresultstotheregistersspeciﬁedintheISAinordertoachieveasequentiallyconsistentstateonexceptions.

Controlspeculationpredictsbranchtargetsbasedonthepriorhistoryforthesameinstruction.Executioncontinuesalongthispathasifthepredictionwascorrect,sothatwhentheactualtargetisresolved,acomparisonwiththepredictedtargetwilleithermatchgivingacorrectlypredictedbranchornot,inwhichcasetherewasamissprediction.Amisspredictioncanrequiremanypipelinecyclestocleanupand,inawide-issuepipeline,thiscanleadtohundredsofinstructionslotsbeingunused,ortobemoreprecise,ifwefocusonpower,beingusedunnecessarily.Bothpredictionandcleaningupamisspredictionrequireadditionalhardware,whichconsumesextraareaandalsopower,oftentoexecuteinstructionswhoseresultsareneverused.Itcanthereforebedescribedaswastefulofchipresourcesand,moreover,hasunpredictableperformancecharacteristics[21].Wewillshowthatitispossibletoobtainhighperformancewithoutspeculationand,moreover,savepowerindoingso.

AsalreadynotedinSection1,astheissuewidthincreasesinanout-of-order,superscalararchitecture,thesizeoftheinstructionwindowandassociatedlogicincreasequadratically,whichresultinalargepercentageofthechipbeingdevotedtoinstructionissue.Theout-of-orderexecutionmechanismthereforepreventsconcurrencyfromscalingwithtechnologyandwillultimatelyrestricttheperformanceovertime.Theonlyreasonforusingthisapproachisthatitprovidesanimplicitmechanismtoachieveconcurrentexecutionfromsequentialbinarycode.2.2.

VLIW

Toachieveahigherperformance,modernmicroprocessorsuseanout-of-orderexecutionmechanismtokeepmultipleexecutionunitsasbusyaspossible.Thisisachievedbyallowinginstructionstobeissuedandcompletedoutoftheoriginalprogramsequenceasameansofexposingconcurrencyinasequentialinstructionstream.Morethanoneinstructioncanbeissuedineachcycle,butonlyindependentinstructionscanbeexecutedinparallel,otherinstructionsmustbekeptwaitingor,undersomecircumstances,canproceedspeculatively.

Speculationreferstoexecutinganinstructionbeforeitisknownwhethertheresultsoftheinstructionwillbeusedornot,thismeansthataguessismadeastotheoutcomeofacontrolordatahazardasameanstocontinueexecutinginstructions,ratherthanstallingthepipeline.Registerrenamingisalsousedtoeliminatetheartiﬁcialdata-dependenciesintroducedbyissuinginstructionsoutoforder.ThisalsoenablestheextensionofthearchitecturalregistersetoftheoriginalISA,whichisnecessarytosupportconcurrencyininstructionexecution.Anyconcurrentexecutionofasequentialprogramwillrequiresomesimilarmechanismtoextendthesynchronizationmemoryavailabletoinstructions.Speculativeexecutionandout-of-orderissueareusedinsuperscalarprocessorstoexposeconcurrencyfromsequentialbinarycode.Areorderbuffer

AnalternativeandexplicitapproachtoconcurrencyininstructionissueisVLIW,wheremultiplefunctionalunitsareusedconcurrentlyasspeciﬁedbyasingleinstructionword.Thisusuallycontainsaﬁxednumberofoperationsthatarefetched,decoded,issuedandexecutedconcurrently.Toavoidcontrolordatahazards,VLIWcompilersmusthoistlaterindependentinstructionsintotheVLIWorifthisisnotpossible,mustexplicitlyaddno-opinstructionsinsteadofrelyingonhardwaretostalltheinstructionissueuntiltheoperandsareready.Thiscancausetwoproblems,ﬁrstly,astallinoneinstructionwillstalltheentirewidthoftheinstruction;secondly,addingno-opinstructions,increasestheprogramsize.Intermsofperformance,iftheprogramsizeislargecomparedtotheI-cacheorTLBsize,itmayresultinhighermissrates,whichinturndegradestheperformanceoftheprocessor[22].Itisnotpossibletoidentifyallpossiblesourcesofpipelinestallsandtheirdurationatcompiletime.Forexample,supposeamemoryaccesscausesacachemiss,thisleadstoalongerthanexpectedstall.Therefore,ininstructionswithnon-deterministicdelaylikealoadinstructiontoacachehierarchy,

TheComputerJournalVol.49No.2,2006

214K.Bousias,N.HasasnehandC.Jesshope

architecturalsupportforcontrolanddataspeculationthroughpredicatedinstructionexecutionandbindingpre-fetchesofdataintocache.Inthisarchitectureeachoperationisguardedbyoneofthepredicateregisters,eachofwhichstores1bitthatdetermineswhethertheresultsoftheinstructionarerequiredornot.Predicationisaformofdelayedbranchcontrolandthisbitissetbasedonacomparisonoperation.Ineffect,instructionsareexecutedspeculativelybutstateupdateisdeterminedbythepredicatebitsoanoperationiscompletedonlyifthevalueofitsguardbitistrue,otherwisetheprocessorinvalidatestheoperation.ThisisaformofspeculationthatexecutesbotharmsofsomebranchesconcurrentlybutthisactionrestrictstheeffectiveILP,dependingonthedensityandnestingofbranches.

Pre-fetchingisachievedinanumberofways.Forexample,byaninstructionidenticaltoaloadwordinstructionthatdoesnotperformaloadbuttouchesthecacheandcontinues,settinginmotionanyrequiredtransactionsandcachemissesupthehierarchy.Theseinstructionsarehoistedbythecompileruptheinstructionsstream,notjustwithinthesamebasicblock.Theycanthereforetoleratehighlatenciesinmemory,providingthecorrectloadscanbepredicted.Therearemanymoreexplicitcontrolsoncachingintheinstructionsettoattempttomanagethenon-deterministicnatureoflargecachehierarchies.Problemsagainarisefromthespeculativenatureofthesolution.Ifforsomereasonthepre-fetchfails,eitherbecauseofaconﬂictorinsufﬁcientdelaybetweenthepre-fetchandgenuineloadwordinstruction,thenasoftwareinterruptistriggeredincurringalargedelayandoverhead.

EPICcompilersfaceamajorprobleminconstructingaplanofexecution,theycannotpredictallconditionalbranchesandknowwhichexecutionpathistaken[29].Tosomeextentthisuncertaintyismitigatedbypredicatedexecutionbutasalreadyindicated,thisiswastefulofresourcesandpowerandlikeallspeculativeapproachescancauseunpredictabilityinperformance.Althoughobjectcodecompatibilityhasbeensolvedtosomeextent,theforwardcompatibilityisonlyasgoodasthecompiler’sabilitytogenerategoodschedulesintheabsenceofdynamicinformation.AlsothecodesizeproblemisstillachallengefacingtheEPICarchitecture[29].2.4.

Multiscalar

thenumberofno-opinstructionsrequiredisnotknownandmostVLIWcompilerswillscheduleloadinstructionsusingthecache-hitlatencyratherthanthemaximumlatency.Thismeansthattheprocessorwillstalloneverycachemiss.Thealternativeofschedulingallloadswiththecachemisslatencyisnotfeasibleformostprogramsbecausethemaximumlatencymaynotbeknownduetobuscontentionormemoryportdelays,anditalsorequiresconsiderableILP.Thisproblemwithnon-determinismincacheaccesslimitsVLIWtocachelessarchitectureunlessspeculativesolutionsareembraced.Thisisasigniﬁcantproblemwithmoderntechnology,whereprocessorspeedsaresigniﬁcantlyhigherthanmemoryspeeds[19].AlsopureVLIWarchitecturesarenotgoodforgeneralpurposeapplications,duetotheirlackofcompatibilityinbinarycode[23].ThemostsigniﬁcantuseofVLIWthereforeisinembeddedsystems,wheretheseconstraintsarebothsolved(i.e.singleapplicationsusingsmallfastmemories).AnumberofprojectsdescribedbelowhaveattemptedtoapplyspeculationtoVLIWinordertosolvetheschedulingproblemsandone,theTransmetaCrusoe,hasapplieddynamicbinarycodetranslationtosolvethebackwardcompatibilityproblem.

TheSunMAJC5200[24]isaCMPbasedonfour-wayissue,VLIWpipelines.Thisarchitectureprovidesasetofpredicatedinstructionstosupportcontrolspeculation.TheMAJCarchitectureattemptstousespeculative,thread-levelparallelism(TLP)tosupportthemultipleprocessors.Thisaggressivelyexecutescodeinparallelthatcannotbefullyparallelizedbythecompiler[25,26].Itrequiresnewhardwaremechanismstoeliminatemostsquashesduetodatadependencies[25].Thismethodofexecutionisagainspeculativeandcandegradetheprocessor’sperformancewhenthespeculationfails.MAJCreplicatesitssharedregistersinallpipelinestoavoidsharingresources.Fromtheimplementationpointofview,replicatingtheregisterscostssigniﬁcantpowerandarea[27]andalsorestrictsthescalability.Furthermore,theMAJCcompilermustknowtheinstructionslatenciesbeforeitcancreateaschedule.Asdescribedpreviously,itisnotsimpletodetectallinstructions’latenciesduetothevarietyofthehardwarecommunicationoverheads.2.3.EPIC

Intel’sexplicitlyparallelinstructioncomputing(EPIC)architectureisanotherspeculativeevolutionofVLIW,whichalsosolvestheforward(althoughnotbackward)codecompatibilityproblem.Itdoesthisthroughtherun-timebindingofinstructionwordstoexecutionunits.TheIA-64[28]architecturesupportsbinarycodecompatibilityacrossarangeofprocessorwidthsbyutilizinginstructionspacketsthatarenotdeterminedbyissuewidth.Thismeansaschedulerisrequiredtoselectinstructionsforexecutionontheavailablehardwarefromthecurrentinstructionpacket.Thisgivesmoreﬂexibilityaswellassupportingbinarycodecompatibilityacrossfuturegenerationsofimplementation.TheIA64alsoprovides

AnotherparadigmtoextractevenmoreILPfromsequentialcodeisthemultiscalararchitecture.Thisarchitectureextendstheconceptofsuperscalarprocessorsbysplittingonewideprocessorintomultiplesuperscalarprocessors.Inasuperscalararchitecture,theprogramcodehasnoexplicitinformationregardingILP,onlythehardwarecanbeemployedtodiscovertheILPfromtheprogram.Inmultiscalar,theprogramcodeisdividedintoasetoftasksorcodefragments,whichcanbeidentiﬁedstaticallybyacombinationofthehardwareandsoftware.Thesetasksareblocksinthecontrolﬂowgraphoftheprogramandareidentiﬁedbythecompiler.Thepurposeof

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

thisapproachistoexposeagreaterconcurrencyexplicitlybythecompiler.

Theglobalcontrolunitusedinthisarchitecturedistributesthetasksamongmultipleparallelexecutionunits.Eachexecutionunitcanfetchandexecuteonlytheinstructionsbelongingtoitsassignedtask.So,whenataskmisspredictionisdetected,allexecutionunitsbetweentheincorrectspeculationpointandthelatertaskaresquashed[30].Likesuperscalar,thiscanresultinmanywastedcycles,however,asthedepthofspeculationismuchgreater,theunpredictabilityinperformanceiscorrespondinglywider.

Thebeneﬁtofthisarchitectureoverasuperscalararchitectureisthatitprovidesmorescalability.Thelargeinstructionwindowisdividedintosmallerinstructionwindows,oneperprocessingunit,andeachprocessingunitsearchesasmallerinstructionwindowforindependentinstructions.Thismitigatestheproblemsofscalinginstructionissuewithissuewidth.Themultipletasksarederivedfromloopsandfunctioncalls,allowingtheeffectivesizeoftheinstructionwindowtobeextremelylarge.Notethatnotallinstructionswithinthiswiderangearesimultaneouslybeingconsideredforexecution[31].Thisoptimizationoftheinstructionwindowisoffsetbyapotentiallylargeamountofcommunication,whichmayeffecttheoverallsystemperformance.

Communicationarisesbecauseofdependenciesbetweentasks,examplesareloop-carrieddependencies,functionargumentsandresults.Resultsstoredtoregister,thatarerequiredbyanothertask,areroutedfromoneprocessortoanotheratrun-timeviaaunidirectionalringnetwork.Recoveryfrommisspeculationisachievedbyadditionalhardwarethatmaintainstwocopiesoftheregistersalongwithasetofregistermasks,ineachprocessingunit[32].Insummarythen,althoughthemultiscalarapproachmitigatesagainstinstructionwindowscalingallowingwiderissuewidth,inpracticeitrequiresmanyofthesamecomplexmechanismsassuperscalarandbeingspeculativeisunlikelytobeabletoperformasconsistentlyasascalableCMP.2.5.

Multi-threading

215

describedinthispaper,althoughtheprocessorwasdesignedasacomponentofalargemulti-computerandnotasgeneralpurposechip.Theinterleavedapproachrequiresalargeconcurrencyinordertomaintainefﬁcientpipelineutilization,asitmustbeﬁlledwithinstructionsfromindependentthreads.Unliketheearlierapproaches,Teraavoidsthisrequirementusingsomethingcalledexplicit-dependencelookahead,whichusesaninstructiontagof3bitsthatspeciﬁeshowmanyinstructionscanbeissuedfromthesamestreambeforeencounteringadependencyonit.Thisminimizesthenumberofthreadsrequiredtokeepthepipelinerunningefﬁciently,whichis∼70inthecaseofmemoryaccesses.ItwillbeseenthatmicrothreadingusesadifferentapproachthatmaintainsfullbackwardcompatibilityintheISA,aswellasinthepipelinestructure.

UnlikeIMT,whichusuallydrawsconcurrencyfromILPandloops,BMTusuallyexploitsregularsoftwarethreads.TherehavebeenmanyBMTproposals,see[17]andevensomecommercialdesignssuchastheSun’sNiagraprocessor[34].HowevertheconcurrencyexposedinBMTarchitecturesislimited,asresources,suchasregisterﬁlesmustbeduplicatedtoavoidexcessivecontextswitchingtimes.ThislimitstheapplicabilityofBMTtocertainclassesofapplications,suchasservers.

SMT,isprobablythemostpopularandcommercialformofmulti-threadinginusetoday.Inthisapproach,multipleinstructionsfrommultiplethreadsprovideILPformultipleexecutionunitsinanout-of-orderpipeline.SeveralrecentarchitectureshaveeitherusedorproposedSMT,suchastheHyper-ThreadTechnologyintheIntelXeonprocessor[35]andtheAlpha21464[36].Asalreadydescribed,themainproblemwithanSMTprocessoristhatitsuffersfromthesamescalabilityissuesasasuperscalarprocessor,i.e.layoutblocksandcircuitdelaysgrowfasterthanlinearlywithissuewidth.Inadditiontothis,multiplethreadssharethesamelevel-1I-cache,whichcancausehighcachemissrates,allofwhichprovideslimitstoitsultimateperformance[18].2.6.

RecentCMPs

Inordertoimproveprocessorperformance,modernmicropro-cessorstrytoexploitTLPthroughamulti-threadingapproachevenatthesametimeastheyexploitILP.Multi-threadingisatechniquethattoleratesdelaysassociatedwithsynchroniz-ing,includingsynchronizingwithremotememoryaccesses,byswitchingtoanewthread,whenonethreadstalls.Manyformsofexplicitmulti-threadingtechniqueshavebeendescribed,suchasinterleavedmulti-threading(IMT),blockedmulti-threading(BMT)andSMT.Agoodsurveyofmulti-threadingisgivenin[17].

AnumberofsupercomputersdesignedbyBurtonSmithhavesuccessfullyexploitedIMT,theseincludetheDelencorHEP,theHorizonandculminatedintheTeraarchitecture[33].ThisapproachisperhapstheclosesttothatofmicrothreadingFromtheabovediscussionweseethatmostcurrenttechniquesforexploitingconcurrencysufferfromsoftwareand/orhardwaredifﬁculties,andthefocusofresearchanddevelopmentactivitynowseemstobeonCMPs.Thesedesignsgiveamoreﬂexibleandscalableapproachtoinstructionissue,freeingthemtoexploitMoore’slawthoughsystemlevelconcurrency.Someapplicationscanexploitsuchconcurrencythroughtheuseofmulti-threadedapplications.Webandotherserversaregoodexamples;however,thebigproblemishowtoprogramCMPsforgeneralpurposecomputationandwhetherperformancecaneverbeachievedfromlegacysequentialcode,eitherinbinaryorevensourceform.

SeveralrecentprojectshaveinvestigatedCMPdesigns[1,2,3,37].Typically,theefﬁciencyofaCMPdependson

TheComputerJournalVol.49No.2,2006

216K.Bousias,N.HasasnehandC.Jesshope

intermsofreducingthenumberofregistersorminimizingthenumberofreadorwriteports[14,15,41,42].

Workdonein[41]describesabypassschemetoreducethenumberofregisterﬁlereadportsbyavoidingunnecessaryregisterﬁlereadsforthecaseswherevaluesarebypassed.Inthisschemeanextrabypasshintbitisaddedtoeachoperandofinstructionswaitingintheissuewindowandawake-upmechanismisissuedtoreduceregisterﬁlereadports.Asdescribedin[43],thistechniquehastwomainproblems.First,theschemeisonlyaprediction,whichcanbeincorrect,requiringseveraladditionalrepaircyclesforrecoveryonmissprediction.Second,becausethebypasshintisnotresetoneverycycle,thehintisoptimisticandcanbeincorrectifthesourceinstructionhaswrittenbacktoregisterﬁlebeforethedependentinstructionisissued.Furthermore,anextrapipelinestageisrequiredtodeterminewhethertoreaddataoperandsfromthebypassnetworkorfromtheregisterﬁle.

Otherapproachesincludeadelayedwrite-backscheme[42],whereamemorystructureisusedtodelaythewrite-backresultsforafewcyclestoreduceregisterﬁleports.Thedisadvantageofthisschemeisthatitisnecessarytowritetheresultsbothtotheregisterﬁleandthewrite-backqueueconcurrentlytoavoidconsistencyproblemsduringregisterrenaming.Theauthorsproposeanextensiontothisschemetoreducethenumberofregisterwriteports.However,thisextensionsuffersfromanIPCpenaltyanditdegradesthepipelineperformance.Furthermore,inthismodel,anybranchmisspredictionscauseapipelinestallandinsufﬁcientuseofthedelaywrite-backqueue.Infactmostpreviousschemesforminimizingthemulti-portedregisterﬁlehaverequiredchangesinthepipelinedesignanddonotenablefullscalability.Atbesttheyprovideaconstantremissioninthescalabilityoftheregisterﬁle.

RecentlyRixneretal.[14]suggestedseveralpartitioningschemesfortheregisterﬁlefromtheperspectiveofstreamingapplications,includingdesignsspanningacentralregisterﬁlethroughtoadistributedregisterﬁleorganization.Theirresults,notsurprisingly,showthatacentralizedregisterﬁleiscostlyandscalesasO(N3),whileinthedistributedscheme,eachALUhasitsownporttoconnecttothelocalregisterﬁlesandanotherporttoaccessotherregisterﬁlesviaafastcrossbarswitchnetwork.Thispartitioningprovedtouselessareaandpowerandcausedlessdelaycomparedwiththepurelyglobalscheme,andwasalsoshowntoprovideascalablesolution.Thedistributedconﬁgurationalsohasasmalleraccesstimecomparedwiththecentralizedorganization.Inthiswork,a12832-bitregisterﬁlewith16readportsand8writeportsisusedforacentralregisterﬁleandiscomparedto8localregisterﬁlesof3232-bitregisters,2readports,1writeport,and1read/writeportforexternalaccess.TheresultfromtheirCACTImodelshoweda47.8%reductioninaccesstimeforthedistributedregisterﬁleorganizationacrossalltechnologies[44].

Itisnotclearfromthiswork,whethertheprogrammingmodelforthedistributedregisterﬁlemodelissufﬁcientlygeneralformostcomputations.Withadistributedregisterﬁle

thedegreeandcharacteristicsoftheparallelism.Executingmultipleprocessesorthreadsinparallelisthemostcommonwaytoextracthighlevelofparallelismbutthisrequiresconcurrencyinthesourcecodeofanapplication.PreviousresearchhasdemonstratedthataCMPwithfour2-issueprocessorswillreachahigherutilizationthanan8-issuesuperscalarprocessor[17].Also,workdescribedin[3]showsthataCMPwitheight2-issuesuperscalarprocessorswouldoccupythesameareaasaconventional12-issuesuperscalar.TheuseofCMPsisaverypowerfultechniquetoobtainmoreperformanceinapowerefﬁcientmanner[38].However,usingsuperscalarprocessorsasabasisforCMPs,withtheircomplexissuewindow,largeon-chipmemory,largemulti-portedregisterﬁleandspeculativeexecutionisnotsuchagoodstrategybecauseofthescalingproblemsalreadyoutlined.Itwouldbemoreefﬁcienttousesimplerin-orderprocessorsandexploitmoreconcurrencyattheCMPlevel,providedthatthiscanbeutilizedbyasufﬁcientlywiderangeofapplications.Thisisanactiveareaofresearchinthecompilercommunityanduntilthisproblemissolved,CMPsbasedonuser-levelthreadswillonlybeusedinapplicationswhichmatchthisrequirement,suchaslargeserverapplications,wheremultipleservicerequestsaremanagedbythreads.

3.REGISTERFILES

Theregisterﬁleisamajordesignobstacletoscalablesystems.Allsystemsthatimplementconcurrencyrequiresomeformofsynchronizingmemory.Inadataﬂowarchitecture,thisisthematchingstore,inanout-of-orderissueprocessoritistheregisterﬁle,supportedbytheinstructionwindow,reservationstationsandre-orderbuffer.Toimplementmoreconcurrencyandhigherlevelsoflatencytolerance,thissynchronizingmemorymustbeincreasedinsize.Thiswouldnotbeaproblemexceptthatincentralizedarchitectures,asissuewidthincreases,thenumberofportstothissynchronizingmemorymustalsoincrease.Theproblemisthatthecellsizegrowsquadraticallywiththenumberofportsorissuewidth.IfNinstructionscanbeissuedinonecycle,thenacentralregisterﬁlerequires2NreadportsandNwriteportstohandletheworstcasescenario.ThismeansthattheregistercellsizegrowsquadraticallywithN.Moreover,asthenumberofregistersalsoincreaseswiththeissuewidth,atypicalscalingofregisterﬁleareaisasthecubeofN.

TheregisterﬁleintheproposedAlpha8-wayissue21464hadasingle512registerﬁlewith24portsintotal.ItoccupiedanareaofsomeﬁvetimesthesizeoftheL1D-cacheof64Kbyte[39].Also,intheMotorola’sM.COREarchitecture,theregisterﬁleenergyconsumptioncanbe16%ofthetotalprocessorpowerand42%ofthedatapathpower[40].Itisclearthereforethatthemulti-portedregisterﬁlesinmodernmicroprocessorsconsumesigniﬁcantpoweranddiearea.Severalprojectshaveinvestigatedtheregisterﬁleproblem

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

andexplicitlyroutednetwork,operationsmustbescheduledbythecompilerandroutinginformationmustalsobegeneratedwithcodeforeachprocessorinordertorouteresultsfromoneprocessor’sregisterﬁletoanother.Althoughitmaybepossibletoprogramstreamingapplicationsusingsuchamodel,ingeneral,concurrencyandschedulingcannotbedeﬁnedstatically.

Otherpreviousworkhasdescribedadistributedregisterﬁleconﬁguration[44]whereafullydistributedregisterﬁleorganizationisusedinasuperscalarprocessor.Thearchitectureexploitsalocalregistermappingtableandadedicatedregistertransfernetworktoimplementthisconﬁguration.Thisarchitecturerequiresanextrahardwarerecopyunittohandletheregisterﬁledispatchoperations.Also,thisarchitecturesuffersfromadelaypenaltyastheexecutionunitofaninstructionthatrequiresavaluefromaremoteregisterﬁlemuststalluntilitisavailable.Theauthorshaveproposedaneagertransfermechanismtoreducethispenalty,butthisstillsuffersfromanIPCpenaltyandrequiresbothcentralissuelogicandglobalrenaming.

Inourresearch,itseemsthatonlythemicrothreadedmodelprovidessufﬁcientinformationtoimplementapenalty-freedistributedregisterﬁleorganization.Suchaproposalisgivenin[6]whereeachprocessorinaCMPhasitsownregisterﬁleinasharedregistermodel.Accessestoremotedataisdescribedinthebinarycodeanddoesnotrequirespeculativeexecutionorrouting.Thedecouplingisprovidedbyasynchronizationmechanismonregistersandtheroutingisdecoupledfromtheoperationofthemicrothreadedpipelineoperation,exploitingthesamelatencytolerancemechanismsasusedformainmemoryaccess.Section4.5explainsinmoredetailhowthemicrothreadedCMPdistributesdatatotheregisterﬁlesandhidesthelatencyduringaremoteregisterﬁleaccess.

217

4.THEMICROTHREADEDMODEL4.1.

Overview

InthissectionweconsiderthemicrothreadedconcurrencymodelinmoredetailanddescribethefeaturesthatsupporttheimplementationofascalableCMPbasedonit.Thismodelwasﬁrstdescribedin[45],andwasthenextendedin[6,19,46]tosupportsystemswithmultipleprocessorson-chip.

LiketheTera,thismodelcombinestheadvantagesofBMTandIMTbutdoessobyexplicitlyinterleavingmicrothreadsonacycle-by-cyclebasisinaconventionalpipeline.Thisisachievedusinganexplicitcontextswitchinstruction,whichisacteduponintheﬁrststageofthepipeline.Contextswitchingisperformedwhenthecompilercannotguaranteethatdatawillbeavailabletothecurrentinstructionandisusedinconjunctionwithasynchronizationmechanismontheregisterﬁlethatsuspendsthethreaduntilthedatabecomesavailable.Thecontextswitchcontrolisnotstrictlynecessary,asthiscanbesignalledfromthesynchronizationfailureontheregister

read.However,itsigniﬁcantlyincreasestheefﬁciencyofthepipeline,especiallywhenalargenumberofthreadsuspensionsoccurtogether,whenthemodelresemblesthatofanIMTarchitecture.OnlywhenthecompilercandeﬁneastaticscheduleareinstructionsfromthesamethreadscheduledinBMTmode.Exceptionstothisarecachemisses,iterativeoperationsandinter-threadcommunications.Thereisoneothersituationwherethecompilerwillﬂagacontextswitchandthatisfollowinganybranchinstruction.Thisallowsexecutiontoproceednon-speculatively,eliminatesthebranchpredictionandcleanuplogicandﬁllsanycontrolhazardbubbleswithinstructionsfromotherthreads,ifanyareactive.

ThemodelisdeﬁnedincrementallyandcanbeappliedtoanyRISCorVLIWinstructionset.SeeTable1fordetailsoftheextensionstoaninstructionsetrequiredtoimplementtheconcurrencycontrolsrequiredbythismodel.Theincrementalnatureofthemodelallowsaminimalbackwardcompatibility,whereexistingbinarycodecanexecuteunchangedontheconventionalpipeline,althoughwithoutanyofthebeneﬁtsofthemodeltoberealized.

MicrothreadingdeﬁnesILPintwoways.SetsofthreadscanbespeciﬁedwherethosethreadsgenerateMIMDconcurrencywithinabasicblock.EachthreadisdeﬁnedbyapointertoitsﬁrstinstructionandisterminatedbyoneormoreKillinstructionsdependingonwhetheritbranchesornot.Setsofthreadsprovideconcurrencyononepipelineandshareregisters.Theyprovidelatencytolerancethroughexplicitcontextswitchingfordataandcontrolhazards.Iterators,ontheotherhand,deﬁneSPMDconcurrencybyexploitingavarietyofloopstructures,includingforandwhileloops.Iteratorsgiveparametricconcurrencybyexecutingiterationsinparallelsubjecttodataﬂowconstraints.Independentloopshavenoloop-carrieddependenciesandcanexecutewithminimaloverheadonmultipleprocessors.Dependentloopscanalsoexecuteonmultipleprocessors,exploitinginstructionlevelconcurrencybutduringtheexecutionofdependencychainsactivitywillmovefromoneprocessortoanotherandspeedupwillnotbelinear.IdeallydependencychainsshouldexecutewithminimallatencyandparametersfortheinstructioninTable1allowdependenciestobebypassedoninteractionsexecutedonasingleprocessorgivingtheminimallatencypossible,i.e.onepipelinecycleperlinkinthechain.

Iteratorssharecodebetweeniterationsanduseasetofthreadstodeﬁnetheloopbody.Thismeansthatsomeformofcontextmustbeprovidedtodifferentiatemultipleiterationsexecutingconcurrently.Thisisachievedbyallocatingregisterstoiterationsdynamically.Afamilyofthreadsthen,isdeﬁnedbyaniteratorcomprisingatripleofstart,stepandlimitoverasetofthreads.Informationisalsorequiredthatdeﬁnesthemicro-contextassociatedwithaniterationand,aseachiterationiscreated,registersforitsmicrocontextareallocateddynamically.Tocreateafamilyofthreadsasingleinstructionisexecutedononeprocessor,whichpointstoathreadcontrolblock(TCB)containingtheaboveparameters.Iterationscan

TheComputerJournalVol.49No.2,2006

218K.Bousias,N.HasasnehandC.Jesshope

TABLE1.Concurrency-controlinstructions.InstructionCreSwchKillBsyncBrk

Instructionbehaviour

CreatesanewfamilyofthreadsCausesacontextswitchtooccurTerminatesthethreadbeingexecutedWaitsforallotherthreadstoterminateTerminatesallotherthreads

thenbescheduledononeormoreprocessorsasrequiredtoachievethedesiredperformance.

Virtualconcurrencyonasinglepipelinedeﬁnesthelatencythatcanbetoleratedandislimitedbythesizeofthelocalregisterﬁleorcontinuationqueueinthescheduler.Thelatterholdstheminimalstateassociatedwitheachthread.Botharerelatedbythetwocharacteristicsofthecode;thenumberofregisterspermicro-contextandthecardinalityofthesetofthreadsdeﬁningtheloopbody.Inthismodel,allthreadsaredrawnfromthesamecontextandtheonlystatemanipulatedinthearchitectureisthethread’sexecutionstate,itsPCandsomeinformationaboutthelocationofitsmicro-contextinthelocalregisterﬁle.Thismechanismremovesanyneedtoswapregistervaluesonacontextswitch.

Theoretically,physicalconcurrencyislimitedonlybythesiliconavailabletoimplementaCMP,asallstructuressupportingthismodelarescalableandarerelatedtotheamountofthevirtualconcurrencyrequiredforlatencytolerance,i.e.registerﬁle,continuationqueueandregisterallocationlogic.Practically,physicalconcurrencywillbelimitedbytheextentoftheloopsthatthecompilercangenerate,whethertheyareindependentorcontainloop-carrieddependenciesandultimately,theoverheadsindistributionandsynchronizationthatframetheSPMDexecution.Notethatthreadcreationproceedsinatwostages.Aconceptualscheduleisdeterminedalgorithmicallyoneachprocessorfollowingthecreationofafamilyofmicrothreadsbuttheactualthreadcreation,i.e.thecreationofentriesinthecontinuationqueue,occursoveraperiodoftimeattherateofonethreadpercycle,keepingupwiththemaximumcontext-switchrate.Thiscontinueswhileresourcesareavailable.

Figure1showsasimplemicrothreadedpipelinewithﬁvestagesandtherequiredsharedcomponentsusedinthismodel.Noticethatnoadditionalstagesarerequiredforinstructionissue,retiringinstructions,orinroutingdatabetweenprocessors’registerﬁles.Shortpipelinesprovidelowlatencyforglobaloperations.Notethatmorepipelinestagescouldbeusedtoreducetheclockperiod,asisthecurrenttrendinmicroprocessorarchitecture.However,asconcurrencyprovidesthemostpower-efﬁcientsolutiontoperformanceitismootwhetherthisisasoundstrategy.Twoprocessorscangivethesameperformanceasonedoublespeedprocessorbutdosowithlesspowerdissipated.DynamicpowerdissipationisproportionaltofrequencyfandV2,butVcanbereducedwith

fgivingaquadraticreductionofenergyrequiredbyexploitingconcurrencyinacomputationoveratleastoversomerangeofvoltagescaling.

ThesetofsharedcomponentsusedinthismodeltosupportthemicrothreadedCMPisminimalandtheiruseisinfrequent.Thesecomponentsarethebroadcastbus,usedtocreateafamilyofthreadsandaringnetworkforregistersharing.InaCreinstructionapointertotheTCBisdistributedtoeachprocessor,wheretheschedulerwillusethisinformationtodeterminethesubsetoftheiterationsitwillexecute.Thebroadcastbusisalsousedtoreplicateglobalstatetoeachprocessor’slocalregisterﬁleinsteadofaccessingacentralizedregisterﬁle.Bothoperationsarelowinfrequencyandcanbeamortizedovertheexecutionofmultipleiterations.

Replicationofglobalvariablesisoneoftwomechanismsthatallowtheregisterﬁleinmicrothreadedmodeltobefullydistributedbetweenthemultipleprocessorsonachip.Theotheristheshared-registerringnetwork,whichallowscommunicationsbetweenpairsofthreadstoimplementloop-carrieddependencies.Thiscommunicationbetweenthesharedanddependentthreadsisdescribedinmoredetailbelowbutusestheringnetworkonlyiftwoiterationsareallocatedtodifferentprocessors.Notethatschedulescanbedeﬁnedtominimizeinter-processorcommunication,moreimportantly,thiscommunicationistotallydecoupledfromthepipeline’soperationthroughtheuseofexplicitcontextswitching.

Bothtypesofglobalcommunicationi.e.thebroadcastbusandtheringnetworkareabletouseasynchronouscommunications,creatingindependentclockingdomainsforeachprocessor.Indeed,thebroadcastmechanismscouldbeimplementedatsomelevelbyeagerpropagationwithintheringnetwork.4.2.

Concurrencycontrols

Themicrothreadmodelisagenericone,asitcanbeappliedtoanyISA,solongasitsinstructionsareexecutedin-order.Inaddition,themodelcanbedesignedtomaintainfullbackwardcompatibility,allowingexistingbinarycodetorunwithoutspeedup[6]onamicrothreadedpipeline.Binarycompatibilitywithspeedupcanalsobeobtainedusingbinary-to-binarytranslationtoidentifyloopsanddependenciesandaddinginstructionstosupporttheconcurrentexecutionofthoseloopsand/ortheconcurrencywithinthebasicblocks.

Table1showstheﬁveinstructionsrequiredtosupportthismodelonanexistingISA.Theinstructionscreateafamilyofthreads,explicitlycontextswitchbetweenthreads,killathreadandtwoinstructionsprovideforglobalsynchronization.Oneisabarriersynchronization,theotheraformofbreakinstruction,whichforcesabreakfromaloopexecutedconcurrently.TheseconcurrencycontrolsprovideanefﬁcientandﬂexiblemethodtoextracthighlevelsofILPfromexistingcode.Eachinstructionwillnowbedescribedinmoredetail.

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

Broadcast bus219

arbiterringTo nearest neighboursShared memoryregister allocation modelread-onlycacheschedulercontextswitchremoteregisterreadregister writebroadcastdecoupled lwbypasscreateinitthreadcontroland IFregisterfilereschedule threadALUcachewritebackFIGURE1.Microthreadedmicroprocessorpipeline.

TABLE2.TCBcontainingparametersthatdescribeafamilyofmicrothreads.ThreadsDependencyPreamblesPostamblesStartLimitStepLocalsSharedsPre-pointerMain-pointerPost-pointer

Cardinalityofthesetofthreadsrepresentinganiteration

Iterationoffsetforanyloopcarrieddependencies,e.g.a[i]:=...a[i−d]

NumberofiterationsusingpreamblecodeNumberofiterationsusingpostamblecodeStartofloopindexvalueLimitofloopindexvalueStepbetweenloopindices

Numberoflocalregistersdynamicallyallocatedperiteration

Numberofsharedregistersdynamicallyallocatedperiteration

OnepointerperthreadinsetforpreamblecodeOnepointerperthreadinsetformainloop–bodycode

Onepointerperthreadinsetforpostamblecode

4.2.1.Threadcreation

ThemicrothreadedmodeldeﬁnesexplicitandparametricconcurrencyusingtheCreinstruction.ThisinstructionbroadcastsapointertotheTCBandtoallprocessorsassignedtothecurrentcontext;see[47]fordetailsofdynamicprocessorallocation.TheTCBcontainsparameters,whichdeﬁneafamilyofthreads,i.ethesetofthreadsrepresentingtheloopbodyandthetripledeﬁningtheloop.Italsodeﬁnesthedynamicresourcesrequiredbyeachthread(itsmicrocontext)intermsoflocalandsharedregisters.Forloopswhichcarryadependency,italsodeﬁnesadependencydistance,andoptionally,pointerstoapreambleand/orpostamblethread,whichcanbeusedtosetupand/orterminateadependencychain.Thedependencydistanceisaconstantoffsetintheindexspacewhichdeﬁnesregularloop-carrieddependencies.Afamilyofthreadscanbecreatedwithoutrequiringapipelineslot,asthecreateinstructionisexecutedconcurrentlywitharegularinstructionintheIFstageofthepipeline.TheTCBforourcurrentworkonimplementationoverheadsisdeﬁnedinTable2.

Aglobalschedulingalgorithmdetermineswhichiterationswillexecuteonwhichprocessors.ThisalgorithmisbuiltintothelocalschedulerbuttheparametersintheTCBandthenumberofprocessorsusedtoexecutethefamilymaybedynamic.Theconcurrencydescribedbythisinstructionisthereforeparametricandmayexceedtheresourcesavailableintermsofregistersandthreadslotsinthecontinuationqueues.Theregisterallocationunitineachlocalschedulermaintainstheallocationstateofallregistersineachregisterﬁleandthiscontrolsthecreationofthreadsatarateofoneperpipelinecycle.Onceallocatedtoaprocessorathreadrunstocompletion,i.e.untilitencountersaKillinstructionandthenterminates.

Aterminatedthreadreleasesitresourcessolongasanydependentthreadhasalsoterminated.Todosobeforethismaydestroydatathathasnotyetbeenreadbythedependentthread.Notethatmicrothreadsareusually(butnotexclusively)veryshortsequencesofinstructionswithoutinternalloops.4.2.2.Context-switching

ThemicrothreadedcontextswitchingmechanismisachievedusingtheSwchinstruction,whichisacteduponintheﬁrststageofthepipeline,givingacycle-by-cycleinterleavingif

TheComputerJournalVol.49No.2,2006

220K.Bousias,N.HasasnehandC.Jesshope

processorwhenmanagingloop-carrieddependencies.Thisimplementsascalableanddistributedshared-registermodelbetweenprocessorswithoutusingasingle,multi-portedregisterﬁle,whichisknowntobeunscalable.

Theuseofdataﬂowsynchronizationbetweenthreadsenablesapolicyofconservativeinstructionexecutiontobeapplied.Whennomicrothreadsareactivebecauseallarewaitingexternalevents,suchasloadwordrequests,thepipelinewillstalland,ifthepipeisﬂushedcompletely,theschedulerwillstopclocksandpowerdowntheprocessorgoingintoastandbymode,inwhichitconsumesminimalpower.Thisisamajoradvantageofdata-drivenmodels.Conservativeinstructionexecutionpoliciesconservepowerincontrasttotheeagerpoliciesusedinout-of-orderissuepipelines,whichhavenomechanismstorecognizesuchaconjunctionofschedules.Thiswillhaveamajorimpactonpowerconservationandefﬁciency.

Contextswitchingandsuccessfulsynchronizationhavenooverheadintermsofadditionalpipelinecycles.Thecontextswitchinterleavesthreadsintheﬁrststageofthepipeline,ifnecessaryonacyclebycyclebasis.Synchronizationoccursatregister-readstageandonlyifitfailswillanyexceptionalactionbetriggered.Onasynchronizationfailure,controlfortheinstructionismutatedtostoreareferencetothemicrothreadintheregisterbeingread.Thismeansthattheonlyoverheadinsupportingtheseexplicitconcurrencycontrolsistheadditionalcyclerequiredtoreissuethefailedinstructionwhenthesuspendedthreadisreactivatedbythearrivalofthedata.Ofcoursethereareoverheadsinhardwarebutthisistrueforanymodel.

Themodelalsoprovidesabarriersynchronization(Bsync)instruction,whichsuspendstheissuingthreaduntilallotherthreadshavecompletedandaBrkinstruction,whichexplicitlykillsallotherthreadsleavingonlythemainthread.Theseinstructionsarerequiredtoprovidebulksynchronizationformemoryconsistency.Thereisnosynchronizationonmainmemory,onlytheregistersaresynchronizing.Thismeansthattwodifferentmicrothreadsinthesamefamilymaynotreadafterwritetothesamelocationinmemorybecausetheorderingofthoseoperationscannotbeguaranteed.Italsomeansthatanyloop-carrieddependenciesmustbecompiledtouseregistervariables.Apartitioningofthemicro-contextsupportsthismechanismefﬁciently.

Wedonotaddressthedesignofthesharedmemorysysteminthispaper,althoughwearecurrentlyinvestigatingseveralapproaches.Indeedthelatencytoleranceprovidedbythismodelmakesthisdesignofthememorysystemsomewhatﬂexible.Forexample,alarge,banked,multi-portedmemorywouldgiveasolutionthatwouldprovideallthebufferingrequiredforthelargenumberofconcurrentrequestsgeneratedbythismodel.Itisimportanttonotethatusingin-orderprocessorsandablock-basedmemoryconsistencymodel,memoryorderingdoesnotposethesameproblemasitdoesinanout-of-orderprocessor.

necessary.WhenaSwchinstructionisexecuted,theIFstagereadsthenextinstructionfromanotherreadythread,whosestateispassedtotheIFstageasaresultofthecontextswitch.AsthisactiononlyrequirestheIFstageofthepipeline,itcanbeperformedconcurrentlywithaninstructionfromthebaseISA,solongastheSwchinstructionispre-fetchedwithit.Thecontextswitchingmechanismisusedtomanagebothcontrolanddatadependencies.Itisusedtoeliminatecontroldependenciesbycontextswitchingfollowingeverytransferofcontrol,inordertokeepthepipelinefullwithoutanybranchprediction.Thishastheadvantagethatnoinstructionisexecutedspeculativelyandconsequently,powerisneitherdissipatedinmakingapredictionnorinexecutinginstructionsonthewrongdynamicpath.Contextswitchingalsoeliminatesbubblesinthepipelineondatadependenciesthathavenon-deterministictiming,suchasloadsfrommemoryorthread-to-threadcommunication.Contextswitchingprovidesanarbitrarylargetolerancetolatency,determinedbythesizeofthelocalregisterﬁle.

4.2.3.Threadsynchronization

Theonlysynchronizingmemoryinthemicrothreadedmodelisprovidedbytheregistersandthisgivesanefﬁcientandscalablemechanismforsynchronizingdatadependencies.Thesynchronizationisperformedusingtwosynchronizationbitsassociatewitheveryregister,whichdifferentiatebetweenthefollowingstates:full,empty,waiting-localandwaiting-remote.Registersareallocatedtomicro-contextsintheemptystateandareadtoanemptyregisterwillfail,resultinginareferencetothemicrothreadthatissuedtheinstructionbeingstoredinthatregister.Thisreferencepassesdownthepipelinewitheachinstructionexecuted.Usingthecontinuationqueueinthescheduler,listsofcontinuationsmaybesuspendedonaregister,whichisrequiredwhenmultiplethreadsaredependentonthevaluetobestoredthere.AllregistersthereforeimplementI-structuresinamicrothreadedmicroprocessor.Inthefullstate,registersoperatenormally,providingdatauponaregisterreadand,ifnosynchronizationisrequired,aregistercanberepeatedlywrittentowithoutchangingitssynchronizationstatetoprovidebackwardcompatibility.Thecompilercaneasilyrecognizethepotentialforasynchronizationfailureifascheduleforthedependencyisnotknownatcompiletime.Ifso,itinsertsacontextswitchonthedependentinstruction.Examplesincludeinstructionsdependentonapriorloadword,producedinanotherthread,orproducediniterativeCPUoperations.

Theregisterissettooneofthewaitingstateswhenitholdsacontinuation.Twokindsofcontinuationaredistinguished:waiting-local,whentheregisterholdstheheadofalistofcontinuationstolocalmicrothreadsand;waiting-remote,whentheregisterholdsaremoterequestfordatafromanotherprocessor.Thelatterenablesthemicro-contextforoneiterationtobestoredforread-onlyaccessonaremote

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

4.2.4.Threadtermination

ThreadterminationinthemicrothreadedmodelisachievedthroughaKillinstruction,whichofcoursecausesacontextswitchaswellasupdatingthemicrothread’sstatetokilled.Theresourcesofthekilledthreadsarereleasedatthisstage,unlessthereisanotherthreaddependentuponit,inwhichcaseitsresourceswillnotbereleaseduntilthedependentthreadhasalsobeenkilled.(Notethatthisisthemostconservativepolicyandmoreefﬁcientpoliciesmaybeimplementedthatdetectwhenallloop-carrieddependencieshasbeensatisﬁed.)4.3.Scalableinstructionissue

CurrentmicroprocessorsattempttoextracthighlevelsofILPbyissuingindependentinstructionsoutofsequence.Theydothismostsuccessfullybypredictingloopbranchesandunrollingmultipleiterationsofaloopwithintheinstructionwindow.Theproblemwiththisapproachhasalreadybeendescribed;alargeinstructionwindowisrequiredinordertoﬁndsufﬁcientindependentinstructionsandthelogicassociatedwithitgrowsatleastwiththesquareoftheissuewidth.

Ifwecomparethiswithwhatishappeninginthemicrothreadedmodel,weseethatalmostexactlythesamemechanismisbeingusedtoextractILP,withonemajordifference,amicrothreadedmicroprocessorexecutesfragmentsofthesequentialprogramsout-of-order.Thesefragments(themicrothreads)areidentiﬁedatcompiletimefromloopbodiesandconventionalILPandmayexecuteinanyordersubjectonlytodataﬂowconstraints.Instructionswithinfragmentshowever,issueandcompletein-order.Wehavealreadyseenthatacontextswitchsuspendsafragmentatinstructionswhoseoperandshavenon-deterministictiming.Thedependentinstructionisissuedandstoresapointertoitsfragmentifaregisteroperandisfoundtobeempty.Anysuspendedfragmentsarerescheduledwhendataiswrittentothewaitingregister.Thusonlyinstructionsuptotheﬁrstdependencyineachfragment(loopbody)areissuedandonlythatinstructionwillbewaitingforthedependencytoberesolved;allsubsequentinstructionsinthatpipelinewillcomefromotherfragments.Inanout-of-orderissuemodeltheinstructionwindowisﬁlledwithallinstructionsfromeachloopunrolledbybranchpredictionbecauseitknowsnothingaprioriabouttheinstructionschedules.

Consideracomputationthatonlyevercontainsoneindependentinstructionperloopoflinstructions,thentogetn-wayissuenloopsmustbeunrolledandtheinstructionwindowwillcontainn*linstructionsforeachninstructionsissued.Incomparison,themicrothreadedmodelwouldissuetheﬁrstnindependentinstructionsfromnthreads(iterations),thenitwouldissuetheﬁrstdependentinstructionsfromthesamenthreadsbeforecontextswitching.Thenextninstructionswouldthencomefromthenextniterations(threads).Synchronization,insteadoftakingplaceinaglobalstructurewithO(n2)complexity,isdistributedtonregistersandhaslinearcomplexity.Eachthreadwaitsfor

221

thedependencytoberesolvedbeforebeingabletoissueanynewinstructions.Ineffecttheinstructionwindowinamicrothreadedmodelisdistributedtothewholeofthearchitecturalregistersetandonlyonelinkinthedependencygraphforeachfragmentofcodeiseverexposedsimultaneously.Moreover,nospeculationiseverrequiredandconsequently,iftheschedulesaresuchthatallprocessorswouldbecomeinactive,thenthisstatecanberecognizedandusedtopower-downtheprocessorstoconserveenergy.

Comparethistotheexecutioninanout-of-orderprocessor,whereinstructionsareexecutedspeculativelyregardlessofwhethertheyareonthecorrectexecutionpath.Althoughpredictionsaregenerallyaccurateindeterminingtheexecutionpathinloops,ifthecodewithinaloopcontainsunpredictable,data-dependentbranches,thiscanresultinalotofenergybeingconsumedfornousefulwork.Researchersnowtalkabout‘breakingthedependencybarrier’usingdatainadditiontocontrolspeculationbutwhatdoesthismean?Indicescanbepredictedreadilybutthesearenottruedependenciesanddonotconstrainthemicrothreadedmodel;addresses,basedonthoseindicescanalsobepredictedwithareasonableamountofaccuracybutagainthesedonotconstrainthemicrothreadedmodel.Thisleavestruecomputationaldatadependencies,whichcanonlybepredictedunderveryextraordinarycircumstances.Itseemsthereforethatthereisnojustiﬁcationforconsumingpowerinattemptingdataspeculation.

Out-of-orderissuehasnoglobalknowledgeofconcurrencyorsynchronization.Microthreading,ontheotherhand,isabletoexecuteconservativelyasitdoeshavethatglobalknowl-edge.Realdependenciesareﬂaggedbycontextswitching,concurrencyisexposedbydynamicallyexecutingparametricCreinstructionsandthenamespaceforsynchronizationspansanentireloopasregistersareallocateddynamically.Atanyinstantthephysicalnamespaceisdeterminedbytheregistersthathavebeenallocatedtothreads.4.4.

Threadstate

Whenathreadisassignedresourcesbythescheduler,itisinitiallysettothewaitingstate,asitmustwaitforitscodetobeloadedintotheI-cachebeforeitcanbeconsideredactive.Athreadwillgointoasuspendedstatewhenithasbeencontextswitcheduntileithertheregistersynchronizationhasbeencompletedorthebranchtargethasbeendeﬁned,whenitagaingoesintothewaitingstate.TheschedulergeneratesarequesttotheI-cachetopre-fetchtherequiredcodeforanythreadthatentersthewaitingstate.Iftherequiredcodeisavailable,thentheI-cacheacknowledgestheschedulerimmediately,otherwisenotuntiltherequiredcodeisinthecache.Thethread’sstatebecomesreadyatthisstage.Akilledstateisalsorequiredtoindicatethosethreadsthathavebeencompletedbutwhosedatamaystillbeinuse.Atanytimethereisjustonethreadperprocessor,whichisintherunningstate;onstart-upthiswillbethemainthread.

TheComputerJournalVol.49No.2,2006

222K.Bousias,N.HasasnehandC.Jesshope

accesspatternshavethecharacteristicsthattheyarewrittentoinfrequentlybutreadfromfrequently.TheaddressspaceinaconventionalRISCISAispartitionedsothatthelower16registersformthisglobalwindow.Thesearestaticallyallocatedforagivencontextandeverythreadcanreadand/orwritetothem.Notethatthemainthreadhas32staticallyallocatedregisters,16ofwhicharevisibletoallmicrothreadsasglobalsand16ofwhicharevisibleonlytothemainthread.Eachthreadsees32registers.Thelower16ofthesearetheglobalsandthesearesharedbyallthreadsandthoseintheupperhalfarelocaltoagiventhread.

Theupper16registersareusedtoaddressthemicrocontextofeachiterationinafamilyofthreads.Aseachiterationsharescommoncode,theaddressofeachmicro-contextintheregisterﬁlemustbeuniquetothatiteration.Aswehaveseen,thebaseaddressofathread’smicro-contextformsapartofitsstate.Thisimmediatelygivesameansofimplementingadistributed,shared-registermodel.Weneedtoknowtheprocessoronwhichathreadisrunningandthebaseaddressofitsmicrocontextinordertoshareitsdata.However,wecanfurtherpartitionthemicro-contextintoalocalpartandasharedparttoavoidtoomuchadditionalcomplexityinimplementingthepipeline.Threeregisterwindowsaremappedtotheupperordynamichalfoftheaddressspaceforeachmicro-context.Thesearethelocalwindow($Li),thesharedwindow($Si)andthedependentwindow($Di).Thusthesumofthesizeofthesethreewindowsmustbe󰀁16.Thelocalwindowstoresvaluesthatarelocaltoagiventhread.Forexampletheystorevaluesfromindexedarraysusedonlyinasingleiteration.ReadsandwritestothelocalwindowarealllocaltotheprocessorathreadisrunningonandnodistributionoftheLwindowisthereforerequired.TheSandDregisterwindowsprovidethemeansofsharingapartofamicro-contextbetweenthreads.TheSwindowiswrittenbyonethreadandisreadbyanotherthreadusingitsDwindow.

Itshouldbenotedthatmanydifferentmodelscanbesupportedbythisbasicmechanism.Inthispaperasimplemodelisdescribedbutdifferentmodelsofcommunicationwithdifferentconstraintsandsolutionstoresourcedeadlockcanbeimplemented.Themechanismwouldevensupportahierarchyofmicrocontextsbyallowinganiterationinonefamilyofthreadstocreateasubordinatefamily,wherethedynamicpartoftheaddressspaceinthecreatingfamilybecamethestaticpartinthesubordinatefamily.Thiswouldsupportnestedmulti-dimensionalloopsaswellasbreadthﬁrstrecursion.Therearedifﬁcultieshowever,inresolvingresourcedeadlockproblemsinallbutthesimplestmodelsandtheserequirefurtherresearchtoresolve.

Inthispaperwedescribeasimplemodel,thatsupportsasinglelevelofloopwithcommunicationbetweeniterationsbeingallowedonlybetweeniterationsthatdifferbyacreate-timeconstant.Anexampleofthistypeofcommunicationcanbefoundinloop-carrieddependencies,whereoneiterationproducesavalue,whichisusedbyanotheriteration.For

Onacontextswitchorkill,theinstructionfetchstageisprovidedwiththestateofanewthreadifanyareactive,otherwisethepipelinestallsforafewcyclestoresolvethesynchronizationandifitfails,thepipelinesimplystops.Thisactionissimple,requiresnoadditionalﬂushorcleanuplogicandmostimportantly,isconservativeinitsuseofpower.Notethatbydeﬁnition,whennolocalthreadsareactive,thesynchronizationeventhastobeasynchronousandhencedoesnotrequireanylocalclocks.

Thestateofathreadalsoincludesitsprogramcounter,thebaseaddressofitsmicro-contextandthebaseaddressandlocationofanymicro-contextsitisdependentupon.Thestatealsoincludesanimplicitslotnumber,whichistheaddressoftheentryinthecontinuationqueueandwhichuniquelyidentiﬁesthethreadonagivenprocessor.Thelastﬁeldrequiredisalinkﬁeld,whichholdsaslotnumberforbuildinglinkedlistsofthreadstoidentifyemptyslots,readyqueuesandanarbitrarynumberofcontinuationqueuesthatsupportmultiplecontinuationsondifferentregisters.TheslotreferenceisusedasatagtotheI-cacheandisalsopassedthroughthepipelineandstoredintherelevantoperandregisterifaregisterreadfails,whereitformstheheadofthatcontinuationqueue.4.5.

Registerﬁlepartitioninganddistribution

WehavealreadyseenthatRixneretal.[14]haveshownthatadistributedregisterﬁlearchitectureachievedabetterperformancecomparedwithaglobalsolutionanditalsoprovidessuperiorscalingproperties.Theirworkwasbasedonstreamingapplications,whereregistersourcesanddestinationsarecompiledstatically.Wewillshowthatsuchadistributedorganizationcanalsobebasedonextensionstoageneral-purposeISAwithdynamicscheduling.Theconceptofadynamicmicro-contextassociatedwithparallelizingdifferentiterationshasalreadybeenintroducedandisrequiredinordertomanagecommunicationsbetweenmicro-contextsinascalablemanner.Itisnecessaryforthecompilertopartitionthemicro-contextintodifferentwindowsrepresentingdifferenttypesofcommunicationandforthehardwaretorecognizethesewindowstotraparegisterreadtoanappropriatedistributedcommunicationmechanism.

Amicrothreadedcompilermustrecognizeandidentifyfourdifferenttypesofcommunicationpatterns.Thereareanumberofwaysinwhichthispartitioningcanbeencoded,andherewedescribeasimpleandefﬁcientschemethatsupportsafullydistributedregisterﬁlebasedonaconventionalRISCISA,assuminga5-bitregisterspeciﬁerandhencea32-registeraddressspacepermicrothread(although,notthesame32registersforeachthread).

Theﬁrstregisterwindowistheglobalwindow(representedby$Gi).Theseregistersareusedtostoreloopinvariantsoranyotherdatathatissharedbyallthreads.Inothermodelsofconcurrencythesewouldrepresentbroadcastdata,whicharewrittenbyoneandreadbymanyprocesses.Their

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

example,A[i]:=...A[i−k]...wherekisinvariantoftheloop.Suchdependenciesnormallyactasadeterrenttoloopvectorizationorparallelizationbutthisisnotsointhismodel,astheindependentinstructionsineachloopcanexecuteconcurrently.ThisisthesameILPasisextractedfromanout-of-ordermodel.

Considernowtheimplementationofthisbasicmodel.Itisstraightforwardtodistributetheglobalregisterwindowanditscharactisticssuggestabroadcastbusasbeinganappropriateimplementation.Thisrequiresthatallprocessorsexecutingafamilyofmicrothreadsbedeﬁnedpriortoanyloopinvariantsbeingwritten(orre-written)totheglobalwindow.Thehardwarethentrapsanywritestotheglobalwindowandreplicatesthevaluesusingthebroadcastbustothecorrespondinglocationinallprocessors’globalwindows.Asmultiplethreadsmayreadthevalueswrittentotheglobalregisterwindow,registersmustsupportarbitrarilylargecontinuationqueues,boundedaboveonlybythenumberofthreadsthatcanbeactiveatanytimeononeprocessor.

Thewritetotheglobalwindowcanbefromanyprocessorandthuscanbeusedtoreturnavaluefromaninteractiontotheglobalstateofacontext.Thewriteisalsoasynchronousandindependentofpipelineoperation,providedthereislocalbufferingforthedataintheeventofaconﬂictonthebus.Contentionforthisbusshouldnotoccurregularly,aswritestoglobalsaregenerallymuchlessfrequentthanreads(byafactorproportionaltotheconcurrencyofthecode).Thisisanalysedlaterinthepaper.

ThedistributionofSandDwindowsisalittlemorecomplexthantheglobalwindow.Normally,aproducerthreadwritestoitsSwindowandtheconsumerreadsfromitsDwindow,whichmapsinsomesenseontotheSwindowoftheproducer;wewilllaterreturntothis.However,thereisnorestrictiononathreadreadingaregisterfromitsSwindowsolongasdatahasalreadybeenwrittentoit(itwoulddeadlockotherwise).ThereisalsonophysicalrestrictiononmultiplewritestotheSwindow,althoughthismayintroducesnon-determinismifasynchronizationispendingonit.Asfarasthehardwareisconcernedtherefore,theSwindowisidenticaltotheLwindow,asallreadsandwritestoitarelocalandaremappedtothedynamichalfoftheregister-addressspace.Ontheotherhand,athreadmayneverwritetoitsDwindow,whichisstrictlyread-only.ThehardwareneedonlyrecognizereadstotheDwindowinordertoimplementsharingbetweentwodifferentthreads.InordertoperformareadfromaDwindow,aprocessorneedsthelocation(processorid)andbaseaddressoftheSwindowoftheproducerthread.Therearetwocasestoconsiderinsupportingthedistributionofregisterﬁlesinthebase-levelmodelwehavedescribed.

Theﬁrstandeasiestcaseiswhentheconsumeriterationisscheduledtothesameprocessorastheproducer.InthiscaseareadtotheDwindowcanbeimplementedasanormalpipelinereadbymappingtheDwindowoftheconsumermicro-contextontotheSwindowoftheproducermicro-context.The

223

thread’sstatemustthereforecontainthebaseaddressofitsownmicro-contextforlocalreadsandalsothebaseaddressofanymicro-contextitisdependentupon.Inthebase-levelmodelwepresent,onlyoneothermicro-contextisaccessed,ataconstantoffsetintheindexspace.

Inthesecondcase,theproducerandconsumeriterationsarescheduledtodifferentprocessors.Now,theconsumer’sreadtotheDwindowwillgeneratearemoterequesttotheprocessoronwhichtheproduceriterationisrunning.Whereasintheﬁrstcaseamicro-context’sDwindowisnotphysicallyallocated,inthissecondcaseitmustbe.Itisusedtocachealocalcopyoftheremotemicro-context’sSwindow.Itisalsousedtostorethethreadcontinuationlocally.Thecommunicationisagainasynchronousandindependentofthepipelineoperation.TheconsumerthreadissuspendedonitsreadtotheDwindowlocationuntilthedataarrivesfromtheremoteprocessor.Forthisconstantstridedcommunication,iterationschedulesexistthatrequireonlynearestneighbourcommunicationinaringnetworktoimplementthedistributedshared-registerscheme.Notethatarequestfromtheconsumerthreadmayﬁndanemptyregister,inwhichcasetherequestgetssuspendedintheproducer’sSwindowuntiltherequireddatahasbeenproduced.Thusashared-registertransactionmayinvolvetwocontinuations,athreadsuspendedintheDwindowoftheconsumer(waiting-local)andaremoterequestsuspendedintheSwindowoftheproducer(waiting-remote).Asthesestatesaremutuallyexclusive,thecompilermustensurethattheproducerthreaddoesnotsuspendononeofitsownS-windowlocations.ThiscanhappenifaloadfrommemorytoanSlocationisalsousedinthelocalthread.However,asdependenciesarepassedviaregistervariablesthiscanonlyhappenintheinitializationofadependencychain.ThiscasecanbeavoidedbyloadingtoalocationintheLwindowwhenthevalueisrequiredlocallyandthencopyingittotheSwindowwithadeterministicschedule.Theadditionalcomplexityrequiredinthisdistributedregisterﬁleimplementationis2bitsineachregistertoencodethefoursynchronizationstates:full,empty,waiting-local,waiting-global;asmallamountofadditionallogictoaddressthedynamicallyallocatedregistersusingbase-displacementaddressing;andasimplestatemachineoneachregisterporttoimplementtherequiredactionbasedonthesynchronizationstate.

Amethodhasnowbeendescribedtodistributeallclassesofcommunicationrequiredinthebase-levelmodel.However,wemustensurethatthisdistributiondoesnotrequireustoimplementregisterﬁleslocallythatarenotscalable.Thisrequiresthenumberoflocalportsintheregisterﬁletobeconstant.AccessestoL,SandlocalDwindowsrequiresatmosttworeadandonewriteportforasingle-issuepipeline.TheGwindowrequiresanadditionalwriteportindependentofthepipelineports.Finally,readstoaremoteDwindowrequireonereadportandonewriteportperprocessor.Contentionforthisportwilldependonthepatternofdependencies,whichforthemodeldescribedisregularandhenceevenlydistributedwith

TheComputerJournalVol.49No.2,2006

224

Global Write BusK.Bousias,N.HasasnehandC.Jesshope

mostsigniﬁcantproblemsinmodernsynchronousprocessordesign.

FullasynchronousdesignisdifﬁcultbutonepromisingtechniqueistouseaGlobally-Asynchronous,Locally-Synchronous(GALS)clockingscheme[48].Thisapproachpromisestoeliminatetheglobalclockingproblemandprovidesasigniﬁcantpowerreductionovergloballysynchronousdesigns.Itdividesthesystemintomultipleindependentdomains,whichareindependentlyclockedbutwhichcommunicateinanasynchronousmanner.AGALSsystemnotonlymitigatesagainsttheclockdistributionproblem,theproblemofclockskewandtheresultingpowerconsumption,itcanalsosimplifythereuseofmodulesastheyhaveasynchronousinterfacesthatdonotrequireredesignfortimingissueswhencomposed[49].InCMPdesign,globalcommunicationisoneofthemostsigniﬁcantproblemsinbothcurrentandfuturesystems[6],yetnoteverysystemcanbedecomposedintoasynchronouslycommunicatingsynchronousblockseasily,theremustbeacleardecouplingoflocalandremoteactivity.Toachievethisthelocalactivityshouldnotbeoverlydependentonaremotecommunication.Themodelwehavedescribedhasjustthisproperty;eachprocessorisindependentandwhenitdoesneedtocommunicatewithotherprocessors,thatcommunicationoccursindependentlywithoutlimitingthelocalactivity.Inshort,thelocalprocessorhastolerancetoanylatencyinvolvedinglobalcommunication,asinmostcircumstancesitwillhavemanyotherindependentinstructionsitcanprocessand,ifthisisnotthecase,itwillsimplyswitchoffitsclocks,reduceitsvoltagelevelsandwaituntilithasworktoaccomplishdissipatingminimalpower.ThesizeofthesynchronousblockinamicrothreadedCMPcanbefromasingleprocessorupwards.Thesizeofthisblockisalow-leveldesigndecision.Theissueisthatastechnologycontinuestoscalethisblocksizewillscaledownwiththeproblemsofsignalpropagation.ThusthemodelprovidessolutionstotheendofscalinginsiliconCMOS.Comparethiswiththecurrentapproach,whichseekstogainperformancebyclockspeedinasinglelargewide-issueprocessorwhereallstrategiesareworkingagainstthetechnology.4.6.

MicrothreadedCMParchitecture

Shared-Dependent PortsInitialization PortData_local_LCQData_G_WriteCache Miss PortWrite_cache_MissData_local_Read1Data_local_WriteData_local_Read2Data_S_WriteData_S_ReadGlobalWrite PortPipeline PortsFIGURE2.Register-ﬁleportsanalysed.

appropriatescheduling.Eachiterationisallocatedaseparatemicrocontextinthedynamichalfoftheregister-addressspaceandtheﬁrstlocalregister($L0)isinitializedbytheschedulertotheloopindexforthatiteration,sothisalsorequiresawriteport.Finally,awriteportisrequiredtosupportdecoupledaccesstothememoryonacachemissinordertoavoidabubbleinthepipeline,whendatabecomesavailable.

Figure2isablockdiagramofthemicrothreadedregisterﬁleillustratingtheseports.Asshown,ithasamaximumofeightlocalportsperprocessor.Theregisterﬁlecouldbeimplementedwithjustthreeportsbystallingthepipelinewheneveroneoftheasynchronousreadsorwritesoccursbutthiswoulddegradeitsperformancesigniﬁcantly.AnanalysisoftheaccessestotheseportsisgiveninSection5below,whereweattempttoreducetheseeightportstoﬁveusingcontention,i.e.thethreepipelineportsandoneadditionalreadandwriteportforallothercases.4.5.1.

Globallyasynchronouslocallysynchronouscommunication

ModernsynchronousCMParchitecturesarebasedonsingleclockdomainwithglobalsynchronizationandcontrolsignals.Thecontrolsignaldistributionmustbeverycarefullydesignedinordertomeettheoperationrateoneachcomponentusedandthelargerthechip,themoreisthepowerthatisrequiredtodistributethesesignals.Infact,clockskew,andthelargepowerconsumptionrequiredtoeliminateit,isoneofthe

AblockdiagramofamicrothreadedCMPisshowninFigure3.Asshown,Nmicrothreadedpipelinesareconnectedtothesetwosharedcommunicationssystems.Theﬁrstisabroadcastbus,forwhichtheremustbecontention,althoughinpracticetheaccesstothisbusisatalowfrequency.Itisusedbyoneprocessortocreateafamilyofthreads,itwillprobablybeusedbythesameprocessortodistributeanyloopinvariantsandﬁnally,ifthereisascalarresult,oneprocessormaywritevaluesbacktogloballocations.Thissituationoccurswhensearchinganiterationspace—itistheonlysituationwherecontentionmightberequired—asanumberofprocessorsmightﬁndasolutionsimultaneouslyandattempttowritetothebus.In

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

CreateCreate225

SchedulerI-cacheSchedulerI-cachePipelineCreate/write $GD-cache…PipelineD-cacheBroadcast BusWrite $G Decoupled LwCreate/write $GWrite $G Initialise $L0Initialise $L0LocalRegister file$D readLocalRegister file$D readDecoupled LwRing interconnect for registers and bus arbitrationIndependently synchronous domainsFIGURE3.MicrothreadedCMParchitecture,showingcommunicationstructuresandclockingdomains.

thiscaseabreakinstructionacquiresthebusandterminatesallotherthreadsallowingthewinnertowriteitsresultsbacktotheglobalstateofthemaincontext.

Thesecondcommunicationsystemistheshared-register,ringnetwork,whichisusedbyprocessorstocommunicateresultsalongadependencychain.Forthemodedescribed,thisrequiresonlylocalconnectivitybetweenindependentlyclockedprocessors.

Allglobalcommunicationsystemsaredecoupledfromtheoperationofthemicrothreadedpipelineandthreadschedulingprovideslatencyhidingduringtheremoteaccess.ThistechniquegivesamicrothreadedCMPaseriousadvantageasalong-termsolutiontosiliconscaling.

5.ANALYSISOFREGISTERFILEPORTS

Inthissection,ananalysisofmicrothreadedregisterﬁleportsismadeintermsoftheaveragenumberofaccessestoeachportoftheregisterﬁleineverypipelinecycle.Thisanalysisisbasedonthehandcompilationofavarietyofloopkernels.TheloopsconsideredincludedanumberofLivermorekernels—somewhichareindependentandsomewhichcontainloop-carrieddependencies.Italsoincludesbothafﬁneandnon-afﬁneloops,vectorandmatrixproblems,andarecursivedoublingalgorithm.Wehaveusedloopkernelsatthisstageaswecurrentlyhavenocompilertocompilecompletebenchmarks.However,asthemodelonlygainsspeedupvialoops,wehavechosenabroadsetofrepresentativeloopsfromscientiﬁcandotherapplications.Analysisofcompleteprogramsandotherstandardbenchmarkswillbeundertakenwhenacompilerwearedevelopingisabletogeneratemicrothreadedcode.

Theresultsarebasedonastaticanalysisoftheaccessestovariousregisterwindowsandinvestigatetheaveragetrafﬁconthemicrothreadedregisterﬁleports.TheﬁvetypesofregisterﬁleportsareshowninFigure2andinclude,pipelineports(read-Randwrite-W),theinitializationport(I),theshared-dependentports(Sd),thebroadcastport(Br)andthewriteportthatisrequiredinthecaseofacachemiss(Wm).Thegoalofthisanalysisistoguidetheimplementationparametersofsuchasystem.Weaimtoshowthatallaccessesotherthanthesynchronouspipelineportscanbeimplementedbyapairofreadandwriteports,witharbitrationbetweenthedifferentsources.InthiscasearegisterﬁlewithﬁveﬁxedportswouldbesufﬁcientforeachoftheprocessorsinourCMPdesign.Themicrothreadedpipelineusesthreesynchronousports.Theseportsareusedtoaccessthreeclassesofregisterwindowsi.e.the$L,$Sand$Gregisterwindows.IfweassumethattheaveragenumberofreadstothepipelineportsineachcycleisRandtheaveragenumberofwritestothepipelineportineachcycleisW,thenthesevaluesaredeﬁnedbythefollowingequations,whereNeisthetotalnumberofinstructionsexecuted.

󰀁

(Read($L)+Read($S)+Read($G))

R=inst.(1)

󰀁

(Write($L)+Write($S)+Write($G))

W=inst.(2)

NeTheinitializationportonotherhandisusedinregisterallocationtoinitializethe$L0totheloopindex.Thisportisaccessedoncewheneachiterationisallocatedtoaprocessorandsotheaveragenumberofaccessestothisportisconstantandequaltotheinverseofthenumberofinstructionsexecutedbythethread

TheComputerJournalVol.49No.2,2006

226K.Bousias,N.HasasnehandC.Jesshope

TABLE3.Averagenumberofaccessestoeachclassofregisterﬁleportoverarangeofloopkernels,m=problemsize.

Loop

A:PartialProductsB:2-DSORL3:InnerProduct

L4:BandedLinearEquationL5:TriDiagonalElimination

Ne3m5m−24m+43m+345m+3

R4m−3Ne8m−15Ne5m+3Ne

2.4m+37.4

Ne7mNe

W2m−1Ne4m−4Ne4m+1Ne3m+22Ne4mNe

I0.3330.20.250.20.250.14290.07140.11110.09090.0385

Br0004nNe0

Sdm−1M∗Nem−2M∗NemM∗Ne1.8m+1.8M∗Nem−1M∗Ne

L6:GeneralLinearRecurrence2.5m+6.5m1/2−5C:PointerChasingL1:HydroFragmentL2:ICCG

L7:EquationofStateFragment

14m+59m+511m+2logm−21

26m+5

5.5m+2.5m1/2−53m+m1/2−2

NeNe9m+36m+2NeNe15m8m+3NeNe

17m−5logm−2710m−5logm−12

NeNe

25m+343m+3

NeNe(m1/2−1)n0.5m−0.5m1/2

NeM∗NenmNeM∗Ne3n

(logm−1)n

Ne3n

beforeitiskilled,no.Therefore,ifIistheaveragenumberofaccessestotheinitializationportpercycle,wecansaythat:

1no

(3)

Adependentreadtoaremoteprocessorusesareadportontheremoteprocessorandawriteportonthelocalprocessor,aswellasareadtothesynchronouspipelineportonthelocalprocessor.Theaveragenumberofaccessestotheseportspercycleisdependentonthetypeofschedulingalgorithmused.Ifweusemoduloscheduling,whereMconsecutiveiterationsarescheduledtooneprocessor,theninterprocessorcommunicationisminimized.Anequationfordependentreadsandwritesisgivenbasedonmoduloschedulingalthoughweconsideritonlyattheworstcasescenario.TheaveragenumberofaccessespercycletothedependentwindowisgivenbelowbySdusingthefollowingequation,whereMisthenumberofconsecutivethreadsschedulingtooneprocessorandNeisthetotalnumberofinstructionsexecuted.ItisclearthattheworstcaseiswhereM=1,i.e.iterationsaredistributedoneperprocessorinamodulomanner.

󰀁

Read($D)

Sd=inst.(4)

M∗NeTheglobalwriteportisusedtostoredatafromthebroadcastbustotheglobalwindowineveryprocessor’slocalregisterﬁle.IfweassumethattheaveragenumberofaccessespercycletothisportisBr,thenBrcanbeobtainedfromthe

followingequation,whereNeisthetotalnumberofinstructionsexecutedandnisthenumberofprocessorsinthesystem.Theresultisproportionaltothenumberofprocessors,asonewriteinstructionwillcauseawritetoeveryprocessorinthesystem.

󰀁

Write($G)∗n

(5)Br=inst.

NeFinally,thefrequencyofaccessestotheportthatisrequiredforthedeferredregisterwriteinthecaseofacachemisscanalsobeobtained.Itisparameterizedbycachemissrateinthisstaticanalysisandagainwelookattheworstcase(100%missrate).Theaveragenumberofwritespercycletothecache-missportisgivenbyWm,whichisgivenbytheformulabelowwhereLwisthenumberofloadinstructionsineachthreadbody,noisthenumberofinstructionsexecutedperthreadbody,andCmisthecachemissrate.Againtheaverageaccesstothisportisconstantforagivenmissrate.

󰀁

(Lw)

∗Cm(6)Wm=inst.

noTable3showstheaveragenumberofaccessestoeachclassofregisterﬁleportoverarangeofloopkernelsusingtheaboveformulae.Theﬁrstsevenkernelsaredependentloops,wherethedependenciesarecarriedbetweeniterationsusingregisters.Thelastthreeareindependentloops,wherealliterationsoftheloopareindependentofeachother.

Asdescribedpreviously,eachofthedistributedregisterﬁleshasfoursourcesforwriteaccessesinadditiontothepipeline

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

0.50.4510.90.80.70.60.50.40.30.20.10

Normalised problem size (m/n)

227

IBS/DAll writesAverage accesses per cycle (individual ports)0.40.350.30.250.20.150.10.050FIGURE4.Averageaccessespercycleonadditionalports,n=4processors.

0.50.45

10.9

IBS/DAll writes0.80.70.6

Average accesses per cycle (individual ports)0.40.350.3

0.250.20.150.10.0500

0.50.40.30.20.10

Normalised problem size (m/n)

FIGURE5.Averageaccessespercycleonadditionalports,n=16processors.

ports.Thesearefor$Gwrite,theinitializationwrite,the$Dreturndataandthewritetotheportthatsupportsdecoupledaccesstomemoryonacachemiss.Ouranalysisshowsthattheaverageaccessesfromthesesourcesismuchlessthanoneaccesspercycleoverallanalysedloopkernels.ThisisshowninFigures4–7whereaccessestoinitialization(I),broadcast(Br)andthenetworkports(Sd,shownasS/D)aregiven.Thefourﬁguresillustratethescalabilityoftheresults(fromn=4to

TheComputerJournalVol.49No.2,2006

Average accesses per cycle (all writes)Average accesses per cycle (all writes)228

0.50.45

K.Bousias,N.HasasnehandC.Jesshope

10.9

IBS/DAll writes0.80.70.6

Average accesses per cycle (individual ports)0.350.3

0.250.20.150.10.0500

0.50.40.30.20.10

Normalised problem size (m/n)FIGURE6.Averageaccessespercycleonadditionalports,n=64processors.

FIGURE7.Averageaccessespercycleonadditionalports,n=256processors.

n=256processors).Resultsareplottedagainstthenormalizedproblemsize,wheremisthesizeoftheproblemintermsofthenumberofiterations,althoughnotalliterationsareexecutedconcurrentlyinallcodes(forexampletherecursivedoublingalgorithmhasasequenceofconcurrentloopsvaryingbypowersof2from2tom/2).Normalizedproblemsizeisthereforeameasureofthenumberofiterationsexecutedperprocessor.Itcanbeseenthatonlyaccessesfromthebroadcastbusincreasewiththenumberofprocessorsandeventhisisonlysigniﬁcantwherefewiterationsaremappedtoeachprocessor.Evenin

TheComputerJournalVol.49No.2,2006

Average accesses per cycle (all writes)0.4

InstructionLevelParallelismthroughMicrothreading

TABLE4.Averagenumberofaccessestoalladditionalwriteportsfordifferentnumberofprocessors,m/n=8.

Averageaccesses(allwriteports)

n=4

0.4050.4430.4800.5180.5560.5940.6320.6690.7070.744

Averageaccesses(allwriteports)

n=16

0.4530.4910.5280.5660.6040.6420.6800.7170.7550.785

Averageaccesses(allwriteports)

n=64

0.5170.5550.5920.6300.6680.7060.7440.7810.8190.857

Averageaccesses(allwriteports)n=256

0.6340.6720.7090.7470.7850.8230.8610.8980.9360.974

229

Missrate(R%)102030405060708090100

AverageaccessestoWmPort

0.0380.0760.1130.1510.1890.2270.2650.3020.3400.378

thecaseof256processors,providingweschedulemorethanafewiterationstoeachprocessor,theoverallnumberofwritesis<50%.Notethataregisterﬁleof512registerssupportsatleast32micro-contextsperprocessor.Toputthisinperspective,thismeansthatonaverageasingleportsharingallI,BrandSdwriteswouldbebusyonly50%ofthetime.Theremaybepeaksinthedistributionofwritespercycle,howeveralloftheseaccessesareasynchronousandtheycanbequeuedwithoutstallingtheoperationofanyofthepipelines.Thisstillleavescapacitytoincludewritesfromthedecoupled-memoryaccesses.

Theanalysisofthedecoupled-memoryportalsoshowsthattheaveragenumberofaccessespercycleissmall.Ifweassumeacachemissrateof50%,thentheaveragenumberofaccessesis<20%overallloopkernels.Thus,asinglewriteportwouldnotbefullyutilizedatthismissrate.Forcompleteness,Table4showstheaveragenumberofaccessespercycletoallwriteportsincludingtheWmportwithavariablecachemissrate.Thistableiscompiledforanormalizedproblemsizewherethem/n=8,whichcorrespondstofullyutilizingasmallregisterﬁle.Actualcachemissratesforoneofthekernelsaregiveninthesimulationresultspresentedbelow.

6.CMPSIMULATIONRESULTS

Thissectiongivessomepreliminarysimulationresults,whichshowscalabilityofperformanceinexecutingasingleloopkernel.Reference[47]alsoshowsscalabilityofpowerforthesamesimulations.Theresultsarepreliminaryastheycannotshowtherelationshipbetweenthescalarpartofthecodeandtheconcurrentpart,asamicrothreadedcompilerisrequiredbeforewecansimulatecompleteapplications.Thesigniﬁcantissuehowever,isthatalloftheseresultswereobtainedfromasinglebinarycodethatwashandcompiledfromtheLivermorehydrofragmentandexecutedonamicrothreadedCMPusingfrom1to2048processors.Thecodeperformsaﬁxednumber

ofiterations(64K).Clearlythisdemonstratesthescheduleinvarianceofmicrothreadedcode.

Theresultsarepresentedasfunctionsofthenumberofprocessorsonwhichthecodewasrunandinclude:thenumberofcyclestoexecutethecode;thenumberofinstructionsexecuted(notethatfailedsynchronizationwillcauseanincreaseintheoverallnumberofinstructionsexecuted);IPC,whichiscomputedfromtheprevioustwo;thenumberofcompletelyinactivecycles,whenthepipeisemptyandnothreadsareabletoexecute;andﬁnallythatL1D-cachehitrate.

Allresultsuseidenticalprocessorswith1024registersperprocessoranda512entrycontinuationqueue.TheI-cacheisalsoidenticalinallresults,arelativelysmall2-Kbyte8-waysetassociativecachewith32bytelinesize.Theﬁrstsetofresultsareforarelativelycomplexlevel-1D-cacheof64Kbyte,with8-wayassociativityand64bytelinesize.Figure8showstheresults.Thelargestnumberofinstructionsexecutedduetoinstructionreissueonfailedsynchronizationis<4%andoccurredat256processors.ThemaximumIPCwas1613andrepresentedaspeedupof1602timesthesingleprocessorexecutionandoccurred,asexpected,with2048processors.Thisisnearly80%efﬁcient.Indeedthespeedupwaswithin4%oftheidealupto256processors.Thebeginningofsaturationinspeedupisnotunexpectedgiventheﬁxedproblemsize.Whenusing2048processors,forexample,theprocessor’sresources(registersandcontinuationqueueslots)areonlypartiallyutilizedandhencelatencytolerancesuffers.Alsotheﬁxedoverheadofthreadcreation,includingbroadcastoftheTCBpointertoallprocessorsisamortizedoverfewercycles.Notethatallsimulationsarestartedwithcoldcaches.Theoverheadisrepresentedbytherelativelystatic100orsocyclesduringwhichtheprocessorsareinactive.Asthesolutiontimeapproachesthis,thepercentageoverheadwillincreaserapidly.Finallyitcanbeseenthatthecachehitratevariesbetween75and95%.Thebesthitrateisseenononeprocessorbuttheworstoccurson∼64processors,wheretheresourcesstarttobecomelessefﬁcientlyused.Notethatthelocalschedules

TheComputerJournalVol.49No.2,2006

230

1.0E+07

K.Bousias,N.HasasnehandC.Jesshope

100

1.0E+06

Cycles, IPC or instructions executed1.0E+05

1.0E+03

1.0E+02

1.0E+01

Cycles to solutionIPCInstructions executedInactive cyclesHit Rate (%)1

100

1000

1.0E+00

010000

Number of processors used to execute kernal

FIGURE8.Characteristicsofmicrothreadedkernelexecution(64KD-cache).

weresettomatchthecachelinesizeandhencemaximizehitrate.

Thesecondsetofresultswasundertakentoillustratewhateffect,ifany,theD-cachehadonperformance.Inthisset,theD-cacheparameterswerealtereddrasticallytogivearesidualcacheofjust1Kbytewithdirectmappedcachelines(n.b.theregisterﬁlesizeisasubstantiallargerthanthisat8Kbyte).Figure9showstheseresults.Itcanbeseenthatnumberofinstructionsexecuted,timetosolutionandIPCareallvirtuallyunchanged.Theonlysigniﬁcantchangeisinthecachehitrate,whichisnowmuchworse,varyingfrom40upto88%onasingleprocessor.IndeedtheminimaltimetosolutionandmaximumIPCbyasmallamountwereobservedwith2048processorsusingtheresidualD-cache.

Wenotethattherearesomecaveatstotheseresults.Theyareobtainedfromtheexecutionofasingleindependentcodekernel.Simulationsofdependentloopsshowthatspeedupsaturates,withthemaximumspeedupbeingdeterminedbytheratioofindependenttodependentinstructionsinthesekernels.Forexample,asimulationofanaivemultiply-accumulateimplementationofmatrixmultiplication(adependentloop)saturatedwithaspeedupofbetween3and4andwasachievedusingonlyfourprocessors.Thisrepresentsthemaximuminstructionparallelismofthecombinediterations.Theresultsalsoignorethescalarcomponentoftheprogram,whichwecannoteffectivelyevaluateuntilwehavefullydevelopeda

microthreadedcompiler.Finally,thememorymodelusedinthesesimulationsisnon-blocking,aswedonothaveanrealisticmodelfullysimulatedyet.Theresultsuseapipelinedmemorywithan8-cyclestart-upforﬁrstwordand2-cyclesperwordtocompleteacachelinetransfer.Notethatthislimitationisnotamajorissuewiththekernelsimulatedasdatacanbedistributedinmemoryaccordingtothelocalscheduleandblockingwouldbeunlikely.Inthe64Kbytecacheonly3%ofmemoryaccessescausearequesttosecondlevelmemory.Theremainingcachemissesaresame-linemissesduetotheregularityofscheduling.Wearecurrentlyworkingonasecondlevelmemoryarchitectureandwillpresentfullsimulationresultsofthiswhenwehaveaworkingcompiler.Theresultspresentedhere,however,showgreatpromiseforthisapproach.

7.CONCLUSIONS

Thecharacteristicsofadvancedintegratedcircuits(ICs)willinfuturerequirepowerfulandscalableCMParchitectures.However,currenttechniqueslikewide-issue,superscalarprocessorssufferfromcomplexityininstructionissueandinthelargemulti-portedregisterﬁlerequired.Thecomplexityofthesecomponentsgrowsatleastquadraticallywithincreasingissuewidth;also,executionofinstructionsusingthesetechniquesmustproceedspeculatively,whichdoesnotalways

TheComputerJournalVol.49No.2,2006

Hit rate (%)1.0E+04

InstructionLevelParallelismthroughMicrothreading

1.0E+07100

231

1.0E+0680

Cycles, IPC or instructions executed1.0E+0570

1.0E+0340

1.0E+0230

1.0E+01Cycles to solutionIPCInstructions executedInactive cyclesHit Rate (%)1

100

1000

1.0E+00010000

Number of Processers used to execute kernal

FIGURE9.Characteristicsofmicrothreadedkernelexecution(1KD-cache).

provideresultsforthepowerconsumed.Inaddition,moreon-chipmemoryisrequiredinordertoamelioratetheeffectsofthesocalled‘memorywall’.Theseobstacleslimittheprocessor’sperformance,byconstrainingparallelismorthroughhavinglargeandslowstructures.Inshort,thisapproachdoesnotprovidescalabilityinaprocessor’sperformance,intheon-chipareaandpowerdissipation.

Analternativesolutionwhicheliminatesthiscomplexityininstructionissueandtheglobalregisterﬁle,andavoidsspeculationhasbeenpresentedinthispaper.Themodelisbasedondecomposingasequentialprogramintosmallfragmentsofcodecalledmicrothreads,whicharescheduleddynamicallyandwhichcancommunicateandsynchronizewitheachotherveryefﬁciently.ThisprocessallowssequentialcodetobecompiledforexecutiononscalableCMPs.Moreover,asthecodeisscheduleinvariant,thesamecodewillexecuteonanynumberofprocessorslimitedonlybyproblemsize.ThemodelexploitsILPwithinbasicblocksandacrossloopbodies.Inaddition,thisapproachsupportsapre-fetchingmechanismthatavoidsmanyinstruction-cachemissesinthepipeline.ThefullydistributedregisterﬁleconﬁgurationusedinthisapproachhastheadditionaladvantageoffullscalabilityinaCMPwiththedecouplingofallformsofcommunicationfromthepipeline’soperation.Thisincludesmemoryaccessesandcommunicationbetweenmicro-contexts.

ThedistributedimplementationofamicrothreadedCMPincludestwoformsofasynchronouscommunication.Theﬁrst

isthebroadcastbus,usedforcreatingthreadsanddistributinginvariants.Thesecondistheshared-registerringnetworkusedtoperformcommunicationbetweentheregisterﬁlesintheproducerandconsumerthreads.TheasynchronousimplementationofthebusandswitchprovidesmanyopportunitiesforpowersavinginlargeCMPsystems.Thedecoupledapproachtoregister-ﬁledesignavoidsacentralizedregisterﬁleorganizationand,aswehaveshown,requiresasmall,ﬁxednumberofportstoeachprocessor’sregisterﬁle,regardlessofthenumberofprocessorsinthesystem.

Ananalysisoftheregister-ﬁleportsintermsofthefrequencyofaccessestoeachlogicalportisdescribedinthispaper.Thisanalysisinvolveddifferenttypesofdependentandindependentloopkernels.Theanalysisillustratesanumberofinterestingissues,whichcanbesummarizedasfollows:

•Asinglewriteportwitharbitrationbetweendifferentsourcesissufﬁcienttosupportallnon-pipelinewrites.Thisporthasanaverageaccessrateof<100%overnormaloperatingconditions.Thisistrueeveninthecaseofa100%cache-missrate.

•Asecondportisrequiredtohandlereadstothe$Dwindow.Theanalysisshowsthattheaverageaccesstothisportis<10%overallanalysedloopkernels.

•Asaconsequence,thedistributedregisterﬁlesrequireonlyﬁveportsperprocessorandtheseportsareﬁxedregardlessofthenumberofprocessorsinthesystem.This

TheComputerJournalVol.49No.2,2006

Hit rate (%)1.0E+04232K.Bousias,N.HasasnehandC.Jesshope

ArchitecturesandCompilationTechniques,Paris,France,October12–18,pp.130–135.IEEEComputerSociety,Washington,DC.

Olukotun,K.,Nayfeh,B.A.,Hammond,L.,Wilson,K.andChang,K.(1996)Thecaseforasingle-chipmultiprocessor.InProc.SeventhInt.Symp.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-7),Cambridge,MA,October1–5.Cambridge,MA,September,pp.2–11.ACMPress,NewYork,NY.

Palacharla,S.,Jouppi,N.P.andSmith,J.(1997)Complexity-effectivesuperscalarprocessors.InProc.24thInt.Symp.ComputerArchitecture,Denver,CO,June1–4,pp.206–218.ACMPress,NewYork,NY.

Tullsen,D.M.,Eggersa,S.andLevy,H.M.(1995)Simultaneousmultithreading:maximizingonchipparallelism.InProc.22ndAnnualInt.Symp.ComputerArchitecture,SantaMargheritaLigure,Italy,June22–24,pp.392–403.ACMPress,NewYork,NY.

Rixner,S.,Dally,W.J.,Khailany,B.,Mattson,P.R.,Kapasi,U.J.andOwens,J.D.(2000)Registerorganizationformediaprocessing.InProc.Int.Symp.HighPerformanceComputerArchitecture,Toulouse,France,January8–12,pp.375–386.IEEECSPress,LosAlamitos,CA.

Balasubramonian,R.,Dwarkadas,S.andAlbonesi,D.(2001)ReducingtheComplexityoftheregisterﬁleindynamicsuperscalarprocessors.InProc.34thInt.Symp.onMicro-architecture,Austin,TX,December1–5,pp.237–248.IEEEComputerSociety,Washington,DC.

Diefendorff,K.andDuquesne,Y.(2002)ComplexSOCsrequirenewarchitectures.EETimes.Availableathttp://www.eetimes.com/issue/se/OEG20020911S0076.

Ungerer,T.,Robec,B.andSilc,J.(2003)Asurveyofprocessorswithexplicitmultithreading.ACMComput.Surveys,35,29–63.Burns,J.andGaudiot,J.-L.(2001)AreaandsystemclockeffectsonSMT/CMPprocessors.InProc.2001Int.Conf.ParallelArchitecturesandCompilationTechniques,Barcelona,Spain,September8–12,pp.211–218.IEEEComputerSociety,Washington,DC.

Jesshope,C.R.(2003)Multi-threadedmicroprocessorsevolutionorrevolution.InProc.8thAsia-PaciﬁcConf.ACSAC’2003,Aizu,Japan,September23–26,LNCS2823,pp.21–45.Springer,Berlin,Germany.

Luo,B.andJesshope,C.R.(2002)Performanceofamicro-threadedpipeline.InProc.7thAsia-PaciﬁcConf.ComputerSystemsArchitecture,Melbourne,Victoria,Australia,January28–February2,pp.83–90.AustraliaComputerSociety,Inc.Darlinghurst,Australia.

Jesshope,C.R.(2001)Implementinganefﬁcientvectorinstructionsetinachipmulti-processorusingmicro-threadedpipelines.InProc.ACSAC2001,GoldCoast,Queensland,Australia,January29–30,pp.80–88.IEEEComputerSociety,LosAlamitos,CA.

Zhou,H.andConte,T.M.(2002)CodeSizeEfﬁciencyinGlobalSchedulingforVLIW/EPICStyleEmbeddedProcessors.TechnicalReport,DepartmentofElectricalandComputerEngineering,NorthCarolinaStateUniversity,Raleigh,NC.Hwang,K.(1993)AdvancedComputerArchitecture.MITandMcGraw-Hill,NewYork,StLouis,SanFrancisco.

providesascalableandefﬁcientsolutionforlargenumberofprocessorson-chip.

•Finally,theaverageaccessestoallwriteportsdoesnotexceed100%eveninthecaseofn=256-processor.However,todealwithalargenumberofprocessors,theperformancewoulddegradegracefullyduetotheinherentlatencytoleranceofthemodel.Eventuallyallthreadswouldbesuspendedwaitingfordataandinthiscasethestalledpipeline(s)wouldfreeupcontentiontothenon-pipelinewriteport.

Finallywepresentresultsofthesimulationofanindependentloopkernel,thatclearlydemonstratescheduleinvarianceofthebinarycodeandlinearspeedupcharacteristicsoverawiderangeofprocessorsonwhichthekernelisscheduled.Clearly,amicrothreadedCMPbasedonafullydistributedandscalableregisterﬁleorganizationandasynchronousglobalcommunicationbusesisagoodcandidatetofutureCMP.REFERENCES

[1]Barroso,L.A.,Gharachorloo,K.,McNamara,R.,Nowatzyk,A.,

Qadeer,S.,Sano,B.,Smith,S.,Stets,S.andVerghese,B.(2000)Piranha:ascalablearchitecturebasedonsingle-chipmultiprocessing.InProc.27thAnnualInt.Symp.ComputerArchitecture,Vancouver,BritishColumbia,Canada,June12–14,pp.282–293.ACMPress,NewYork,NY.

[2]Hammond,L.,Hubbert,B.A.,Siu,M.,Prabhu,M.K.,Chen,M.

andOlukotun,K.(2000)TheStanfordHydraCMP.IEEEMicro,20,71–84.

[3]Hammond,L.,Nayfah,B.A.andOlukotun,K.(1997)Asingle-chipmultiprocessor.IEEEComput.Soc.,30,79–85.

[4]Tendler,J.M.,Dodson,J.S.,Fields,J.S.,Le,H.andSinharoy,B.

(2002)Power4SystemMicro-architecture.IBMJ.Res.Develop.,46,5–25.

[5]Tremblay,M.,Chan,J.,Chaudhry,S.,Conigliaro,A.W.and

Tse,S.S.(2000)TheMAJCarchitecture:asynthesisofparallelismandscalability.IEEEMicro,20,12–25.

[6]Jesshope,C.R.(2004)Scalableinstruction-levelparallelism.

InProc.ComputerSystems:Architectures,ModelingandSimulation,3rdand4thInt.Workshops,SAMOS2004,Samos,Greece,July19–21,LNCS3133,pp.383–392.Springer.

[7]Bhandarkar,D.(2003)Billiontransistorchipsinmainstream

enterpriseplatformsofthefuture.InProc.9thInt.Symp.High-PerformanceComputerArchitecture,Anaheim,CA,February8–12,pp.3.IEEEComputerSociety,Washington,DC.

[8]Agarwal,V.,Hrishikesh,M.S.,Keckler,S.W.andBurger,D.

(2000)ClockrateversusIPC:theendoftheroadforconventionalmicroarchitectures.InProc.27thAnnualInt.Symp.ComputerArchitecture,Vancouver,British,Columbia,Canada,June10–14,pp.248–259.ACMPress,NewYork,NY.

[9]Onder,S.andGupta,R.(2001)Instructionwake-upinwide

issuesuperscalars.InProc.7thInt.Euro-ParConf.ManchesteronParallelProcessing,Manchester,UK,August28–31,pp.418–427.Springer-Verlag,London,UK.

[10]Onder,S.andGupta,R.(1998)Superscalarexecution

withdynamicdataforwarding.InProc.Int.Conf.Parallel

[11]

[12]

[13]

[14]

[15]

[16]

[17][18]

[19]

[20]

[21]

[22]

[23]

TheComputerJournalVol.49No.2,2006

InstructionLevelParallelismthroughMicrothreading

[24]Sudharsanan,S.,Sriram,P.,Frederickson,H.andGulati,A.

(2000)ImageandvideoprocessingusingMAJC5200.InProc.2000IEEEInt.Conf.ImageProcessing,Vancouver,BC,Canada,September10–13,pp.122–125.IEEEComputerSociety,Washington,DC.

[25]Cintra,M.andTorrellas,J.(2002)Eliminatingsquashesthrough

learningcross-threadviolationsinspeculativeparallelisationformultiprocessors.InProc.8thInt.Symp.High-PerformanceComputerArchitecture,Boston,MA,February2–6,pp.43–54.IEEEComputerSociety,Washington,DC.

[26]Cintra,M.Martinez,J.S.andTorrellas,J.(2000)Architecture

supportforscalablespeculativeparallelizationinshared-memorymultiprocessors.InProc.Int.Symp.ComputerArchitecture,Vancouver,Canada,June10–14,pp.13–24.ACMPress,NewYork,NY.

[27]Terechko,A.,Thenaff,E.L.,Garg,M.J.,VanEijndhoven,J.V.

andCorporaal,H.(2003)Inter-clustercommunicationmod-elsforclusteredVLIWprocessors.InProc.9thInt.Symp.High-PerformanceComputerArchitecture,Anaheim,CA,February8–12,pp.354–364.IEEEComputerSociety,Washington,DC.

[28]Halfhill,T.(1998)InsideIA-64.ByteMagaz.,23,81–88.

[29]Schlansker,M.S.andRau,B.R.(2000)EPIC:anarchitecturefor

instruction-levelparallelprocessors.CompilerandArchitectureResearch,HPL-1999-111.HPLaboratories,PaloAlto.

[30]Sundararaman,K.andFranklin,M.(1997)Multiscalarexecution

alongasingleﬂowofcontrol.InProc.IEEEInt.Conf.ParallelProcessing,Bloomington,IL,August11–15,pp.106–113.IEEEComputerSociety,Washington,DC.

[31]Sohi,G.S.,Breach,S.E.andVijaykumar,T.N.(1995)

Multiscalarprocessors.InProc.22ndAnnualInt.Symp.ComputerArchitecture,S.MargheritaLigure,Italy,June22–24,pp.414–425.ACMPress,NewYork,NY.

[32]Breach,S.E.,Vijaykumar,T.N.andSohi,G.S.(1994)

Theanatomyoftheregisterﬁleinamultiscalarprocessor.InProc.27thInt.Symp.Microarchitecture,SanJose,CA,November30–December2,pp.181–190.ACMPress,NewYork,NY.

[33]Alverson,R.,Callahan,D.,Cummings,D.,Koblenz,B.,

Porterﬁeld,A.andSmith,B.(1990)TheTeracom-putersystem.InProc.4thInt.Conf.Supercomputing,Amsterdam,TheNetherlands,June11–15,pp.1–6.ACMPress,NewYork,NY.

[34]Kongetira,P.,Aingaran,K.andOlukotun,K.(2005)Niagara:

32-waymultithreadedSparcprocessor.IEEEComput.Soc.,25,21–29.

[35]Marr,D.T.,Binns,F.,Hill,D.L.,Hinton,G.,Koufaty,D.A.and

Upton,M.(2002)Hyper-threadingtechnologyarchitectureandmicroarchitecture.IntelTechnol.J.,6,4–15.

[36]Emer,J.(1999)Simultaneousmultithreading:multipleAlpha’s

performance.InPresentationattheMicroprocessorForum’99,MicroDesignResources,SanJose,CA.

233

[37]Codrescu,L.,Wills,D.S.andMeindl,J.D.(2001)Architecture

oftheAtlasChipMultiprocessor:dynamicallyparallelisingirregularapplications.IEEEComput.Soc.,50,67–82.

[38]Diefendorff,K.(1999)Power4focusesonmemorybandwidth:

IBMconfrontsIA-64,saysISAnotimportant.MicroprocessorRep.,13,11–17.

[39]Preston,R.P.etal.(2002)Designofan8-widesuperscalar

RISCmicroprocessorwithsimultaneousmultithreading.InProc.2002IEEEInt.Solid-StateCircuitsConf.,SanFrancisco,CA,February4–6,pp.334–335.IEEESolid-StateCircuits,USA.[40]Scott,L.,Lee,L.,Arends,J.andMoyer,B.(1998)Designingthe

low-powerM-COREarchitecture.InProc.IEEEPowerDrivenMicroArchitectureWorkshopatISCA98,Barcelona,Spain,June28,pp.145–150.

[41]Park,I.,Powell,M.D.andVijaykumar,T.N.(2002)Reducing

registerportsforhigherspeedandlowerenergy.InProc.35thAnnualACM/IEEEInt.Symp.Microarchitecture,Istanbul,Turkey,November18–22,pp.171–182.IEEEComputerSociety,LosAlamitos,CA.

[42]Kim,N.S.andMudge,T.(2003)Reducingregisterports

usingdelayedwrite-backqueuesandoperandpre-fetch.InProc.17thAnnualInt.Conf.Supercomputing,SanFrancisco,CA,June23–26,pp.172–182.ACMPress,NewYork,NY.

[43]Tseng,J.H.andAsanovic,K.(2003)Bankedmultiported

registerﬁlesforhigh-frequencysuperscalarmicroprocessors.InProc.30thInt.Symp.ComputerArchitecture,SanDiego,CA,June9–11,pp.62–71.ACMPress,NewYork,NY.

[44]Bunchua,S.,Wills,D.S.andWills,L.M.(2003)Reducing

operandtransportcomplexityofsuperscalarprocessorsusingdistributedregisterﬁles.InProc.21stInt.Conf.ComputerDesign,SanJose,CA,October13–15,pp.532–535.IEEEComputerSociety,LosAlamitos,CA.

[45]Bolychevsky,A.,Jesshope,C.R.andMuchnick,V.(1996)

DynamicSchedulinginRISCArchitectures.IEEProc.Comput.Digit.Tech.,143,309–317.

[46]Jesshope,C.R.(2005)Micro-grids—theexploitationofmassive

on-chipconcurrency.InProc.HPCWorkshop2004,GridComputing:ANewFrontierofHighPerformanceComputing,L.Grandinetti(ed.),Cetraro,Italy,May31–June3.Elsevier,Amsterdam.

[47]Bousias,K.andJesshope,C.R.(2005)Thechallengesof

massiveon-chipconcurrency.TenthAsia-PaciﬁcComputerSystemsArchitectureConference,Singapore,October24–26.LNCS3740,pp.157–170.Springer-Verlag.

[48]Shapiro,D.(1984)GloballyAsynchronousLocallySynchronous

Circuits.PhDThesis,ReportNo.STAN-CS-84-1026,StanfordUniversity.

[49]Shengxian,Z.,Li,W.,Carlsson,J.,Palmkvist,K.and

Wanhammar,L.(2002)AnasynchronouswrapperwithnovelhandshakecircuitsforGALSsystems.InProc.IEEE2002Int.Conf.Communication,CircuitsandSystems,Cheungdu,China,June29–July1,pp.1521–1525.IEEESociety,CA.

TheComputerJournalVol.49No.2,2006

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

Instruction Level Parallelism through Microthreading-A Scalable Approach to Chip Multiproce