January 1, 2016
By: Wayne Dyck

GNU grep and regex lookarounds

I need just the user names from the Rig.xml file. Below is a sample of one RightsGroup element contained in the XML file; in reality there could be a hundred of these.

<RightsGroup GUID="{F4B45F3B-1C90-4B3C-9C3E-57E92A45A961}">
    <Versions Count="1">
        <Version xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" IsDependent="1" InternalID="2197" GUID="{F4B45F3B-1C90-4B3C-9C3E-57E92A45A961}" IsShortcut="0" Parent="{E4D19237-9DB3-12D1-B43E-106097071364}" Name="Star Wars Editors" Location="/User Roles/Editors/" IsNew="0" IsModified="0" IsDeleted="0" UseOuterScriptForPostings="0" UserRoleType="4" IsRobotIndexable="0" IsRobotFollowable="0" IsHiddenModePublished="0" SortOrdinal="0" Expiredate="401769" Effectivedate="0" ModifiedWhen="42054.745502395832" CreatedWhen="37998.6830443287" ApprovalStatusModifiedBy="" ReadyForApproval="0" ApprovalStatus="1" IsHighPriority="0" SameRightsAsParent="1" Objects="0" Containers="0" TotalCount="0">
            <RoleMembers>
                <Member UserName="WinNT://STARWARS/DVader" />
                <Member UserName="WinNT://STARWARS/Yoda" />
                <Member UserName="WinNT://STARWARS/BFett" />
                <Member UserName="WinNT://STARWARS/LSkywalker" />
                <Member UserName="WinNT://STARWARS/ATano" />
                <Member UserName="WinNT://STARWARS/HSolo" />
                <Member UserName="WinNT://STARWARS/PAmidala" />
                <Member UserName="WinNT://STARWARS/OKenobi" />
                <Member UserName="WinNT://STARWARS/JFett" />
                <Member UserName="WinNT://STARWARS/LOrgana" />
            </RoleMembers>
        </Version>
    </Versions>
</RightsGroup>

My initial plan is to write a small Python program which will parse the XML and use regular expressions to match and extract the required text. I’ve done this many times before and it’s easy enough stepping through the various elements and attributes.

from xml.dom import minidom
xmldoc = minidom.parse('Rig.xml')
memberlist = xmldoc.getElementsByTagName('Member')
...

I'm all for writing code, however, the pattern match can actually be done with a one line grep expression using a lookbehind condition.

grep -oP "(?<=WinNT:\/\/STARWARS\/)([\w\s-]*)" Rig.xml

The result is this:

DVader
Yoda
BFett
LSkywalker
ATano
HSolo
PAmidala
OKenobi
JFett
LOrgana

For a fantastic tutorial on lookarounds, refer to Mastering Lookahead and Lookbehind.

Tags: grep regex