January 1, 2016
By: Wayne Dyck
GNU grep and regex lookarounds
I need just the user names from the Rig.xml
file. Below is a sample of one RightsGroup
element contained in the XML file; in reality there could be a hundred of these.
<RightsGroup GUID="{F4B45F3B-1C90-4B3C-9C3E-57E92A45A961}">
<Versions Count="1">
<Version xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" IsDependent="1" InternalID="2197" GUID="{F4B45F3B-1C90-4B3C-9C3E-57E92A45A961}" IsShortcut="0" Parent="{E4D19237-9DB3-12D1-B43E-106097071364}" Name="Star Wars Editors" Location="/User Roles/Editors/" IsNew="0" IsModified="0" IsDeleted="0" UseOuterScriptForPostings="0" UserRoleType="4" IsRobotIndexable="0" IsRobotFollowable="0" IsHiddenModePublished="0" SortOrdinal="0" Expiredate="401769" Effectivedate="0" ModifiedWhen="42054.745502395832" CreatedWhen="37998.6830443287" ApprovalStatusModifiedBy="" ReadyForApproval="0" ApprovalStatus="1" IsHighPriority="0" SameRightsAsParent="1" Objects="0" Containers="0" TotalCount="0">
<RoleMembers>
<Member UserName="WinNT://STARWARS/DVader" />
<Member UserName="WinNT://STARWARS/Yoda" />
<Member UserName="WinNT://STARWARS/BFett" />
<Member UserName="WinNT://STARWARS/LSkywalker" />
<Member UserName="WinNT://STARWARS/ATano" />
<Member UserName="WinNT://STARWARS/HSolo" />
<Member UserName="WinNT://STARWARS/PAmidala" />
<Member UserName="WinNT://STARWARS/OKenobi" />
<Member UserName="WinNT://STARWARS/JFett" />
<Member UserName="WinNT://STARWARS/LOrgana" />
</RoleMembers>
</Version>
</Versions>
</RightsGroup>
My initial plan is to write a small Python program which will parse the XML and use regular expressions to match and extract the required text. I’ve done this many times before and it’s easy enough stepping through the various elements and attributes.
from xml.dom import minidom
xmldoc = minidom.parse('Rig.xml')
memberlist = xmldoc.getElementsByTagName('Member')
...
I'm all for writing code, however, the pattern match can actually be done with a one line grep
expression using a lookbehind condition.
grep -oP "(?<=WinNT:\/\/STARWARS\/)([\w\s-]*)" Rig.xml
The result is this:
DVader
Yoda
BFett
LSkywalker
ATano
HSolo
PAmidala
OKenobi
JFett
LOrgana
For a fantastic tutorial on lookarounds, refer to Mastering Lookahead and Lookbehind.