Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python RegEx – Match Whitespace But Not Newline

#1
Python RegEx – Match Whitespace But Not Newline

<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;524754&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;1&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;5\/5 - (1 vote)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;142.5&quot;,&quot;_legend&quot;:&quot;{score}\/{best} - ({count} {votes})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<h2>Problem Formulation</h2>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4ac.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Challenge</strong>: How to design a regular expression pattern that matches whitespace characters such as the empty space <code>' '</code> and the tabular character <code>'\t'</code>, but not the newline character <code>'\n'</code>?</p>
<p>An example of this would be to replace all whitespaces (except newlines) between a <a href="https://blog.finxter.com/how-to-convert-space-delimited-file-to-csv-in-python/" data-type="post" data-id="522346" target="_blank" rel="noreferrer noopener">space-delimited file</a> with commas to obtain a CSV.</p>
<h2>Method 1: Use Character Class</h2>
<p class="has-global-color-8-background-color has-background">The <a rel="noreferrer noopener" href="https://blog.finxter.com/python-character-set-regex-tutorial/" data-type="post" data-id="6208" target="_blank">character class</a> pattern <code>[ \t]</code> matches one empty space <code>' '</code> or a tabular character <code>'\t'</code>, but not a newline character. If you want to match an arbitrary number of empty spaces except for newlines, append the plus quantifier to the pattern like so: <code>[ \t]+</code>. </p>
<p>Here’s an example where you replace all separating whitespace (except newline) with a comma to receive a CSV formatted output:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re txt = 'a \t b c\nd e f'
csv_txt = re.sub('[ \t]+', ',', txt)
print(csv_txt)</pre>
<p>Output:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">a,b,c
d,e,f</pre>
<h3>Why the space in the pattern <code>[ \t]</code>?</h3>
<p>The reason there’s a space in the pattern is to match the empty space. The character class essentially is an OR relationship, i.e., one item within the character class is matched. For the given pattern, it matches either the empty space <code>' '</code> or the tabular character <code>'\t'</code>. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: <a href="https://blog.finxter.com/python-character-set-regex-tutorial/" data-type="post" data-id="6208" target="_blank" rel="noreferrer noopener">Character Class (Character Set) — The Ultimate Guide for Python</a></p>
<h2>Method 2: Match Individual Different Whitespace Characters</h2>
<p>The previous method only matches the horizontal tab (<a rel="noreferrer noopener" href="https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x" target="_blank">U+0009</a>) and breaking space (<a href="https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x">U+0020</a>) characters. If you want more fine-grained control about which whitespace characters to match and which not, you can use the following baseline approach.</p>
<p>The following list of Unicode whitespace characters <code><a rel="noreferrer noopener" href="https://www.lesinskis.com/python-unicode-whitespace.html" data-type="URL" data-id="https://www.lesinskis.com/python-unicode-whitespace.html" target="_blank">UNICODE_WHITESPACES</a></code> contains all major whitespace variants you may want to check your string for. You can generate a character class using the string expression <code>'[' + ''.join(UNICODE_WHITESPACES) + ']'</code>. </p>
<p>Here’s a variant that finds all matches of whitespace characters in a given text:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re UNICODE_WHITESPACES = [ "\u0009", # character tabulation "\u000a", # line feed "\u000b", # line tabulation "\u000c", # form feed "\u000d", # carriage return "\u0020", # space "\u0085", # next line "\u00a0", # no-break space "\u1680", # ogham space mark "\u2000", # en quad "\u2001", # em quad "\u2002", # en space "\u2003", # em space "\u2004", # three-per-em space "\u2005", # four-per-em space "\u2006", # six-per-em space "\u2007", # figure space "\u2008", # punctuation space "\u2009", # thin space "\u200A", # hair space "\u2028", # line separator "\u2029", # paragraph separator "\u202f", # narrow no-break space "\u205f", # medium mathematical space "\u3000", # ideographic space
] txt = ' \t\n\r'
pattern = '[' + ''.join(UNICODE_WHITESPACES) + ']'
matches = re.findall(pattern, txt)
print(matches)
# [' ', '\t', '\n', '\r']</pre>
<p>Of course, you can restrict this to only contain whitespaces that are not newline-related.</p>
<h2>Method 3: Match Individual Different Whitespaces Except Newlines</h2>
<p>The following code snippet uses the <code>UNICODE_WHITESPACES</code> constant but comments out the newline whitespaces so that newline-related characters such as <code>'\n'</code> and <code>'\r'</code> are not matched anymore!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="5,7,21-22,29,32" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re UNICODE_WHITESPACES = [ "\u0009", # character tabulation # "\u000a", # line feed "\u000b", # line tabulation "\u000c", # form feed # "\u000d", # carriage return "\u0020", # space # "\u0085", # next line "\u00a0", # no-break space "\u1680", # ogham space mark "\u2000", # en quad "\u2001", # em quad "\u2002", # en space "\u2003", # em space "\u2004", # three-per-em space "\u2005", # four-per-em space "\u2006", # six-per-em space "\u2007", # figure space "\u2008", # punctuation space "\u2009", # thin space "\u200A", # hair space # "\u2028", # line separator # "\u2029", # paragraph separator "\u202f", # narrow no-break space "\u205f", # medium mathematical space "\u3000", # ideographic space
] txt = ' \t\n\r'
pattern = '[' + ''.join(UNICODE_WHITESPACES) + ']'
matches = re.findall(pattern, txt)
print(matches)
# [' ', '\t']
</pre>
<p>Of course, you can comment out the individual whitespace Unicode characters you don’t want to match as required by your own application.</p>
</div>


https://www.sickgaming.net/blog/2022/07/...t-newline/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016