Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.




134 posts

Master Geek
+1 received by user: 9


Topic # 242698 9-Nov-2018 21:10
Send private message quote this post

 

 

Hi there

 

I have some CSV files that come with double quote delimiters and every now and then there is a rogue quote in a field that I would like to replace.

 

I am using TCL as the application has this natively embedded and I am struggling with the regex that would identify this pattern and thus use the regex in the regsub function of TCL. In essence, open the file, loop each line and regsub

 

Here is an example: 100,"Hello","big",190,"Longer than 10" lengths",99 (I am trying to replace the " after 10 with say zz)

 

A unix regex of (?<!^|,)"(?!,|$) works a treat, but does not confirm with TCL script.

 

Any ideas/assistance/alternative approaches...I could use unix (ksh) script, but was keen on TCL for is ease and I use it else where.

 


Create new topic
2547 posts

Uber Geek
+1 received by user: 1238

Lifetime subscriber

  Reply # 2124247 12-Nov-2018 16:19
Send private message quote this post

Try liberal insertion of backslashes in your regex. particularly the " in the middle.

 

Or do it in Notepad++... it allows regex search & replace.

 

 


18 posts

Geek


  Reply # 2124677 13-Nov-2018 12:04
Send private message quote this post

I don't think TCL handles lookbehind (your /(?<!^|,)/).  But you should be able to easily work around this by including the ',' within the match, and putting it back by replacing the match with ',zz'. Since this now requires the ',', you'll need to check for /^"/ separately.


 
 
 
 




134 posts

Master Geek
+1 received by user: 9


  Reply # 2124968 13-Nov-2018 19:24
Send private message quote this post

frankv:

 

Try liberal insertion of backslashes in your regex. particularly the " in the middle.

 

Or do it in Notepad++... it allows regex search & replace.

 

 

Thanks for that - a big NP++ user, bit weak in regex




134 posts

Master Geek
+1 received by user: 9


  Reply # 2124980 13-Nov-2018 19:27
Send private message quote this post

shaned:

 

I don't think TCL handles lookbehind (your /(?<!^|,)/).  But you should be able to easily work around this by including the ',' within the match, and putting it back by replacing the match with ',zz'. Since this now requires the ',', you'll need to check for /^"/ separately.

 

 

Thanks - spot on with my issue...no look behind in TCL regex. What I have I found using google foo and works a treat in Unix etc., just not in TCL. As my regex ability is weak...a sample would be appreciated.


18 posts

Geek


  Reply # 2125547 14-Nov-2018 12:49
Send private message quote this post

I don't know TCL, so I'll just break down the regex for you.  I'll jump around a bit to explain the simpler parts first.

 

"    nothing special here, matched a quite character
,    like above, just matched a comma
$    matched the end of the string or end of the line, depending the command/language/function and options
^   matched the beginning of the string/line
,|$    matched either a comma OR the end of string
^|,   matched the beginning of the string OR a comma


X(?=Y)   this is where things get tricky. This is a lookahead that would match an X, but only if it is followed by a Y. The important part is that the Y is not considered part of the text that matches. Once you get your head around that the rest of loohahead and lookbehind is pretty simple.
X(?!Y)    this is a 'negative' lookahead, it will match X only if it is NOT followed by Y
"(?!,|$)    so this is a negative lookahead that ensures the quote is matched only when it is not followed by a comma or an end of string

(?<=Y)X   lookbehinds are similar, this matched X, but only when it is preceded by Y
(?<!Y)X   matched X only if it is NOT preceded by Y
(?<!^|,)"   matched a quote, but only if it is not preceded by the beginning of the string or a comma

So know you should be able to get your head completely around your (?<!^|,)"(?!,|$). If not, have a coffee and read the above again.

 

 

 

We have to find a way to do without the lookbehind. We need to find X" so that we can substitute the " part, but the X part is not a known character so it will be included in the match, but we have no idea what it will be. For this we need regex captures. The match operation captures specified parts of the match for us to refer to later.  (X)" will match just like X", but will capture the X.

 

Ok, so we need to match a quite, but only when it is immediately after the start of string or after a character that is not a comma. And we need to capture whatever was before the quote so that we can include it in the regsub substitution.

 

[abc]  match any of the characters a,b or c
[^abc]  match any character other than a,b,c
[^,]  match any character except a comma
[^,]|^   match any character not a comma, or match the beginning of the string
([^,]|^)"  match like above, followed by a quote, and put the prefix part into a capture

 

For the substation, you want to replace the " with zz. This is as before, but you need to include the captured part in your replacement. It looks like TCL uses \1,\2 etc syntax for captures, so it's simply a matter of replacing with \1zz.

 

We could do the same trick for the lookahead suffix, but since TCL supports lookahead you can either keep it as is, or replace it with another capture to be consistent.

 

replace ([^,]|^)"(?!,|$) with \1zz   (capture for prefix, lookahead for suffix)

 

replace ([^,]|^)"([^,]|$) with \1zz\2  (captures for prefix and suffix)

 

 

 

I'll leave it for you to work out how to write this in TCL


2547 posts

Uber Geek
+1 received by user: 1238

Lifetime subscriber

  Reply # 2126641 14-Nov-2018 15:26
Send private message quote this post

Yoban:

 

frankv:

 

Try liberal insertion of backslashes in your regex. particularly the " in the middle.

 

Or do it in Notepad++... it allows regex search & replace.

 

 

Thanks for that - a big NP++ user, bit weak in regex

 

 

And here I was, being very impressed by your use of the look-behind feature, which I've only used once or twice. lol

 

 




134 posts

Master Geek
+1 received by user: 9


  Reply # 2128950 18-Nov-2018 16:54
Send private message quote this post

frankv:

 

Yoban:

 

frankv:

 

Try liberal insertion of backslashes in your regex. particularly the " in the middle.

 

Or do it in Notepad++... it allows regex search & replace.

 

 

Thanks for that - a big NP++ user, bit weak in regex

 

 

And here I was, being very impressed by your use of the look-behind feature, which I've only used once or twice. lol

 

 

All the credit goes to my google foo that found the expression on stackoverflow




134 posts

Master Geek
+1 received by user: 9


  Reply # 2128951 18-Nov-2018 16:55
Send private message quote this post

Thanks @shaned, will digest and see who it goes.


446 posts

Ultimate Geek
+1 received by user: 94

Subscriber

  Reply # 2128955 18-Nov-2018 17:20
Send private message quote this post

I'm a NP++ user too, but the last time I had to do a regex replacement I used TextPad https://www.textpad.com/

 

It allowed me to do a replace with incremental numbering.

 

 





Speedtest

Create new topic



Twitter »

Follow us to receive Twitter updates when new discussions are posted in our forums:



Follow us to receive Twitter updates when news items and blogs are posted in our frontpage:



Follow us to receive Twitter updates when tech item prices are listed in our price comparison site:



Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.


Geekzone Live »

Our community of supporters help make Geekzone possible. Click the button below to join them.

Support Geezone on PressPatron



Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.

Alternatively, you can receive a daily email with Geekzone updates.