[Hidden-tech] Text Manipulation Problem

Elijah Gwynn eli at egwynn.com
Mon Oct 23 09:33:06 EDT 2017


ASCII’s FS, GS, RS, and US chars almost certainly won’t be in the text you’re dealing with. They’re not printable so they’d look really out of place in a text doc. That said, I don’t know if they’re actually well enough supported (e.g., in Excel) to make their use “a good idea”. The theoretical use-case would be that you could use them instead of their CSV “counterpart” delimiters (commas and newlines) in a program generating your new file, without worrying about any commas/newlines in the source text screwing up your table. Search/replace would probably not work with them.

Eli

Sent from my iPhone

> On Oct 23, 2017, at 7:31 AM, Shel Horowitz <shel at principledprofit.com> wrote:
> 
> But those all look like strings that might show up for other reasons, which makes global search-and-replace not an option. How do yo work around that?
> 
> 
> Shel Horowitz - "The Transformpreneur"(sm)
> ________________________________________________
> Watch (and please share) my TEDx Talk,
> "Impossible is a Dare: Business for a Better World"
> http://www.ted.com/tedx/events/11809 
> (move your mouse to "event videos")
> 
> Contact me to bake in profitability while addressing hunger, 
> poverty, war, and catastrophic climate change
> 
> Twitter: @shelhorowitz
> 
> * First business ever to be Green America Gold Certified
> * Inducted into the National Environmental Hall of Fame
> 
> http://goingbeyondsustainability.com 
> http://transformpreneur.com
> mailto:shel at greenandprofitable.com * 413-586-2388
> Award-winning, best-selling author of 10 books. Latest:
> Guerrilla Marketing to Heal the World (co-authored with Jay Conrad Levinson)
> 
> _________________________________________________
> 
>> On Sun, Oct 22, 2017 at 2:39 PM, Elijah Gwynn <eli at egwynn.com> wrote:
>> Regarding tabs-vs-commas: it's a real tragedy that more programs don't make use of any of the four ASCII delimiter characters (https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text) that have been available since ASCII-1965. The whole world of character-escaping problems that we programmers deal with in order to support CSV/TSV could have been avoided!
>> 
>> Eli
>> 
>> On 22 Oct 2017, at 12:22, Rich Roth wrote:
>> 
>> Since I do a lot of text handling for a number of projects, I'll add a few more comments:
>> 
>> 1) programmed (perl,sed,script or saved regex) vs find & replace.
>> I find any repeatable method far better then find/replace for a number of reasons.
>> David didn't mention if this is a one time need or repeating, clearly repeating requires more of a saved method.
>> Even with a one time need, F/R has a fatal flaw if you pick a bad pattern or just mis-type, while using a saved technique
>> you can test your method until it's right.  
>> A few comments about unexpected variations in data re-enforce this idea.
>> 
>> 2) Create Tab delimited vs CSV
>> He didn't say which spreadsheet, Excel and most others will accept tab delimited and using tabs does reduce a variety of bumps that extra commas produce.
>> 3) OpenRefine
>> Leave it to HT (thanks Steve) to add a tangent idea of use to others.  I am working on a variety of text processing tasks, using OCR and various scripts and that looks to be a useful tools.  
>> 
>> In one case, I have processed some 15 data sources into a common display system for Shaker community members over the 200 years of 17 communities and some 15,000 members.  I still am working on a OCR of a 1970 microfilm data set of 16,000 more entries. You can see some of the results at: http://memoirs.shakerpedia.com/
>> If any Shaker aficionados on HT, any help is welcome on that or http://shakerpedia.com/ in general.
>> 
>> 4) If anyone has such conversion/scanning projects for community groups, esp historical society, please contact me.
>> We are now doing some work for ours: Historical society of Greenfield.
>> 
>> Good luck to David - Rich
>> 
>>> On 10/22/2017 6:21 AM, Steven Brewer wrote:
>>> I see people have made all the obvious suggestions. Let me add that
>>> NeoOffice can do search and replace with regular expressions.
>>> 
>>> But folks should also be aware of OpenRefine: It's a tool for taking
>>> messy data sets and cleaning them up. It's perhaps overkill for
>>> something like this, but maybe not: It has a bunch of tools for
>>> identifying classes of problems (like those that crop up with dodgy OCR)
>>> and being able to correct them all at once. It's worth being aware of
>>> anyway.
>>> 
>>> Good luck!
>>> 
>>> On 10/21/17 7:34 AM, David Greenberg wrote:
>>>> I have a hard copy list of names, addresses and phone numbers. I can
>>>> scan to PDF and then copy and paste to a text editor (BBEdit) or other
>>>> file. I then need to manipulate the text so that I end up with a csv
>>>> file that can be opened by a spreadsheet program. Tools that I have at
>>>> my disposal include BBEdit (with Grep), a MAMP stack, NeoOffice (Mac
>>>> version of OpenOffice) and FileMaker.
>>>> 
>>>> Input looks like this:
>>>> 
>>>> John Doe
>>>> (413) 111-1111
>>>> 123 First St Greenfield 01301
>>>> Jane Smith
>>>> 456 So Main Ln Greenfield 01301
>>>> Jane Ann Smith
>>>> (413) 222-2222
>>>> 78 Main Ct Greenfield 01301
>>>> 
>>>> Note that all addresses will include 'Greenfield 01301' and, /if /the
>>>> data includes a phone number, it will start with '(413)'.
>>>> 
>>>> Output should look like this:
>>>> 
>>>> John,Doe,(413) 111-1111,123 First St,Greenfield,01301
>>>> Jane,Smith,,456 So Main Ln,Greenfield,01301
>>>> Jane Ann,Smith,(413) 222-2222,78 Main Ct,Greenfield,01301
>>>> 
>>>> Any suggestions greatly appreciated. Thanks.
>>>> 
>>>> David
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
>>>> Hidden-discuss at lists.hidden-tech.net
>>>> 
>>>> You are receiving this because you are on the Hidden-Tech Discussion list.
>>>> If you would like to change your list preferences, Go to the Members   
>>>> page on the Hidden Tech Web site.
>>>> http://www.hidden-tech.net/members
>>>> 
>> 
>> -- 
>> Rich Roth
>> Webmaster/Steering Committee Member
>> Hidden-tech http://www.hidden-tech.net
>> The Talent you need is right here,
>> Join and share your skills
>> ((Sponsored by Thrives Media))
>> http://www.thrivesmedia.com
>> http://www.welovemuseums.com
>> _______________________________________________
>> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
>> Hidden-discuss at lists.hidden-tech.net
>> 
>> You are receiving this because you are on the Hidden-Tech Discussion list.
>> If you would like to change your list preferences, Go to the Members
>> page on the Hidden Tech Web site.
>> http://www.hidden-tech.net/members
>> 
>> 
>> _______________________________________________
>> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
>> Hidden-discuss at lists.hidden-tech.net
>> 
>> You are receiving this because you are on the Hidden-Tech Discussion list.
>> If you would like to change your list preferences, Go to the Members
>> page on the Hidden Tech Web site.
>> http://www.hidden-tech.net/members
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.hidden-tech.net/pipermail/hidden-discuss/attachments/20171023/036cf733/attachment-0001.html 


Google

More information about the Hidden-discuss mailing list