[Hidden-tech] Text Manipulation Problem

Robert Heller heller at deepsoft.com
Sun Oct 22 18:08:40 EDT 2017


At Sun, 22 Oct 2017 14:39:10 -0400 "Elijah Gwynn" <eli at egwynn.com> wrote:

> 
> 
> 
> Regarding tabs-vs-commas: it's a real tragedy that more programs don't 
> make use of any of the *four* ASCII delimiter characters 
> (https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text) that have 
> been available since ASCII-1965. The whole world of character-escaping 
> problems that we programmers deal with in order to support CSV/TSV could 
> have been avoided!

You do know that an apostrophe is not really an apostrophe and a double quote 
mark is not really a double quote mark.  That is the ASCII apostrophe (') and 
ASCII double quote mark (") seem to be avoided like the plague.  Many "modern" 
programs (mailers, wordprocessors, etc.) want to use the alt characters off in 
the world beyond 0x7f...  I get that all of the time. :-)

> 
> Eli
> 
> On 22 Oct 2017, at 12:22, Rich Roth wrote:
> 
> > Since I do a lot of text handling for a number of projects, I'll add a 
> > few more comments:
> >
> > 1) *programmed (perl,sed,script or saved regex) vs find & replace.*
> > I find any repeatable method far better then find/replace for a number 
> > of reasons.
> > David didn't mention if this is a one time need or repeating, clearly 
> > repeating requires more of a saved method.
> > Even with a one time need, F/R has a fatal flaw if you pick a bad 
> > pattern or just mis-type, while using a saved technique
> > you can test your method until it's right.
> > A few comments about unexpected variations in data re-enforce this 
> > idea.
> >
> > 2) *Create Tab delimited vs CSV
> > *He didn't say which spreadsheet, Excel and most others will accept 
> > tab delimited and using tabs does reduce a variety of bumps that extra 
> > commas produce.
> >
> > 3) *OpenRefine*
> > Leave it to HT (thanks Steve) to add a tangent idea of use to 
> > others.  I am working on a variety of text processing tasks, using 
> > OCR and various scripts and that looks to be a useful tools.
> >
> > In one case, I have processed some 15 data sources into a common 
> > display system for Shaker community members over the 200 years of 17 
> > communities and some 15,000 members.  I still am working on a OCR of 
> > a 1970 microfilm data set of 16,000 more entries. You can see some of 
> > the results at: http://memoirs.shakerpedia.com/
> > If any Shaker aficionados on HT, any help is welcome on that or 
> > http://shakerpedia.com/ in general.
> >
> > 4) If anyone has such conversion/scanning projects for community 
> > groups, esp historical society, please contact me.
> > We are now doing some work for ours: Historical society of Greenfield.
> >
> > Good luck to David - Rich
> >
> > On 10/22/2017 6:21 AM, Steven Brewer wrote:
> >> I see people have made all the obvious suggestions. Let me add that
> >> NeoOffice can do search and replace with regular expressions.
> >>
> >> But folks should also be aware of OpenRefine: It's a tool for taking
> >> messy data sets and cleaning them up. It's perhaps overkill for
> >> something like this, but maybe not: It has a bunch of tools for
> >> identifying classes of problems (like those that crop up with dodgy 
> >> OCR)
> >> and being able to correct them all at once. It's worth being aware of
> >> anyway.
> >>
> >> Good luck!
> >>
> >> On 10/21/17 7:34 AM, David Greenberg wrote:
> >>> I have a hard copy list of names, addresses and phone numbers. I can
> >>> scan to PDF and then copy and paste to a text editor (BBEdit) or 
> >>> other
> >>> file. I then need to manipulate the text so that I end up with a csv
> >>> file that can be opened by a spreadsheet program. Tools that I have 
> >>> at
> >>> my disposal include BBEdit (with Grep), a MAMP stack, NeoOffice (Mac
> >>> version of OpenOffice) and FileMaker.
> >>>
> >>> Input looks like this:
> >>>
> >>> John Doe
> >>> (413) 111-1111
> >>> 123 First St Greenfield 01301
> >>> Jane Smith
> >>> 456 So Main Ln Greenfield 01301
> >>> Jane Ann Smith
> >>> (413) 222-2222
> >>> 78 Main Ct Greenfield 01301
> >>>
> >>> Note that all addresses will include 'Greenfield 01301' and, /if 
> >>> /the
> >>> data includes a phone number, it will start with '(413)'.
> >>>
> >>> Output should look like this:
> >>>
> >>> John,Doe,(413) 111-1111,123 First St,Greenfield,01301
> >>> Jane,Smith,,456 So Main Ln,Greenfield,01301
> >>> Jane Ann,Smith,(413) 222-2222,78 Main Ct,Greenfield,01301
> >>>
> >>> Any suggestions greatly appreciated. Thanks.
> >>>
> >>> David
> >>>
> >>>
> >>> _______________________________________________
> >>> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
> >>> Hidden-discuss at lists.hidden-tech.net
> >>>
> >>> You are receiving this because you are on the Hidden-Tech Discussion 
> >>> list.
> >>> If you would like to change your list preferences, Go to the Members
> >>> page on the Hidden Tech Web site.
> >>> http://www.hidden-tech.net/members
> >>>
> >
> > -- 
> > Rich Roth
> > Webmaster/Steering Committee Member
> > Hidden-tech http://www.hidden-tech.net
> > The Talent you need is right here,
> > Join and share your skills
> > ((Sponsored by Thrives Media))
> > http://www.thrivesmedia.com
> > http://www.welovemuseums.com
> 
> 
> > _______________________________________________
> > Hidden-discuss mailing list - home page: http://www.hidden-tech.net
> > Hidden-discuss at lists.hidden-tech.net
> >
> > You are receiving this because you are on the Hidden-Tech Discussion 
> > list.
> > If you would like to change your list preferences, Go to the Members
> > page on the Hidden Tech Web site.
> > http://www.hidden-tech.net/members
> 
> MIME-Version: 1.0
> 
> _______________________________________________
> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
> Hidden-discuss at lists.hidden-tech.net
> 
> You are receiving this because you are on the Hidden-Tech Discussion list.
> If you would like to change your list preferences, Go to the Members   
> page on the Hidden Tech Web site.
> http://www.hidden-tech.net/members
> 
>                                                                                                         

-- 
Robert Heller             -- 978-544-6933
Deepwoods Software        -- Custom Software Services
http://www.deepsoft.com/  -- Linux Administration Services
heller at deepsoft.com       -- Webhosting Services
             


Google

More information about the Hidden-discuss mailing list