[Hidden-tech] Text Manipulation Problem

Rich Roth webmaster at hidden-tech.net
Sun Oct 22 12:22:23 EDT 2017


Since I do a lot of text handling for a number of projects, I'll add a 
few more comments:

1) *programmed (perl,sed,script or saved regex) vs find & replace.*
I find any repeatable method far better then find/replace for a number 
of reasons.
David didn't mention if this is a one time need or repeating, clearly 
repeating requires more of a saved method.
Even with a one time need, F/R has a fatal flaw if you pick a bad 
pattern or just mis-type, while using a saved technique
you can test your method until it's right.
A few comments about unexpected variations in data re-enforce this idea.

2) *Create Tab delimited vs CSV
*He didn't say which spreadsheet, Excel and most others will accept tab 
delimited and using tabs does reduce a variety of bumps that extra 
commas produce.

3) *OpenRefine*
Leave it to HT (thanks Steve) to add a tangent idea of use to others.  I 
am working on a variety of text processing tasks, using OCR and various 
scripts and that looks to be a useful tools.

In one case, I have processed some 15 data sources into a common display 
system for Shaker community members over the 200 years of 17 communities 
and some 15,000 members.  I still am working on a OCR of a 1970 
microfilm data set of 16,000 more entries. You can see some of the 
results at: http://memoirs.shakerpedia.com/
If any Shaker aficionados on HT, any help is welcome on that or 
http://shakerpedia.com/ in general.

4) If anyone has such conversion/scanning projects for community groups, 
esp historical society, please contact me.
We are now doing some work for ours: Historical society of Greenfield.

Good luck to David - Rich

On 10/22/2017 6:21 AM, Steven Brewer wrote:
> I see people have made all the obvious suggestions. Let me add that
> NeoOffice can do search and replace with regular expressions.
>
> But folks should also be aware of OpenRefine: It's a tool for taking
> messy data sets and cleaning them up. It's perhaps overkill for
> something like this, but maybe not: It has a bunch of tools for
> identifying classes of problems (like those that crop up with dodgy OCR)
> and being able to correct them all at once. It's worth being aware of
> anyway.
>
> Good luck!
>
> On 10/21/17 7:34 AM, David Greenberg wrote:
>> I have a hard copy list of names, addresses and phone numbers. I can
>> scan to PDF and then copy and paste to a text editor (BBEdit) or other
>> file. I then need to manipulate the text so that I end up with a csv
>> file that can be opened by a spreadsheet program. Tools that I have at
>> my disposal include BBEdit (with Grep), a MAMP stack, NeoOffice (Mac
>> version of OpenOffice) and FileMaker.
>>
>> Input looks like this:
>>
>> John Doe
>> (413) 111-1111
>> 123 First St Greenfield 01301
>> Jane Smith
>> 456 So Main Ln Greenfield 01301
>> Jane Ann Smith
>> (413) 222-2222
>> 78 Main Ct Greenfield 01301
>>
>> Note that all addresses will include 'Greenfield 01301' and, /if /the
>> data includes a phone number, it will start with '(413)'.
>>
>> Output should look like this:
>>
>> John,Doe,(413) 111-1111,123 First St,Greenfield,01301
>> Jane,Smith,,456 So Main Ln,Greenfield,01301
>> Jane Ann,Smith,(413) 222-2222,78 Main Ct,Greenfield,01301
>>
>> Any suggestions greatly appreciated. Thanks.
>>
>> David
>>
>>
>> _______________________________________________
>> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
>> Hidden-discuss at lists.hidden-tech.net
>>
>> You are receiving this because you are on the Hidden-Tech Discussion list.
>> If you would like to change your list preferences, Go to the Members
>> page on the Hidden Tech Web site.
>> http://www.hidden-tech.net/members
>>

-- 
Rich Roth
Webmaster/Steering Committee Member
Hidden-tech http://www.hidden-tech.net
The Talent you need is right here,
Join and share your skills
((Sponsored by Thrives Media))
http://www.thrivesmedia.com
http://www.welovemuseums.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.hidden-tech.net/pipermail/hidden-discuss/attachments/20171022/7f2bc6cf/attachment.html 


Google

More information about the Hidden-discuss mailing list