12-18-2008
Removing Embedded Newline from Delimited File
Hey there - a bit of background on what I'm trying to accomplish, first off. I am trying to load the data from a pipe delimited file into a database. The loading tool that I use cannot handle embedded newline characters within a field, so I need to scrub them out.
Solutions that I have tried so far:
1) From a thread here in 2005:
{record = record $0
if (gsub(/"/,"&", record) % 2 )
{ record = record " "
next
}
}
{
print record
record = ""
}
Problems - This worked beautifully on the test data, Then, it was working just fine on the main data... until I received the following error:
"The result [...] of the gsub function cannot be longer than 3000 bytes.
Did I mention that the field with embedded newline characters is going to be loaded as a character large object into the database? Granted, it's only going to be about 6k at max, but that's still more than gsub can handle.
Another note - the test data didn't have any embedded double-quotes. I doubt that this would cause a problem, but in the interest of full disclosure, I should state it.
2) Monster regex:
sed -n '
H
g
s/\n//g
h
/^"\(\(""\)*[^"]*\)*"\(;"\(\(""\)*[^"]*\)*"\)*$/{p;s/.*//g;h;d;}
$p
' filename
Problem - this removes EVERY newline from the script, not just the in-line ones. Definitely can't use this to load the data. Plus, it take a pretty substantial chunk of CPU to run through it.
Other issues with the data:
The "good" newlines at the end of each record are in the same format as the embedded newlines. The FTP client that they use must auto-apply dos2unix when it detects a "text" filetype.
Any help with this would be appreciated. If you need me to clear anything up, let me know.
-Brandon
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I just stuckup in doing some regular expressions on a file.
I have data which has multiple FHS and BTS segments like:
FHS|12121|LOCAL|2323
MSH|10101|POTAMAS|2323
PID|121221|THOMAS|DAVID|23432
OBX|2342|H1211|3232
BTS|0000|MERSTO|LIABLE
FHS|12121|LOCAL|2323
MSH|10101|POTAMAS|2323... (3 Replies)
Discussion started by: naren_0101bits
3 Replies
2. UNIX for Advanced & Expert Users
Hi - I tried to remove ^M in a delimited file using "tr -d "\r" and "sed 's/^M//g'", but it does not work quite well. While the ^M is removed, the format of the record is still cut in half, like
a,b, c
c,d,e
The delimited file is generated using sh script by outputing a SQL query result to... (7 Replies)
Discussion started by: sirahc
7 Replies
3. Shell Programming and Scripting
Hi Experts
I am very new to perl and need to make a script using perl.
I would like to remove blanks in a text tab delimited file in in a specfic column range ( colum 21 to column 43) sample input and output shown below :
Input:
117 102 650 652 654 656
117 93 95... (3 Replies)
Discussion started by: Faisal Riaz
3 Replies
4. Shell Programming and Scripting
Hi Guys,
Happy New Year to you all!
I have a requirement to read an embedded new-line using KSH's read builtin.
Here is what I am trying to do:
run_sql "select guestid, address, email from guest" | while read id addr email
do
## Biz logic goes here
done
I can take care of any... (6 Replies)
Discussion started by: a_programmer
6 Replies
5. Shell Programming and Scripting
Hi Gurus,
Apologies as I feel like this must be answered already on here somewhere but I just can't find it. I find many people looking to remove all \n and \r (CR and LF) or one or the other but the only times I've found someone trying to remove them only when both are together they've found... (7 Replies)
Discussion started by: Leedor
7 Replies
6. Shell Programming and Scripting
Greetings all,
i have csv file with pipe separated columns
SSN|NAME|ADDRESS|FILLER
123|abc|myaddress|xxx
234|BBB|my
add
ress
broken up|yyy
In the example above, the second record is broken into multiple lines. I need to keep going until I find a "|" since this issue is with the... (14 Replies)
Discussion started by: stayalive
14 Replies
7. UNIX for Dummies Questions & Answers
I'm trying to remove all of the empty lines at the end of a Tab delimited file. They have no data just tabs.
I've tried may things, here are a couple:
sed /^\t.\t/d File1 > File2
sed /^\t{44}/d File1 > File2
What am I missing? (9 Replies)
Discussion started by: SirHenry1
9 Replies
8. Shell Programming and Scripting
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
#DATA
#VALID 1
and ends with a footer as shown below
#END
The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies
9. Shell Programming and Scripting
Hi below is my file.
cat input.dat
101,abhilash,1000
102,prave
en,2000
103,partha,4
000
10
4,naresh,5000
(its just a example file)
and my output should be:
101,abhilash,1000
102,praveen,2000
103,partha,4000
104,naresh,5000
below is my code
cat input.dat |tr -d '\n' >... (6 Replies)
Discussion started by: abhilash_nakka
6 Replies
10. Shell Programming and Scripting
Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker...
Column #1 is a simple ID, which is used to identify the duplicate.
Once dups are identified, I need to only keep the one... (2 Replies)
Discussion started by: kevinprood
2 Replies
LEARN ABOUT BSD
syserrlst
SYSERRLST(5) File Formats Manual SYSERRLST(5)
NAME
syserrlst - error message file format
DESCRIPTION
mkerrlst(1), creates error message files in the format described below.
An ``error message file'' consists of a header, an array of structures specifying the offset and length of each message, and the array of
message strings separated by newlines.
The message strings are separated by newlines but the newline characters are not included in the size of the message. These newline char-
acters serve only to make the file editable or printable (after stripping off the header).
The file format is:
/*
* Definitions used by the 'mkerrlst' program which creates error message
* files.
*
* The format of the file created is:
*
* struct ERRLSTHDR ehdr;
* struct ERRLST emsg[num_of_messages];
* struct {
* char msg[] = "error message string";
* char lf = '0;
* } [num_of_messages];
*
* Note: the newlines are NOT included in the message lengths, the newlines
* are present to make it easy to 'cat' or 'vi' the file.
*/
struct ERRLSTHDR
{
short magic;
short maxmsgnum;
short maxmsglen;
short pad[5]; /* Reserved */
};
struct ERRLST
{
off_t offmsg;
short lenmsg;
};
#define ERRMAGIC 012345
SEE ALSO
mkerrlst(1), syserrlst(3)
BUGS
Format of the file isn't necessarily portable between machines.
HISTORY
This file format is new with 2.11BSD.
3rd Berkeley Distribution March 7, 1996 SYSERRLST(5)