Converting txt file into CSV using awk or sed


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Converting txt file into CSV using awk or sed
# 1  
Old 07-15-2012
Converting txt file into CSV using awk or sed

Hello folks

I have a txt file of information about journal articles from different fields. I need to convert this information into a format that is easier for computers to manipulate for some research that I'm doing on how articles are cited. The file has some header information and then details of records. For example,
Code:
Tue Jun 19 14:07:34 EDT 2012
CSA
Database: EconLit

Record 1 of 500

DN: Database Name
    EconLit
TI: Title
    Statistical Modeling of Monetary Policy and Its Effects
AU: Author
    Sims, Christopher A
SO: Source
    American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205
DE: Descriptors
    History of Economic Thought: Macroeconomics (B220); Economic
    Methodology (B410); Methodological Issues: General (C180); Business
    Fluctuations, Cycles (E320); Prices, Business Fluctuations, and
    Cycles: Forecasting and Simulation: Models and Applications (E370);
    Monetary Policy (E520); Modeling; Monetary; Monetary Policy; Policy
PY: Publication Year
    2012

Record 2 of 500

DN: Database Name
    EconLit
TI: Title
    Targeting the Poor: Evidence from a Field Experiment in Indonesia
AU: Author
    Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A;
    Tobias, Julia
SO: Source
    American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40
DE: Descriptors
    Field Experiments (C930); Measurement and Analysis of Poverty (I320);
    Welfare and Poverty: Government Programs, Provision and Effects of
    Welfare Programs (I380); Microeconomic Analyses of Economic
    Development (O120); Economic Development: Human Resources, Human
    Development, Income Distribution, Migration (O150); Economic
    Development: Urban, Rural, Regional, and Transportation Analysis,
    Housing, Infrastructure (O180); Urban, Rural, Regional, and
    Transportation Economics: Regional Migration, Regional Labor Markets,
    Population, Neighborhood Characteristics (R230); Indonesia; Asia;
    Experiment; Experiments; Field Experiment; Poor; Poverty; Village
PY: Publication Year
    2012

.
.
.

My goal is to convert this information into CSV format like so:
Code:
"TITLE","AUTHOR(S)","SOURCE","DESCRIPTOR CODES ONLY",PUBLICATION

So the above should turn into
Code:
"Statistical Modeling of Monetary Policy and Its Effects","Sims, Christopher A","American Economic Review, vol. 102, no. 4, June 2012, pp. 1187-1205","B220,B410,C180,E320,E370,E520",2012
"Targeting the Poor: Evidence from a Field Experiment in Indonesia","Alatas, Vivi; Banerjee, Abhijit; Hanna, Rema; Olken, Benjamin A; Tobias, Julia","American Economic Review, vol. 102, no. 4, June 2012, pp. 1206-40","C930,I320,I380,O120,O150,O180,R230",2012

Note that there are some descriptors that do not have codes, (eg. `Modeling' at the end of the first record). The code needs to drop those descriptors and only include the 4 character/numeric codes in brackets.

I am certain this is a fairly simple task for either awk or sed, but I don't know either with the proficiency that I should. I'd be grateful if someone can out with this.

Last edited by Scrutinizer; 07-15-2012 at 02:58 PM.. Reason: quote tags => code tags plus extra
# 2  
Old 07-15-2012
One way:

Code:
awk '
    function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            for( x in stuff )
                if( index( stuff[x], "," ) )            # must quote if it has commas
                    stuff[x] = "\"" stuff[x] "\"";
            printf( "%s,%s,%s,%s,%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

    /^Record/ { printit(); delete stuff; n = ""; next; }
    /TI:/ { n = "title"; next; }
    /DN:/ { n = "dname"; next; }
    /AU:/ { n = "auth"; next; }
    /DE:/ { n = "des"; next; }
    /PY:/ { n = "year"; next; }
    /SO:/ { n = "source"; next; }
    /^..:/ { n = ""; next; }

    n == "" { next; }

    n == "des" {
        split( $0, a, ";" );
        for( i = 1; i <= length( a ); i++ )
        {
            if( substr( a[i], length(a[i]) ) == ")" )
            {
                l = split( a[i], b, " " );
                stuff[n] = stuff[n] substr( b[l], 2, length( b[l] ) - 2 ) ",";
            }
        }
        next;
    }

    {
        sub( "^ +", "" );
        stuff[n] = stuff[n] $0 " "; 
        next; 
    }

    END { printit(); }

' input-file >output.csv

Assumption made: descriptors were always the last token before the semicolon, and in parens.

Last edited by agama; 07-15-2012 at 04:32 PM.. Reason: formatting
# 3  
Old 07-15-2012
Hi agama, I just gave your code a try. It says
Code:
awk: line 27: illegal reference to array a

.
# 4  
Old 07-15-2012
What version of awk are you running? (run `awk --version` to see)

---------- Post updated at 15:45 ---------- Previous update was at 15:42 ----------

This might be a fix (assuming it was length( a ) that awk was complaining about).

Code:
awk '
    function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            for( x in stuff )
                if( index( stuff[x], "," ) )            # must quote if it has commas
                    stuff[x] = "\"" stuff[x] "\"";
            printf( "%s,%s,%s,%s,%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }

    /^Record/ { printit(); delete stuff; n = ""; next; }
    /TI:/ { n = "title"; next; }
    /DN:/ { n = "dname"; next; }
    /AU:/ { n = "auth"; next; }
    /DE:/ { n = "des"; next; }
    /PY:/ { n = "year"; next; }
    /SO:/ { n = "source"; next; }
    /^..:/ { n = ""; next; }

    n == "" { next; }

    n == "des" {
        al = split( $0, a, ";" );
        for( i = 1; i <= al; i++ )
        {
            if( substr( a[i], length(a[i]) ) == ")" )
            {
                l = split( a[i], b, " " );
                stuff[n] = stuff[n] substr( b[l], 2, length( b[l] ) - 2 ) ",";
            }
        }
        next;
    }

    {
        sub( "^ +", "" );
        stuff[n] = stuff[n] $0 " ";
         next; 
    }

    END { printit(); }

'


Last edited by agama; 07-15-2012 at 04:46 PM.. Reason: formatting again
# 5  
Old 07-15-2012
'awk --version' gives 'mawk: not an option: --version'. 'man awk' gives the man page for mawk. Should I get something else?

---------- Post updated at 01:19 AM ---------- Previous update was at 01:16 AM ----------

Hi agama, the fix makes it work almost perfectly. Oddly, the CSV file has the title in quotes sometimes but not other times, and it doesn't seem to follow any pattern. Should I attach the txt file and csv for reference?

---------- Post updated at 01:28 AM ---------- Previous update was at 01:19 AM ----------

Ah, some of the title entries have quotes because they have commas in them. Can I modify the code so it puts all title entries in quotes by default? I tried removing
Code:
if( index( stuff[x], "," ) )            # must quote if it has commas

. That puts quotes on everything which is fine except then I need the quotes to disappear on the date field.
# 6  
Old 07-15-2012
Very good.

Adding quotes regardless, except for the date field, with a small change to the print function:

Code:
function printit( )
    {
        if( stuff["auth"] )
        {
            sub( ",$", "",  stuff["des"] );
            printf( "\"%s\",\"%s\",\"%s\",\"%s\",%s\n", stuff["title"], stuff["auth"], stuff["source"], stuff["des"], stuff["year"] );
        }
    }


Justs adds quotes to all things in the printf(), so it's a bit more efficient this way too.
This User Gave Thanks to agama For This Post:
# 7  
Old 07-15-2012
Thanks agama! I wonder if I can get your advice on wrapping this all up into a nice shell script. I have several of these text files that are generated through another script and land up in a folder. Would it be possible for me to write a .sh file that performs this awk operation on each text file in the directory that I'm in, and append everything into a large CSV file?
Once I'm done with this, there is one last step, but it might be easier to explain in a separate thread or perhaps once I am to get this master CSV file.

Thanks again for you help!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk, sed, shell all words in INPUT.txt find in column1 of TABLE.txt and replce with column2 in

Hi dears i have text file like this: INPUT.txt 001_1_173 j nuh ]az 001_1_174 j ]esma. nuh ]/.xori . . . and have another text like this TABLE.txt j j nuh word1... (6 Replies)
Discussion started by: alii
6 Replies

2. Shell Programming and Scripting

Using awk for converting xml to txt

Hi, I have a xml script, I converted it to .txt with values comma seperated using awk function. But I want the output values should be inside double quotes My xml script (Workorders.xml) is shown like below: <?xml version="1.0" encoding="utf-8" ?> <scbm-extract version="3.3">... (8 Replies)
Discussion started by: Viswanatheee55
8 Replies

3. Shell Programming and Scripting

awk to print value from txt file to csv

Hi, I want to print two columns from a .txt file to a .csv file using awk. data in text file: Application -------------------------------------------------- ----------- OS Related Issues 1 EMEA Solutions ... (8 Replies)
Discussion started by: prashu_g
8 Replies

4. Shell Programming and Scripting

Using csh / awk / sed to compare database sizes in a txt file

Hello, I have an output file showing database sizes across the 3 environments that I use (LIVE, TEST & DEVELOPMENT). I am trying to write a script that lets me know if the size of a db on one environment is different to its corresponding db on the other environments. Here is an example... (4 Replies)
Discussion started by: stevie_g
4 Replies

5. Shell Programming and Scripting

awk/sed/something else for csv file

Hi, I have a filename.csv in which there are 3 colums, ie: Name ; prefixnumber ; number root ; 020 ; 1234567 user1,2,3 ; 070 ; 7654321 What I want is to merge colum 2 and 3 that it becomes 0201234567 or even better +31201234567 so the country number is used and drop the leading 0.... (9 Replies)
Discussion started by: necron
9 Replies

6. Shell Programming and Scripting

Converting txt file in csv

HI All, I have a text file memory.txt which has following values. Average: 822387 7346605 89.93 288845 4176593 2044589 51883 2.47 7600 i want to convert this file in csv format and i am using following command to do it. sed s/_/\./g <... (3 Replies)
Discussion started by: mkashif
3 Replies

7. Shell Programming and Scripting

Using awk/sed in handling csv file.

Please study the below script and the output Script: echo "Minimum ${host} ${process} response time=${min} ms" >> ${OUTDIR}/${OUTFILE}; echo "Maximum ${host} ${process} response time=${max} ms" >> ${OUTDIR}/${OUTFILE}; echo "Average ${host} ${process} response time=${avg} ms" >>... (0 Replies)
Discussion started by: ajincoep
0 Replies

8. UNIX for Dummies Questions & Answers

Converting txt file to csv file

Hi, Using rsync, I've sent the output to a text file. This is the text file : Pls help me on converting this text file to a csv file. Probably a script or sth to convert the text file to a csv file. (3 Replies)
Discussion started by: anaigini45
3 Replies

9. Shell Programming and Scripting

converting xls file to txt file and xls to csv

I need to convert an excel file into a text file and an excel file into a CSV file.. any code to do that is appreciated thanks (6 Replies)
Discussion started by: bandar007
6 Replies

10. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

HI guys, I have created a script to read 1 column in a csv file and then place it in text file. However, when i checked out the text file, it is not in a column format... Example: CSV file contains name,age aa,11 bb,22 cc,33 After using awk to get first column TXT file... (1 Reply)
Discussion started by: mdap
1 Replies
Login or Register to Ask a Question