Tabulate data in a table.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Tabulate data in a table.
# 1  
Old 10-19-2016
Tabulate data in a table.

I have about a million tables stored as .txt files in a directory. The content of the files look like this:
Code:

ages-eligible-for-study:
genders-eligible-for-study:
accepts-healthy-volunteers:
CONTINUE

18-months-to-36-months
both
yes
CONTINUE
DONE

Other examples,
Code:


CONTINUE
non-electric-grass-pitch
electric-grass-pitch
towan-two-bedroom-holiday-home
crantock-two-bedroom-apartment
CONTINUE
10-x
10-x
6-x
4-x
CONTINUE
DONE

My aim is to change the format of text in each file to a tabular form. The output that I expect is:
Code:
ages-eligible-for-study: 18-months-to-36-months
genders-eligible-for-study: both
accepts-healthy-volunteers:yes

Code:
non-electric-grass-pitch 10-x
electric-grass-pitch 10-x
towan-two-bedroom-holiday-home 6-x
crantock-two-bedroom-apartment 4-x

Note that above is an example with just two columns, there can be tables more than two columns, like the one below:

Code:
size
waist-(on-half)
hip-(on-half)
inseam
CONTINUE
alt
14"
24-3/4"
34"
CONTINUE
axlt
15"
26-3/4"
34-1/2"
CONTINUE
a2xlt
16"
28-3/4"
35"
CONTINUE
a3xlt
17"
30-3/4"
35-1/2"
CONTINUE
DONE

I have written a bash script, which does the following in a for loop for each file:
1. Count number of times CONTINUE occurs in a file:
Code:
num_char=`grep -w 'CONTINUE' filename | wc -l`

2. Save text above CONTINUE in separate temp files using this command, which does not seem to work:
Code:
awk '/CONTINUE/{x="temp_file"++i;next}{print > x;}' $file

3. Delete DONE from the "temp_file"'s using this command:
Code:
sed -i '/DONE/d' ./$temp_file

4. Paste the contents in another file but in a same order as the columns occur in the main file, which I am finding it difficult to do too. I think I need another for loop here:
Code:
paste -d ' ' temp_file1 temp_fil2 > output_file

5. Delete all blank lines from output_file
Code:
awk 'NF' output_file > tmp && mv tmp ./output_file

The above process, as I suspect, will be too slow too on millions of files. I also wrote a Python program, but that is also not giving the desired output. I am using Linux and BASH.
# 2  
Old 10-19-2016
Unfortunately, your files' structure doesn't seem to be clear nor understandable to me. How would the first CONTINUE in file2 influence the desired output? And what should be the output of file3?
# 3  
Old 10-19-2016
Hi,

See this sample running on one input file :

Code:
$ cat cont1db.txt
size
waist-(on-half)
hip-(on-half)
inseam
CONTINUE
alt
14"
24-3/4"
34"
CONTINUE
axlt
15"
26-3/4"
34-1/2"
CONTINUE
a2xlt
16"
28-3/4"
35"
CONTINUE
a3xlt
17"
30-3/4"
35-1/2"
CONTINUE
DONE
# running the script 
$ ./contdb.sh cont1db.txt
Processing : cont1db.txt
# Results :
$ cat cont1db.csv
size;waist-(on-half);hip-(on-half);inseam
alt;14";24-3/4";34"
axlt;15";26-3/4";34-1/2"
a2xlt;16";28-3/4";35"
a3xlt;17";30-3/4";35-1/2"

And the bash script :
Code:
$ cat contdb.sh
#
echo "Processing : $1"
awk '
function joincsv(array,start,end)
{
    result = array[start++]
    for (i = start ; i <= end; i++)
        result = result ";" array[i]
    return result
}
BEGIN { indice = 0 }
/CONTINUE/ {print joincsv(A,1,indice)
            indice=0 ; next}
/DONE/ {exit}
 {A[++indice]=$0}' $1 > $(basename $1 .txt).csv

Now you just have to enhance the script to manage wildcards for all your input files.
This User Gave Thanks to blastit.fr For This Post:
# 4  
Old 10-19-2016
I agree with RudiC that the leading CONTINUE line makes no sense in your 1st sample input file. The following code silently ignores that line. Maybe something like:
Code:
#!/bin/bash
find . -name '*.txt' |
  awk '
  {	path = $0
	for(i in line)
		delete line[i]
	mlc = lc = 0
  	while((rc = (getline < path)) == 1) {
		if(!NF)	continue
		else if($0 == "CONTINUE")
			lc = 0
		else if($0 == "DONE") {
			close(path)
			if(mlc) {
				for(i = 1; i <= mlc; i++)
					print line[i] > path
				printf("%d line table placed in \"%s\".\n", mlc,
				    path)
			} else	printf("No data found in \"%s\".\n", path)
			break
		} else {if(++lc > mlc)
				mlc = lc
			line[lc] = ((line[lc] != "") ? line[lc] " " : "") $0
		}
	}
	if(rc == 0)
		printf("Previously converted \"%s\"?  DONE not found.\n", path)
	else if(rc == -1)
		printf("Error while reading \"%s\"!  No changes made.\n", path)
	close(path)
  }'

will give you something close to what you want???

If your sample input files are named 1.txt, 2.txt, and 3.txt, respectively, running the above code produces the output:
Code:
3 line table placed in "./1.txt".
4 line table placed in "./2.txt".
4 line table placed in "./3.txt".

and changes the contents of 1.txt to:
Code:
ages-eligible-for-study: 18-months-to-36-months
genders-eligible-for-study: both
accepts-healthy-volunteers: yes

2.txt to:
Code:
non-electric-grass-pitch 10-x
electric-grass-pitch 10-x
towan-two-bedroom-holiday-home 6-x
crantock-two-bedroom-apartment 4-x

and 3.txt to:
Code:
size alt axlt a2xlt a3xlt
waist-(on-half) 14" 15" 16" 17"
hip-(on-half) 24-3/4" 26-3/4" 28-3/4" 30-3/4"
inseam 34" 34-1/2" 35" 35-1/2"

If someone else would like to try this code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 10-19-2016
Try also
Code:
awk '
!NF             {next
                }
/^DONE/         {for (i=1; i<=MX; i++) print T[i]
                }
/^CONTINUE/     {n = 0
                 next
                }
                {T[++n] = T[n] DL[n] $0
                 DL[n] = " "
                 MX = n
                }
' file

This works for file1:
Code:
ages-eligible-for-study: 18-months-to-36-months
genders-eligible-for-study: both
accepts-healthy-volunteers: yes

and file2:
Code:
non-electric-grass-pitch 10-x
electric-grass-pitch 10-x
towan-two-bedroom-holiday-home 6-x
crantock-two-bedroom-apartment 4-x

but for the (unspecified) file3 output we have
Code:
size alt axlt a2xlt a3xlt
waist-(on-half) 14" 15" 16" 17"
hip-(on-half) 24-3/4" 26-3/4" 28-3/4" 30-3/4"
inseam 34" 34-1/2" 35" 35-1/2"

This User Gave Thanks to RudiC For This Post:
# 6  
Old 10-19-2016
This
Code:
awk '
BEGIN           {c = 1
                }

!NF             {next
                }

/^DONE/         {for (i=2; i<c; i++) for (j=1; j<=MX; j++) print T[1,j], T[i,j] " > " (FILENAME ".new")
                 delete T
                 delete DL
                 n = 0
                 c = 1
                 next
                }

/^CONTINUE/     {if (n) c++; else c = 1
                 n = 0
                 next
                }

                {T[c,++n] = $0
                 MX = n
                }
' file[123]
ages-eligible-for-study: 18-months-to-36-months > file1.new
genders-eligible-for-study: both > file1.new
accepts-healthy-volunteers: yes > file1.new
non-electric-grass-pitch 10-x > file2.new
electric-grass-pitch 10-x > file2.new
towan-two-bedroom-holiday-home 6-x > file2.new
crantock-two-bedroom-apartment 4-x > file2.new
size alt > file3.new
waist-(on-half) 14" > file3.new
hip-(on-half) 24-3/4" > file3.new
inseam 34" > file3.new
size axlt > file3.new
waist-(on-half) 15" > file3.new
hip-(on-half) 26-3/4" > file3.new
inseam 34-1/2" > file3.new
size a2xlt >> file3.new
waist-(on-half) 16" > file3.new
hip-(on-half) 28-3/4" > file3.new
inseam 35" > file3.new
size a3xlt > file3.new
waist-(on-half) 17" > file3.new
hip-(on-half) 30-3/4" > file3.new
inseam 35-1/2" > file3.new

would work on several files, and it would distribute the multiple "columns" differently... remove the quotes around the redirection operator if happy and want to produce new result files.
This User Gave Thanks to RudiC For This Post:
# 7  
Old 10-20-2016
The table data has been obtained from web pages, which are very noisy. In the first table, there are two columns where the first few strings could not be obtained in both columns. This could be because there might be some non-ASCII characters or characters other than English. But the second table i.e. the second example above, consists of three columns separated by CONTINUE, where the first column data might consist of non-ASCII characters or something else, but the HTML structure suggested the web page parsing program that there are three columns. I can understand that this is a bit tricky. The trick is to consider each CONTINUE to separate one column in a file The output that I expect from file3 is:
pre.cjk { font-family: "Nimbus Mono L",monospace; }p { margin-bottom: 0.1in; line-height: 120%; }
Code:
size alt axlt a2xlt a3xlt
waist-(on-half) 14" 15" 16" 17"
hip-(on-half) 24-3/4" 26-3/4" 28-3/4" 30-3/4"
inseam 34" 34-1/2" 35" 35-1/2"

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Tabulate nodes and subnodes of XML file script

Hello to all, I'd like to tabulate the content of the XML file below, I've been trying with AWK putting the Top node (<KIND>) as RS and so far I'm able to print the first and second field of each record, but I don't know how to continue with the following nodes that have more inner nodes. The... (4 Replies)
Discussion started by: Ophiuchus
4 Replies

2. Shell Programming and Scripting

Getting data in table form

Hi, I have a csv file from which i am fetching few columns as below: IFILE=/home/home1/Report1.csv OFILE=/home/home1/`date +"%m%d%y%H%M%S"`.dat if #Checks if file exists and readable then awk -F "," '(NR>4) {print $1,$6,$2,$3,$4,$5,$6}' ${IFILE} >> ${OFILE} fi cat $OFILE | mail... (7 Replies)
Discussion started by: Vivekit82
7 Replies

3. Shell Programming and Scripting

Arrange data in table

Hello All, I have following data into my file named record. Name City phone number email Jhon Newyork 123456987 jhon@gmail.com Maria Texas 569865612 Maria_Sweet@rediffmail.com Chan Durben NA Chan123@gmail.com The output should be in straight columns.. There should not be any... (1 Reply)
Discussion started by: Nakul_sh
1 Replies

4. Shell Programming and Scripting

Arrange data in table

Hello All, I have following data into my file named record. Name City phone number email Jhon Newyork 123456987 jhon@gmail.com Maria Texas 569865612 Maria_Sweet@rediffmail.com Chan Durben NA Chan123@gmail.com |---------------------------------------------------------------| |Name ... (2 Replies)
Discussion started by: Nakul_sh
2 Replies

5. Shell Programming and Scripting

Take Data From a table and send it through mail

Hi can anyone help me in writing a code for taking data from a table and need to send that data through mail using mail -x command.. (3 Replies)
Discussion started by: ginrkf
3 Replies

6. Shell Programming and Scripting

How to take data from table?

Hi all , am using unix aix .. Actually i have a table called table 1 in that table year period startdate enddate 2013 1 26/03/2012 29/04/2012 2013 2 30/04/2012 27/05/2012 2013 3 28/05/2012 28/06/2012 2013 4 25/06/2012 ... (10 Replies)
Discussion started by: Venkatesh1
10 Replies

7. UNIX for Dummies Questions & Answers

Append data 1 table to other

Hi. I have 2 create 2 temporary tables.the data will be same with same cols..but after creating 2 tables..i have to merge data in file and send..however the query is after merging data no duplicates shud be present..and only 1 record for a entity must be present.. for eg: table1 has foll cols... (3 Replies)
Discussion started by: musu
3 Replies

8. UNIX for Advanced & Expert Users

Data from table to column

Hi All, in bash I have a text file which is something like 7.96634E-07 1.0000 0.00000E+00 0.0000 0.00000E+00 0.0000 1.59327E-06 0.7071 2.23058E-05 0.1890 6.61207E-05 0.1098 1.13919E-04 0.0865 1.47377E-04 0.0747 .... .... 0.00000E+00 0.0000 0.00000E+00 0.0000 ... (6 Replies)
Discussion started by: f_o_555
6 Replies

9. Shell Programming and Scripting

Data table

Hi there, I get a list of data set by means a script with many echo command that looks like this: CASE 1 A= 4 B= 4 CASE 2 A= 3 B= 5 Is... (3 Replies)
Discussion started by: Giordano Bruno
3 Replies

10. Shell Programming and Scripting

Showing several data in one table

hi again :$ i am creating a script to be able to monitor a machine's performance. the code in the file is: ================q============== sar 2 3 |awk '{print $3}' vmstat 2 3 |awk '{print $19 " " $21}' iostat -cx 2 3 |awk '{print $8 " " $10 " " $13}' ================uq============= the... (4 Replies)
Discussion started by: mohamedh_14
4 Replies
Login or Register to Ask a Question