Sponsored Content
Top Forums Programming Open Source Splitting files using awk and reading filename value from input data Post 302976078 by rbatte1 on Thursday 23rd of June 2016 01:35:45 PM
Old 06-23-2016
Splitting files using awk and reading filename value from input data

I have a process that requires me to read data from huge log files and find the most recent entry on a per-user basis. The number of users may fluctuate wildly month to month, so I can't code for it with names or a set number of variables to capture the data, and the files are large so I don't want to read the it several times.

The entries of interest have a particular string so I can extract just them from the overall log file and I have a way to split the output into separate files on a per-user basis, my plan being to then just read the last line of each files created with tail -1 and the filename giving me the user account in question.

My boss, however, worries about false-positive data matches for my expression (by chance or maliciously) that might try to overwrite a critical file.


My data has a syslog-type date in it which means doing a sort -u is proving tricky too. I've got this far with splitting the data out to files under /tmp/logs as splitlog.rbatte1 or similar but if field 11 were ever */../../etc/passwd then potentially I would be in trouble.

The date is the first three fields and 'as far as I am aware' a valid user name would be in field 11, but ........

A simplified part of the code would be:-
Code:
grep "Active transaction started" /var/log/qapplog | awk "{print \$1, \$2, \$3, \$11> \"/tmp/logs/splitlog.\"\$11}"
for userfile in /tmp/logs/splitlog.*
do
   lastrecord=$(tail -1 $userfile)
   printf "User %s last record is %s\n" "$userfile" "$lastrecord"
   .... whatever else here ....
done

I have considered adding tr -d "\/" to strip out the characters, but now that it's been raised, I'm concerned that there may be other things I'm not considering.

Is there a better way to work here, potentially with awk getting the equivalent of basename "$11" or variable substitution in the shell of "${{11}##*/}"?


Any suggestions welcome. Perhaps there is a better design overall that will find the last entry on a per-user basis. The log is thankfully written in time order, so the last in the file by user name is the last by time already.

Kind regards,
Robin
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Moving files by splitting the path embedded in the filename

Hello All. I am having a directory /tmp/rahul which contains many files in the format @#home@#rahul@#programs@#script.pl where /home/rahul/programs is the directory where the script.pl file is to be placed. I have many files in this format. What i want is a script which read these... (7 Replies)
Discussion started by: rahulrathod
7 Replies

2. Shell Programming and Scripting

Reading in data sets into arrays from an input file.

Hye all, I would like some help with reading in a file in which the data is seperated by commas. for instance: input.dat: 1,2,34,/test for the above case, the fn. will store the values into an array -> data as follows: data = 1 data = 2 data = 34 data = /test I am trying to write... (5 Replies)
Discussion started by: sidamin810
5 Replies

3. Shell Programming and Scripting

awk reading 2 input files but not getting expected value

I'm reading 2 input files but not getting expected value. I should get an alpha value on file_1_data but not getting any. Please help. >cat test6.sh awk ' FILENAME==ARGV { file_1_data=$0; print "----- 1 Line " NR " -----" $1; next } FILENAME==ARGV { file_2_data=$0; print "----- 2... (1 Reply)
Discussion started by: pdtak
1 Replies

4. Shell Programming and Scripting

Splitting input files into multiple files through AWK command

Hi, I needs to split *.txt files from single directory depends on the some mutltiple input values. i have wrote the code like below for file in *.txt do grep -i -h "value1|value2" $file > $file; done. My requirment is more input values needs to be given in grep; let us say 50... (3 Replies)
Discussion started by: arund_01
3 Replies

5. Shell Programming and Scripting

Reading specific contents from 1 input files and appending it to another input file

Hi guys, I am new to AWK and unix scripting. Please see below my problem and let me know if anyone you can help. I have 2 input files (example given below) Input file 2 is a standard file (it will not change) and we have to get the name (second column after comma) from it and append it... (5 Replies)
Discussion started by: sksahu
5 Replies

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Hi all, I have a list of xml file. I need to split the files to a different files when see the <ko> tag. The list of filename are B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml ... (3 Replies)
Discussion started by: natalie23
3 Replies

7. Shell Programming and Scripting

Help with reading two input files in awk

Hello, I'm trying to write an awk program that reads two files inputs. example, file 1: 0.00017835 0.000176738 0.00018811 0.000189504 0.000188155 0.000180065 0.000178991 0.000178252 0.000182513 file 2: 1.7871769E-05 1.5139576E-16 1.5140196E-16 1.5139874E-16 1.7827407E-04 ... (5 Replies)
Discussion started by: joseamck
5 Replies

8. Shell Programming and Scripting

Splitting input CSV file into 3 files

Hi , I am receiving a CSV file that can vary in number of rows each time. I am supposed to split this file into 3 separate files like this: 1. create a file named 'File1.csv' that will contain first 3 rows of the input file 2. create file named 'File2.csv' that will contain last 3 rows of the... (7 Replies)
Discussion started by: kedrick
7 Replies

9. Shell Programming and Scripting

Splitting the Data using awk

Hello All, I have a comma delimiter file with 10 columns. I took the desired data but from $4 I need to split into two columns as 3+7 bytes. awk -F"," -v OFS=',' '{print $2,$3,$4}' foo.txt 42366,11/10/2014,5012418769 42366,11/10/2014,2046955672 42366,11/10/2014,2076802951 ... (3 Replies)
Discussion started by: karumudi7
3 Replies

10. Shell Programming and Scripting

How to embed data instead of reading user input from an array?

Hello, I am running under ubuntu1 14.04 and I have a script which is sending given process names to vanish so that I'd see less output when I run most popular tools like top etc in terminal window. In usual method it works. Whenever I restart the system, I have to enter the same data from... (2 Replies)
Discussion started by: baris35
2 Replies
splitlog(1)						      General Commands Manual						       splitlog(1)

NAME
splitlog - split WWW server (httpd) access logfiles SYNOPSIS
splitlog [-f configfile] [options...] [--] [ logfile | + | - ]... DESCRIPTION
splitlog reads a sequence of httpd common logfile format (CLF) access_log files and/or the standard input and splits the logfile entries into separate files according to the entry's requested URL or virtual host prefix. splitlog is intended to be run periodically by the webmaster as a means for providing individual logfiles for each of the customers of a server, since it is less efficient for the server itself to generate multiple logfiles. splitlog does not make any changes to the input file and can be configured to write the split files in any directory. By default, a cached DNS lookup is performed on any IP addresses which are unresolved in the input file. The log entries can also be anonymized if there are concerns about the requesting clients' privacy. splitlog is a perl script, which means you need to have a perl interpreter to run the program. It has been tested with perl versions 4.036 and 5.002. OPTIONS
Configuration Options These options define how splitlog should establish defaults and interpret the command-line. -f filename Get the configuration defaults from the given file. If used, this must be the first argument on the command-line, since it needs to be interpreted before the other command options. The file splitlog.rc is included with the distribution as an example of this file; it contains perl source code which directly sets the control and display options provided by splitlog and contains a function for altering the split logfile name-selection algorithm. If filename is not a pathname, the include path (see FILES) is searched for filename. An empty string as filename will disable this feature. [-f "splitlog.rc"] -- Last option (the remaining arguments are treated as input files). Diagnostic Options These options provide information about splitlog usage or about some unusual aspects of the logfile(s) being processed. -h Help - display usage information to STDERR and then exit. -e Display to STDERR all invalid log entries. Invalid log entries can occur if the server is miswriting or overwriting its own log, if the request is made by a broken client or proxy, or if a malicious attacker is trying to gain privileged access to your system. Process Options These options modify how and where logfile entries are written. -x Discard any logfile entries without a filename key instead of placing them in a special OTHERS.log. -v Use a prefix of the input file entries (ended by the first ":" or space) for selecting the output filename instead of, or in addition to, the URL path. The most likely use for such a prefix is for the requested virtual host. -dir directory Place the output logfiles in the given directory instead of the current working directory. -anon imu Anonymize the logfile entries before writing them to split logs. The value is some combination of the letters "i" (ident field is removed), "m" (machine name is replaced with ANON or 0), and "u" (authentication userid field is removed). -dns -nodns Do (-dns) or don't (-nodns) use the system's hostname lookup facilities to find the DNS hostname associated with any unresolved IP addresses. Looking up a DNS name may be very slow, particularly when the results are negative (no DNS name), which is why a caching capability is included as well. [-dns] -cache filename Use the given DBM database as the read/write persistent DNS cache (the .dir and .pag extensions are appended automatically). Cached entries (including negative results) are removed after the time configured for $DNSexpires [two months]. No caching is performed if filename is the empty string, which may be needed if your system does not support DBM or NDBM functionality. Running -dns without a persistent cache is not recommended. [-cache "dnscache"] Search Options These options are used to include or exclude logfile entries from being output according to whether or not they match a given pattern. The pattern is supplied in the form of a perl regular expression, except that the characters "+" and "." are escaped automatically unless the -noescape option is given. Enclose the pattern in single-quotes to prevent the command shell from interpreting some special characters. Multiple occurrences of the same option results in an OR-ing of the regular expressions. -a regexp -A regexp Include (-a) or exclude (-A) all requests containing a hostname/IP address matching the given perl regular expression. -c regexp -C regexp Include (-c) or exclude (-C) all requests resulting in an HTTP status code matching the given perl regular expression. -d regexp -D regexp Include (-d) or exclude (-D) all requests occurring on a date (e.g., "Feb 02 1994") matching the given perl regular expression. -t regexp -T regexp Include (-t) or exclude (-T) all requests occurring during the hour (e.g., "23" is 11pm - 12pm) matching the given perl regular expression. -m regexp -M regexp Include (-m) or exclude (-M) all requests using an HTTP method (e.g., "HEAD") matching the given perl regular expression. -n regexp -N regexp Include (-n) or exclude (-N) all requests on a URL (archive name) matching the given perl regular expression. -noescape Do not escape the special characters ("+" and ".") in the remaining search options. INPUT
After parsing the options, the remaining arguments on the command-line are treated as input arguments and are read in the order given. If no input arguments are given, the configured default logfile is read [+]. - Read from standard input (STDIN). + Read the default logfile. [as configured] logfile... Read the given logfile. If the logfile's extension indicates that is is compressed (gz|z|Z), then pipe it through the configured decompression program [gunzip -c] first. USAGE
In most cases, splitlog is run on a periodic basis by a wrapper program as a crontab entry shortly after midnight, typically in conjunction with rotating the current logfile. The -D today option can be used to split the main logfile on a daily basis without rotation. All of the command-line options, and a few options that are not available from the command-line, can be changed within the user configuration file (see splitlog.rc). This file is actually a perl library module which is executed as part of the program's initialization. The example provided with the distribution includes complete documentation on what variables can be set and their range of values. If the default algorithm for selecting the split logfile name isn't desired, or if some set of names should be combined into a single file, then uncomment the user_path_map() function and define your own name-selection algorithm. The wwwstat program can be used to analyze the resulting logfiles. See wwwstat for a description of the common logfile format. Perl Regular Expressions The Search Options and many of the configuration file settings allow for full use of perl regular expressions (with the exception that the -a, -A, -n and -N options treat '+' and '.' characters as normal alphabetic characters unless they are preceded by the -noescape option). Most people only need to know the following special characters: ^ at start of pattern, means "starts with pattern". $ at end of pattern, means "ends with pattern". (...) groups pattern elements as a single element. ? matches preceding element zero or one times. * matches preceding element zero or more times. + matches preceding element one or more times. . matches any single character. [...] denotes a class of characters to match. [^...] negates the class. Inside a class, '-' indicates a range of characters. (A|B|C) matches if A or B or C matches. Depending on your command shell, some special characters may need to be escaped on the command line or enclosed in single-quotes to avoid shell interpretation. ENVIRONMENT
HOME Location of user's home directory, placed on INC path. LOGDIR Used instead of HOME if latter is undefined. PERLLIB A colon-separated list of directories in which to look for the user configuration file. FILES
Unless a pathname is supplied, the configuration file is obtained from the current directory, the user's home directory (HOME or LOGDIR), the standard library path (PERLLIB), and the directory indicated by the command pathname (in that order). splitlog.rc User configuration file. dnscache.dir dnscache.pag DBM files for persistent DNS cache. SEE ALSO
crontab(1), httpd(1m), perl(1), wwwstat(1) More info and the latest version of splitlog can be obtained from http://www.ics.uci.edu/pub/websoft/wwwstat/ ftp://www.ics.uci.edu/pub/websoft/wwwstat/ If you have any suggestions, bug reports, fixes, or enhancements, please join the <wwwstat-users@ics.uci.edu> mailing list by sending e- mail with "subscribe" in the subject of the message to the request address <wwwstat-users-request@ics.uci.edu>. The list is archived at the above address. More About Perl The Perl Language Home Page http://www.perl.com/perl/index.html Johan Vromans' Perl Reference Guide http://www.xs4all.nl/~jvromans/perlref.html AUTHOR
Roy Fielding (fielding@ics.uci.edu), University of California, Irvine. Please do not send questions or requests to the author, since the number of requests has long since overwhelmed his ability to reply, and all future support will be through the mailing list (see above). This work has been sponsored in part by the Defense Advanced Research Projects Agency under Grant Numbers MDA972-91-J-1010 and F30602-94-C-0218. This software does not necessarily reflect the position or policy of the U.S. Government and no official endorsement should be inferred. 03 November 1996 splitlog(1)
All times are GMT -4. The time now is 05:14 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy