String copy


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting String copy
# 15  
Old 03-26-2011
Hi

The raw data comes from a website scrape and contains 12000 products name but no brands names, e.g.

Filippo Berio Mild & Light Olive Oil (500ml)

What I've done initially is to the get the scrape program to take the first word of each product to use as the brand name. This is good for about 90% of brands but of course only works for those which are only a single word.

I then use a simple sed script to pick out the incorrect brands e.g.

sed -i "s/^7/7 Up/g" brands.txt
sed -i "s/^Ainsley/Ainsley Harriott/g" brands.txt
sed -i "s/^Air/Air Wick/g" brands.txt
sed -i "s/^Alfa/Alfa One/g" brands.txt
sed -i "s/^All/All About Shine/g" brands.txt
sed -i "s/^Alta/Alta Italia/g" brands.txt
sed -i "s/^Ambi/Ambi Pur/g" brands.txt
sed -i "s/^Angel/Angel Delight/g" brands.txt

However this again only fixes %70 of brands as some brands have the same first word, e.g.

John West
John Frieda

I can pull out a list of all the brands if needed. I've attached just the product list file for now but will produce and attach the brand file later today. What makes this a little bit more complicated is that this list will be updated weekly so new brands are constantly being added.

I can get my brand list to a point where it will only take me 10 - 15 minutes of manual editing so it's not the end of the world :-)

Many thanks for you kind help on this little problem...
# 16  
Old 03-26-2011
The best way of doing this would be to find a way to extract only the brand name from the web site.
Otherwise, you have to build a list that will contains all the "multi word" brand name so that we can then setup a script to parse it with a pseudo code like :

for all brand in multi word band name list
set the : separator at the right place
for all other
set the : separator just after the first word

If you give more clue about the way you initially generate the initial file, maybe it can help to directly separate the brand & product fields at generation step.
# 17  
Old 03-26-2011
The file is initially generated using a piece of web scraping software, I've just had a look at the web pages I do the scrape from, it looks like I can produce a scrape of all the brand names.
# 18  
Old 03-26-2011
so if you can generate the list of the brand name only, and then the other file containing everything, you can then setup the separator at the right place.
Can you upload the file containing only the brand name ?
# 19  
Old 03-26-2011
I'm afraid automatically generating the brand list is harder than I initially thought. I will still continue to look at this. The support forum for the website for the scrape software is currently down and I need to ask a question there. Many thanks for your help :-)
# 20  
Old 03-26-2011
some line appear more than once in your file :

Here you can see at which line number the duplicate occure :

Code:
$ sort tst | uniq -d | while read a
> do
> cat -n tst | grep "$a"
> done
  6776  Bloo Acticlean Cistern Blocks Citrus (2)
  6777  Bloo Acticlean Cistern Blocks Citrus (2)
  7704  Tesco Premium Supermeat Variety Pack (6x400g)
  7981  Tesco Premium Supermeat Variety Pack (6x400g)
  7711  Winalot Classics in Jelly Variety Pack (6x400g)
  7712  Winalot Classics in Jelly Variety Pack (6x400g)

---------- Post updated at 03:24 PM ---------- Previous update was at 03:21 PM ----------

Code:
$ sort tst | uniq -d | while read a
> do
> grep -n "$a" tst
> done
6776:Bloo Acticlean Cistern Blocks Citrus (2)
6777:Bloo Acticlean Cistern Blocks Citrus (2)
7704:Tesco Premium Supermeat Variety Pack (6x400g)
7981:Tesco Premium Supermeat Variety Pack (6x400g)
7711:Winalot Classics in Jelly Variety Pack (6x400g)
7712:Winalot Classics in Jelly Variety Pack (6x400g)

# 21  
Old 03-26-2011
Wow didn't realise that, the website that I scraped the data from also has the duplicates! They are a big company with lots of developers you'd think they would have noticed :-)
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Copy a string to another file

OS version: RHEL 6.7 Shell : Bash I have a file like below. It has 500K lines. I want to extract TAG_IDs shown in single quote at the end to copied to another file. As if I had copied the TAG_IDs using block select (Column Select) in modern text editor $ cat file.txt UPDATE TAGREF SET... (9 Replies)
Discussion started by: John K
9 Replies

2. Shell Programming and Scripting

Copy last third char form string

HI Input A.txt ABC907 ABC907_1B_9 ABC985 ABC985_1A_9 ABC985 ABC985_1B_9 ABC985 ABC985_1C_9 ABC05037 ABC05037_1A_9 ABC05037 ABC05037_1B_9 Base of column 2 last third char. If It is A the 1,if B then 2 If C then 3 File B.txt ABC907 ABC907_1B_9 2 ABC985 ABC985_1A_9 1 ABC985... (8 Replies)
Discussion started by: asavaliya
8 Replies

3. Shell Programming and Scripting

Help on Script of Copy String from column

Hello, My DATA: PLOKIJ1234G 12 13 14 15 PLOKIJ1234E 12 13 14 15 PLOKIJ1234F 12 22 33 44 IJNUHB12346 55 66 77 88 IJNUHB12347 32 34 45 67 IJUHU345D 23 23 22 33 IJUHYG23E 11 24 23 23 IJUHYG23F 77 88 99 00 output: PLOKIJ1234 PLOKIJ1234G 12 13 14 15 PLOKIJ1234... (11 Replies)
Discussion started by: asavaliya
11 Replies

4. Shell Programming and Scripting

input a string and copy lines from a file with that string on it

i have a file1 with many lines. i have a script that will let me input a string. for example, APPLE. what i need to do is to copy all lines from file1 where i can find APPLE or any string that i specify and paste in on file 2 thanks in advance! (4 Replies)
Discussion started by: engr.jay
4 Replies

5. Programming

String copy in C language

Hello, I have a text file (FILE.txt) that contains the following information: userAxxx.name@email.com userA userBxxx.name@email.com userB userxxCx.name@email.com userC and more.... in scripting, I can easily do a grep and awk to store an email info into a string... (6 Replies)
Discussion started by: tqlam
6 Replies

6. Shell Programming and Scripting

Copy-paste string automatically

Hi i'm not really sure if it's possible or not in bash. Basically I want to convert a ssh key created with ssh-keygen to putty format. The problem is that puttygen doesn't have an option for supplying passphrase in batch mode so it must be entered manually. For sskey generated with ssh-keygen i use... (2 Replies)
Discussion started by: ktm
2 Replies

7. Shell Programming and Scripting

how to copy one string in ksh into another

Hi Does anybody know if there is a utility/command in ksh which would allow to copy/insert the contents of one string into certain positions of the other? for example: A=" ABCDEF " B="HHH" I need to insert contents of string "B" into string "A" from position 3 to 5, so... (3 Replies)
Discussion started by: aoussenko
3 Replies

8. Shell Programming and Scripting

find a string and copy the string after that

Hi! just want to seek help on this: i have a file wherein i want to find a string and copy the string after that and paste that other string to a new file. ex: TOTAL 123456 find "TOTAL" and copy "123456" and paste "123456" to a new file NOTE: there are many "TOTAL" strings on that... (12 Replies)
Discussion started by: kingpeejay
12 Replies

9. UNIX for Dummies Questions & Answers

Copy or Grep all text below a string

Hello, I am trying to copy all the text from a file below a search string... For example i want to grep all text below the word sure: UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! ... (2 Replies)
Discussion started by: aliaa2a
2 Replies

10. Shell Programming and Scripting

Copy / string.

I'm trying to get a script to copy a url then put it in a different place in the file. Example is currently the script goes to a site takes the urls on it and then puts them into an html file. Only thing is I want to make them into links. So currently lynx goes to the page takes out the urls.... (6 Replies)
Discussion started by: Paulw0t
6 Replies
Login or Register to Ask a Question