URL: https://www.my.target.site.com
Thanks to this script you can extract any content from an entire site using CSS selectors and regular expressions
Thanks to this script you can extract any content from an entire site using CSS selectors and regular expressions
Offers
Description
Start by indicating the path of the CSV file to be written on your computer and its file name.You then have 2 parts, CSS selectors and regular expressions. These are two different content extraction methods.
Indicate for each data the name of the column and reason for the extraction.
Two tutorials for learning regular expressions and CSS selectors are available. Allow half a day to learn the techniques if you do not know these tools. It's less complicated than you might think at first glance. You will find many video and text tutorials on the Internet about these techniques. These are very powerful tools.
Initial script (INITIAL)
//WIZ_COMMENT This script allows you to extract information from a website by targeting the data with CSS selectors or regular expressions. Once configured, press save, then launch script to send a crawler to the entire target site. The information will be gathered in a CSV. You must specify a different name for each column.
pathDir = ''; //WIZ_VARIABLE #name:Path of the directory where the CSV must be writtend (leave blank to chose your Desktop)
nameCSV = 'my_export'; //WIZ_VARIABLE #name:Name of the CSV file
deleteCSVstart= true; //WIZ_VARIABLE #name:Remove the CSV on startup if a file already exists
//WIZ_TITLE Find data by CSS Selectors
//WIZ_LINK #link:https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors #name:How you build a CSS selector?
column1 = 'titre'; //WIZ_VARIABLE #name:Name of the column 1
cssSelector1 = "h1"; //WIZ_VARIABLE #name:CSS selector of the data in the column 1 #css
column2 = ''; //WIZ_VARIABLE #name:Name of the column 2
cssSelector2 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 2 #css
column3 = ''; //WIZ_VARIABLE #name:Name of the column 3
cssSelector3 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 3 #css
column4 = ''; //WIZ_VARIABLE #name:Name of the column 4
cssSelector4 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 4 #css
column5 = ''; //WIZ_VARIABLE #name:Name of the column 5
cssSelector5 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 5 #css
column6 = ''; //WIZ_VARIABLE #name:Name of the column 6
cssSelector6 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 6 #css
column7 = ''; //WIZ_VARIABLE #name:Name of the column 7
cssSelector7 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 7 #css
column8 = ''; //WIZ_VARIABLE #name:Name of the column 8
cssSelector8 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 8 #css
column9 = ''; //WIZ_VARIABLE #name:Name of the column 9
cssSelector9 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 9 #css
column10 = ''; //WIZ_VARIABLE #name:Name of the column 10
cssSelector10 = ""; //WIZ_VARIABLE #name:CSS selector of the data in the column 10 #css
//WIZ_TITLE Find data by regular expressions
//WIZ_LINK #link:https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 #name:How you build a regular expression?
//WIZ_
column11 = ''; //WIZ_VARIABLE #name:Name of the column 11
regex11 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 11 #regex
column12 = ''; //WIZ_VARIABLE #name:Name of the column 12
regex12 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 12 #regex
column13 = ''; //WIZ_VARIABLE #name:Name of the column 13
regex13 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 13 #regex
column14 = ''; //WIZ_VARIABLE #name:Name of the column 14
regex14 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 14 #regex
column15 = ''; //WIZ_VARIABLE #name:Name of the column 15
regex15 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 15 #regex
column16 = ''; //WIZ_VARIABLE #name:Name of the column 16
regex16 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 16 #regex
column17 = ''; //WIZ_VARIABLE #name:Name of the column 17
regex17 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 17 #regex
column18 = ''; //WIZ_VARIABLE #name:Name of the column 18
regex18 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 18 #regex
column19 = ''; //WIZ_VARIABLE #name:Name of the column 19
regex19 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 19 #regex
column20 = ''; //WIZ_VARIABLE #name:Name of the column 20
regex20 = ""; //WIZ_VARIABLE #name:Regular expression for the data in the column 20 #regex
if(!pathDir) pathDir=path("desktop")
path0=pathDir+nameCSV+".csv"
if(deleteCSVstart) delete(path0)
cssSeletors=[:]
regexs=[:]
if(cssSelector1) cssSeletors.put(column1,cssSelector1)
if(cssSelector2) cssSeletors.put(column2,cssSelector2)
if(cssSelector3) cssSeletors.put(column3,cssSelector3)
if(cssSelector4) cssSeletors.put(column4,cssSelector4)
if(cssSelector5) cssSeletors.put(column5,cssSelector5)
if(cssSelector6) cssSeletors.put(column6,cssSelector6)
if(cssSelector7) cssSeletors.put(column7,cssSelector7)
if(cssSelector8) cssSeletors.put(column8,cssSelector8)
if(cssSelector9) cssSeletors.put(column9,cssSelector9)
if(cssSelector10) cssSeletors.put(column10,cssSelector10)
if(regex11) regexs.put(column11,regex11)
if(regex12) regexs.put(column12,regex12)
if(regex13) regexs.put(column13,regex13)
if(regex14) regexs.put(column14,regex14)
if(regex15) regexs.put(column15,regex15)
if(regex16) regexs.put(column16,regex16)
if(regex17) regexs.put(column17,regex17)
if(regex18) regexs.put(column18,regex18)
if(regex19) regexs.put(column19,regex19)
if(regex20) regexs.put(column20,regex20)
global.put("path0",path0)
global.put("cssSeletors",cssSeletors)
global.put("regexs",regexs)
//MATRICULE BYNW
Script running on the page (FORPAGE)
path0=global.get("path0")
cssSeletors=global.get("cssSeletors")
regexs=global.get("regexs")
datas=[:]
datas.put("URL of the page",urlPage)
cssSeletors.each{title,cssSelector->
datas.put(title,cleanSelect(cssSelector))
}
regexs.each{title,reg->
datas.put(title,cleanRegex(reg))
}
csv(path0,datas)