标签:sas编程   正则表达式   数据挖掘   模式匹配   hash   
继续《SAS编程与数据挖掘商业案例》学习笔记,本文侧重数据处理实践,包括:HASH对象、自定义format、以及功能强大的正则表达式
一:HASH对象
Hash对象又称散列表,是根据关键码值而直接进行访问的数据结构,是根据关键码值而直接进行访问的数据结构,
sas提供了两个类来处理哈希表,用于存储数据的hash和用于遍历的hiter,hash类提供了查找、添加、修改、删除等方法,hiter提供了用于定位和遍历的first、next等方法。
优点:键值的查找是在内存中进行的,有利于提高性能;
              hash表可以在数据步运行时,动态的添加更新或删除观测;
              hash表中可以很快的定位数据,减少查找次数;
常用方法:
definekey:定义键
Definedata:定义值
definedone:定义完成,可以载入数据
add:添加键值,如在hash表中已存在,则忽略;
replace:如果健在hash表中存在,则替换,如果不存在则添加键值
remove:清除键值对
find:查找健值,如果存在则将值写入对应变量
check:查找键值,如果存在则返回rc=0,不修改当前变量的值;
output:将hash表输出到数据集
clear:清空hash表,但并不删除对象
equal:判断两个hash类是否相等
 
find方法的示例:
libname chapt12 ‘f:\data_model\book_data\chapt12‘;
data results;
 if _n_=0 then set chapt12.participants;                  
   if _n_ = 1 then do;
    declare hash h(dataset:‘chapt12.participants‘);    
    h.definekey(‘name‘);
    h.definedata(‘gender‘, ‘treatment‘);
    h.definedone();
  end;
   set chapt12.weight;
  if h.find() = 0 then
    output;
run;
 
hiter对象的引例:
data patients;
  length patient_id $ 16 discharge 8;
  input patient_id discharge:date9.;
datalines;
smith-4123 15mar2004
hagen-2834 23apr2004
smith-2437 15jan2004
flinn-2940 12feb2004
;
data _null_;
  if _n_=0 then set patients;
  declare hash ht(dataset:"patients",ordered:"ascending");
  ht.definekey("patient_id");
  ht.definedata("patient_id", "discharge");
  ht.definedone();
  declare hiter iter("ht");
  rc = iter.first();
  do while (rc=0);
    put patient_id discharge:date9.;
    rc = iter.next();
  end;
run;
用declare hiter iter("ht");给hash表ht定义了一个遍历器iter,之后调用first方法将遍历器定位到hash表的第一条观测,然后使用next方法遍历hash表中的所有记录并输出。
 
商业实战-两个数据集的合并:
    data both1(drop=rc);    
      declare hash plan ();   
   rc = plan.definekey (‘plan_id‘); 
   rc = plan.definedata (‘plan_desc‘); 
   rc = plan.definedone ();  
   do until (eof1) ;     
     set chapt12.plans end = eof1;
     rc = plan.add ();    
  end;
  do until (eof2) ; 
     set chapt12.members end = eof2;
     call missing(plan_desc);
     rc = plan.find (); 
     output;   
  end;
  stop;
run;
上述程序可以简化为:
data both2;
   length plan_id $3 plan_desc $20;
   if _n_ = 1 then do;
         declare hash h(dataset:‘chapt12.plans‘);
         h.definekey(‘plan_id‘);
         h.definedata(‘plan_desc‘);
         h.definedone();
         call missing(plan_desc);
      end;
   set chapt12.members;
   rc=h.find();
run;
二:format
自定义format:
Proc Format;
    Value $ Sex_Fmt
    ‘F‘=‘女‘
    ‘M‘=‘男‘
    Other = ‘未知‘;
    Value Age_Dur
    Low-10="10岁以下"            
    11-13="11-13岁"
    14-<15="14-15"
    15-High="15岁以上";
Run;
应用:
Data  test;
Set  sashelp.class(keep=sex age);
x=put(sex,$sex_fmt);y=put(age,age_dur.);
Run;
三:正则表达式:
/.../  一个正则表达式的起止;
|  数项之间的选择,“或”运算;
()   匹配组,标记一个子表达式的开始和结束位置;
.    除换行符以外的任意字符;
\w  任一单词字符,数字大小写字母以及下划线
\W  任一非单词字符
\s   任一空白字符,包括空格、制表符、换行符、回车符、中文全角空格等;
\S   任一非空白字符,
\d   0-9任一数字
\D  任一非数字字符
[...]
[^...]
[a-z]  从a到z
[^a-z]  不在从a到z范围内的任意字符
^  匹配输入字符串的开始位置
$  匹配输入字符串的结尾位置
\b  描述单词的前或后边界
\B  表示非单词边界
*  匹配0次或多次
+ 匹配一次或多次
?  匹配零次或 一次
{n}  匹配n次
{n,}  匹配n次以上
{n,m}  匹配n到m次
 
常用函数:
Prxparse     定义一个正则表达式
Prxmatch  返回匹配模式的首次匹配位置
Call prxsubstr   返回匹配模式在目标字符串的开始位置和长度
Prxposn    返回正则表达式子表达式对应的匹配模式值
Call  prxposn    返回正则表达式子表达式对应的匹配模式和长度
Cal l  prxnext  返回匹配模式在目标字符串中的多个匹配位置和长度
Prxchange    替代匹配模式的值
Call prxchange   替代匹配模式的值
 
eg1:
data _null_;
   if _n_ = 1 then pattern_num = rxparse("/cat/");
  
   retain pattern_num;
   input string $30.;
   position = rxmatch(pattern_num,string);
   file print;
   put pattern_num= string= position=;
datalines;
there is a cat in this line.
does not match cat
cat in the beginning
at the end, a cat
cat
;
run;
eg2:数据验证
data match_phone;
   set chapt12.phone_numbers;
   if _n_ = 1 then pattern = prxparse("/\(\d\d\d\) ?\d\d\d-\d{4}/");
   retain pattern;
   if prxmatch(pattern,phone) gt 0 then output;
run;
找出不匹配的手机号码
data unmatch_phone;
   set chapt12.phone_numbers;
   where not prxmatch("/\(\d\d\d\) ?\d\d\d-\d{4}/",phone);
run;
Eg3:提取匹配某种模式的字符串
data extract;
   if _n_ = 1 then do;
      pattern = prxparse("/\(\d\d\d\) ?\d\d\d-\d{4}/");
      if missing(pattern) then do;
         put "error in compiling regular expression";
         stop;
      end;
   end;
   retain pattern;
   length number $ 15;
   input string $char80.;
   call prxsubstr(pattern,string,start,length);
      if start gt 0 then do;
      number = substr (string,start,length); 
      number = compress(number," ");
      output;
   end;
   keep number;
datalines;
this line does not have any phone numbers on it
this line does: (123)345-4567 la di la di la
also valid (123) 999-9999
two numbers here (333)444-5555 and (800)123-4567
;
run;
eg4:提取名字
data ReversedNames;
   input name & $32.;
   datalines;
Jones, Fred
Kavich, Kate
Turley, Ron
Dulix, Yolanda
;
data FirstLastNames;
   length first last $ 16;
   keep first last;
   retain re;
   if _N_ = 1 then
      re = prxparse(‘/(\w+), (\w+)/‘);
   set ReversedNames;
   if prxmatch(re, name) then
      do;
         last = prxposn(re, 1, name);
         first = prxposn(re, 2, name);
      end;
run;
注:1,2分别代表正则表达式中的两个组
eg5:提取符合规定的名字
data old;
   input name $60.;
   datalines;
Judith S Reaveley
Ralph F. Morgan
Jess Ennis
Carol Echols
Kelly Hansen Huff
Judith
Nick
Jones
;
data new;
   length first middle last $ 40;
   re1 = prxparse(‘/(\S+)\s+([^\s]+\s+)?(\S+)/o‘);
   re2 = prxparse(‘/(\S+)(\s+)([^\s]+\s+)(?)(\S+)/o‘);
   set old;
   id1=prxmatch(re1, name);
   id2=prxmatch(re2, name);
   if id1 then
      do;
         first = prxposn(re1, 1, name);
         middle = prxposn(re1, 2, name);
         last = prxposn(re1, 3, name);
      end;
   if id2 then test=prxposn(re1, 4, name);
   put test=;
run;
Eg6:返回匹配模式的多个位置
data _null_;
   expressionid = prxparse(‘/[crb]at/‘);
   text = ‘the woods have a bat, cat, and a rat!‘;
   start = 1;
   stop = length(text);
   call prxnext(expressionid, start, stop, text, position, length);
      do while (position > 0);
         found = substr(text, position, length);
         put found= position= length=;
         call prxnext(expressionid, start, stop, text, position, length);
      end;
run;
注:首次执行call prxnext返回一个position,然后进入循环,在抽取满足条件的子串中,再次执行all
 prxnext,此时会返回下一个匹配的position;
Eg7:替换文本
data cat_and_mouse;
   input text $char40.;
   length new_text $ 80;
   if _n_ = 1 then match = prxparse("s/[Cc]at/mouse/");
   retain match;
   call prxchange(match,-1,text,new_text,len,trunc,num);   
   if trunc then put "note: new_text was truncated";
datalines;
the Cat in the hat
there are two cat cats in this line
here is no replacement
;
run;
 
 
《SAS编程与数据挖掘商业案例》学习笔记之十九
标签:sas编程   正则表达式   数据挖掘   模式匹配   hash   
原文地址:http://blog.csdn.net/goodhuajun/article/details/39893829