码迷,mamicode.com
首页 > 其他好文 > 详细

kaggle比赛实践M5-baseline研读

时间:2020-04-27 13:19:07      阅读:143      评论:0      收藏:0      [点我收藏+]

标签:product   const   cat   form   base   update   def   sunday   nbsp   

 

采用lightGBM模型

准备数据与训练

calendar.csv数据集导入。

该数据数聚包含物品的售卖时间与物品类型

  • date: The date in a “y-m-d” format.
  • wm_yr_wk: The id of the week the date belongs to.
  • weekday: The type of the day (Saturday, Sunday, …, Friday).
  • wday: The id of the weekday, starting from Saturday.
  • month: The month of the date.
  • year: The year of the date.
  • event_name_1: If the date includes an event, the name of this event.
  • event_type_1: If the date includes an event, the type of this event.
  • event_name_2: If the date includes a second event, the name of this event.
  • event_type_2: If the date includes a second event, the type of this event.
  • snap_CAsnap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAPpurchases on the examined date. 1 indicates that SNAP purchases are allowed.
# Correct data types for "calendar.csv"
calendarDTypes = {"event_name_1": "category", 
                  "event_name_2": "category", 
                  "event_type_1": "category", 
                  "event_type_2": "category", 
                  "weekday": "category", 
                  wm_yr_wk: int16, 
                  "wday": "int16",
                  "month": "int16", 
                  "year": "int16", 
                  "snap_CA": "float32", 
                  snap_TX: float32, 
                  snap_WI: float32 }

# Read csv file
calendar = pd.read_csv("./calendar.csv", 
                       dtype = calendarDTypes)
calendar["date"] = pd.to_datetime(calendar["date"])
calendar.head(10)

技术图片

 

 技术图片

 

 

# Transform categorical features into integers
for col, colDType in calendarDTypes.items():
    if colDType == "category":
        calendar[col] = calendar[col].cat.codes.astype("int16")
        calendar[col] -= calendar[col].min()

calendar.head(10)
  • calendar[col].cat.codes.astype("int16") 这个是属于简单的编码标签类别编码。后面我们尝试改为one编码试试

sell_prices.csv

File 2: “sell_prices.csv”

该数据数聚包含物品的每天每单位的售卖价格

  • store_id: The id of the store where the product is sold.
  • item_id: The id of the product.
  • wm_yr_wk: The id of the week.
  • sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set). 
# Correct data types for "sell_prices.csv"
priceDTypes = {"store_id": "category", 
               "item_id": "category", 
               "wm_yr_wk": "int16",
               "sell_price":"float32"}

# Read csv file
prices = pd.read_csv("./sell_prices.csv", 
                     dtype = priceDTypes)

prices.head()

技术图片

# Transform categorical features into integers
for col, colDType in priceDTypes.items():
    if colDType == "category":
        prices[col] = prices[col].cat.codes.astype("int16")
        prices[col] -= prices[col].min()
        
prices.head()

技术图片

sales_train_validation.csv

File 3: “sales_train.csv”

Contains the historical daily unit sales data per product and store.

  • item_id: The id of the product.
  • dept_id: The id of the department the product belongs to.
  • cat_id: The id of the category the product belongs to.
  • store_id: The id of the store where the product is sold.
  • state_id: The State where the store is located.
  • d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
firstDay = 250
lastDay = 1913

# Use x sales days (columns) for training
numCols = [f"d_{day}" for day in range(firstDay, lastDay+1)]

# Define all categorical columns
catCols = [id, item_id, dept_id,store_id, cat_id, state_id]

# Define the correct data types for "sales_train_validation.csv"
dtype = {numCol: "float32" for numCol in numCols} 
dtype.update({catCol: "category" for catCol in catCols if catCol != "id"})

[(k,v)  for k,v in dtype.items()][:10]

技术图片

# Read csv file
ds = pd.read_csv("./sales_train_validation.csv", 
                 usecols = catCols + numCols, dtype = dtype)

ds.head()

技术图片

 

 

# Transform categorical features into integers
for col in catCols:
    if col != "id":
        ds[col] = ds[col].cat.codes.astype("int16")
        ds[col] -= ds[col].min()
        
ds = pd.melt(ds,
             id_vars = catCols,
             value_vars = [col for col in ds.columns if col.startswith("d_")],
             var_name = "d",
             value_name = "sales")

# Merge "ds" with "calendar" and "prices" dataframe
ds = ds.merge(calendar, on = "d", copy = False)
ds = ds.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False)

ds.head()

1·1

kaggle比赛实践M5-baseline研读

标签:product   const   cat   form   base   update   def   sunday   nbsp   

原文地址:https://www.cnblogs.com/wqbin/p/12785680.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!