By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ….
注意:
Formally, MAP … and REDUCE … are syntactic transformations of SELECT TRANSFORM ( … ). In other words, they serve as comments or notes to the reader of the query. BEWARE: Use of these keywords may be dangerous as (e.g.) typing “REDUCE” does not force a reduce phase to occur and typing “MAP” does not force a new map phase!
clusterBy: CLUSTER BY colName (',' colName)* distributeBy: DISTRIBUTE BY colName (',' colName)* sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)* rowFormat : ROW FORMAT (DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [ESCAPED BY char] [LINES SEPARATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES property_name=property_value, property_name=property_value, ...]) outRowFormat : rowFormat inRowFormat : rowFormat outRecordReader : RECORDREADER className query: FROM ( FROM src MAP expression (',' expression)* (inRowFormat)? USING 'my_map_script' ( AS colName (',' colName)* )? (outRowFormat)? (outRecordReader)? ( clusterBy? | distributeBy? sortBy? ) src_alias ) REDUCE expression (',' expression)* (inRowFormat)? USING 'my_reduce_script' ( AS colName (',' colName)* )? (outRowFormat)? (outRecordReader)? FROM ( FROM src SELECT TRANSFORM '(' expression (',' expression)* ')' (inRowFormat)? USING 'my_map_script' ( AS colName (',' colName)* )? (outRowFormat)? (outRecordReader)? ( clusterBy? | distributeBy? sortBy? ) src_alias ) SELECT TRANSFORM '(' expression (',' expression)* ')' (inRowFormat)? USING 'my_reduce_script' ( AS colName (',' colName)* )? (outRowFormat)? (outRecordReader)?
转化的例子1:
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM(map_output.dt, map_output.uid)
USING 'reduce_script'
AS date, count;
转化的例子2:
FROM (
FROM src
SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
USING '/bin/cat'
AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
RECORDREADER 'org.apache.hadoop.hive.contrib.util.typedbytes.TypedBytesRecordReader'
) tmap
INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue
将TRANSFORM的输出打出来
脚本的输出默认是string,如果想进行类型转换的需要进行如下操作。
SELECT TRANSFORM(stuff)
USING 'script'
AS thing1, thing2
类型转化 SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 INT)
FROM staged_employees se
INSERT OVERWRITE TABLE employee PARTITION (country = 'US' and state = 'OR') select * where se.country = 'US' and se.state = 'OR'
INSERT OVERWRITE TABLE employee PARTITION (country = 'US' and state = 'CA') select * where se.country = 'US' and se.state = 'CA'
SELECT userid, movieid,
CASE
WHEN rating <= 1 THEN "low"
WHEN rating > 1 and rating <= 3 THEN "middle" ELSE "high"
END as level
from u_data limit 10;
4. WHERE 语句 (对应Hive编程指南的6.2 WHERE语句)
4.1 WHERE语句不能使用列别名。
SELECT name, salary, salary * (1-deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
WHERE round(salary * (1-deductions["Federal Taxes"])) > 70000
这个查询语句里面,有重复的表达式。下面的查询语句使用一个列别名来消除重复问题,但是并不能生效。
SELECT name, salary, salary * (1-deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
WHERE round(salary_minus_fed_taxes) > 70000;
报错:Invalid table alias or column reference 'salary_minus_fed_taxes'
正如错误信息所提示的,WHERE语句不能使用列别名。 不过我们可以使用一个嵌套的SELECT语句
SELECT e.* FROM
(SELECT name,salary, salary * (1-deductions["Federal Taxes"]) as salary_minus_fed_taxes FROM employees) e
WHERE round(e.salary_minus_fed_taxes) > 70000;
4.2 谓词操作符
4.3 GROUP BY语句
GROUP BY 通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
4.4 JOIN 语句
Hive支持通常的SQL JOIN语句,但是只支持等值连接,并且在ON子句中只支持AND。
4.4.1 INNER JOIN
SELECT a.ymd, a.price_close, b.price_close, c.price_close
FROM stocks a JOIN stocks b ON a.ymd = b.ymd
JOIN stocks c ON a.ymd = c.ymd
WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
UNION ALL将2个或多个进行合并。每个union子查询都必须有相同的列,并且每个字段的字段类型必须是一致的。
比如,如果第2个字段是FLOAT类型,那么所有其他子查询的第2个字段必须都是FLOAT类型的。
举例,将日志数据进行合并的例子:
SELECT log.ymd, log.level, log.message
FROM(
SELECT l1.ymd, l1.level, l1.message, 'Log1' AS source
FROM log1 l1
UNION ALL
SELECT l2.ymd, l2.level, l2.message, 'Log2' AS source
FROM log2 l2
) log
SORT BY log.ymd ASC;
5 视图(hive编程指南第7章 视图)
3.3 一些用法
SELECT a.* FROM invites a WHERE a.ds=’2008-08-15’;
FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH ‘/Users/keything/Downloads/ml-100k/u.data’ OVERWRITE INTO TABLE u_data;
SELECT COUNT(*) FROM u_data;
一个简单的例子:
创建表:
1 2 3 4 5 6 7 8
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;