如何从火花数据框中过滤出一个空值

By simon at 2018-02-28 • 0人收藏 • 43人看过

我使用以下模式在spark中创建了一个数据框:

root
 |-- user_id: long (nullable = false)
 |-- event_id: long (nullable = false)
 |-- invited: integer (nullable = false)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = false)
 |-- friend_id: long (nullable = false)
数据如下所示:
+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|     null|
| 516353916|1076364848|      0|       3|         1| 3597645735|     null|
| 528218683|1151525474|      0|       1|         0| 3433080956|     null|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|     null|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+
我想过滤在“friendid”字段中的行具有空值。
scala> val aaa = test.filter("friend_id is null")

scala> aaa.count
我得到了:res52:Long = 0这显然不对。什么是正确的方式 得到它? 还有一个问题,我想替换friend
id字段中的值。我想要 用null替换null对于除null以外的任何其他值,都为0和1。我可以的代码 弄清楚是:
val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)
此代码也不起作用。任何人可以电话我怎样才能解决它?谢谢

7 个回复 | 最后更新于 2018-02-28
2018-02-28   #1

假设您有这样的数据设置(以便结果可重现):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
数据是:
+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n2|   2|n2@c1.com|[c1,1,d1]|
|  n3|   3|n3@c1.com|[c1,1,d1]|
|  n4|   4|n4@c2.com|[c2,2,d2]|
|  n5|null|n5@c2.com|[c2,2,d2]|
|  n6|   6|n6@c2.com|[c2,2,d2]|
|  n7|   7|n7@c3.com|[c3,3,d3]|
|  n8|   8|n8@c3.com|[c3,3,d3]|
+----+----+---------+---------+
现在过滤员工与null ID,你会做 -
df.filter("id is null").show
这将正确向您显示以下内容:
+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n5|null|n5@c2.com|[c2,2,d2]|
+----+----+---------+---------+
来到第二页你的问题的艺术,你可以取代null ids与0和其他值与1与此 -
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
这个结果s在:
+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|n1@c1.com|[c1,1,d1]|
|  n2|  1|n2@c1.com|[c1,1,d1]|
|  n3|  1|n3@c1.com|[c1,1,d1]|
|  n4|  1|n4@c2.com|[c2,2,d2]|
|  n5|  0|n5@c2.com|[c2,2,d2]|
|  n6|  1|n6@c2.com|[c2,2,d2]|
|  n7|  1|n7@c3.com|[c3,3,d3]|
|  n8|  1|n8@c3.com|[c3,3,d3]|
+----+---+---------+---------+

2018-02-28   #2

df.where(df.col("friend_id").isNull)

2018-02-28   #3

我使用下面的代码来解决我的问题。有用。但我们都知道,我 在一个国家的一英里工作解决它。所以,是的那有什么捷径? 谢谢

def filter_null(field : Any) : Int = field match {
    case null => 0
    case _    => 1
}

val test = train_event_join.join(
    user_friends_pair,
    train_event_join("user_id") === user_friends_pair("user_id") &&
    train_event_join("event_owner") === user_friends_pair("friend_id"),
    "left"
).select(
    train_event_join("user_id"),
    train_event_join("event_id"),
    train_event_join("invited"),
    train_event_join("day_diff"),
    train_event_join("interested"),
    train_event_join("event_owner"),
    user_friends_pair("friend_id")
).rdd.map{
    line => (
        line(0).toString.toLong,
        line(1).toString.toLong,
        line(2).toString.toLong,
        line(3).toString.toLong,
        line(4).toString.toLong,
        line(5).toString.toLong,
        filter_null(line(6))
        )
    }.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")

2018-02-28   #4

假设您有这样的数据设置(以便结果可重现):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
数据是:
+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n2|   2|n2@c1.com|[c1,1,d1]|
|  n3|   3|n3@c1.com|[c1,1,d1]|
|  n4|   4|n4@c2.com|[c2,2,d2]|
|  n5|null|n5@c2.com|[c2,2,d2]|
|  n6|   6|n6@c2.com|[c2,2,d2]|
|  n7|   7|n7@c3.com|[c3,3,d3]|
|  n8|   8|n8@c3.com|[c3,3,d3]|
+----+----+---------+---------+
现在过滤员工与null ID,你会做 -
df.filter("id is null").show
这将正确向您显示以下内容:
+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|n1@c1.com|[c1,1,d1]|
|  n5|null|n5@c2.com|[c2,2,d2]|
+----+----+---------+---------+
来到第二页你的问题的艺术,你可以取代null ids与0和其他值与1与此 -
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
这个结果s在:
+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|n1@c1.com|[c1,1,d1]|
|  n2|  1|n2@c1.com|[c1,1,d1]|
|  n3|  1|n3@c1.com|[c1,1,d1]|
|  n4|  1|n4@c2.com|[c2,2,d2]|
|  n5|  0|n5@c2.com|[c2,2,d2]|
|  n6|  1|n6@c2.com|[c2,2,d2]|
|  n7|  1|n7@c3.com|[c3,3,d3]|
|  n8|  1|n8@c3.com|[c3,3,d3]|
+----+---+---------+---------+

2018-02-28   #5

从Michael Kopaniov的提示,下面的作品

df.where(df("id").isNotNull).show

2018-02-28   #6

或者像df.filter($"friend_id".isNotNull)一样

2018-02-28   #7

我找到了一个很好的解决方案,让我用任何空值删除行: Dataset<Row> filtered = df.filter(row -> !row.anyNull()); 如果有人对另一种情况感兴趣,j请拨打row.anyNull().(Spark 2.1.0使用Java API)

登录后方可回帖

Loading...