在Apache Parquet的第一步输入错误

By simon at 2018-02-07 • 0人收藏 • 54人看过

而在试用Apache时遇到这种类型的错误会让人困惑 第一次镶木地板文件格式。不应该rquet支持所有的数据 Pandas的类型?我错过了什么?

import pandas
import pyarrow
import numpy

data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
提出:
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)

table.pxi in pyarrow.lib.Table.from_pandas()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
    354             arrays = list(executor.map(convert_column,
    355                                        columns_to_convert,
--> 356                                        convert_types))
    357 
    358     types = [x.type for x in arrays]

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
    343 
    344     def convert_column(col, ty):
--> 345         return pa.array(col, from_pandas=True, type=ty)
    346 
    347     if nthreads == 1:

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._ndarray_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes是:
0      object
1      object
2      object
3      object
4      object
5     float64
6     float64
7      object
8      object
9      object
10     object
11     object
12     object
13    float64
14     object
15    float64
16     object
17    float64
...

1 个回复 | 最后更新于 2018-02-07
2018-02-07   #1

在Apache Arrow中,表格列的数据类型必须是同类的。大熊猫 支持值可以b的Python对象列e不同的类型那么你 在写入Parquet格式之前需要做一些数据清理。 我们已经处理了一些基本的cases(像一个列中的字节和unicode) 在Arrow-Python绑定中,但是我们不会冒任何有关如何操作的猜测 处理坏的数据。我打开了JIRA 关于添加选项的https://issues.apache.org/jira/browse/ARROW-2098 强迫意想不到的价值在这样的情况下,这可能会有所帮助 未来。

登录后方可回帖

Loading...