Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 17875
Image: ubuntu2004
Kernel: Python 3 (Anaconda 2020)
!pip install pandas_read_xml
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pandas_read_xml in /home/user/.local/lib/python3.7/site-packages (0.3.1) Requirement already satisfied: pyarrow in /home/user/.local/lib/python3.7/site-packages (from pandas_read_xml) (3.0.0) Requirement already satisfied: requests in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas_read_xml) (2.24.0) Requirement already satisfied: zipfile36 in /home/user/.local/lib/python3.7/site-packages (from pandas_read_xml) (0.1.3) Requirement already satisfied: distlib in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas_read_xml) (0.3.1) Requirement already satisfied: xmltodict in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas_read_xml) (0.12.0) Requirement already satisfied: pandas in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas_read_xml) (1.1.5) Requirement already satisfied: urllib3>=1.26.3 in /home/user/.local/lib/python3.7/site-packages (from pandas_read_xml) (1.26.4) Requirement already satisfied: numpy>=1.16.6 in /ext/anaconda2020.02/lib/python3.7/site-packages (from pyarrow->pandas_read_xml) (1.18.5) Requirement already satisfied: chardet<4,>=3.0.2 in /ext/anaconda2020.02/lib/python3.7/site-packages (from requests->pandas_read_xml) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /ext/anaconda2020.02/lib/python3.7/site-packages (from requests->pandas_read_xml) (2.8) Requirement already satisfied: certifi>=2017.4.17 in /ext/anaconda2020.02/lib/python3.7/site-packages (from requests->pandas_read_xml) (2020.12.5) Requirement already satisfied: python-dateutil>=2.7.3 in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas->pandas_read_xml) (2.8.0) Requirement already satisfied: pytz>=2017.2 in /ext/anaconda2020.02/lib/python3.7/site-packages (from pandas->pandas_read_xml) (2019.3) Requirement already satisfied: six>=1.5 in /ext/anaconda2020.02/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->pandas_read_xml) (1.14.0)
test_xml = """<?xml version="1.0" encoding="UTF-8"?> <!-- bookstore.xml --> <bookstore> <book ISBN="0123456001"> <title>Java For Dummies</title> <author>Tan Ah Teck</author> <category>Programming</category> <year>2009</year> <edition>7</edition> <price>19.99</price> </book> <book ISBN="0123456002"> <title>More Java For Dummies</title> <author>Tan Ah Teck</author> <category>Programming</category> <year>2008</year> <price>25.99</price> </book> <book ISBN="0123456010"> <title>The Complete Guide to Fishing</title> <author>Bill Jones</author> <author>James Cook</author> <author>Mary Turing</author> <category>Fishing</category> <category>Leisure</category> <language>French</language> <year>2000</year> <edition>2</edition> <price>49.99</price> </book> </bookstore>"""
import pandas_read_xml as pdxi from pandas_read_xml import flatten, fully_flatten, auto_separate_tables
/ext/anaconda2020.02/lib/python3.7/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning)
df = pdxi.read_xml(test_xml, ['bookstore']) df
book
0 [{'@ISBN': '0123456001', 'title': 'Java For Du...
df = df.pipe(flatten) df
book
0 {'@ISBN': '0123456001', 'title': 'Java For Dum...
1 {'@ISBN': '0123456002', 'title': 'More Java Fo...
2 {'@ISBN': '0123456010', 'title': 'The Complete...
df = df.pipe(flatten) df
book|@ISBN book|title book|author book|category book|year book|edition book|price book|language
0 0123456001 Java For Dummies Tan Ah Teck Programming 2009 7 19.99 NaN
1 0123456002 More Java For Dummies Tan Ah Teck Programming 2008 NaN 25.99 NaN
2 0123456010 The Complete Guide to Fishing [Bill Jones, James Cook, Mary Turing] [Fishing, Leisure] 2000 2 49.99 French
key_columns = ['book|@ISBN'] data = df.pipe(auto_separate_tables, key_columns)
data.keys()
dict_keys(['author', 'category', 'book'])
data['author']
@ISBN author
0 0123456010 Bill Jones
1 0123456010 James Cook
2 0123456010 Mary Turing
3 0123456001 Tan Ah Teck
4 0123456002 Tan Ah Teck