I'm trying to parse and convert XML to CSV. The tricky part is that headers should exactly match terms specified in the documentation of 3rd party CSV parser and it contains spaces between words, i.e. "Item title", "Item description", etc.
Since Items are defined as variables in items.py, I'm can't create Items containing spaces, i.e.
Item title = scrapy.Field()
I've tried adding to settings.py:
FEED_EXPORT_FIELDS = ["Item title", "Item description"]
It edits CVS headers, but after this it doesn't match Items anymore so it doesn't populated any data into .csv.
class MySpider(XMLFeedSpider): name = 'example' allowed_domains = ['example.com'] start_urls = ['http://example.com/feed.xml'] itertag = 'item' def parse_node(self, response, node): item = FeedItem() item['id'] = node.xpath('//*[name()="g:id"]/text()').get() item['title'] = node.xpath('//*[name()="g:title"]/text()').get() item['description'] = node.xpath('//*[name()="g:description"]/text()').get() return item
Parser works fine, I get all the data I need. The issue is just with csv headers.
Is there a way to easily add customized headers that doesn't match names of Items and can contain few words?
Output I currently get:
id, title, description 12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 12346, Quick Fox, The quick brown fox jumps over the lazy dog.
Desired output should look like this:
ID, Item title, Item description 12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 12346, Quick Fox, The quick brown fox jumps over the lazy dog.
Input for testing:
<rss> <channel> <title>Example</title> <link>http://www.example.com</link> <description>Description of Example.com</description> <item> <g:id>12345</g:id> <g:title>Lorem Ipsum</g:title> <g:description>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</g:description> </item> <item> <g:id>12346</g:id> <g:title>Quick Fox</g:title> <g:description>The quick brown fox jumps over the lazy dog.</g:description> </item> </channel> </rss>
And this is the content of items.py:
import scrapy class FeedItem(scrapy.Item): id = scrapy.Field() title = scrapy.Field() description = scrapy.Field() pass
score:1
You can make your own csv exporter! Ideally you can just extend the current exporter with a different method:
# exporters.py
from scrapy.exporters import CsvItemExporter
class MyCsvItemExporter(CsvItemExporter):
header_map = {
'description': 'Item Description',
}
def _write_headers_and_set_fields_to_export(self, item):
if not self.include_headers_line:
return
# this is the parent logic taken from parent class
if not self.fields_to_export:
if isinstance(item, dict):
# for dicts try using fields of the first item
self.fields_to_export = list(item.keys())
else:
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
headers = list(self._build_row(self.fields_to_export))
# here we add our own extra mapping
# map headers to our value
headers = [self.header_map.get(header, header) for header in headers]
self.csv_writer.writerow(headers)
And then activate it in your settings:
FEED_EXPORTERS = {
'csv': 'myproject.exporters.MyCsvItemExporter',
}
score:0
You can use built-in dictionary dict
type as item with required csv header values as dictionary key:
def parse_node(self, response, node):
item = dict() #item = {}
item['ID'] = node.xpath('//*[name()="g:id"]/text()').get()
item['Item title'] = node.xpath('//*[name()="g:title"]/text()').get()
item['Item description'] = node.xpath('//*[name()="g:description"]/text()').get()
return item #yield item
Credit To: stackoverflow.com
Related Query
- Scrapy's Custom CSV headers for CsvItemExporter
- How to write a custom JSON decoder for a complex object?
- Is there a way to return a custom value for min and max in Python?
- Custom double star operator for a class?
- How do I load custom image based datasets into Pytorch for use with a CNN?
- Python 3 - Add custom headers to urllib.request Request
- Custom locale configuration for float conversion
- How to write a custom f1 loss function with weighted average for keras?
- Python type annotation for custom duck type
- Build custom AWS Lambda layer for Scikit-image
- How to use import statement for custom modules in Python
- Whats the difference between 'rb' and 'rU' in the open() function for csv
- How to fix 'Could not find a version that satisfies the requirement' for install_requires list when pip installing in custom package?
- Reproduce LightGBM Custom Loss Function for Regression
- Python3 csv writer failing, exiting on error "TypeError: 'newline' is an invalid keyword argument for this function
- Set background colour for a custom QWidget
- Iterator for custom class in Python 3
- Using LogRecordFactory in python to add custom fields for logging
- Why does six.py use custom class for finding MAXSIZE?
- Python Beautiful soup get correct column headers for each table
- Weird Nan loss for custom Keras loss
- Custom error message for invalid datetime format using flask wtform DateTimeField
- Progress for downloading large CSV files from Internet using Python
- Custom names for pytest parametrized tests
- Python add custom reaction for message
- Keras ImageDataGenerator equivalent for csv files
- Use custom environment vars for os.path.expandpath in Python
- Loading images in Keras for CNN from directory but label in CSV file
- csv module returning a BOM for first column
- How to import csv file data in aerospike without using aerospike loader?Is there any alternative available for aerospike loader?
- Custom environments for Gym Error: Cannot re-register id
- Create a presigned S3 URL for get_object with custom logging information using Boto3?
- Special case to grab the headers for a DictReader in Python
- Yocto Bitbake Recipe for Custom Python Script and PyTest
- Custom formats set in logging.basicConfig for StreamHandler
- python list constructor for custom type - length called twice
- How to read data from CSV into nested key-value pairs for future retrieval?
- How do I use a Custom Provider [keycloak] for OAuth2.0 in Flask-Appbuilder?
- How to load custom data into tfds for keras cyclegan example?
- Custom metric for Semantic segmentation
- Is the 'pass' statement necessary for custom assertions?
- Extended Information For An IPython Custom Completer
- Autocreate tables in Bigquery for multiple CSV files
- Make a custom loss function for mean intersection of union for regression in bounding boxes
- Implementing sum() for custom datatype
- Custom loss function in Keras with TensorFlow Backend for images
- Need approach on building Custom NER for extracting below keywords from any format of payslips
- Custom usage message for many-valued argument
- Python's Min() for Custom Class (Binary Search Tree)
- Create custom exception handler for unhandled exceptions in Flask
More Query from same tag
- list index out of range - CodingBat Problem - Has22
- Python3 - Pulling values out of a request
- Logical Indexing with Python's map
- Why doesn't list.index() raise a ValueError when value isn't found?
- Python Data Channel Timeout During FTP File Transfer
- Regex expression to find strings between two strings in Python
- How could I add 1 to each input
- Python script getting Killed
- clear all messages in certain channel on_load
- Insert text from one file into the middle of another in a specific place with out losing the contents of the second file
- Python Convert a list of binary to string
- Python/MyPy: How to annotate a method that can return one of several different types of objects?
- Returning values from dictionary in python
- How to get the positive score only?
- How can I compute the area of intersection of two polygons with sympy?
- psycopg2 connect to master in db cluster
- how do create while loop input for accept only 1 or 2 as input in python?
- struct.unpack returning very large value python
- How to get more than 1000 responses from gMail using Python imaplib
- Code to detect if a line in a file is unicode: TypeError: String argument without encoding
- python draw half-triangle with numbers
- unicode and encoding for persian or arabic in python3
- Struct Unpack with Variable
- How to create .EXE file in python using cx_freeze
- How to get the original name of a positional parameter in Python?
- Unable to plot all variables of a dataset Seaborn heat map
- Fastest Way to convert a Binary String to Binary Array (Array of 1 and 0)
- Making GStreamer video/audio in Python smooth and loop
- python-gdb error: Python Exception <class 'RuntimeError'> Type does not have a target
- Delete specific record in SQLite3 DB matching two variables
- Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'
- Cherrypy _cp_dispatch strange behaviour with url without trailing slash: POST then GET
- How to check if letters of one string are in another
- Using classes to print out instance attributes based on user input in Python
- How to get href attribute using selenium for python?