For the past two days I have been scanning the Internet to try to find the solution to my problem. I have a folder of different files. They run the gambit of file types. I am trying to write a python script that will read the metadata from each file, if it exists. The intent is to eventually output the data to a file to compare with another program's metadata extraction.
I have found some examples where it worked for a very few number of the files in the directory. All the ways I have found have dealt with opening a Storage Container object. I am new to Python and am not sure what a Storage Container object is. I just know that most of my files error out when trying to use
pythoncom.StgOpenStorage(<File Name>, None, flags)
With the few that actually work, I am able to get the main metadata tags, like Title, Subject, Author, Created, etc.
Does anyone know a way other than Storage Containers to get to the metadata? Also, if there is an easier way to do this with another language, by all means, suggest it.
Thanks
You can use the Shell com objects to retrieve any metadata
visible in Explorer:
import win32com.client
sh=win32com.client.gencache.EnsureDispatch('Shell.Application',0)
ns = sh.NameSpace(r'm:\music\Aerosmith\Classics Live!')
colnum = 0
columns = []
while True:
colname=ns.GetDetailsOf(None, colnum)
if not colname:
break
columns.append(colname)
colnum += 1
for item in ns.Items():
print (item.Path)
for colnum in range(len(columns)):
colval=ns.GetDetailsOf(item, colnum)
if colval:
print('\t', columns[colnum], colval)
I decided to write my own answer as an attempt to combine and clarify the answers above (which heavily helped me solve my problems).
I'd say there are two approaches to this problem.
Situation 1: you know which metadata the file contains (which metadata you're interested in).
In this case, lets say you have a list of strings, which contains the metadata you're interested in. I assume here that these tags are correct (i.e. you're not interested in the number of pixels of a .txt file).
metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created']
Now, using the code provided by Greedo and Roger Upole I created a function which accepts the file's full path and name separately and returns a dictionary containing the metadata of interest:
def get_file_metadata(path, filename, metadata):
# Path shouldn't end with backslash, i.e. "E:\Images\Paris"
# filename must include extension, i.e. "PID manual.pdf"
# Returns dictionary containing all file metadata.
sh = win32com.client.gencache.EnsureDispatch('Shell.Application', 0)
ns = sh.NameSpace(path)
# Enumeration is necessary because ns.GetDetailsOf only accepts an integer as 2nd argument
file_metadata = dict()
item = ns.ParseName(str(filename))
for ind, attribute in enumerate(metadata):
attr_value = ns.GetDetailsOf(item, ind)
if attr_value:
file_metadata[attribute] = attr_value
return file_metadata
# *Note: you must know the total path to the file.*
# Example usage:
if __name__ == '__main__':
folder = 'E:\Docs\BMW'
filename = 'BMW series 1 owners manual.pdf'
metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created']
print(get_file_metadata(folder, filename, metadata))
Results with:
{'Name': 'BMW series 1 owners manual.pdf', 'Size': '11.4 MB', 'Item type': 'Foxit Reader PDF Document', 'Date modified': '8/30/2020 11:10 PM', 'Date created': '8/30/2020 11:10 PM'}
Which is correct, as I just created the file and I use Foxit PDF reader as my main pdf reader.
So this function returns a dictionary, where the keys are the metadata tags and the values are the values of those tags for the given file.
Situtation 2: you don't know which metadata the file contains
This is a somewhat tougher situation, especially in terms of optimality. I analyzed the code proposed by Roger Upole, and well, basically, he attempts to read metadata of a None file, which results in him obtaining a list of all possible metadata tags. So I thought it might just be easier to hardcopy this list and then attempt to read every tag. That way, once you're done, you'll have a dictionary containing all tags the file actually possesses.
Simply copy what I THINK is every possible metadata tag and just attempt to obtain all the tags from the file.
Basically, just copy this declaration of a python list, and use the code above (replace metadata with this new list):
metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created', 'Date accessed', 'Attributes', 'Offline status', 'Availability', 'Perceived type', 'Owner', 'Kind', 'Date taken', 'Contributing artists', 'Album', 'Year', 'Genre', 'Conductors', 'Tags', 'Rating', 'Authors', 'Title', 'Subject', 'Categories', 'Comments', 'Copyright', '#', 'Length', 'Bit rate', 'Protected', 'Camera model', 'Dimensions', 'Camera maker', 'Company', 'File description', 'Masters keywords', 'Masters keywords']
I don't think this is a great solution, but on the other hand, you can keep this list as a global variable and then use it without needing to pass it to every function call. For the sake of completness, here is the output of the previous function using this new metadata list:
{'Name': 'BMW series 1 owners manual.pdf', 'Size': '11.4 MB', 'Item type': 'Foxit Reader PDF Document', 'Date modified': '8/30/2020 11:10 PM', 'Date created': '8/30/2020 11:10 PM', 'Date accessed': '8/30/2020 11:10 PM', 'Attributes': 'A', 'Perceived type': 'Unspecified', 'Owner': 'KEMALS-ASPIRE-E\\kemal', 'Kind': 'Document', 'Rating': 'Unrated'}
As you can see, the dictionary returned now contains all the metadata that the file contains.
The reason this works is because of the if statement:
if attribute_value:
which means that whenever an attribute is equal to None, it won't be added to the returning dictionary.
I'd underline that in case of processing many files it would be better to declare the list as a global/static variable, instead of passing it to the function every time.
The problem is that there are two ways that Windows stores file metadata. The approach you're using is suitable for files created by COM applications; this data is included inside the file itself. However, with the introduction of NTFS5, any file can contain metadata as part of an alternate data stream. So it's possible the files that succeed are COM-app created ones, and the ones that are failing aren't.
Here's a possibly more robust way of dealing with the COM-app created files: Get document summary information from any file.
With alternate data streams, it's possible to read them directly:
meta = open('myfile.ext:StreamName').read()
Update: okay, now I see none of this is relevant because you were after document metadata and not file metadata. What a difference clarity in a question can make :|
Try this: How to retrieve author of a office file in python?
Windows API Code Pack may be used with Python for .NET to read/write file metadata.
Download the NuGet packages for WindowsAPICodePack-Core and
WindowsAPICodePack-Shell
Extract the .nupkg files with a compression utility like 7-Zip to the script's path or someplace defined in the system path variable.
Install Python for .NET with pip install pythonnet.
Example code to get and set the title of an MP4 video:
import clr
clr.AddReference("Microsoft.WindowsAPICodePack")
clr.AddReference("Microsoft.WindowsAPICodePack.Shell")
from Microsoft.WindowsAPICodePack.Shell import ShellFile
# create shell file object
f = ShellFile.FromFilePath(r'movie..mp4')
# read video title
print(f.Properties.System.Title.Value)
# set video title
f.Properties.System.Title.Value = 'My video'
Hack to check available properties:
dir(f.Properties.System)
Roger Upole's answer helped immensely. However, I also needed to read the "last saved by" detail in an ".xls" file.
XLS file attributes can be read with win32com. The Workbook object has a BuiltinDocumentProperties.
https://gist.github.com/justengel/87bac3355b1a925288c59500d2ce6ef5
import os
import win32com.client # Requires "pip install pywin32"
__all__ = ['get_xl_properties', 'get_file_details']
# https://learn.microsoft.com/en-us/dotnet/api/microsoft.office.tools.excel.workbook.builtindocumentproperties?view=vsto-2017
BUILTIN_XLS_ATTRS = ['Title', 'Subject', 'Author', 'Keywords', 'Comments', 'Template', 'Last Author', 'Revision Number',
'Application Name', 'Last Print Date', 'Creation Date', 'Last Save Time', 'Total Editing Time',
'Number of Pages', 'Number of Words', 'Number of Characters', 'Security', 'Category', 'Format',
'Manager', 'Company', 'Number of Btyes', 'Number of Lines', 'Number of Paragraphs',
'Number of Slides', 'Number of Notes', 'Number of Hidden Slides', 'Number of Multimedia Clips',
'Hyperlink Base', 'Number of Characters (with spaces)']
def get_xl_properties(filename, xl=None):
"""Return the known XLS file attributes for the given .xls filename."""
quit = False
if xl is None:
xl = win32com.client.DispatchEx('Excel.Application')
quit = True
# Open the workbook
wb = xl.Workbooks.Open(filename)
# Save the attributes in a dictionary
attrs = {}
for attrname in BUILTIN_XLS_ATTRS:
try:
val = wb.BuiltinDocumentProperties(attrname).Value
if val:
attrs[attrname] = val
except:
pass
# Quit the excel application
if quit:
try:
xl.Quit()
del xl
except:
pass
return attrs
def get_file_details(directory, filenames=None):
"""Collect the a file or list of files attributes.
Args:
directory (str): Directory or filename to get attributes for
filenames (str/list/tuple): If the given directory is a directory then a filename or list of files must be given
Returns:
file_attrs (dict): Dictionary of {filename: {attribute_name: value}} or dictionary of {attribute_name: value}
if a single file is given.
"""
if os.path.isfile(directory):
directory, filenames = os.path.dirname(directory), [os.path.basename(directory)]
elif filenames is None:
filenames = os.listdir(directory)
elif not isinstance(filenames, (list, tuple)):
filenames = [filenames]
if not os.path.exists(directory):
raise ValueError('The given directory does not exist!')
# Open the com object
sh = win32com.client.gencache.EnsureDispatch('Shell.Application', 0) # Generates local compiled with make.py
ns = sh.NameSpace(os.path.abspath(directory))
# Get the directory file attribute column names
cols = {}
for i in range(512): # 308 seemed to be max for excel file
attrname = ns.GetDetailsOf(None, i)
if attrname:
cols[i] = attrname
# Get the information for the files.
files = {}
for file in filenames:
item = ns.ParseName(os.path.basename(file))
files[os.path.abspath(item.Path)] = attrs = {} # Store attributes in dictionary
# Save attributes
for i, attrname in cols.items():
attrs[attrname] = ns.GetDetailsOf(item, i)
# For xls file save special properties
if os.path.splitext(file)[-1] == '.xls':
xls_attrs = get_xl_properties(item.Path)
attrs.update(xls_attrs)
# Clean up the com object
try:
del sh
except:
pass
if len(files) == 1:
return files[list(files.keys())[0]]
return files
if __name__ == '__main__':
import argparse
P = argparse.ArgumentParser(description="Read and print file details.")
P.add_argument('filename', type=str, help='Filename to read and print the details for.')
P.add_argument('-v', '--show-empty', action='store_true', help='If given print keys with empty values.')
ARGS = P.parse_args()
# Argparse Variables
FILENAME = ARGS.filename
SHOW_EMPTY = ARGS.show_empty
DETAILS = get_file_details(FILENAME)
print(os.path.abspath(FILENAME))
for k, v in DETAILS.items():
if v or SHOW_EMPTY:
print('\t', k, '=', v)
I know this is an old question, but I had the same problem and ended up creating a package to solve my problem: windows-metadata.
An aside, Roger Upole's answer was a good starting point, however it doesn't capture all the attributes a file can have (the break if not colname ends the loop too soon, since Windows skips some column numbers, for whatever reason. So Roger's answer gives the first 30 or so attributes, when there are actually nearly 320).
Now, to answer the question using this package:
from windows_metadata import WindowsAttributes
attr = WindowsAttributes(<File Name>) # this will load all the filled attributes a file has
title = attr["Title"] # dict-like access
title = attr.title # attribute like access -> these two will return the same value
subject = attr.subject
author = attr.author
...
And so on for any available attributes a file has.
Related
I have a multiple project in my gitlab repository wherein I do perform multiple commits when it requires.
I have develop a code in python through which I can get report of all the commits done by me in a csv format for all the projects available in gitlab repository as I have hard coded the the project ids in my python code as a LIST.
The Header of the csv file is : Date, submitted, gitlab_url, project, username, subject.
Now I want to run the pipeline manually by setting up an environment variable as 'Project_Ids'
and want to pass some of the project ids as value (More than one project id as a value) so that csv report should get generated for only these projects which has been passed as a value in environment variable.
so My question is , How can I pass multiple project ids as a value in 'Project_Ids' key while running the pipeline manually.
enter image description here
import gitlab
import os
import datetime
import csv
import re
Project_id_list = ['9427','8401','17937','26813','24899','23729','34779','27638','28600']
headerList = ['Date', 'Submitted', 'Gitlab_url', 'Project', 'Branch', 'Status', 'Username', 'Ticket', 'Subject']
filename = 'mydemo_{}'.format(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
# private token authentication
gl = gitlab.Gitlab('https://main.gitlab.in.com/', private_token="MLyWwLyEhU2zZjjjhZXog")
gl.auth()
# list all projects
for m in Project_id_list:
i=0
if (i<len(Project_id_list)):
i=+1
print(m)
projects = gl.projects.get(m)
commits = projects.commits.list(all=True, query_parameters={'ref_name': 'master'})
with open(f"{filename}_{m}.csv", 'w', newline="") as file:
dw = csv.DictWriter(file, delimiter=',',
fieldnames=headerList)
dw.writeheader()
for commit in commits:
print(commit)
msg = commit.message
if 'master' in msg or 'LCS-' in msg:
projectName = projects.path_with_namespace
branch = 'master'
status = 'merged'
date = commit.committed_date.split('T')[0]
submitted1 = commit.created_at.split('T')[1]
submitted = submitted1.split('.000')[0]
Gitlab_url = commit.web_url.split('-')[0]
username = commit.author_name
subject = commit.title
subject1 = commit.message.splitlines()
print(subject1)
subject2 = subject1[0:3]
print(subject2)
subject3 = ' '.join(subject2)
print(subject3)
match = re.search('S-\d+', subject3)
if match:
ticket = match.group(0)
ticket_url = 'https://.in.com/browse/' + str(ticket)
ticket1 = ticket_url
dw.writerow({'Date': date, 'Submitted': submitted, 'Gitlab_url': Gitlab_url, 'Project': projectName,
'Branch': branch, 'Status': status, 'Username': username, 'Ticket': ticket1,
'Subject': subject3})
else:
ticket1 = 'Not Found'
dw.writerow({'Date': date, 'Submitted': submitted, 'Gitlab_url': Gitlab_url, 'Project': projectName,
'Branch': branch, 'Status': status, 'Username': username, 'Ticket': ticket1,
'Subject': subject3})
Just use a space or some other delimiter in the variable value. For example, a string like 123 456 789
Then in Python, simply parse the variable. For example, using the string .split method to split on whitespace.
import os
...
project_ids_variable = os.environ.get('PROJECT_IDS', '') # '123 456 789'
project_ids = project_ids_variable.split() # ['123', '456', '789']
for project_id in project_ids:
project = gl.projects.get(project_id)
print(project)
How do we assign a dictionary a name that is taken from input from the user and save that dictionary to a txt file so that we can search for it by its name and print output to the user?
I am currently here:
Any ideas how?
import sys
import pathlib
'''Arg-V's should be in following order <app.py> <action> <nick_name> <name> <phone> <email>'''
if str(sys.argv[1]).lower == 'add':
current = {'Name': sys.argv[3], 'Phone Number': sys.argv[4], 'Email': sys.argv[5]}
with open('contacts.txt', 'w') as f:
f.write(current)
As per Naming Lists Using User Input :
An indirect answer is that, as several other users have told you, you don't want to let the user choose your variable names, you want to be using a dictionary of lists, so that you have all the lists that users have created together in one place.
import json
name = input('name/s of dictionary/ies : ')
names = {}
name = name.split()
print(name)
for i in name:
names[i]={}
print(names)
for i in names:
print(i,'--------->', names[i])
for i in names:
names[i] = '1'*len(i)
for i in names:
with open(i+'.txt', 'w+') as file:
file.write('prova : '+json.dumps(names[i]))
I am trying to use the Spotipy package to export audio features for various songs, and plotting them using Plotly.
I used a tkinter gui so I can enter the song name there and the command sends it to the CSV.
All of that seems to be working properly, however when the data exports to the CSV it adds it on the same line as the previous line, instead of a new line
The problem did not exist the entire time, but I can't recall a specific thing that may have prompted it starting. I'm really new to Python/programming so I apologize for my code.
Here's the function in which the writing to the CSV occurs:
def find_song(trackid):
#Returns results as a list of dictionaries
results = sp.audio_features(tracks=[trackid])
#change list of dictionaries into a dataframe
audio_features = pd.DataFrame(results)
#Acquiring and labeling song name, artist name, today's date
song = sp.track(trackid)['name']
artist = sp.track(trackid)['artists']
artist_info = pd.DataFrame(artist)
now = datetime.datetime.now()
artistname = artist_info.name[0]
date = '%d/%d/%d' % (now.month, now.day, now.year)
audio_features['Artist'] = artistname
audio_features['Song Name'] = song
audio_features['Date'] = date
#New dataframe of only relevant information
audio_features = audio_features[['Date','Artist', 'Song Name', 'acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy', 'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri', 'valence']]
plot_me()
audio_features.to_csv("/Users/username/Documents/CSV's_Landing/spotify_data.csv", mode='a', header=False)
I intend for the audio feature data for each new song to show up on a new line in the csv like so:
'Date','Artist','SongName','acousticness','analysis_url','danceability','duration_ms'
'0','8/8/2019','Post Animal', 'When I Get Home', '0.0665', 'https://api.spotify.com/v1/audio-analysis/5azJUob8ahbXB3M9YFwTpd', '0.245','323200'
'0','8/8/2019','BROCKHAMPTON', 'I BEEN BORN AGAIN', '0.324', 'https://api.spotify.com/v1/audio-analysis/5fiR9Dy9hNXEPZOLo1kyNb', '0.73', '219720'
Instead, it looks like this:
'Date','Artist','SongName','acousticness','analysis_url','danceability','duration_ms'
'0','8/8/2019','Post Animal', 'When I Get Home', '0.0665', 'https://api.spotify.com/v1/audio-analysis/5azJUob8ahbXB3M9YFwTpd', '0.245','323200','8/8/2019','BROCKHAMPTON', 'I BEEN BORN AGAIN', '0.324', 'https://api.spotify.com/v1/audio-analysis/5fiR9Dy9hNXEPZOLo1kyNb', '0.73', '219720'
So basically, tacking what's supposed to be the new line onto the previous line. It doesn't add another "Index" number, either. Any insight would be greatly appreciated!
I am stuck with setting up python and the library dedupe from dedupe.io to deduplicate a set of entries in a postgres database. The error is - "Records do not line up with data model" which should be easy to solve but I just do not get why I get this message.
What I have now (focused code and removed other functions)
# ## Setup
settings_file = 'lead_dedupe_settings'
training_file = 'lead_dedupe_training.json'
start_time = time.time()
...
def training():
# We'll be using variations on this following select statement to pull
# in campaign donor info.
#
# We did a fair amount of preprocessing of the fields in
""" Define Lead Query """
sql = "select id, phone, mobilephone, postalcode, email from dev_manuel.somedata"
# ## Training
if os.path.exists(settings_file):
print('reading from ', settings_file)
with open(settings_file, 'rb') as sf:
deduper = dedupe.StaticDedupe(sf, num_cores=4)
else:
# Define the fields dedupe will pay attention to
#
# The address, city, and zip fields are often missing, so we'll
# tell dedupe that, and we'll learn a model that take that into
# account
fields = [
{'field': 'id', 'type': 'ShortString'},
{'field': 'phone', 'type': 'String', 'has missing': True},
{'field': 'mobilephone', 'type': 'String', 'has missing': True},
{'field': 'postalcode', 'type': 'ShortString', 'has missing': True},
{'field': 'email', 'type': 'String', 'has missing': True}
]
# Create a new deduper object and pass our data model to it.
deduper = dedupe.Dedupe(fields, num_cores=4)
# connect to db and execute
conn = None
try:
# read the connection parameters
params = config()
# connect to the PostgreSQL server
conn = psycopg2.connect(**params)
print('Connecting to the PostgreSQL database...')
cur = conn.cursor()
# excute sql
cur.execute(sql)
temp_d = dict((i, row) for i, row in enumerate(cur))
print(temp_d)
deduper.sample(temp_d, 10000)
print('Done stage 1')
del temp_d
# close communication with the PostgreSQL database server
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
print('Closed Connection')
# If we have training data saved from a previous run of dedupe,
# look for it an load it in.
#
# __Note:__ if you want to train from
# scratch, delete the training_file
if os.path.exists(training_file):
print('reading labeled examples from ', training_file)
with open(training_file) as tf:
deduper.readTraining(tf)
# ## Active learning
print('starting active labeling...')
# Starts the training loop. Dedupe will find the next pair of records
# it is least certain about and ask you to label them as duplicates
# or not.
# debug
print(deduper)
# vars(deduper)
# use 'y', 'n' and 'u' keys to flag duplicates
# press 'f' when you are finished
dedupe.convenience.consoleLabel(deduper)
# When finished, save our labeled, training pairs to disk
with open(training_file, 'w') as tf:
deduper.writeTraining(tf)
# Notice our argument here
#
# `recall` is the proportion of true dupes pairs that the learned
# rules must cover. You may want to reduce this if your are making
# too many blocks and too many comparisons.
deduper.train(recall=0.90)
with open(settings_file, 'wb') as sf:
deduper.writeSettings(sf)
# We can now remove some of the memory hobbing objects we used
# for training
deduper.cleanupTraining()
The error message is "Records do not line up with data model. The field 'id' is in data_model but not in a record". As you can see, I am defining 5 fields to be "learned". The query I am using returns me exactly these 5 columns with the data in it.
The output of
print(temp_d)
is
{0: ('00Q1o00000OjmQmEAJ', '+4955555555', None, '01561', None), 1: ('00Q1o00000JhgSUEAZ', None, '+4915555555', '27729', 'email#aemail.de')}
Which looks to me like valid input for the dedupe library.
What I tried
I checked if he already wrote a file as training set which would get
somehow read and be used, this is not the case (code would even say
it)
I tried debugging the "deduper" object where the definition of
the fields and such go in, I can see the fields definition
looking at other examples like csv or mysql which do pretty much the same I do.
Please point me in the direction where I am wrong.
It looks like the issue may be that your temp_d is a dictionary of tuples, as opposed to the expected input of a dictionary of dictionaries. I just started working with this package and found an example here which works for my purposes, which provides this function for setting up the dictionary albeit from a csv instead of the data pull you have in yours.
data_d = {}
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['Id'])
data_d[row_id] = dict(clean_row)
return data_d
I made an improvement to my code according to this suggestion from #paultrmbrth. what i need is to scrape data from pages that are similar to this and this one and i want the csv output to be like the picture below.
But my code's csv output is little messy, like this:
I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.
UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in categories then the csv output file header will only contain the first webpage's categories i.e References, Referenced in, Features, Featured in and Spoofed in so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofswill not be on the output csv file header so is its contents.
Here is the code i used:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["imdb.com"]
start_urls = (
'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
)
def parse(self, response):
item = {}
for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
key = h4.xpath('normalize-space()').get().strip()
if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']:
values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
'string(.//a)').getall(),
item[key] = values
yield item
and here is exporters.py file:
try:
from itertools import zip_longest as zip_longest
except:
from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])
What I'm trying to achieve is i want all these categories to be on the csv output header.
'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'
Any help would be appreciated.
You can extract the title using below
item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
For the CSV part you would need to create a FeedExports which can split each row into multiple rows
from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow(row)
Then you need to assign the feed exporter in your settings
FEED_EXPORTERS = {
'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}
Assuming you put the code in exporters.py file. The output will be as desired
Edit-1
To set the fields and their order you will need to define FEED_EXPORT_FIELDS in your settings.py
FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']
https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS
To set csv data format one of the easiest way is to clean data using excel power queries follow these steps:
1: open csv file in excel.
2:select all values using ctrl+A
3:Then click on table from insert and create table.
4:after create table click on Data from top menu and select From Table 5:know they open new excel window power queries.
6:select any column and click on split column
7: from split column select by delimiter,
8: know select delimiter like comma,space etc
9: final step select advanced option in which there are two options split in rows or column
10: you can do all type of data cleaning using these power queries this is the easiest way to setup data format according to your need