I have folder C:\test_xml where i have list of XML files. I want to get all the xml file names and store this in csv file xml_file.csv. I am trying with below Python code but dont know how to proceed as i am quiet new in Python.
import os
import glob
files = list(glob.glob(os.path.join('C:\temp','*.xml')))
print (files)
A way to get a list of only the filenames:
import pathlib
files = [file.name for file in pathlib.Path(r"C:\temp").glob("*.xml")]
The documentation for the csv module has some examples on how to write a .csv file
So I have working code for searching for files I want in a directory. I can save them as a dataframe or a list. I then figured out how to do findall(),find().text to get tags and texts from one single xml file. I want to now make it more dynamic to a set of values equal to the position in this dataframe or list. I am getting an error message of "expected str, bytes, or os.PathLike object, not tuple." I want to know where is my understanding wrong in this code or where is my understanding a little off? I also am not sure if maybe I am over complicating it with a dataframe and should just use a list.
.....
import os
import pandas as pd
import os
import xml.etree.ElementTree as ET
.....
current_dur = r'Workplace Investing'
#empty data frame to put the file paths in.
file_results = []
#logic to search through the directories.
for root, dirs, files in os.walk(current_dur):
for file in files:
if file.endswith('.ldm') or file.endswith('.cdm') or file.endswith('.pdm'):
file_results.append(os.path.join(root, file))
filesdataframe = pd.DataFrame(file_results)
filesdataframe.rename(columns = { 0 :'Directory Path'}, inplace = True)
......
#In my working code I set the xmlfile equal to a single directory path.
#This worked for one file. Now my thinking is maybe I set xmlfile equal to dataframe.iterrows().
#That way when I make my loop it will go by each row.
xmlfile = next(filesdataframe.iterrows())
for ind in filesdataframe.index:
#This is next line is where I am getting my issue.
tree = ET.parse(xmlfile)
root = tree.getroot()
.....
I have about 20 csv files that I need to read in, is it possible to read in the whole folder instead of doing them individually? I am using python. Thanks
You can't. The fileinput module almost meets your needs, allowing you to pretend a bunch of files are a single file, but it also doesn't meet the requirements of files for the csv module (namely, that newline translation must be turned off). Just open the files one-by-one and append the results of parsing to a single list; it's not that much more effort. No matter what you do something must "do them individually"; there is no magic to say "read 20 files exactly as if they were one file". Even fobbing off to cat or the like (to concatenate all the files into a single stream you can read from) is just shunting the same file-by-file work elsewhere.
You can pull a list of files in Python by using os.listdir. From there, you can loop over your list of files, and generate a list of CSV files:
import os
filenames = os.listdir("path/to/directory/")
csv_files = []
for name in filenames:
if filename.endswith("csv"):
csv_files.append(name)
From there, you'll have a list containing every CSV in your directory.
The shortest thing that I can think of is this, it's not in one line because you have to import a bunch of stuff so that line is not that long:
from os import listdir
from os.path import isfile
from os.path import splitext
from os.path import join
import pandas as pd
source = '/tmp/'
dfs = [
pd.read_csv(join(source, path)) for path in listdir(source) if isfile(join(source, path)) and splitext(join(source, path))[1] == '.csv'
]
Quite new to XML and the python-pptx module I want to remove a single hyperlink that is present on every page
my own attempt so far has been to retrieve my files, change to zip format and unzip them into separate folders
I then locate the following attribute <a:hlinkClick r:id="RelId4">
and remove it whilst removing the Relationshipattribute within the xml.rels file which corresponds to this slide.
I then rezip and change the extension to pptx and this loads fines. I then tried to replicate this in Python so I can create an on-going automation.
my attempt:
from pathlib import Path
import zipfile as zf
from pptx import Presentation
import re
import xml.etree.ElementTree as ET
path = 'mypath'
ppts = [files for files in Path(path).glob('*.pptx')]
for file in ppts:
file.rename(file.with_suffix('.zip'))
zip_files = ppts = [files for files in Path(path).glob('*.zip')]
for zips in zip_files:
with zf.ZipFile(zips,'r') as zip_ref:
zip_ref.extractall(Path(path).joinpath('zipFiles',zips.stem))
I then do some further filtering and end up with my xmls from the rels folder & the ppt/slide folder.
it's here that I get stuck I can read my xml with the ElementTree Module but I cannot find the relevant tag to remove?
for file in normal_xmls:
tree = (ET.parse(file).getroot())
y = tree.findall('a')
print(y)
this yields nothing, I tried to use the python-pptx module but the .Action.Hyperlink doesn't seem to be a complete feature unless I am misunderstanding the API.
To remove a hyperlink from a shape (the kind where clicking on the shape navigates somewhere), set the hyperlink address to None:
shape.click_action.hyperlink.address = None
In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]