using python I'm trying to create summary with existing data of csv and finding difficulties in extracting data from one of the cell.
the input csv file
I want to include only the city name and file path from info 4 column and expecting the summary like - AlexxxxxyyyyzzzzzNewyork\Folder1\Folder2\Test.txt
the code
csv_data_out[csv_line_out].append(conten[Name])
csv_data_out[csv_line_out].append(conten[info 1])
csv_data_out[csv_line_out].append(conten[info 2])
csv_data_out[csv_line_out].append(conten[info 3])
csv_data_out[csv_line_out].append(conten[info 4])
csv_summary = ("".join(csv_data_out[csv_line_out]))
with open(outputfile, 'wb') as newfile:
writer = csv.writer(newfile, delimiter = ';')
writer.writerow(csv_columns_out[:])
writer.writerows(csv_data_out)
newfile.close()
any idea to fetch only the required details from info 4 col ?
Essentially you have a csv inside a csv. There's not info posted to give a fully complete answer but here's most of it.
You can take a string and process it as a csv using io.StringIO (or io.BytesIO if a byte string).
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import csv
from io import StringIO
# Create somewhere to put the inputs in case needed later
stored_items = []
with open('data.csv', 'r') as csvfile:
inputs = csv.reader(csvfile)
# skip the header row
next(inputs)
for row in inputs:
# Extract the Info 4 column for processing
f = StringIO(row[4])
string_file = csv.reader(f,quotechar='"')
build_string = ""
for string_row in string_file:
build_string = f"{string_row[0]}{string_row[1]}"
# Merge everything into a summary
summary_string = f"{row[0]}{row[1]}{row[2]}{row[3]}{build_string}"
# Add all the data back to storage
stored_items.append((row[0],row[1],row[2],row[3],row[4],summary_string))
print(summary_string)
The reason why I say there's not enough information posted in because, for example, will the location always be (a) which can have a fixed text replacement, or will be conditional e.g. it could be (a) or (b) in which case it would possibly require regex. (My preference is not to use regex unless absolutely necessary). Also, is it always the first two terms you are after from Info 4, or will the terms be found in different places in the text etc. Without seeing more samples of the data it's impossible to answer definitely.
Related
I have two CSV files that have been renamed to text files. I need to compare a column in each one (a date) to confirm they have been updated.
For example, c:\temp\oldfile.txt has 6 columns and the last one is called version. I need to make sure that c:\temp\newfile.txt has a different value for version. It doesn't need to do any date verification of any kind, as long as the comparison sees that they're different, it can proceed. If possible, I would prefer to stick with 'standard' libraries as I'm just learning and don't want to start creating dictionaries and learning pandas and numpy just yet.
Edit
Here's a copy of oldfile.txt and newfile.txt.
oldfile.txt:
feed_publisher_name,feed_publisher_url,feed_lang,feed_start_date,feed_end_date,feed_version
MyStuff,http://www.mystuff.com,en,20220103,20220417,22APR_20220401
newfile.txt:
feed_publisher_name,feed_publisher_url,feed_lang,feed_start_date,feed_end_date,feed_version
MyStuff,http://www.mystuff.com,en,20220103,20220417,22APR_20220414
In this case the comparison would note that the last column has a different value and would know to proceed with the rest of the script. Otherwise, if the values are the same, it will know that it was not updated and I'll have the program exit.
You can do it by using the csv module in the standard library since that's the format of your files.
import csv
with open('oldfile.txt', 'r', newline='') as oldfile, \
open('newfile.txt', 'r', newline='') as newfile:
old_reader = csv.DictReader(oldfile)
new_reader = csv.DictReader(newfile)
old_row = next(old_reader)
new_row = next(new_reader)
same = old_row['feed_version'] == new_row['feed_version']
print(f"The files are {'the same' if same else 'different'}.")
If you are only interested in checking if there two files are the equal (essentially "updated"), you can compute the hash of one file and compare with the hash of the other
To compute hash (for example, sha256), you can use the following function:
import hashlib
def sha256sum(filename):
# Opens the file
with open(filename, 'rb') as file:
content = file.read()
hasher = hashlib.sha256()
hasher.update(content)
return hasher.hexdigest()
hashlib is probably part of the standard library if you went through the default installation process.
For example, if you write "v1.0" in a text document, the hasher function will give "fa8b919c909d5eb9e373d090928170eb0e7936ac20ccf413332b96520903168e"
If you later change it to "v1.1", the hasher function will give "eb79768c42dbbf9f10733e525a06ea9eb08f28b7b8edf9c6dcacb63940aedcb0".
These are two different hexdigest values, so it would imply that two files are different.
Reading the file-
we don't need any libraries for this. just opening the file and reading it, then doing a little parsing:
a, b = "", "" # set the globals for the comparison
with open("c:/temp/oldfile.txt") as f: # open the file as f
text = f.read().split('\n')[1] # get the contents of the file then cut just the second line from it
a = text.split(',')[5] # spliting the string by ',' to an array then getting the 6th element
Then opening the other one:
with open("c:/temp/newfile.txt") as f:
text = f.read().split('\n')[1]
b = text.split(',')[5]
more on reading files here
Comparing the lines-
if a == b:
print("The date is the same!")
else:
print("The date is different...")
Of course you can make this into a function and make it return whether or not they're equal then use the value to determine the future of the program.
Hope this helps!
I've got a bunch of .dcm-files (dice-files) where I would like to extract the header and save the information there in a CSV file.
As you can see in the following picture, I've got a problem with the delimiters:
For example when looking at the second line in the picture: I'd like to split it like this:
0002 | 0000 | File Meta Information Group Length | UL | 174
But as you can see, I've not only multiple delimiters but also sometimes ' ' is one and sometimes not. Also the length of the 3rd column varies, so sometimes there is only a shorter text there, e.g. Image Type further down in the picture.
Does anyone have a clever idea, how to write it in a CSV file?
I use pydicom to read and display the files in my IDE.
I'd be very thankful for any advice :)
I would suggest going back to the data elements themselves and working from there, rather than from a string output (which is really meant for exploring in interactive sessions)
The following code should work for a dataset with no Sequences, would need some modification to work with sequences:
import csv
import pydicom
from pydicom.data import get_testdata_file
filename = get_testdata_file("CT_small.dcm") # substute your own filename here
ds = pydicom.dcmread(filename)
with open('my.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow("Group Elem Description VR value".split())
for elem in ds:
writer.writerow([
f"{elem.tag.group:04X}", f"{elem.tag.element:04X}",
elem.description(), elem.VR, str(elem.value)
])
It may also require a bit of change to make the elem.value part look how you want it, or you may want to set the CSV writer to use quotes around items, etc.
Output looks like:
Group,Elem,Description,VR,value
0008,0005,Specific Character Set,CS,ISO_IR 100
0008,0008,Image Type,CS,"['ORIGINAL', 'PRIMARY', 'AXIAL']"
0008,0012,Instance Creation Date,DA,20040119
0008,0013,Instance Creation Time,TM,072731
...
I am new to Python and am attempting to process some XML into CSV files for later diff validation against the output from a database. The code I have below does a good job of taking the 'tct-id' attributes from the XML and outputting them in a nice column under the heading 'DocumentID', as I need for my validation.
However, the outputs from the database just come as numbers, whereas the outputs from this code include the version number of the XML ID; for example
tct-id="D-TW-0010054;3;"
where I need the ;3; removed so I can validate properly.
This is the code I have; is there any way I can go about rewriting this so it will pre-process the XML snippets to remove that - like only take the first 12 characters from each attribute and write those to the CSV, for example?
from lxml import etree
import csv
xml_fname = 'example.xml'
csv_fname = 'output.csv'
fields = ['tct-id']
xml = etree.parse(xml_fname)
with open(xml_fname) as infile, open(csv_fname, 'w', newline='') as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, fields, delimiter=';', extrasaction="ignore")
wtr = csv.writer(outfile)
wtr.writerow(["DocumentID"])
for node in xml.xpath("//*[self::StringVariables or self::ElementVariables or self::PubInfo or self::Preface or self::DocHistory or self::Glossary or self::StatusInfo or self::Chapter]"):
atts = node.attrib
atts["elm_name"] = node.tag
w.writerow(node.attrib)
All help is very appreciated.
Assuming you'll only have one ;3; type string to remove from the tct-id, you can use regular expressions
import re
tct_id="D-TW-0010054;3;"
to_rem=re.findall(r'(;.*;)',tct_id)[0]
tct_id=tct_id.replace(to_rem,'')
note i'm using tct_id instead of tct-id as python doesn't usually allow variables to be set like that
I have a log record like this (millions of rows):
previous_status>SERVICE</previous_status><reason>1</>device_id>SENSORS</device_id><DEVICE>ISCS</device_type><status>OK
I would like to to extract all the words in capital into individual columns in excel using python to look like this :
SERVICE SENSORS DEVICE
As per the comments from #peter-wood, it isn't clear what your input is. However, assuming that your input is as you posted, then here is a minimal solution that works off the given structure. If it is not quite right, you should be able to easily change it to search on whatever is really your structure.
import csv
# You need to change this path.
lines = [row.strip() for row in open('/path/to/log.txt').readlines()]
# You need to change this path to where you want to write the file.
with open('/path/to/write/to/mydata.csv', 'w') as fh:
# If you want a different delimiter, like tabs '\t', change it here.
writer = csv.writer(fh, delimiter=',')
for l in lines:
# You can cut and paste the tokens that start and stop the pieces you are looking for here.
service = l[l.find('previous_status>')+len('previous_status>'):l.find('</previous_status')]
sensor = l[l.find('device_id>')+len('device_id>'):l.find('</device_id>')]
device = l[l.find('<DEVICE>')+len('<DEVICE>'):l.find('</device_type>')]
writer.writerow([service, sensor, device])
EDIT*
Solution was to wrap text in the column. This will restore the original format.
I am trying to create a CSV using the CSV module provided in Python. My issue is when the CSV is created the contents of the file inserted into the field loses it's format.
Example input can be pulled from 'whois 8.8.8.8'. I want the field to hold the formatting from that input.
Is there a way to maintain the files original format within the cell?
#!/usr/bin/python
import sys
import csv
file1 = sys.argv[1]
file2 = sys.argv[2]
myfile1 = open(file1, "rb")
myfile2 = open(file2, "rb")
ofile = open('information.csv', "wb")
stuffwriter = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
stuffwriter.writerow([myfile1.read(),myfile2.read()])
myfile1.close()
myfile2.close()
ofile.close()
Example Input(All In One Cell):
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
#
# Query terms are ambiguous. The query is assumed to be:
# "n 8.8.8.8"
#
# Use "?" to get help.
#
#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=8.8.8.8?showDetails=true&showARIN=false&ext=netref2
#
Level 3 Communications, Inc. LVLT-ORG-8-8 (NET-8-0-0-0-1) 8.0.0.0 - 8.255.255.255
Google Incorporated LVLT-GOOGL-1-8-8-8 (NET-8-8-8-0-1) 8.8.8.0 - 8.8.8.255
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
Would like the cell to hold the format above. Currently when I open in Excel it is all one line.
I am getting my data from executing:
whois 8.8.8.8 > inputData.txt
echo "8.8.8.8 - Google" > inputData2.txt
python CreateCSV inputData2.txt inputData.txt
This is what I would like to see:
http://www.2shared.com/photo/WZwDC7w2/Screen_Shot_2013-06-06_at_1231.html
This is what I'm seeing:
http://www.2shared.com/photo/9dRFGCxh/Screen_Shot_2013-06-06_at_1222.html
Convert .CSV to an .XLSX
In Excel, right click column with data that lost it's format
Select 'Format Cells...'
Select the 'Alignment' tab
Check 'Wrap Text'
All is good!