Parsing .txt files to a single .csv output

Parsing .txt files to a single .csv output - python

I am currently trying to parse 2 text files, then have a .csv output. One contains a list of path/file location, and the other is contains other info related to the path/file location.
1st text file contains (path.txt):
C:/Windows/System32/vssadmin.exe
C:/Users/Administrator/Desktop/google.com
2nd text file contains (filelist.txt):
-= List of files in hash: =-
$VAR1 = {
'File' => [
{
'RootkitInfo' => 'Normal',
'FileVersionLabel' => '6.1.7600.16385',
'ProductVersion' => '6.1.7601.17514',
'Path' => 'C:/Windows/System32/vssadmin.exe',
'Signer' => 'Microsoft Windows',
'Size' => '210944',
'SHA1' => 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
},
{
'RootkitInfo' => 'Normal',
'FileVersionLabel' => '6.1.7600.16385',
'ProductVersion' => '6.1.7601.17514',
'Path' => 'C:/Users/Administrator/Desktop/steam.exe',
'Signer' => 'Valve Inc.',
'Size' => '300944',
'SHA1' => 'cf23df2207d99a74fbe169e3eba035e633b65d94'
},
{
'RootkitInfo' => 'Normal',
'FileVersionLabel' => '6.1.7600.16385',
'ProductVersion' => '6.1.7601.17514',
'Path' => 'C:/Users/Administrator/Desktop/google.com',
'Signer' => 'Valve Inc.',
'Size' => '300944',
'SHA1' => 'cf23df2207d99a74fbe169e3eba035e633b78987'
},
.
.
.
]
}
How do I go about having a .csv output containing the path of the file with its corresponding hash value? Also, in case I would like to add additional column/info corresponding to the path?
Sample table output:
<table>
<tr>
<th>File Path</th>
<th>Hash Value</th>
</tr>
<tr>
<td>C:/Windows/System32/vssadmin.exe</td>
<td>da39a3ee5e6b4b0d3255bfef95601890afd80709</td>
</tr>
<tr>
<td>C:/Users/Administrator/Desktop/google.com</td>
<td>cf23df2207d99a74fbe169e3eba035e633b78987</td>
</tr>
</table>

You could construct regex pattern that matches what you are looking for
pattern = r"""{.*?(C:/Windows/System32/vssadmin.exe).*?'SHA1' => '([^']*)'.*?}"""
To use it with multiple file names in a loop turn that pattern into a format string.
fmt = r"""{{.*?({}).*?'SHA1' => '([^']*)'.*?}}"""
Something like this:
import re
with open('filelist.txt') as f:
s = f.read()
with open('path.txt') as f:
for line in f:
pattern = fmt.format(line.strip())
m = re.search(pattern, s, flags=re.DOTALL)
if m:
print(m.groups())
else:
print('no match for', fname)
It's a little inefficient and depends on the contents of the files to be exactly like you represented - like capitalization being the same.
Or without regular expressions: iterate over the lines of filelist.txt; find the Path line; extract the path with a slice, see if it is a path from path.txt; find the very next SHA1 line; extract the hash with a slice. This relies on the position of the two lines relative to each other and the position of the characters in each line. This will probably be more efficient.
with open('path.txt') as f:
fnames = set(line.strip() for line in f)
with open('filelist.text') as f:
for line in f:
line = line.strip()
if line.startswith("'Path'") and line[11:-2] in fnames:
name = line[11:-2]
while not line.startswith("'SHA1'"):
line = next(f)
line = line.strip()
print((name, line[11:-2]))
This one also assumes the text files are as you represented them.

To parse the alleged second .txt (of which it is not), you will need to re-structure it so that it looks like a normal python data structure. It's pretty close, and there are ways to coerce it to look like one:
import ast
contents = "" # this will be to hold the read contents of that file
filestart = False
with open('filelist.txt') as fh:
for line in fh:
if not filestart and not line.startswith("$VAR"):
continue
elif line.startswith("$VAR"):
contents+="{" # start the dictionary
filestart = True # to kill the first if statement
else:
contents += line # fill out with rest of file
# create dictionary, we use ast here because json will fail
result = ast.literal_eval(contents.replace("=>", ":"))
# {'File': [{'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Windows/System32/vssadmin.exe', 'Signer': 'Microsoft Windows', 'Size': '210944', 'SHA1': 'da39a3ee5e6b4b0d3255bfef95601890afd80709'}, {'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Users/Administrator/Desktop/steam.exe', 'Signer': 'Valve Inc.', 'Size': '300944', 'SHA1': 'cf23df2207d99a74fbe169e3eba035e633b65d94'}, {'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Users/Administrator/Desktop/google.com', 'Signer': 'Valve Inc.', 'Size': '300944', 'SHA1': 'cf23df2207d99a74fbe169e3eba035e633b78987'}]}
files = result["File"] # get your list from here
Now that it's in a tolerable format, I'd convert it to a dict of file: hash key-value pairs for easy lookup against your other file
files_dict = {file['Path']: file['SHA1'] for file in files}
# now grab your other file, and lookups should be quite simple
with open("path.txt") as fh:
results = [f"{filepath.strip()}, {files_dict.get(filepath.strip())}" for filepath in fh]
# Now you can put that to a csv
with open("paths.csv", "w") as fh:
fh.write('File Path, Hash Value') # write the header
fh.write('\n'.join(results))
There are better ways to do this, but that could be left as an exercise to the reader

Related

AttributeError: 'dict' object has no attribute 'split'

I am trying to run this code where data of a dictionary is saved in a separate csv file.
Here is the dict:
body = {
'dont-ask-for-email': 0,
'action': 'submit_user_review',
'post_id': 76196,
'email': email_random(),
'subscribe': 1,
'previous_hosting_id': prev_hosting_comp_random(),
'fb_token': '',
'title': review_title_random(),
'summary': summary_random(),
'score_pricing': star_random(),
'score_userfriendly': star_random(),
'score_support': star_random(),
'score_features': star_random(),
'hosting_type': hosting_type_random(),
'author': name_random(),
'social_link': '',
'site': '',
'screenshot[image][]': '',
'screenshot[description][]': '',
'user_data_process_agreement': 1,
'user_email_popup': '',
'subscribe_popup': 1,
'email_asked': 1
}
Now this is the code to write in a CSV file and finally save it:
columns = []
rows = []
chunks = body.split('}')
for chunk in chunks:
row = []
if len(chunk)>1:
entry = chunk.replace('{','').strip().split(',')
for e in entry:
item = e.strip().split(':')
if len(item)==2:
row.append(item[1])
if chunks.index(chunk)==0:
columns.append(item[0])
rows.append(row)
df = pd.DataFrame(rows, columns = columns)
df.head()
df.to_csv ('r3edata.csv', index = False, header = True)
but this is the error I get:
Traceback (most recent call last):
File "codeOffshoreupdated.py", line 125, in <module>
chunks = body.split('}')
AttributeError: 'dict' object has no attribute 'split'
I know that dict has no attribute named split but how do I fix it?
Edit:
format of the CSV I want:
dont-ask-for-email, action, post_id, email, subscribe, previous_hosting_id, fb_token, title, summary, score_pricing, score_userfriendly, score_support, score_features, hosting_type,author, social_link, site, screenshot[image][],screenshot[description][],user_data_process_agreement,user_email_popup,subscribe_popup,email_asked
0,'submit_user_review',76196,email_random(),1,prev_hosting_comp_random(),,review_title_random(),summary_random(),star_random(),star_random(),star_random(),star_random(),hosting_type_random(),name_random(),,,,,1,,1,1
Note: all these functions mentioned are return values
Edit2:
I am picking emails from the email_random() function like this:
def email_random():
with open('emaillist.txt') as emails:
read_emails = csv.reader(emails, delimiter = '\n')
return random.choice(list(read_emails))[0]
and the emaillist.txt is like this:
xyz#gmail.com
xya#gmail.com
xyb#gmail.com
xyc#gmail.com
xyd#gmail.com
other functions are also picking the data from the files like this too.

Since body is a dictionary, you don't have to a any manual parsing to get it into a CSV format.
If you want the function calls (like email_random()) to be written into the CSV as such, you need to wrap them into quotes (as I have done below). If you want them to resolve as function calls and write the results, you can keep them as they are.
import csv
def email_random():
return "john#example.com"
body = {
'dont-ask-for-email': 0,
'action': 'submit_user_review',
'post_id': 76196,
'email': email_random(),
'subscribe': 1,
'previous_hosting_id': "prev_hosting_comp_random()",
'fb_token': '',
'title': "review_title_random()",
'summary': "summary_random()",
'score_pricing': "star_random()",
'score_userfriendly': "star_random()",
'score_support': "star_random()",
'score_features': "star_random()",
'hosting_type': "hosting_type_random()",
'author': "name_random()",
'social_link': '',
'site': '',
'screenshot[image][]': '',
'screenshot[description][]': '',
'user_data_process_agreement': 1,
'user_email_popup': '',
'subscribe_popup': 1,
'email_asked': 1
}
with open('example.csv', 'w') as fhandle:
writer = csv.writer(fhandle)
items = body.items()
writer.writerow([key for key, value in items])
writer.writerow([value for key, value in items])
What we do here is:
with open('example.csv', 'w') as fhandle:
this opens a new file (named example.csv) with writing permissions ('w') and stores the reference into variable fhandle. If using with is not familiar to you, you can learn more about them from this PEP.
body.items() will return an iterable of tuples (this is done to guarantee dictionary items are returned in the same order). The output of this will look like [('dont-ask-for-email', 0), ('action', 'submit_user_review'), ...].
We can then write first all the keys using a list comprehension and to the next row, we write all the values.
This results in
dont-ask-for-email,action,post_id,email,subscribe,previous_hosting_id,fb_token,title,summary,score_pricing,score_userfriendly,score_support,score_features,hosting_type,author,social_link,site,screenshot[image][],screenshot[description][],user_data_process_agreement,user_email_popup,subscribe_popup,email_asked
0,submit_user_review,76196,john#example.com,1,prev_hosting_comp_random(),,review_title_random(),summary_random(),star_random(),star_random(),star_random(),star_random(),hosting_type_random(),name_random(),,,,,1,,1,1

Python JSON append if value doesn't exist

I've got a json file with 30-ish, blocks of "dicts" where every block has and ID, like this:
{
"ID": "23926695",
"webpage_url": "https://.com",
"logo_url": null,
"headline": "aewafs",
"application_deadline": "2020-03-31T23:59:59",
}
Since my script pulls information in the same way from an API more than once, I would like to append new "blocks" to the json file only if the ID doesn't already exist in the JSON file.
I've got something like this so far:
import os
check_empty = os.stat('pbdb.json').st_size
if check_empty == 0:
with open('pbdb.json', 'w') as f:
f.write('[\n]') # Writes '[' then linebreaks with '\n' and writes ']'
output = json.load(open("pbdb.json"))
for i in jobs:
output.append({
'ID': job_id,
'Title': jobtitle,
'Employer' : company,
'Employment type' : emptype,
'Fulltime' : tid,
'Deadline' : deadline,
'Link' : webpage
})
with open('pbdb.json', 'w') as job_data_file:
json.dump(output, job_data_file)
but I would like to only do the "output.append" part if the ID doesn't exist in the Json file.

I am not able to complete the code you provided but I added an example to show how you can achieve the none duplicate list of jobs(hopefully it helps):
# suppose `data` is you input data with duplicate ids included
data = [{'id': 1, 'name': 'john'}, {'id': 1, 'name': 'mary'}, {'id': 2, 'name': 'george'}]
# using dictionary comprehension you can eliminate the duplicates and finally get the results by calling the `values` method on dict.
noduplicate = list({itm['id']:itm for itm in data}.values())
with open('pbdb.json', 'w') as job_data_file:
json.dump(noduplicate, job_data_file)

I'll just go with a database guys, thank you for your time, we can close this thread now

Dictionary from a String with particular structure

I am using python 3 to read this file and convert it to a dictionary.
I have this string from a file and I would like to know how could be possible to create a dictionary from it.
[User]
Date=10/26/2003
Time=09:01:01 AM
User=teodor
UserText=Max Cor
UserTextUnicode=392039n9dj90j32
[System]
Type=Absolute
Dnumber=QS236
Software=1.1.1.2
BuildNr=0923875
Source=LAM
Column=OWKD
[Build]
StageX=12345
Spotter=2
ApertureX=0.0098743
ApertureY=0.2431899
ShiftXYZ=-4.234809e-002
[Text]
Text=Here is the Text files
DataBaseNumber=The database number is 918723
..... (There are more than 1000 lines per file) ...
On the text I have "Name=Something" and then I would like to convert it as follows:
{'Date':'10/26/2003',
'Time':'09:01:01 AM'
'User':'teodor'
'UserText':'Max Cor'
'UserTextUnicode':'392039n9dj90j32'.......}
The word between [ ] can be removed, like [User], [System], [Build], [Text], etc...
In some fields there is only the first part of the string:
[Colors]
Red=
Blue=
Yellow=
DarkBlue=

What you have is an ordinary properties file. You can use this example to read the values into map:
try (InputStream input = new FileInputStream("your_file_path")) {
Properties prop = new Properties();
prop.load(input);
// prop.getProperty("User") == "teodor"
} catch (IOException ex) {
ex.printStackTrace();
}
EDIT:
For Python solution, refer to the answerred question.
You can use configparser to read .ini, or .properties files (format you have).
import configparser
config = configparser.ConfigParser()
config.read('your_file_path')
# config['User'] == {'Date': '10/26/2003', 'Time': '09:01:01 AM'...}
# config['User']['User'] == 'teodor'
# config['System'] == {'Type': 'Abosulte', ...}

Can easily be done in python. Assuming your file is named test.txt.
This will also work for lines with nothing after the = as well as lines with multiple =.
d = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.strip() # Remove any space or newline characters
parts = line.split('=') # Split around the `=`
if len(parts) > 1:
d[parts[0]] = ''.join(parts[1:])
print(d)
Output:
{
"Date": "10/26/2003",
"Time": "09:01:01 AM",
"User": "teodor",
"UserText": "Max Cor",
"UserTextUnicode": "392039n9dj90j32",
"Type": "Absolute",
"Dnumber": "QS236",
"Software": "1.1.1.2",
"BuildNr": "0923875",
"Source": "LAM",
"Column": "OWKD",
"StageX": "12345",
"Spotter": "2",
"ApertureX": "0.0098743",
"ApertureY": "0.2431899",
"ShiftXYZ": "-4.234809e-002",
"Text": "Here is the Text files",
"DataBaseNumber": "The database number is 918723"
}

I would suggest to do some cleaning to get rid of the [] lines.
After that you can split those lines by the "=" separator and then convert it to a dictionary.

PYTHON JSONtoDICT help needed - It appears python is interpreting my json-dictionary conversion as a list

The following code is giving me the error:
Traceback (most recent call last): File "AMZGetPendingOrders.py", line 66, in <module>
item_list.append(item['SellerSKU']) TypeError: string indices must be integers
The code:
from mws import mws
import time
import json
import xmltodict
access_key = 'xx' #replace with your access key
seller_id = 'yy' #replace with your seller id
secret_key = 'zz' #replace with your secret key
marketplace_usa = '00'
orders_api = mws.Orders(access_key, secret_key, seller_id)
orders = orders_api.list_orders(marketplaceids=[marketplace_usa], orderstatus=('Pending'), fulfillment_channels=('MFN'), created_after='2018-07-01')
#save as XML file
filename = 'c:order.xml'
with open(filename, 'w') as f:
f.write(orders.original)
#ConvertXML to JSON
dictString = json.dumps(xmltodict.parse(orders.original))
#Write new JSON to file
with open("output.json", 'w') as f:
f.write(dictString)
#Read JSON and parse our order number
with open('output.json', 'r') as jsonfile:
data = json.load(jsonfile)
#initialize blank dictionary
id_list = []
for order in data['ListOrdersResponse']['ListOrdersResult']['Orders']['Order']:
id_list.append(order['AmazonOrderId'])
#This "gets" the orderitem info - this code actually is similar to the initial Amazon "get" though it has fewer switches
orders_api = mws.Orders(access_key, secret_key, seller_id)
#opens and empties the orderitem.xml file
open('c:orderitem.xml', 'w').close()
#iterated through the list of AmazonOrderIds and writes the item information to orderitem.xml
for x in id_list:
orders = orders_api.list_order_items(amazon_order_id = x)
filename = 'c:orderitem.xml'
with open(filename, 'a') as f:
f.write(orders.original)
#ConvertXML to JSON
amz_items_pending = json.dumps(xmltodict.parse(orders.original))
#Write new JSON to file
with open("pending.json", 'w') as f:
f.write(amz_items_pending)
#read JSON and parse item_no and qty
with open('pending.json', 'r') as jsonfile1:
data1 = json.load(jsonfile1)
#initialize blank dictionary
item_list = []
for item in data1['ListOrderItemsResponse']['ListOrderItemsResult']['OrderItems']['OrderItem']:
item_list.append(item['SellerSKU'])
#print(item)
#print(id_list)
#print(data1)
#print(item_list)
time.sleep(10)
I don't understand why Python thinks this is a list and not a dictionary. When I print id_list it looks like a dictionary (curly braces, single quotes, colons, etc)
print(data1) shows my dictionary
{
'ListOrderItemsResponse':{
'#xmlns':'https://mws.amazonservices.com/Orders/201 3-09-01',
'ListOrderItemsResult':{
'OrderItems':{
'OrderItem':{
'QuantityOrdered ':'1',
'Title':'Delta Rothko Rolling Bicycle Stand',
'ConditionId':'New',
'Is Gift':'false',
'ASIN':'B00XXXXTIK',
'SellerSKU':'9934638',
'OrderItemId':'49 624373726506',
'ProductInfo':{
'NumberOfItems':'1'
},
'QuantityShipped':'0',
'C onditionSubtypeId':'New'
}
},
'AmazonOrderId':'112-9XXXXXX-XXXXXXX'
},
'ResponseM etadata':{
'RequestId':'8XXXXX8-0866-44a4-96f5-XXXXXXXXXXXX'
}
}
}
Any ideas?

because you are iterating over each key value in dict:
{'QuantityOrdered ': '1', 'Title': 'Delta Rothko Rolling Bicycle Stand', 'ConditionId': 'New', 'Is Gift': 'false', 'ASIN': 'B00XXXXTIK', 'SellerSKU': '9934638', 'OrderItemId': '49 624373726506', 'ProductInfo': {'NumberOfItems': '1'}, 'QuantityShipped': '0', 'C onditionSubtypeId': 'New'}
so first value in item will be 'QuantityOrdered ' and you are trying to access this string as if it is dictionary
you can just do:
id_list.append(data1['ListOrderItemsResponse']['ListOrderItemsResult']['OrderItems']['OrderItem']['SellerSKU']))
and avoid for loop in dictionary

I guess you are trying to iterate OrderItems and finding their SellerSKU values.
for item in data1['ListOrderItemsResponse']['ListOrderItemsResult']['OrderItems']:
item_list.append(item['SellerSKU'])

How to retrieve key value pair from a cfg file

I am new to python.
I have a config file as shown below in the same order. I need to retrieve key, value pairs from config file and will use those values in my script
# Name and details
(
{ group => 'abc',
host => 'pqr.com',
user => 'anonymous',
src => '/var/tmp',
dest => '/tmp',
},
{ group => 'abc',
host =>'pqr.com',
user => 'anonymous',
src => '/tmp'
dest => '/var/tmp'
},
{ group => 'pqr',
host =>'abc.com',
user => 'xyz',
src => '/home/pp',
dest => '/var/tmp',
},
{ group => 'xyz',
host =>'p.com',
user => 'x',
src => '/home/',
dest => '/tmp',
}
)
Each
{
}
is considerd as one block..Group,user,host are unique as well as repeated.
I have to read and parse the config file and display key and value pair.Pls help.
Key : group,Value : 'abc'(say)
key : host ,Value :'pqr.com'
Key : user, Value :'anonymous'
Key : src,Value :'/var/tmp',
key : dest,Value : '/tmp'
Thank you,
I have written the code which displays keys and values taking cfg file(shown above) as an input.
idx = 0
dictList = []
while True:
try:
start = config.index("{", idx)
end = config.index("}", start+1)
slice = config[start+1:end-1]
sliceList = [s.strip() for s in slice.split(",") if s.strip()]
dd = {}
for item in sliceList:
key, value = [s.strip() for s in item.split("=>")]
print key, value
Output while displaying keys,values
key 'value'
group 'abc'
host 'pqr.com'
user 'ananymous'
src '/use/tmp
Now the problem is ,how to display the value corresponding to a key.
Eg : print group- should display abc
print host should display pqr.com, and so on.

You'll probably need to parse it, here's a small example on how to do this.
import re
def parse(data):
'''Parse data block, return itertator on objects inside'''
for block in re.finditer('{[^}]*}', data, re.M): # Split to objects
obj = {}
for match in re.finditer("([a-z]+) => '([^']*)'", block.group()):
obj[match.group(1)] = match.group(2)
yield obj
Now you have two problems :)

Your data is bit malformed to be directly interpreted by Python. So you would have to per-process the data before interpreting it
Change all Occurrence of => to : : data.replace("=>",":")
Quote all the Keys : re.sub(" (\w+) ",r"'\1'",data.replace("=>",":"))
You can then feed it to ast.literal_eval
import re,ast
ast.literal_eval(re.sub(" (\w+) ",r"'\1'",data.replace("=>",":")))

http://docs.python.org/library/configparser.html
You want to try that out for this.
But your config file format will want to change to a more ini format
[section]
key = value
http://deron.meranda.us/python/demjson/
demjson also is nice for python objects -> strings and back.
I tend to use these in this situation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing .txt files to a single .csv output - python

Related

AttributeError: 'dict' object has no attribute 'split'

Python JSON append if value doesn't exist

Dictionary from a String with particular structure

PYTHON JSONtoDICT help needed - It appears python is interpreting my json-dictionary conversion as a list

How to retrieve key value pair from a cfg file

Categories

Resources