Search in text file and save in Excel - python

I have a text file with information about 1000 student
So i need to save each student details in an excel sheet
Heres a sample of the data:
0000:
name=Jack
Age=16
Grade=90
0001:
name=Max
Age=18
Grade=85
0002:
name=Kayle
Age=17
Grade=92
I want to have a result like this:

It's quite easy using pandas and a dict:
with open('file.txt', 'r') as f:
lines = f.readlines()
students = []
student = {}
for line in lines:
if ':' in line:
student['id'] = line.split(':')[0]
elif 'name' in line:
student['Name'] = line.split('=')[1].replace('\n','')
elif 'Age' in line:
student['Age'] = line.split('=')[1].replace('\n','')
elif 'Grade' in line:
student['Grade'] = line.split('=')[1].replace('\n','')
students.append(student)
print(student)
student = {}
import pandas as pd
df = pd.DataFrame(students)
df.to_excel("output.xlsx")
print(df)

I always use Word for such a job. With Replace, search for Paragraph Marks and replace them with a Tab-character.
E.g. replace :[paragraph mark][space][space][space][space]name= with a [tab character]. With that you get rid of all the rubbish and you end up with 0000[tab character]Jack.
When you're done with all lines of tab separated data, select all the lines with data (make sure not to select empty lines without the three tab-characters, otherwise it won't work) and click on Insert -> Table -> Insert Table... Now the data is converted into a Word table. Just copy the table to Excel and you're done.

Related

Python entire XML file to list and then into dataframe, missing most of the file

My final goal is to take each xml file and enter the raw format of the XML into Snowflake, and this is the result I have so far. For some reason though when i convert the list to a Dataframe, the dataframe is only take a couple items from the list for each file...and not the entire 5000 rows in the xml.
My list Data is grabbing all contents from multiple files, in the list you can see the following:
Each list item is genertating a numpy array and its splitting up the elements from the looks of it.
dated = datetime.today().strftime('%Y-%m-%d')
source_dir = r'C:\Users\jSmith\.spyder-py3\SampleXML'
table_name = 'LV_XML'
file_list = glob.glob(source_dir + '/*.XML')
data = []
for file_path in file_list:
data.append(
np.genfromtxt(file_path,dtype='str',delimiter='|',encoding='utf-8')) #delimiter used to make sure it is not splitting based on spaces, might be the issue?
df = pd.DataFrame(list(zip(data)),
columns =['SRC_XML'])
df['SRC_XML']=df['SRC_XML'].astype(str)
df = df.replace(',','', regex=True)
df["TPR_AS_OF_DT"] = dated
The data frame has the following in each column:
Solution via Dave, with a small tweak:
for file_path in file_list:
with open(file_path,'r') as afile:
content = ''
for aline in afile:
content += aline.replace('\n',' ') # changed to replace for my needs
data.append(content)
This puts the data into a single string, and allows it to be ready to be inserted into the Snowflake table as 1 string...for future queries
Perhaps replace the file reading with this:
for file_path in file_list:
with open(file_path,'r') as afile:
content = ''
for aline in afile:
content += aline.strip('\n')
data.append(content)

python csv file add to field based off another field

I have a csv file looks like this:
I have a column called “Inventory”, within that column I pulled data from another source and it put it in a dictionary format as you see.
What I need to do is iterate through the 1000+ lines, if it sees the keywords: comforter, sheets and pillow exist than write “bedding” to the “Location” column for that row, else write “home-fashions” if the if statement is not true.
I have been able to just get it to the if statement to tell me if it goes into bedding or “home-fashions” I just do not know how I tell it to write the corresponding results to the “Location” field for that line.
In my script, im printing just to see my results but in the end I just want to write to the same CSV file.
from csv import DictReader
with open('test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for line in csv_dict_reader:
if 'comforter' in line['Inventory'] and 'sheets' in line['Inventory'] and 'pillow' in line['Inventory']:
print('Bedding')
print(line['Inventory'])
else:
print('home-fashions')
print(line['Inventory'])
The last column of your csv contains commas. You cannot read it using DictReader.
import re
data = []
with open('test.csv', 'r') as f:
# Get the header row
header = next(f).strip().split(',')
for line in f:
# Parse 4 columns
row = re.findall('([^,]*),([^,]*),([^,]*),(.*)', line)[0]
# Create a dictionary of one row
item = {header[0]: row[0], header[1]: row[1], header[2]: row[2],
header[3]: row[3]}
# Add each row to the list
data.append(item)
After preparing your data, you can check with your conditions.
for item in data:
if all([x in item['Inventory'] for x in ['comforter', 'sheets', 'pillow']]):
item['Location'] = 'Bedding'
else:
item['Location'] = 'home-fashions'
Write output to a file.
import csv
with open('output.csv', 'w') as f:
dict_writer = csv.DictWriter(f, data[0].keys())
dict_writer.writeheader()
dict_writer.writerows(data)
csv.DictReader returns a dict, so just assign the new value to the column:
if 'comforter' in line['Inventory'] and ...:
line['Location'] = 'Bedding'
else:
line['Location'] = 'home-fashions'
print(line['Inventory'])

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

Python CSV writer

I have a csv that looks like this:
HA-MASTER,CategoryID
38231-S04-A00,14
39790-S10-A03,14
38231-S04-A00,15
39790-S10-A03,15
38231-S04-A00,16
39790-S10-A03,16
38231-S04-A00,17
39790-S10-A03,17
38231-S04-A00,18
39790-S10-A03,18
38231-S04-A00,19
39795-ST7-000,75
57019-SN7-000,75
38251-SV4-911,75
57119-SN7-003,75
57017-SV4-A02,75
39795-ST7-000,76
57019-SN7-000,76
38251-SV4-911,76
57119-SN7-003,76
57017-SV4-A02,76
What I would like to do is reformat this data so that there is only one line for each categoryID for example:
14,38231-S04-A00,39790-S10-A03
76,39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02
I have not found a way in excel that I can accomplish this programatically. I have over 100,000 lines. Is there a way using python CSV Read and Write to do something like this?
Yes there is a way:
import csv
def addRowToDict(row):
global myDict
key=row[1]
if key in myDict.keys():
#append values if entry already exists
myDict[key].append(row[0])
else:
#create entry
myDict[key]=[row[1],row[0]]
global myDict
myDict=dict()
inFile='C:/Users/xxx/Desktop/pythons/test.csv'
outFile='C:/Users/xxx/Desktop/pythons/testOut.csv'
with open(inFile, 'r') as f:
reader = csv.reader(f)
ignore=True
for row in reader:
if ignore:
#ignore first row
ignore=False
else:
#add entry to dict
addRowToDict(row)
with open(outFile,'w') as f:
writer = csv.writer(f)
#write everything to file
writer.writerows(myDict.itervalues())
Just edit inFile and outFile
This is pretty trivial using a dictionary of lists (Python 2.7 solution):
#!/usr/bin/env python
import fileinput
categories={}
for line in fileinput.input():
# Skip the first line in the file (assuming it is a header).
if fileinput.isfirstline():
continue
# Split the input line into two fields.
ha_master, cat_id = line.strip().split(',')
# If the given category id is NOT already in the dictionary
# add a new empty list
if not cat_id in categories:
categories[cat_id]=[]
# Append a new value to the category.
categories[cat_id].append(ha_master)
# Iterate over all category IDs and lists. Use ','.join() to
# to output a comma separate list from an Python list.
for k,v in categories.iteritems():
print '%s,%s' %(k,','.join(v))
I would read in the entire file, create a dictionary where the key is the ID and the value is a list of the other data.
data = {}
with open("test.csv", "r") as f:
for line in f:
temp = line.rstrip().split(',')
if len(temp[0].split('-')) == 3: # => specific format that ignores the header...
if temp[1] in data:
data[temp[1]].append(temp[0])
else:
data[temp[1]] = [temp[0]]
with open("output.csv", "w+") as f:
for id, datum in data.iteritems():
f.write("{},{}\n".format(id, ','.join(datum)))
Use pandas!
import pandas
csv_data = pandas.read_csv('path/to/csv/file')
use_this = csv_data.group_by('CategoryID').values
You will get a list with everything you want, now you just have to format it.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Cheers.
I see many beautiful answers have come up while I was trying it, but I'll post mine as well.
import re
csvIN = open('your csv file','r')
csvOUT = open('out.csv','w')
cat = dict()
for line in csvIN:
line = line.rstrip()
if not re.search('^[0-9]+',line): continue
ham, cid = line.split(',')
if cat.get(cid,False):
cat[cid] = cat[cid] + ',' + ham
else:
cat[cid] = ham
for i in sorted(cat):
csvOUT.write(i + ',' + cat[i] + '\n')
Pandas approach:
import pandas as pd
df = pd.read_csv('data.csv')
#new = df.groupby('CategoryID')['HA-MASTER'].apply(lambda row: '%s' % ','.join(row))
new = df.groupby('CategoryID')['HA-MASTER'].agg(','.join)
new.to_csv('out.csv')
out.csv:
14,"38231-S04-A00,39790-S10-A03"
15,"38231-S04-A00,39790-S10-A03"
16,"38231-S04-A00,39790-S10-A03"
17,"38231-S04-A00,39790-S10-A03"
18,"38231-S04-A00,39790-S10-A03"
19,38231-S04-A00
75,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
76,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
This was an interesting question. My solution was to append each new item for a given key to a single string in the value, along with a comma to delimit the columns.
with open('Input01.csv') as input_file:
file_lines = [item.strip() for item in input_file.readlines()]
for item in iter([i.split(',') for i in file_lines]):
if item[1] in set_vals:
set_vals[item[1]] = set_vals[item[1]] + ',' + item[0]
else:
set_vals[item[1]] = item[0]
with open('Results01.csv','w') as output_file:
for i in sorted(set_vals.keys()):
output_file.write('{},{}\n'.format(i, set_vals[i]))
MaxU's implementation, using pandas, has good potential and looks really elegant, but all the values are placed into one cell, because each of the strings is double-quoted. For example, the line corresponding to the code '18'—"38231-S04-A00,39790-S10-A03"—would place both values in the second column.
import csv
from collections import defaultdict
inpath = '' # Path to input CSV
outpath = '' # Path to output CSV
output = defaultdict(list) # To hold {category: [serial_numbers]}
for row in csv.DictReader(open(inpath)):
output[row['CategoryID']].append(row['HA-MASTER'])
with open(outpath, 'w') as f:
f.write('CategoryID,HA-MASTER\n')
for category, serial_number in output.items():
row = '%s,%s\n' % (category, serial_number)
f.write(row)

Inserting data into two columns of csv

My test1111.csv looks similar to this:
Sales #, Date, Tel Number, Comment
393ED3, 5/12/2010, 5555551212, left message
585E54, 6/15/2014, 5555551213, voice mail
585868, 8/16/2010, , number is 5555551214
I have the following code:
import re
import csv
from collections import defaultdict
# Below code places csv entries into dictionary so that they can be parsed
# by column. Then print statement prints Sales # column.
columns = defaultdict(list)
with open("c:\\test1111.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns[k].append(v)
# To print all columns, use: print columns
# To print a specific column, use: print(columns['ST'])
# Below line takes list output and separates into new lines
sales1 = "\n".join(columns['Sales #'])
print sales1
# Below code searches all columns for a 10 digit number and outputs the
# results to a new csv file.
with open("c:\\test1111.csv", "r") as old, \
open("c:\\results1111.csv", 'wb') as new:
for line in old:
#Regex to match exactly 10 digits
match = re.search('(?<!\d)\d{10}(?!\d)', line)
if match:
match1 = match.group()
print match1
new.writelines((match1) + '\n')
else:
nomatch = "No match"
print nomatch
new.writelines((nomatch) + '\n')
The first section of the code opens the original csv and prints all entries from the Sales # column to stdout with each entry in its own row.
The second section of the code opens the original csv and searches every row for a 10 digit number. When it finds one it writes each one (or no match) to each row of a new csv.
What I would like to now do is to also write the sales column data to the new csv. So ultimately, the sales column data would appear as rows in the first column and the regex data would appear as rows in the second column in the new csv. I have been having trouble getting that to work as the new.writelines won't take two arguments. Can someone please help me with how to accomplish this?
I would like the results1111.csv to look like this:
393ED3, 5555551212
585E54, 5555551213
585868, 5555551214
Starting with the second part of your code, all you need to do is concatenate the sales data within your writelines:
sales_list = sales1.split('\n')
# Below code searches all columns for a 10 digit number and outputs the
# results to a new csv file.
with open("c:\\test1111.csv", "r") as old, \
open("c:\\results1111.csv", 'wb') as new:
i = 0 # counter to add the proper sales figure
for line in old:
#Regex to match exactly 10 digits
match = re.search('(?<!\d)\d{10}(?!\d)', line)
if match:
match1 = match.group()
print match1
new.writelines(str(sales_list[i])+ ',' + (match1) + '\n')
else:
nomatch = "No match"
print nomatch
new.writelines(str(sales_list[i])+ ',' + (nomatch) + '\n')
i += 1
Using the counter i, you can keep track of what row you're on and use that to add the corresponding sales column figure.
Just to point out that in a CSV, unless the spaces are really needed, they shouldn't be there. Your data should look like this:
Sales #,Date,Tel Number,Comment
393ED3,5/12/2010,5555551212,left message
585E54,6/15/2014,5555551213,voice mail
585868,8/16/2010,,number is 5555551214
And, adding a new way of getting the same answer, you can use Pandas data analysis libraries for task involving data tables. It will only be 2 lines for what you want to achieve:
>>> import pandas as pd
# Read data
>>> data = pd.DataFrame.from_csv('/tmp/in.cvs')
>>> data
Date Tel Number Comment
Sales#
393ED3 5/12/2010 5555551212 left message
585E54 6/15/2014 5555551213 voice mail
585868 8/16/2010 NaN number is 5555551214
# Write data
>>> data.to_csv('/tmp/out.cvs', columns=['Tel Number'], na_rep='No match')
That last line will write to out.cvs the column Tel Number inserting No match when no telephone number is found, exactly what you want. Output file:
Sales#,Tel Number
393ED3,5555551212.0
585E54,5555551213.0
585868,No match

Categories

Resources