Read a csv file with multiple data sections into an addressable structure - python

I have made a csv file which looks like this:
Now, in my Python file I want it to take the data from food field place column, which is only:
a
b
c
d
e
Then I want it to take from drink field only the data from taste and so on.
My question is: How do I make a database that will have like "fields" (IE: food/drinks) and inside each field address the specific cells I described?

This question is pretty wide open, so I will show two possible ways to parse this data into a structure that can be accessed in the manner you described.
Solution #1
This code uses a bit more advanced python and libraries. It uses a generator around a csv reader to allow the multiple sections of the data to be read efficiently. The data is then placed into a pandas.DataFrame per section. And each data frame is accessible in a dict.
The data can be accessed like:
ratings['food']['taste']
This will give a pandas.Series. A regular python list can be had with:
list(ratings['food']['taste'])
Code to read data to Pandas Dataframe using a generator:
def csv_record_reader(csv_reader):
""" Read a csv reader iterator until a blank line is found. """
prev_row_blank = True
for row in csv_reader:
row_blank = (row[0] == '')
if not row_blank:
yield row
prev_row_blank = False
elif not prev_row_blank:
return
ratings = {}
ratings_reader = csv.reader(my_csv_data)
while True:
category_row = list(csv_record_reader(ratings_reader))
if len(category_row) == 0:
break
category = category_row[0][0]
# get the generator for the data section
data_generator = csv_record_reader(ratings_reader)
# first row of data is the column names
columns = next(data_generator)
# use the rest of the data to build a data frame
ratings[category] = pd.DataFrame(data_generator, columns=columns)
Solution #2
Here is a solution to read the data to a dict. The data can be accessed with something like:
ratings['food']['taste']
Code to read CSV to dict:
from collections import namedtuple
ratings_reader = csv.reader(my_csv_data)
ratings = {}
need_category = True
need_header = True
for row in ratings_reader:
if row[0] == '':
if not (need_category or need_header):
# this is the end of a data set
need_category = True
need_header = True
elif need_category:
# read the category (food, drink, ...)
category = ratings[row[0]] = dict(rows=[])
need_category = False
elif need_header:
# read the header (place, taste, ...)
for key in row:
category[key] = []
DataEnum = namedtuple('DataEnum', row)
need_header = False
else:
# read a row of data
row_data = DataEnum(*row)
category['rows'].append(row_data)
for k, v in row_data._asdict().items():
category[k].append(v)
Test Data:
my_csv_data = [x.strip() for x in """
food,,
,,
place,taste,day
a,good,1
b,good,2
c,awesome,3
d,nice,4
e,ok,5
,,
,,
,,
drink,,
,,
place,taste,day
a,good,1
b,good,2
c,awesome,3
d,nice,4
e,ok,5
""".split('\n')[1:-1]]
To read the data from a file:
with open('ratings_file.csv', 'rb') as ratings_file:
ratings_reader = csv.reader(ratings_file)

Related

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

Copying the Matching Columns from CSV File

Input:
I have two csv files (file1.csv and file2.csv).
file1 looks like:
ID,Name,Gender
1,Smith,M
2,John,M
file2 looks like:
name,gender,city,id
Problem:
I want to compare the header of file1 with file2 and copy the data of the matching columns. The header in file1 need to be in lowercase prior to finding the matching columns in file2.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw file1 and file2
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I have tried the following code so far:
import csv
file1 = csv.DictReader(open("file1.csv")) #reading file1.csv
file1_Dict = {} # the dictionary of lists that will store the keys and values as list
for row in file1:
for column, value in row.iteritems():
file1_Dict.setdefault(column,[]).append(value)
for key in file1_Dict: # converting the keys of the dictionary to lowercase
file1_Dict[key.lower()] = file1_Dict.pop(key)
file2 = open("file2.csv") #reading file2.csv
file2_Dict ={} # store the keys into a dictionary with empty values
for row2 in file2:
row2 = row2.split(",")
for i in row2:
file2_Dict[i] = ""
Any idea how to solve this problem?
I had a crack on this problem using python without taking performance into consideration. Took me quite a while, phew!
This is my solution.
import csv
csv_data1_filepath = './file1.csv'
csv_data2_filepath = './file2.csv'
def main():
# import nem config and data into memory
data1 = list(csv.reader(file(csv_data1_filepath, 'r')))
data2 = list(csv.reader(file(csv_data2_filepath, 'r')))
file1_header = data1[0][:] # Get f1 header
file2_header = data2[0][:] # Get f1 header
lowered_file1_header = [item.lower() for item in file1_header] # lowercase it
lowered_file2_header = [item.lower() for item in file2_header] # do it for header 2 anyway
col_index_dict = {}
for column in lowered_file1_header:
if column in file2_header:
col_index_dict[column] = lowered_file1_header.index(column)
else:
col_index_dict[column] = -1 # mark as column that will not be worked on later
for column in lowered_file2_header:
if not column in lowered_file1_header:
col_index_dict[column] = -1 # mark as column that will not be worked on later
# build header
output = [col_index_dict.keys()]
is_header = True
for row in data1:
if is_header is False:
rowData = []
for column in col_index_dict:
column_index = col_index_dict[column]
if column_index != -1:
rowData.append(row[column_index])
else:
rowData.append('')
output.append(rowData)
else:
is_header = False
print(output)
if __name__ == '__main__':
main()
This will give you the output of:
[
['gender', 'city', 'id', 'name'],
['M', '', '1', 'Smith'],
['M', '', '2', 'John']
]
Note that the output kind of lost its ordering but this should be fixable by using the ordered dictionary instead.
Hope this helps.
You don't need Python for this. This is a task for SQL.
SQLite Browser supports CSV Import. Take the below steps to get the desired output:
Download and install SQLite Browser
Create a new Database
Import both CSV's as tables (let's say the table names are file1 and file2, respectively)
Now, you can decide how you want to match the data sets. If you only want to match the files on ID, then you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id
If you want to match on every column, you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id and f1.name = f2.name and f1.gender = f2.gender
Finally, simply export the query results back to a CSV.
I spent a lot of time myself trying to perform tasks like this with scripting languages. The benefit of using SQL is that you simply tell what you want to match on, and then let the database do the optimization for you. Generally, it ends up doing the matching faster than any code I could write.
In case you're interested, python also has a sqlite module that comes out-of-the-box. I've gravitated towards using this as my source for data in python scripts for the above reason, and I simply import the CSV's required in SQLite browser before running the python script.

Dealing with strings amidst int in csv file, None value

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks
Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Remove unwanted columns from CSV file

I have two lists of CSV files that my program is combining into a single file.
The first group of files has 5 columns of data that I do not want to include in the output. How do I remove those 5 columns, whether I do it row-by-row or all at one time, from the data I have read in using csv.reader?
Here's my function (I would like to keep the function def and structure mostly the same):
def get_data(filename,rowlen,delimit=','):
data = []
with open(filename, 'rb') as f:
raw = csv.reader(f, dialect='excel', delimiter=delimit)
if raw != None:
for row in raw:
if row[-1] == '':
row.pop()
for i in range(len(row),rowlen):
row.append('-999')
data.append(row)
return data
I tried doing this:
raw = csv.reader(f, dialect='excel', delimiter=delimit)
if raw != None:
for row in raw:
if rowlen == 13: # This is true only for csv files I want to shorten
row = row[0:8]
rowlen = 8
if row[-1] == '':
But the output file remained the same. Also, I tried commenting out rowlen = 8, but this just filled the columns I don't want with -999.
You need to replace the row in raw or create a new list that will contains your sliced rows, here a correction of a part of your code with enumerate to keep track of the index of the row to be replaced in raw.
for i, row in enumerate(raw):
if rowlen == 13: # This is true only for csv files I want to shorten
raw[i] = row[0:8]
rowlen = 8
Another example where you don't alter raw :
new_container = []
for row in raw:
if rowlen == 13: # This is true only for csv files I want to shorten
new_container.append(row[0:8]) # we just append your slice to the new_container each iteration
rowlen = 8
You should check out pandas. It makes working with csv much much better..
from pandas import read_csv
def get_data(filename, rowlen, delimit=','):
df = read_csv(filename, header=None, sep=delimit, usecols=range(rowlen))
df.to_csv('output.csv', index=False)
get_data('input.csv',4)

Categories

Resources