I am pretty new to Python, so I am possibly looking over an easy solution, but everything I have tried thus far has been fruitless.
I have hundreds of CSV files with identical format. The format I have is
--Name of File (unimportant)
--Single Number Value (unimportant)
--Important Row of Column Names
--Two More Rows of Unimportant Formatting Garbage
--Thousands of Rows of Important Data
--Several Blank Rows
--Thousands of Rows of Unimportant Garbage Again
I need to format it so that I am able to easily grab the Column Names and the Important Data underneath. The format is set so that the column names are always on row 5 and that the data always starts on row 8, but the amount of data can very from several hundred to several thousand.
EDIT: I got the exact row number of the heading wrong. Also, I forgot to mention that I need to save the result to a dataframe for future analysis.
This is an image of the top of the csv file
This is an image of the bottom of the csv file. Note that when it switches from 'important data' to 'unimportant data' the number of columns increases, which might make programming difficult.
You can use the below code. I got the column names with the line number =5, and data starting from line number =8 and stopped where we encounter a blank line.
import csv,pandas as pd
Space_encounter_linenum_flag=0
index_df=-1
#This flag is set when it encounters first blank line after the data values end
with open("C:/Users/user/PycharmProjects/spacysample/MrX.csv", 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for row in csvreader:
index_df=index_df+1
if csvreader.line_num==5:
#To get column names
print("THE COLUMN NAMES IN LINE NUMBER 5 ARE...........")
print(', '.join(row))
df_col=pd.DataFrame(row)
if csvreader.line_num==8:
#To get data values
print("**********************************************************")
print("THE DATA VALUES STARTING FROM LINE NUMBER 8 ARE...........")
while row[-1] is '':
row.pop()
print(', '.join(row))
df_col.append(row)
if (csvreader.line_num>8) and max(row, key=len)=='':
#set flag when blank line is encountered
Space_encounter_linenum_flag=1
if (csvreader.line_num>8 and row is not '') and (row is not '') and Space_encounter_linenum_flag!=1:
#stop where blank line is encountered
while row[-1] is '':
row.pop()
print(', '.join(row))
df_val=pd.DataFrame(row)
df_col.append(df_val)
if (csvreader.line_num>8) and Space_encounter_linenum_flag==1:
print('Loop breaks at, line number: '+str(csvreader.line_num))
break
Hope this does exactly what you want.
import pandas as pd
df = pd.read_csv('path_to_your_csv', header=5)[7:]
# List Columns
df.columns
In case you don't have pandas : pip install pandas
read_csv docs : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Related
I'm using PyQt5 and want to compare values from a csv file with values imputed by the user through QLineEdit(). Then, if the values are the same, I want to get the whole row imported to a QTableWidget.
The csv file contains 3 different columns, with width values, height values and thickness values.
I've tried this to solve the first problem:
import csv
with open('csvTest.csv') as file:
reader = csv.reader(file)
for row in reader:
if row[0] == self.widthTextbox.text() or row[1] == self.heightTextbox.text() or row[2] == self.thickTextbox.text():
print("Found: {}".format(row))
This didn't work, and I know that using "or" is problematic because I want this to act like a filter, so if the user only inputs one of the three atributes he'll get some rows, if he inputs two he'll get fewer rows and if he inputs all three of them he will get even fewer. But using "or" allows any line that fits any condition valid.
The second problem is, if this worked, I'd like to make the number of rows in the table equal to the number of rows that passed through the filter, using something like self.tableWidget.setRowCount('''number of rows found''') .
Finally, the last issue would be to make the QTableWidget rows identical to the ones that the filter found.
To solve first and second issue this could be a way:
import csv
from collections import Counter
rows_finded = []
with open('csvTest.csv') as file:
reader = csv.reader(file)
for row in reader:
values = [self.widthTextbox.text(), self.heightTextbox.text(), self.thickTextbox.text()]
if Counter(values) == Counter(row):
rows_finded.append(row)
self.tableWidget.setRowCount(len(rows_finded))
To solve last issue (source: Python - PyQt - QTable Widget - adding rows):
for i, row in enumerate(rows_finded):
for j, col in enumerate(row):
item = QTableWidgetItem(col)
self.tableWidget.setItem(i, j, item)
First of all, keep in mind that I'm a complete beginner with Python. I've been trying to figure this out all afternoon with no luck.
This is what I'm trying to do:
Let's say we have two csv files:
file 1:
col1;col2
659039;16,9
659038;27,8
659037:36,4
file 2:
col1;col2
659037:36,4
659039;16,9
659038;30
I want to search col1 of file 2 for all the items in col1 of file 1, and if it is found and there is a difference in col2, return that line. In the above case only the last line of file 2 would be returned, because the other lines are identical(line number doesn't matter). I only want those who are different.
Poorly explained. Hope you understand what I mean. Any help would be hugely appreciated!
Try to do one thing at a time. First, extract all the values you need to check file2 for from file1 and store them in a data structure that is easy to work with. In the example below, I looped through all of the lines in file1 and collected the contents in a dictionary. Specifically, the keys are from column one and the values are from column two.
Now, you can loop through each row in file2 and try to find any row where the content in column one doesn't exist as a key in the dictionary.If the key does exist, make sure that its value doesn't match column two. Only when both of those conditions are satisfied should you return that line.
import csv
fileItems = {}
linesToReturn = []
with open('file1.csv', newline='', encoding='utf-8-sig') as file1:
reader = csv.reader(file1, True)
for row in reader:
fileItems[row[0]] = row[1]
with open('file2.csv', newline='', encoding='utf-8-sig') as file2:
reader = csv.reader(file2, True)
for row in reader:
if fileItems.get(row[0]) != row[1]:
linesToReturn.append(row)
print(linesToReturn)
If you're using csv to search through the files, check out the documentation here.
Break down your problem into sub-problems. You can use Pandas framework to achieve this in following steps-
Read the csv file.
Use pandas framework to compare both the columns. You can refer to - https://www.shanelynn.ie/python-pandas-read_csv-load-data-from-csv-files/
If you find the desired difference, add the line to python list
return the list at the end of the code.
I have the header name of a column from a series of massive csv files with 50+ fields. Across the files, the index of the column I need is not always the same.
I have written code that finds the index number of the column in each file. Now I'd like to add only this column as the key in a dictionary where the value counts the number of unique strings in this column.
Because these csv files are massive and I'm trying to use best-practices for efficient data engineering, I'm looking for a solution that uses minimal memory. Every solution I find for writing a csv to a dictionary involves writing all of the data in the csv to the dictionary and I don't think this is necessary. It seems that the best solution involves only reading in the data from this one column and adding this column to the dictionary key.
So, let's take this as sample data:
FOODS;CALS
"PIZZA";600
"PIZZA";600
"BURGERS";500
"PIZZA";600
"PASTA";400
"PIZZA";600
"SALAD";100
"CHICKEN WINGS";300
"PIZZA";600
"PIZZA";600
The result I want:
food_dict = {'PIZZA': 6, 'PASTA': 1, 'BURGERS': 1, 'SALAD': 1, 'CHICKEN WINGS': 1}
Now let's say that I want the data from only the FOODS column and in this case, I have set the index value as the variable food_index.
Here's what I have tried, the problem being that the columns are not always in the same index location across the different files, so this solution won't work:
from itertools import islice
with open(input_data_txt, "r") as file:
# This enables skipping the header line.
skipped = islice(file, 1, None)
for i, line in enumerate(skipped, 2):
try:
food, cals = line.split(";")
except ValueError:
pass
food_dict = {}
if food not in food_dict:
food_dict[food] = 1
else:
food_dict[food] += 1
This solution works for only this sample -- but only if I know the location of the columns ahead of time -- and again, a reminder that I have upwards of 50 columns and the index position of the column I need is different across files.
Is it possible to do this? Again, built-ins only -- no Pandas or Numpy or other such packages.
The important part here is that you do not skip the header line! You need to split that line and find the indices of the columns you need! Since you know the column headers for the information you need, put those into a reference list:
wanted_headers = ["FOODS", "RECYCLING"]
with open(input_data_txt, "r") as infile:
header = infile.read().split(';')
wanted_cols = [header.index(label) for label in wanted_headers if label in header]
# wanted_cols is now a list of column numbers you want
for line in infile.readlines(): # Iterate through remaining file
fields = line.split(';')
data = [fields[col] for col in wanted_cols]
You now have the data in the same order as your existing headers; you can match it up or rearrange as needed.
Does that solve your blocking point? I've left plenty of implementation for you ...
Use Counter and csv:
from collections import Counter
import csv
with open(filename) as f:
reader = csv.reader(f)
next(reader, None) # skips header
histogram = Counter(line[0] for line in reader)
CVS Sample
So I have a csv file(sample in link above) , with variable names in row 7 and values in row 8 . The Variable all have units after them, and the values are just numbers like this :
Velocity (ft/s) Volumetric (Mgal/d Mass Flow (klb/d) Sound Speed (ft/s)
.-0l.121 1.232 1.4533434 1.233423
There are alot more variables, but basically I need some way to search in the csv file for the specefic unit groups, and then append the value associated with that in a list. For example search for text "(ft/s)", and then make a dictionary with Velocity and Sound speed as Keys, and their associated values . I am unable to do this because the csv is formatted like an excel spreadsheet, and the cells contains the whole variable name with it's unit
In the end I will have a dictionary for each unit group, and I need to do it this way because each csv file generated, the unit groups change ( ft/s becomes m/s). I also can't use excel read, because it doesn't work in IronPython.
You can use csv module to read the appropriate lines into lists.
defaultdict is a good choice for data aggregation, while variable
names and units can be easily separated by splitting on '('.
import csv
import collections
with open(csv_file_name) as fp:
reader = csv.feader(fp)
for k in range(6): # skip 6 lines
next(reader)
varnames = next(reader) # 7th line
values = next(reader) # 8th line
groups = collections.defaultdict(dict)
for i, (col, value) in enumerate(zip(varnames, values)):
if i < 2:
continue
name, units = map(str.strip, col.strip(')').split('(', 1))
groups[units][name] = float(value)
Edit: added the code to skip first two columns
I'll help with the part I think you're stuck on, which is trying to extract the units from the category. Given your data, your best bet may be to use regex, the following should work:
import re
f = open('data.csv')
# I assume the first row has the header you listed in your question
header = f.readline().split(',') #since you said its a csv
for item in header:
print re.search(r'\(.+\)', item).group()
print re.sub(r'\(.+\)', '', item)
That should print the following for you:
(ft/s)
Velocity
(Mgal/d)
Volumetric
(klb/d)
Mass Flow
(ft/s)
Sound Speed
You can modify the above to store these in a list, then iterate through them to find duplicates and merge the appropriate strings to dictionaries or whatnot.
I am new to Python, and I am trying to sort of 'migrate' a excel solver model that I have created to Python, in hopes of more efficient processing time.
I receive a .csv sheet that I use as my input for the model, it is always in the same format.
This model essentially uses 4 different metrics associated with product A, B and C, and I essentially determine how to price A, B, and C accordingly.
I am at the very nascent stage of effectively inputting this data to Python. This is what I have, and I would not be surprised if there is a better approach, so open to trying anything you veterans have to recommend!
import csv
f = open("141881.csv")
for row in csv.reader(f):
price = row[0]
a_metric1 = row[1]
a_metric2 = row[2]
a_metric3 = row[3]
a_metric4 = row[4]
b_metric1 = row[7]
b_metric2 = row[8]
b_metric3 = row[9]
b_metric4 = row[10]
c_metric1 = row[13]
c_metric2 = row[14]
c_metric3 = row[15]
c_metric4 = row[16]
The .csv file comes in the format of price,a_metric1,a_metric2,a_metric3,a_metric4,,price,b_metric1,b_metric2,b_metric3,b_metric4,price,,c_metric1,c_metric2,c_metric3,c_metric4
I skip the second and third price column as they are identical to the first one.
However when I run the python script, I get the following error:
c_metric1 = row[13]
IndexError: list index out of range
And I have no idea why this occurs, when I can see the data is there myself (in excel, this .csv file would go all the way to column Q, or what I understand as row[16].
Your help is appreciated, and any advice on my approach is more than welcomed.
Thanks in advance!
Using print() can be your friend here:
import csv
with open('141881.csv') as file_handle:
file_reader = csv.reader(file_handle)
for row in file_reader:
print(row)
The code above will print out EACH row.
To print out ONLY the first row replace the for loop with: print(file_reader.__next__()) (assuming Python3)
Printing out row(s) will allow you to see what exactly a "row" is.
P.S.
Using with is advisable because it handles the opening and closing of the file for you
Look into pandas.
Read file as:
data = pd.read_csv('141881.csv'))
to read a columns:
col = data.columns['column_name']
to read a row:
row = data.ix[row_number]
CSV Module in Python transforms a spreadsheet into a matrice : a list of list
The python module to read csv transform each line of your input into a list.
For each row, it will split the row into a list of cell.In other words, one array is composed of as many columns you have into your excel spreadsheet.
Try in terminal:
>>> f = open("141881.csv")
>>> print csv.reader(f)
>>>[["id", "name", "company", "email"],[1563, "defoe", "SuperFastCompany",],["def#superfastcie.net"],[1564, "doe", "Awsomestartup", "doe#awesomestartup"], ...]`
So that's why you iterate throught the rows of your spreadsheet assigning the value into a new variable.
I recommend you to read on basics of list manipulation.
But...
What is an IndexError? catching exception:
If one cell is empty or one row has less columns than other: it will thraw an Error. Such as you described. IndexError means Python wasn't able to find a value for this specific cell. In other words if some line of your excel spreadsheet are smaller than the other it will say there is no such value to asign and throw an Index Error. That why knowing how to catch exception could be very useful to see the problem. Try to verify that the list of each has the same lenght if not assign an empty value for example
try:
#if row has always 17 cells with values
#I can just assign it directly using a little trick
price,a_metric1,a_metric2,a_metric3,a_metric4,,price,b_metric1,b_metric2,b_metric3,b_metric4,price,c_metric1,c_metric2,c_metric3,c_metric4 = row'
except IndexError:
# if there is no 17 cells
# tell me how many cells is actually in the list
# you will see there that there less than 17 elements
print len(row)
Now you can just skip the error by assigning None value to those who don't appears in the csv file
You can read more about Catching Exception
Thanks everyone for your input - printing the results made me realize that I was getting the IndexError because of the very first row, which only had headers. Skipping that row got rid of the error.
I will look into pandas, it seems like that will be useful for the type of work I am doing.
Thanks again for all of your help, much appreciated.