I have a CSV file and I would like to read this cell-by-cell so that I can write it into excel. I am using csv.reader and enumerating the result so that I can put values into corresponding cells in Excel.
With the current code, once I enumerate the values turn into strings. If I write to excel with sheet.write(rowi,coli,value), all cells are formatted as text. I can't have this, because I need to sum columns afterward and they need to be treated as numbers
For example, my text file will have: 1, a, 3, 4.0, 5, 6, 7
After first enumeration, the first row: (0, '1, a, 3, 4.0, 5, 6, 7')
After second enumeration, first column of first row: (0, 0, '1')
QUESTION: How can I read this csv file to yield (0, 0, 1) (etc.)?
Here's some code I'm working with:
import csv, xlwt
with open('file.csv', 'rb') as csvfile:
data = csv.reader ((csvfile), delimiter=",")
wbk= xlwt.Workbook()
sheet = wbk.add_sheet("file")
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
sheet.write(rowi,coli,value)
#print(rowi,coli,value) gives (rowi, coli, 'value')
import csv, xlwt
with open('file.csv', 'rb') as csvfile:
data = csv.reader ((csvfile), delimiter=",")
wbk= xlwt.Workbook()
sheet = wbk.add_sheet("file")
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
sheet.write(rowi,coli,value)
wbk.save("workbook_file")
Even though print(rowi,coli,value) shows 'value', the cell in the outputted file should show it without quotes.
If your data is in the format 1, 2, 3 and not 1,2,3 include this after your for coli, value in enumerate(row): line:
value = value.lstrip(" ")
Well I think the csv module of python is still lacking a crystal ball ... More seriously, in the csv file there is no indication of the type of the variable, integer, float, string or date. By default, the Reader transforms a row in an list of strings.
If you want some columns to be integer, you can add to your script a list of boolean. Say you have 4 columns and the third is integer
int_col = [ false, false, true, false ]
...
for rowi, row in enumerate(data):
for coli, value in enumerate(row):
val = int(value) if int_col(coli) else value
sheet.write(rowi,coli,val)
You can also try to guess what columns are integer, reading n rows (for example n = 10) and saying that for each column where you found n integers you treat that column as integer.
Or you can even imagine a 2 pass operation : first pass determine the type of the columns and second does the inserts.
I find Python's standard library functions a bit lacking for processing CSV files. I prefer to work with pandas when possible.
import xlwt
from pandas.io.parsers import read_csv
df = read_csv('file.csv')
#number the columns sequentially
df.columns = [i for i, e in enumerate(df.columns)]
#unstack the columns to make 2 indices plus a column, make row come before col,
#sort row major order, and then unset the indices to get a DataFrame
newDf = df.unstack().swaplevel(0,1).sort_index().reset_index()
#rename the cols to reflect the types of data
newDf.columns = ['row', 'col', 'value']
#write to excel
newDf.to_excel('output.xls', index=False)
This will also keep the row and column numbers as integer values values. I took an example csv file and row and col both were integer valued, not string.
Related
I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]
I have outputted a pandas df into an excel file using xlsxwriter. I'm trying to create a totals row at the top. To do so, I'm trying to create a function that dynamically populates the totals based off the column I choose.
Here is an example of what I'm intending to do:
worksheet.write_formula('G4', '=SUM(G4:G21)')
#G4 = Where total should be placed
I need this to be a function because the row counts can change (summation range should be dynamic), and I want there to be an easy way to apply this formula to various columns.
Therefore I've come up with the following:
def get_totals(column):
start_row = '4' #row which the totals will be on
row_count = str(tl_report.shape[0]) #number of rows in the table so I can sum up every row.
return (worksheet.write_formula(str(column+start_row),"'=SUM("+str(column+start_row)+":"+str(column+row_count)+")'") )
When running get_totals("G") it just results in 0. I'm suspecting it has to do with the STR operator that I had to apply because its adding single quotes to the formula, and therefore rendering it incorrectly.
However I cannot take the str operator out because I cannot concatenate INTs apparently?
Maybe I'm coding this all wrong, new to python, any help appreciated.
Thank you!
You could also do something like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3,4], 'B': [5,6,7,8],
'C': [np.nan, np.nan, np.nan, np.nan]})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow = 2)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
def get_totals(start_row, sum_column, column1='A', column2='B'):
for row in range(start_row,df.shape[0]+start_row):
worksheet.write_formula(f'{sum_column}{row}', f'=SUM({column1}{row},{column2}{row})')
get_totals(4, 'C')
writer.save()
Output:
In almost all cases XlsxWriter methods support two forms of notation to designate the position of cells: Row-column notation and A1 notation.
Row-column notation uses a zero based index for both row and column while A1 notation uses the standard Excel alphanumeric sequence of column letter and 1-based row. For example:
(6, 2) # Row-column notation.
('C7') # The same cell in A1 notation.
So for your case you could do the following and set the row-column values programatically (you may have to adjust by -1 to get zero indexing):
worksheet.write_formula(start_row, start_column, '=SUM(G4:G21)')
For the formula you could use of of XlsxWriter's utility functions:
from xlsxwriter.utility import xl_range
my_range = xl_range(3, 6, 20, 6) # G4:G21
I have a csv file which I am splitting with delimiter ','. My target is to iterate through the first column of the entire file and if it matches with the word I have, then I need to have the subsequent values of that particular row into different lists.
Example:
AAA,man,2300,
AAA,woman,3300,
BBB,man,2300,
BBB,man,3300,
BBB,man,2300,
BBB,woman,3300,
CCC,woman,2300,
CCC,man,3300,
DDD,man,2300,
My code:
import csv
datafile = "test.txt"
with open('C:/Users/Dev/Desktop/TEST/Details/'+datafile, 'r') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print (rows)
If I search for a value BBB, I want to have the rest of the details of the rows into 3 different lists. (CSV file will always have only 4 columns; the fourth column might be empty sometimes, so we just leave it with a comma)
Sample:
list1 = [man, man, man, woman]
list2 = [2300, 3300, 2300, 3300]
list3 = [ , , , ,]
How can I do this?
Try it with pandas:
import pandas as pd
df = pd.read_csv('path/to/file',sep=',',header=None)
Now just use:
list1,list2,list3 = df[df[0] == "BBB"].T.values.tolist()
Example df:
df = pd.DataFrame(dict(col1=["AAA","BBB","BBB"],
col2=[1,2,3],
col3=[4,5,6]))
Outputs:
(['BBB', 'BBB'], [2, 3], [5, 6]) #list1,list2,list3
Answer for your question is there in your statement: "If I search for a value, say BBB, i want to have the rest of the details of the rows into 3 different lists"
Create empty list:-
list1=[]
list2=[]
list3=[]
Append values into those list:-
for row in reader:
if ( row[0] == "BBB" ):
list1.append(row[1])
list2.append(row[2])
list3.append(row[3])
You can initialize three empty list variables and then, in the loop of rows, if c1 matches your value, append the consequent columns to the list.
Edit: OR use pandas at Anton VBR has answered.
I'll ignore the part you read data from csv file.
let us begin with a list ( 2d array ). construct a for loop to to search only row1 for your condition - for your example result vector=[1,2,7,8,9]. this vector contains list of indices meeting your condition.
now to get the "filtered" list justmake another for loop extracting all other rows indices result_vector.
I have a csv file with 330k+ rows and 12 columns. I need to put column 1 (numeric ID) and column 3 (text string) into a list or array so I can analyze the data in column 3.
this code worked for me to pull out the third col:
for row in csv_strings:
string1.append(row[2])
Can someone point me to the correct class of commands that I can research to get the job done?
Thanks.
Pandas is the best tool for this.
import pandas as pd
df = pd.read_csv("filename.csv", usecols=[ 0, 2 ])
points = []
for row in csv_strings:
points.append({id: row[0], text: row[2]})
You can pull them out into a list of key value pairs.
A different answer, using tuples, which ensure immutability and are pretty fast, but less convenient than dictionaries:
# build results
results = []
for row in csv_lines:
results.append((row[0], row[2]))
# Read results
for result in results:
result[0] # id
result[1] # string
import csv
x,z = [],[]
csv_reader = csv.reader(open('Data.csv'))
for line in csv_reader:
x.append(line[0])
z.append(line[2])
This can help u getting data from 1st and 3rd column
What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
Polished version of P.S. answer is as follows. It works.
Remember we have inserted lot of missing values in the dataframe.
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
If you want something really concise without explicitly giving column names, you could do this:
Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
You can use shantanu pathak's method to find longest row length in your data.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)
We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','
Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.