I am a beginner at Python and I'm looking to take 3 specific columns starting at a certain row from a .csv spreadsheet and then import each into python.
For example
I would need to take 1000 rows worth of data from column F starting at
row 12.
I've looked at options using cvs and pandas but I can't figure out how
to have them start importing at a certain row/column.
Any help would be greatly appreciated.
If the spreadsheet is not huge, the easiest approach is to load the entire CSV file into Python using the csv module and then extract the required rows and columns. For example:
import csv
rows = list(csv.reader(file('Book1.csv', 'rb')))
data = [column[5] for column in rows[11:11+1000]]
will do the trick. Remember that Python starts numbering from 0, so column[5] is column F from your spreadsheet and rows[11] is row 12.
CSV files being text files, there is no way to read a certain line. You will have to read line per line, and count... Have a look at the csv module in Python, which will explain how to (easily) read lines. Particularly this section.
Related
I'm running a python script to automate some of my day-to-day tasks at work. One task I'm trying to do is simply add a row to an existing ods sheet that I usually open via LibreOffice.
This file has multiple sheets and depending on what my script is doing, it will add data to different sheets.
The thing is, I'm having trouble finding a simple and easy way to just add some data to the first unpopulated row of the sheet.
Reading about odslib3, pyexcel and other packages, it seems that to write a row, I need to specifically tell the row number and column to write data, and opening the ods file just to see what cell to write and tell the pythom script seems unproductive
Is there a way to easily add a row of data to an ods sheet without informing row number and column ?
If I understand the question I believe that using a .remove() and a .append() will do the trick. It will create and populate data on the last row (can't say its the most efficient though).
EX if:
from pyexcel_ods3 import save_data
from pyexcel_ods3 import get_data
data = get_data("info.ods")
print(data["Sheet1"])
[['first_row','first_row'],[]]
if([] in data["Sheet1"]):
data["Sheet1"].remove([])#remove unpopulated row
data["Sheet1"].append(["second_row","second_row"])#add new row
print(data["Sheet1"])
[['first_row','first_row'],['second_row','second_row']]
I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
Ignore these columns and the carriage return contained within? or
Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.
Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.
I have a csv file that I am trying to split based on the number of columns. The original file has about 24000 columns and I want to split this into files with each files having a fixed number of columns (say 1000). I want to run to do feature selection on weka on the individual files. I have the following code in python.
import pandas as pd
import numpy as np
i=0
df=pd.read_csv("glio.csv")
#row_split=int(input("Enter the Row Split: "))
row_split=6000
name ="temp_file_"
ext=".csv"
rows, columns = df.shape
df_temp=df.iloc[:,:row_split]
df_temp.to_csv(name+str(i)+ext)
i=i+1
while(row_split<columns):
df_temp=df.iloc[:,row_split+1:row_split+100]
df_temp.to_csv(name+str(i)+ext)
i=i+1
row_split+=1000
It is generating the individual files as expected but after splitting I am not able to load the individual files in weka. I am getting the following error
I am new to this and have no idea why this occurs. I cannot find answers online. It would be really helpful if someone could explain why this is happening and how to correct this
First of all add index=False to the to_csv call:
df_temp.to_csv(name+str(i)+ext, index=False)
Also please upload a screenshot of the csv file when you open it in some csv viewer application (e.g. Excel).
(Before i go ahead and ask this question please understand that i have done research but this is just to fill in holes in my information)
I have a standard excel spread sheet .xls that contains one table with the following info in it:
Now what i would like to achieve is to directly translate this .xls file into a data type that can be stored in memory for the python application that im writing to access this information accordingly like a dictionary.
I have read up a fair bit on this but my experience in coding isn't 100% as it has been awhile.
you can use pandas library, excellent library for excel manipulations.
import pandas as pd
data_frame = pd.read_excel("path_to_excel", "sheet_name")
data_frame is like a table or matrix that holds your data, you can manipulate this data_frame really easily
So what you want should be possible with the csv module, assuming you convert your xls to a csv (just save as...).
Like so:
import csv
with open('filepath.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['A'], row['B'])
DictReader takes the first row, and assumes those are the dictionary keys. Then it turns each row into a dictionary where you can access the values using the keys defined in the first row.
If you don't want it to be an actual dict, you can just use csv.reader(f) which allows you to access the rows using list indexing (the above example would end with print(row[0], row[1])).
This all has the nice bonus of being able to use the standard library without any 3rd party imports - so will run on any machine with Python.
You can use xlrd to loop through the excel file. You can loop through the excel file and create a dictionary, as you suggested.
A better alternative would be pandas, which reads your excel as a table, called data frame. You can access any cell, row or column from this data frame.
Eg, you have:
X Y
0 0.213784 0.461443
1 0.703082 0.600445
2 0.111101 0.648624
3 0.101367 0.924729
>>> import pandas as pd
>>> df = pd.read_excel(filename)
>>> df["X"]
0 0.213784
1 0.703082
2 0.111101
3 0.101367
>>> df["Y"]
0 0.461443
1 0.600445
2 0.648624
3 0.924729
>>> df["X"][0]
0.21378370373100195