New to pandas here.
I have a pandas DataFrame called all_invoice with only one column called 'whole_line'.
Each row in all_invoice is a fixed width string. I need a new DataFrame from all_invoice using read_fwf.
I have a working solution that looks like this:
invoice = pd.DataFrame()
for i,r in all_invoice['whole_line'].iteritems():
temp_df = pd.read_fwf(StringIO(r), colspecs=in_specs,
names=in_cols, converters=in_convert)
invoice = invoice.append(temp_df, ignore_index = True)
in_specs, in_cols, and in_convert have been defined earlier in my script.
So this solution works but is very slow. For 18K rows with 85 columns, it takes about 6 minutes for this part of the code to execute. I'm hoping for a more elegant solution that doesn't involve iterating over the rows in the DataFrame or Series and that will use the apply function to call read_fwf to make this go faster. So I tried:
invoice = all_invoice['whole_line'].apply(pd.read_fwf, colspecs=in_specs,names=in_cols, converters=in_convert)
The tail end of my traceback looks like:
OSError: [Errno 36] File name too long:
Following that colon is the string that is passed to the read_fwf method. I suspect that this is happening because read_fwf needs a file path or buffer. In my working (but slow) code, I'm able to call StringIO() on the string to make it a buffer but I cannot do that with the apply function. Any help with getting the apply working or another way to make use of the read_fwf on the entire series/df at once to avoid iterating over the rows is appreciated. Thanks.
Have you tried just doing:
invoice = pd.read_fwf(filename, colspecs=in_specs,
names=in_cols, converters=in_convert)
Related
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
I'm working on a rather large excel file and as part of it I'd like to insert two columns into a new excel file at the far right which works, however whenever I do so an unnamed column appears at the far left with numbers in it.
This is for an excel file, and I've tried to use the .drop feature as well as use a new file and read about the CSV files but I cannot seem to apply it here, so nothing seems to solve it.
wdf = pd.read_excel(tLoc)
sheet_wdf_map = pd.read_excel(tLoc, sheet_name=None)
wdf['Adequate'] = np.nan
wdf['Explanation'] = np.nan
wdf = wdf.drop(" ", axis=1)
I expect the output to be my original columns with only the two new columns being on the far right without the unnamed column.
Add index_col=[0] as an argument to read_excel.
I am trying to read a large csv file and run a code. I use chunk size to do the same.
file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)
I get the following error in the code:
AttributeError: 'TextFileReader' object has no attribute 'index'
How to resolve this?
Those errors are stemming from the fact that your pd.read_csv call, in this case, does not return a DataFrame object. Instead, it returns a TextFileReader object, which is an iterator. This is, essentially, because when you set the iterator parameter to True, what is returned is NOT a DataFrame; it is an iterator of DataFrame objects, each the size of the integer passed to the chunksize parameter (in this case 1000000).
Specific to your case, you can't just call df.index because, simply, an iterator object does not have an index attribute. This does not mean that you cannot access the DataFrames inside the iterator. What it means is that you would either have to loop through the iterator to access one DataFrame at a time or you would have to use some kind of way of concatenating all those DataFrames into one giant one.
If you are considering just working with one DataFrame at a time, then the following is what you would need to do to print the indexes of each DataFrame:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
for df in dfs:
print(df.index)
# do something
df.to_csv('output_file.csv', mode='a', index=False)
This will save the DataFrames into an output file with the name output_file.csv. With the mode parameter set to a, the operations should append to the file. As a result, nothing should be overwritten.
However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame, then the following would perhaps be a better path:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
giant_df = pd.concat(dfs)
print(giant_df.index)
Since you are already using the iterator parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators offer when it comes to memory management for large datasets.
I hope this proves useful.
I want to put some data available in an excel file into a dataframe in Python.
The code I use is as below (two examples I use to read an excel file):
d=pd.ExcelFile(fileName).parse('CT_lot4_LDO_3Tbin1')
e=pandas.read_excel(fileName, sheetname='CT_lot4_LDO_3Tbin1',convert_float=True)
The problem is that the dataframe I get has the values with only two numbers after comma. In other words, excel values are like 0.123456 and I get into the dataframe values like 0.12.
A round up or something like that seems to be done, but I cannot find how to change it.
Can anyone help me?
thanks for the help !
You can try this. I used test.xlsx which has two sheets, and 'CT_lot4_LDO_3Tbin1' is the second sheet. I also set the first value as Text format in excel.
import pandas as pd
fileName = 'test.xlsx'
df = pd.read_excel(fileName,sheetname='CT_lot4_LDO_3Tbin1')
Result:
In [9]: df
Out[9]:
Test
0 0.123456
1 0.123456
2 0.132320
Without seeing the real raw data file, I think this is the best answer I can think of.
Well, when I try:
df = pd.read_csv(r'my file name')
I have something like that in df
http://imgur.com/a/Q2upp
And I cannot put .fileformat in the sentence
You might be interested in removing column datatype inference that pandas performs automatically. This is done by manually specifying the datatype for the column. Here is what you might be looking for.
Python pandas: how to specify data types when reading an Excel file?
Using pandas 0.20.1 something like this should work:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.fileformat')
for exemple, in excel:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.xlsx')
Read this documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?
I'm calling it like this:
# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec'
# )
I've also tried writing out with the spark-csv module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.
I'm calling that like this:
df.write.format('com.databricks.spark.csv').save('results')
The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:
final_data = # Do your logic on df and return a new DataFrame
final_data.write.format('com.databricks.spark.csv').save('results')