How to efficiently remove junk above headers in an .xls file - python

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.

A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...

If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Related

Is there a way to view my data frame in pandas without reading in the file every time?

Here is my code:
import pandas as pd
df = pd.read_parquet("file.parqet", engine='pyarrow')
df_set_index = df.set_index('column1')
row_count = df.shape[0]
column_count = df.shape[1]
print(df_set_index)
print(row_count)
print(column_count)
Can I run this without reading in the parquet file each time I want to do a row count, column count, etc? It takes a while to read in the file because it's large and I already read it in once but I'm not sure how to.
pd.read_parquet reads files that are stored on the disc and stores it in cache which is naturally slow with a lot of data. So, you could engineer a solution like:
1.) column_count
pd.read_parquet("file.parqet", engine='pyarrow', nrows=1).shape[1]
-> This would give you the number of columns while only reading in 1 row
-> .shape returns a tuple with values (# rows, # columns), so just grab the second item for the number of columns as demonstrated above.
2.) row_count
cols_want = ['colmn1'] # put whatever column names you want here
row_count = pd.read_parquet("file.parqet", engine='pyarrow', usecols=cols_want).shape[0]
-> This would give you the number of rows in the column "column1" without having to read in all the other columns (which is the reason for your solution taking awhile).
3.) df.set_index(...) isn't meant to be stored in a variable, so I'm not sure what you want to do there. If you're trying to see what is in the column just use #2 above and remove the ".shape[0]" call

Get index and column with multiple headers and index_col in Pandas DataFrame

I have a dataframe with multiple headers and column indexes, and would like to retrieve the list of entries that are non-zero. The dataframe is constructed from a .csv file provided by another party.
Its hard to include data as its sensitive, but I read in the data and remove NaNs to make it smaller and only include non-zero rows and columns.
df = pd.read_csv('Example.csv', header=[0,1,2,3], index_col=[0,1])
a = df.where(df==1).dropna(how='all').dropna(axis=1)
x = [(df[col][df[col].eq(1)].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq(1)].index))]
for i in range(len(x)):
print(x[i])
I am hoping for the output
((index col1, index col2), (header 3))
So I guess the hypothetical would be
If I listed every iteration of comic book characters under header I would have:
Brand: Marvel/DC/Etc
Hero: Spiderman/Captain America/...
Person: Parker/Riley/Morales
Then my column indexes would be Comic name, next column number of that comic.
Each entry would be 1 if the character is present, and nothing otherwise in the .csv read from Excel.
I would like the output to be
((Amazing Spiderman, 1),( Parker, Spiderman))
etc.
I hope that makes sense.
I resolved this by removing rows not being used in the query at that time. It is not an ideal solution but it will make your version operational, though it does mean it can be fiddly if you need both/N headers outputted.

Parse multiple tables of different sizes from a single csv file

I have a CSV file that contains multiple tables. Each table has a title, and variable number of rows and columns (these numbers may vary between files). The titles, as well as names of rows and columns may also change between different files I will need to parse in the future, so I cannot hardcode them. some columns may contain empty cells as well.
Here is a screenshot of an example CSV file with this structure:
I need to find a solution that will parse all the tables from the CSV into Pandas DFs. Ideally the final output would be an Excel file, where each table is saved as a sheet, and the name of each sheet will be the corresponding table title.
I tried the suggested solution in this post but it kept failing in identifying the start/end of the tables. When I used a simpler version of the input csv file, the suggested code only returned one table.
I would appreciate any assistance!!
You could try this:
df = pd.read_csv("file.csv")
dfs = []
start = 0
for i, row in df.iterrows():
if all(row.isna()): # Empty row
# Remove empty columns
temp_df = df.loc[start:i, :].dropna(how="all", axis=1)
if start: # Grab header, except for first df
new_header = temp_df.iloc[0]
temp_df = temp_df[1:]
temp_df.columns = new_header
temp_df = temp_df.dropna(how="all", axis=0)
dfs.append(temp_df)
start = i + 1
Then, you can reach each df by calling dfs[0], dfs[1], ...

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

Pandas ignore many rows when reading csv

I have a dataset with more than 1,000,000 rows.
However, read_csv cannot read them all.
products = pd.read_csv("PAD_NEW.csv", encoding = "ISO-8859-1", error_bad_lines=False)
products.shape
(859971, 137)
But in r using fread, i can get 1048575
> dim(products)
[1] 1048575 137
I tried to read using R first then write a new file for Python. But it did not work.
UPDATE: I manually check those rows being ignored, there is a column named description and there are sentences like "new product, next week" I think python take the "," in this column as a separate. Because after I delete this column it works.

Categories

Resources