Pandas skipping certain columns - python

I'm trying to format an Amazon Vendor CSV using Pandas but I'm running into an issue. The issue stems from the fact that Amazon inserts a row with report information before the headers.
When trying to skip over that row when assigning headers to the dataframe, not all columns are captured. Below is my attempt at explicitly stating which row to pull columns from but it doesn't appear to be correct.
df = pd.read_csv(path + 'Amazon Search Terms_Search Terms_US.csv', sep=',', error_bad_lines=False, index_col=False, encoding='utf-8')
headers = df.loc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
print('Copying data into new data frame....')
Before it looks like this(I want row 2 to be all the columns in the new df:
After the fact it looks like this(it only selects 5):
I've also tried having it skiprows when opening the CSV, it doesn't treat the report row as data so it just ends up skipping actual data. Not really sure what is going wrong here, any help would be appreciated.

As posted in the comment by #suvayu, adding header=1 into the read csv did the job.

Related

pandas read excel without unnamed columns

Trying to read excel table that looks like this:
B
C
A
data
data
data
data
data
but read excel doesn't recognizes that one column doesn't start from first row and it reads like this:
Unnamed : 0
B
C
A
data
data
data
data
data
Is there a way to read data like i need? I have checked parameters like header = but thats not what i need.
A similar question was asked/solved here. So basically the easiest thing would be to either drop the first column (if thats always the problematic column) with
df = pd.read_csv('data.csv', index_col=0)
or remove the unnamed column via
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
You can skip automatic column labeling with something like pd.read_excel(..., header=None)
This will skip random labeling.
Then you can use more elaborate computation (e.g. first non empty value) to get the labels such as
df.apply(lambda s: s.dropna().reset_index(drop=True)[0])

how to read excel file with nested columns with pandas?

I am trying to read an Excel file using pandas but my columns and index are changed:
df = pd.read_excel('Assignment.xlsx',sheet_name='Assignment',index_col=0)
Excel file:
Jupyter notebook:
By default pandas consider first row as header. You need to tell that take 2 rows as header.
df = pd.read_excel("xyz.xlsx", header=[0,1], usecols = "A:I", skiprows=[0])
print df
You can choose to mention skiprows depending on the requirement. If you remove skiprows, it will show first row header without any unnamed entries.
Refer this link

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

How to read the same column header, but "spelt" differently in new files. Pandas

So I am writing in python using pandas. The code that I wrote extracts specific column headers from an excel file, which works, but I dont want to have to go into the code everytime to change out the name of the column headers to extract when working on new files with the same data.
Here is my extraction method:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
df2 = df[['ZONE_NAME','POLYGONSTRING']]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
So When I run this code to another excel file I want it to accept any possible name for "ZONE_NAME" which could be "zonename", "Zone Name" etc...
If your problem is just limited to different ways to write a column like "ZONE_NAME" (e.g., "zone_name", "ZONENAME", "ZONE_name", etc.) then why not just use some type of filter on the column names:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
# This will filter out any non-alphabetical characters from each
# column name and lower it (so "ZONE_NAME" "ZONENAME" and "zone_NAME"
# would become "zonename")
filtered_columns = [re.sub('[^a-zA-Z]', "", c).lower() for c in list(df.columns.values)]
df.columns = filtered_columns
df2 = df[filtered_columns]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
Hope this helps.
Pandas read_csv will automatically detect column headers. No need to specify anything ahead of time. Your post is lacking the links to any image, It would be better to post samples of the data in-line though.

How to write specific columns of a dataframe to a CSV?

I'm writing a script to reduce a large .xlsx file with headers into a CSV, and then write a new CSV file with only the required columns based on the header names.
import pandas
import csv
df = pandas.read_csv('C:\\Python27\\Work\\spoofing.csv')
time = df["InviteTime (Oracle)"]
orignum = df["Orig Number"]
origip = df["Orig IP Address"]
destnum = df["Dest Number"]
df.to_csv('output.csv', header=[time,orignum,origip,destnum])
The error I'm getting is with that last bit of code, and it says
ValueError: Writing 102 cols but got 4 aliases
I'm sure I'm overlooking something stupid, but I've read over the to_csv documentation on the pandas website and I'm still at a loss. I know I'm misusing the to_csv parameters but I can't seem to get my head around the documentation.
Any help is appreciated, thanks!
The way to select specific columns is this -
header = ["InviteTime (Oracle)", "Orig Number", "Orig IP Address", "Dest Number"]
df.to_csv('output.csv', columns = header)
column_list=["column_name1", "column_name2", "column_name3", "column_name4"]
#filter the dataframe beforehand
ds[column_list].to_csv('output.csv',index=False)
#or use columns arg
ds.to_csv('output.csv', columns = column_list,index=False)
I provide index=False arg in order to write only column values

Categories

Resources