How to read excel data starting from specific col - python

I have an excel workbook with multiple sheets and I am trying to import/read the data starting from an empty col.
the row data look like this
A
C
One
Two
Three
and I am trying to get the data
C
Two
Three
I can't use usecols as the position of this empty col changes in each sheet I have in the workbook.
I have tried this but didn't work out for me.
df = df[~df.header.shift().eq('').cummax()]
I would appreciate any suggestions or hints. Mant thanks in advance!

Assuming that you want to start from the first empty header, then:
df = df[df.columns[list(df.columns).index('Unnamed: 1'):]]

Related

pandas read excel without unnamed columns

Trying to read excel table that looks like this:
B
C
A
data
data
data
data
data
but read excel doesn't recognizes that one column doesn't start from first row and it reads like this:
Unnamed : 0
B
C
A
data
data
data
data
data
Is there a way to read data like i need? I have checked parameters like header = but thats not what i need.
A similar question was asked/solved here. So basically the easiest thing would be to either drop the first column (if thats always the problematic column) with
df = pd.read_csv('data.csv', index_col=0)
or remove the unnamed column via
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
You can skip automatic column labeling with something like pd.read_excel(..., header=None)
This will skip random labeling.
Then you can use more elaborate computation (e.g. first non empty value) to get the labels such as
df.apply(lambda s: s.dropna().reset_index(drop=True)[0])

Pandas skipping certain columns

I'm trying to format an Amazon Vendor CSV using Pandas but I'm running into an issue. The issue stems from the fact that Amazon inserts a row with report information before the headers.
When trying to skip over that row when assigning headers to the dataframe, not all columns are captured. Below is my attempt at explicitly stating which row to pull columns from but it doesn't appear to be correct.
df = pd.read_csv(path + 'Amazon Search Terms_Search Terms_US.csv', sep=',', error_bad_lines=False, index_col=False, encoding='utf-8')
headers = df.loc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
print('Copying data into new data frame....')
Before it looks like this(I want row 2 to be all the columns in the new df:
After the fact it looks like this(it only selects 5):
I've also tried having it skiprows when opening the CSV, it doesn't treat the report row as data so it just ends up skipping actual data. Not really sure what is going wrong here, any help would be appreciated.
As posted in the comment by #suvayu, adding header=1 into the read csv did the job.

I would like to change all the columns of my data frame into a csv file

I am trying to change each column of my data frame into csv format and I think the code I have is wrong. If the data frame has 15 columns I want 15 cvs columns.
Here is what I am doing:
t= None
for i in range(len(VF.columns)):
t= pd.Dataframe(VF[i])
t.to_csv()
I am using jupyter notebook. Could anybody explain what's happening in the code given above?
If I'm understanding correctly, you can try:
for col in VF.columns:
VF[[col]].to_csv('%s.csv' % col)
Why not just use:
for i in range(len(VF.columns)):
VF.columns[i].to_csv()
No need to store it as a separate dataframe
Also your loop needs the ".columns" on the VF for iterating i. Otherwise your grabbing all columns for each row.
I think that this what you want
List = np.arange(0,15,1)
for i in List:
i.to_csv('location-of -file/i.csv')

Search entire excel sheet with Pandas for word(s)

I am trying to essentially replicate the Find function (control-f) in Python with Pandas. I want to search and entire sheet (all rows and columns) to see if any of the cells on the sheet contain a word and then print out the row in which the word was found. I'd like to do this across multiple sheets as well.
I've imported the sheet:
pdTestDataframe = pd.read_excel(TestFile, sheet_name="Sheet Name",
keep_default_na= False, na_values=[""])
And tried to create a list of columns that I could index into the values of all of the cells but it's still excluding many of the cells in the sheet. The attempted code is below.
columnsList = []
for i, data in enumerate(pdTestDataframe.columns):
columnList.append(pdTestDataframe.columns[i])
for j, data1 in enumerate(pdTestDataframe.index):
print(pdTestDataframe[columnList[i]][j])
I want to make sure that no matter the formatting of the excel sheet, all cells with data inside can be searched for the word(s). Would love any help I can get!
Pandas has a different way of thinking about this. Just calling df[df.text_column.str.contains('whatever')] will show you all the rows in which the text is contained in one specific column. To search the entire dataframe, you can use:
mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]
(Source is here)

concatenate excel datas with python or Excel

Here's my problem, I have an Excel sheet with 2 columns (see below)
I'd like to print (on python console or in a excel cell) all the data under this form :
"1" : ["1123","1165", "1143", "1091", "n"], *** n ∈ [A2; A205]***
We don't really care about the Column B. But I need to add every postal code under this specific form.
is there a way to do it with Excel or in Python with Panda ? (If you have any other ideas I would love to hear them)
Cheers
I think you can use parse_cols for parse first column and then filter out all columns from 205 to 1000 by skiprows in read_excel:
df = pd.read_excel('test.xls',
sheet_name='Sheet1',
parse_cols=0,
skiprows=list(range(205,1000)))
print (df)
Last use tolist for convert first column to list:
print({"1": df.iloc[:,0].tolist()})
The simpliest solution is parse only first column and then use iloc:
df = pd.read_excel('test.xls',
parse_cols=0)
print({"1": df.iloc[:206,0].astype(str).tolist()})
I am not familiar with excel, but pandas could easily handle this problem.
First, read the excel to a DataFrame
import pandas as pd
df = pd.read_excel(filename)
Then, print as you like
print({"1": list(df.iloc[0:N]['A'])})
where N is the amount you would like to print. That is it. If the list is not a string list, you need to cast the int to string.
Also, there are a lot parameters that can control the load part of excel read_excel, you can go through the document to set suitable parameters.
Hope this would be helpful to you.

Categories

Resources