DataFrame does not exist whern using read_excel method

DataFrame does not exist whern using read_excel method - python

Trying to iterate of a dataframe using iterrows, but its telling me it is not defined.
after opening the excel file with read_excel and getting the data into what I believe to be a dataframe it will not let me use iterrows() on the dataframe
df = pd.read_excel('file.xlsx')
objDF = pd.DataFrame(df['RDX']) $Throws does not exist
for (i, r) in objDF.iterrows():
#do stuff
Expected to be able to iterate over the rows and perform a calculation

Why are you trying to create a dataframe from a dataframe? Is the sole intention to just iterate across one column of the original dataframe? If so, you could access the column as follows:
df = pd.read_excel('file.xlsx')
for index, row in df.iterrows():
print(row['RDX'])

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?

From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]

Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

Import several sheets from the same excel into one dataframe in pandas

I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.

read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco

you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.

Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.

You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task

You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

passing a long dict into aggfunc pandas python pivot_table

I would like to create a very long pivot table using pandas.
I import a .csv file, creating the dataframe df. The .csv file looks like:
LOC,surveyor_name,test_a,test_b
A,Bob,FALSE,FALSE
A,Bob,TRUE,TRUE
B,Bob,TRUE,FALSE
B,Ryan,TRUE,TRUE
I have the basic pivot table setup here, creating the pivot on index LOC
table = pd.pivot_table(df, values=['surveyor_name'], index=['LOC'],aggfunc={'surveyor_name': np.count_nonzero})
I would like to pass into the aggfunc section a dictionary for each column heading
I created a csv with the list of column headings and the aggregation function, i.e:
a,b
surveyor_name, np.count_nonzero
test_a,np.count_nonzero
test_b,np.count_nonzero
I create a dataframe and convert this dataframe to a dict here:
keys = pd.read_csv('keys.csv')
x = keys.to_dict()
I now have object x that I want to enter into aggfunc, but it is at this point I can't move foward.

So the issue with this came in two parts.
Firstly the creation of the dict was not correct.
x= dict(zip(keys['a'],keys['b']))
Secondly, instead of np.count_nonzero the use of nunique worked.

How to use an existing column as index in Spark's Dataframe

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.

add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))

This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame does not exist whern using read_excel method - python

Why are you trying to create a dataframe from a dataframe? Is the sole intention to just iterate across one column of the original dataframe? If so, you could access the column as follows: df = pd.read_excel('file.xlsx') for index, row in df.iterrows(): print(row['RDX'])

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

Import several sheets from the same excel into one dataframe in pandas

How to find if a values exists in all rows of a dataframe?

passing a long dict into aggfunc pandas python pivot_table

How to use an existing column as index in Spark's Dataframe

Categories

Resources