I would like to create a very long pivot table using pandas.
I import a .csv file, creating the dataframe df. The .csv file looks like:
LOC,surveyor_name,test_a,test_b
A,Bob,FALSE,FALSE
A,Bob,TRUE,TRUE
B,Bob,TRUE,FALSE
B,Ryan,TRUE,TRUE
I have the basic pivot table setup here, creating the pivot on index LOC
table = pd.pivot_table(df, values=['surveyor_name'], index=['LOC'],aggfunc={'surveyor_name': np.count_nonzero})
I would like to pass into the aggfunc section a dictionary for each column heading
I created a csv with the list of column headings and the aggregation function, i.e:
a,b
surveyor_name, np.count_nonzero
test_a,np.count_nonzero
test_b,np.count_nonzero
I create a dataframe and convert this dataframe to a dict here:
keys = pd.read_csv('keys.csv')
x = keys.to_dict()
I now have object x that I want to enter into aggfunc, but it is at this point I can't move foward.
So the issue with this came in two parts.
Firstly the creation of the dict was not correct.
x= dict(zip(keys['a'],keys['b']))
Secondly, instead of np.count_nonzero the use of nunique worked.
Related
Given a dictionary with multiple dataframes in it. How I can add a column to each dataframe with all the rows in that df filled with the key name'?
I tried this code:
for key, df in sheet_to_df_map.items():
df['sheet_name'] = key
This code does add the key column in each dataframe inside the dictionary, but also creates an additional dataframe.
Can't this be done without creating an additional dataframe?
Furthermore, I want to separate dataframes from the dictionary by number of columns. All the dataframes that have 10 columns concatenated, the ones with 9 concatenated and so on. I don't know how to do this.
I could do it with the method assign() in the DataFrames and then replacing the hole value in the dictionary, but I don't know in fact if it's this that you want...
for key, df in myDictDf.items():
myDictDf[key] = df.assign(sheet_name=[key for w in range(len(df.index))])
To sort your dictionary, I think you can use an OrderedDict with the columns property of the DataFrames.
By using len(df.columns) you can get the quantity of columns for each frame.
I think these links can be useful for you:
https://note.nkmk.me/en/python-pandas-len-shape-size/
https://www.geeksforgeeks.org/python-sort-python-dictionaries-by-key-or-value/
I've found a related question too:
Adding new column to existing DataFrame in Python pandas
Assuming I have a pandas DF as follows:
mydf=pd.DataFrame([{'sectionId':'f0910b98','xml':'<p/p>'},{'sectionId':'f0345b98','xml':'<a/a>'}])
mydf.set_index('sectionId', inplace=True)
I would like to get a dicctionary out of it as follwos:
{'f0910b98':'<p/p>', 'f0345b98':'<a/a>'}
I tried the following:
mydf.to_dict()
mydf.to_dict('records')
And it is not what I am looking for.
I am looking for the correct way to use to_dict()
Note: I know I can get the two columns into two lists and pack them in a dict like in:
mydict = dict(zip(mydf.sectionId, mydf.xml))
but I am looking for a pandas straight direct method (if there is one)
You could transpose your dataframe and then to_dict it and select the first item (the xml now-index).
mydf.T.to_dict(orient='records')[0]
returns
{'f0910b98': '<p/p>', 'f0345b98': '<a/a>'}
Trying to iterate of a dataframe using iterrows, but its telling me it is not defined.
after opening the excel file with read_excel and getting the data into what I believe to be a dataframe it will not let me use iterrows() on the dataframe
df = pd.read_excel('file.xlsx')
objDF = pd.DataFrame(df['RDX']) $Throws does not exist
for (i, r) in objDF.iterrows():
#do stuff
Expected to be able to iterate over the rows and perform a calculation
Why are you trying to create a dataframe from a dataframe? Is the sole intention to just iterate across one column of the original dataframe? If so, you could access the column as follows:
df = pd.read_excel('file.xlsx')
for index, row in df.iterrows():
print(row['RDX'])
I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.
add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))
This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.
I get an empty dataframe when I try to group values using the pivot_table. Let's first create some stupid data:
import pandas as pd
df = pd.DataFrame({"size":['large','middle','xsmall','large','middle','small'],
"color":['blue','blue','red','black','red','red']})
When I use:
df1 = df.pivot_table(index='size', aggfunc='count')
returns me what I expect. Now I would like to have a complete pivot table with the color as column:
df2 = df.pivot_table(index='size', aggfunc='count',columns='color')
But this results in an empty dataframe. Why? How can I get a simple pivot table which counts me the number of combinations?
Thank you.
You need to use len as the aggfunc, like so
df.pivot_table(index='size', aggfunc=len, columns='color')
If you want to use count, here are the steps:
First add a frequency columns, like so:
df['freq'] = df.groupby(['color', 'size'])['color'].transform('count')
Then create the pivot table using the frequency column:
df.pivot_table(values='freq', index='size', aggfunc='count', columns='color')
you need another column to be used as values for aggregation.
Add a column -
df['freq']=1
Your code will work.