Value error when plotting Dataframe from index - python

I have a dataframe which is of the following structure:
A B
Location1 1
Location2 2
1 3
2 4
In the above example column A is the index. I am attempting to produce a scatter plot using the index and column B. This data frame is made by resampling and averaging another dataframe like so:
df = df.groupby("A").mean()
Now obviously this sets the index equal to column A and I can plot it using the following which is adapted from here. Use index in pandas to plot data
df.reset_index().plot(x = "A",y = "B",kind="scatter", figsize=(10,10))
Now when I run this it returns the follow:
ValueError: scatter requires x column to be numeric
As the index column is intended to be a column of strings for which I can plot a scatter plot how can I go about fixing this?

You may want to select only the integer rows:
import pandas as pd
d = {'A': ["Location1", "Location2", 1, 2], 'B': [1, 2, 3, 4]}
df = pd.DataFrame(data=d)
df_numeric = df[pd.to_numeric(df.A, errors='coerce').notnull()]
print(df_numeric)
A B
2 1 3
3 2 4
Grouped by A:
df_numeric_grouped_by_A = df_numeric.groupby("A").mean()
print(df_numeric_grouped_by_A)
B
A
1 3
2 4

You may have to transponse the DataFrame, so that you have the index(Column A) as columnnames and then calculate the mean of the columns and plot them.

Related

New column in dataset based em last value of item

I have this dataset
In [4]: df = pd.DataFrame({'A':[1, 2, 3, 4, 5]})
In [5]: df
Out[5]:
A
0 1
1 2
2 3
3 4
4 5
I want to add a new column in dataset based em last value of item, like this
A
New Column
1
2
1
3
2
4
3
5
4
I tryed to use apply with iloc, but it doesn't worked
Can you help
Thank you
With your shown samples, could you please try following. You could use shift function to get the new column which will move all elements of given column into new column with a NaN in first element.
import pandas as pd
df['New_Col'] = df['A'].shift()
OR
In case you would like to fill NaNs with zeros then try following, approach is same as above for this one too.
import pandas as pd
df['New_Col'] = df['A'].shift().fillna(0)

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

Plot top n metrics from a pandas pivot table

In a pandas dataframe, if I create a pivot and plot based on the column, how can I restrict the number of columns being plotted based on a aggregated function.
Example,
Suppose I have a dataframe pivot as
sum
name 'a' 'b' 'c' 'd'
key
1 1 2 3 4
2 1 2 3 4
3 1 2 3 nan
which I plot as
Now, I would want to only plot top (2) columns based on average such that only 'c' and 'd' shows up suppressing 'a' and 'd'.
How can I achieve this using pandas?
Input DataFrame and Plot
from io import StringIO
import pandas as pd
TESTDATA = StringIO("""name;key;value
'a';1;1
'a';2;1
'a';3;1
'b';1;2
'b';2;2
'b';3;2
'c';1;3
'c';2;3
'c';3;3
'd';1;4
'd';2;4
""")
df = pd.read_csv(TESTDATA, sep=";")
pivot = pd.pivot_table(df, index='key', columns='name', values='value',aggfunc=[np.sum])
pivot.plot()
You can find mean of pivot and then find nlargest with n=2 , then using .loc[] , select only those columns and plot:
pivot.loc[:,pivot.mean().nlargest(2).index].plot()
#or pivot[pivot.mean().nlargest(2).index].plot()

Numeric differences between two different dataframes in python

I would like to find the numeric difference between two or more columns of two different dataframe.
The following
would be the starting table.
This one Table (Table 2)
contains the single values that I need to subtract to Table 1.
I would like to get a third table where I get the numeric differences between each row of Table 1 and the single row from Table 2. Any help?
Try
df.subtract(df2.values)
with df being your starting table and df2 being Table 2.
Can you try this and see if this is what you need:
import pandas as pd
df = pd.DataFrame({'A':[5, 3, 1, 2, 2], 'B':[2, 3, 4, 2, 2]})
df2 = pd.DataFrame({'A':[1], 'B':[2]})
pd.DataFrame(df.values-df2.values, columns=df.columns)
Out:
A B
0 4 0
1 2 1
2 0 2
3 1 0
4 1 0
you can just do df1-df2.values like below this will use numpy broadcast to substract all df2 from all rows but df2 must have only one row
example
df1 = pd.DataFrame(np.arange(15).reshape(-1,3), columns="A B C".split())
df2 = pd.DataFrame(np.ones(3).reshape(-1,3), columns="A B C".split())
df1-df2.values

Assign a series to ALL columns of the dataFrame (columnwise)?

I have a dataframe, and series of the same vertical size as df, I want to assign
that series to ALL columns of the DataFrame.
What is the natural why to do it ?
For example
df = pd.DataFrame([[1, 2 ], [3, 4], [5 , 6]] )
ser = pd.Series([1, 2, 3 ])
I want all columns of "df" to be equal to "ser".
PS Related:
One way to solve it via answer:
How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series (creating all true mask), but I guess there should be some more
simple way.
If I need NOT all, but SOME columns - the answer is given here:
Assign a Series to several Rows of a Pandas DataFrame
Use to_frame with reindex:
a = ser.to_frame().reindex(columns=df.columns, method='ffill')
print (a)
0 1
0 1 1
1 2 2
2 3 3
But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data:
df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)
Maybe a different way to look at it:
df = pd.concat([ser] * df.shape[1], axis=1)

Categories

Resources