Is there a way with Pandas Dataframe to name only the first or first and second column even if there's 4 columns :
Here
for x in range(1, len(table2_query) + 1):
if x == 1:
cursor.execute(table2_query[x])
df = pd.DataFrame(data=cursor.fetchall(), columns=['Q', col_name[x-1]])
and it gives me this :
AssertionError: 2 columns passed, passed data had 4 columns
Consider the df:
df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('ABCD'))
df
then use rename and pass a dictionary with the name changes to the argument columns:
df.rename(columns=dict(A='a', B='b'))
Instantiating a DataFrame while only naming a subset of the columns
When constructing a dataframe with pd.DataFrame, you either don't pass an index/columns argument and let pandas auto-generate the index/columns object, or you pass one in yourself. If you pass it in yourself, it must match the dimensions of your data. The trouble of mimicking the auto-generation of pandas while augmenting just the ones you want is not worth the trouble and is ugly and is probably non-performant. In other words, I can't even think of a good reason to do it.
On the other hand, it is super easy to rename the columns/index values. In fact, we can rename just a few. I think below is more in line with the spirit of your question:
df = pd.DataFrame(np.arange(8).reshape(2, 4)).rename(columns=str).rename(columns={'1': 'A', '3': 'F'})
df
Related
I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute b to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()
I am writing a function that operates on the labels of a pandas dataframe and I want to have a parameter axis to decide whether to operate on index or columns.
So I wrote something like:
if axis==0:
to_sort = df.index
elif axis==1:
to_sort = df.columns
else:
raise AttributeError
where df is a pandas dataframe.
Is there a better way of doing this?
Note I am not asking for a code review, but more specifically asking if there is a pandas attribute (something like labels would make sense to me) that allows me to get index or columns depending on a parameter/index to be passed.
For example (code not working):
df.labels[0] # index
df.labels[1] # columns
Short answer: You can use iloc(axis=...)
Documentation: http://pandas.pydata.org/pandas-docs/stable/advanced.html
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
(They seem to have omitted iloc in regards to the axis parameter)
A complete example
df = pd.DataFrame({"A":['a1', 'a2'], "B":['b1', 'b2']})
print(df)
Output:
A B
0 a1 b1
1 a2 b2
With axis=0
print(df.iloc(axis=0)[0].index)
Output:
Index(['A', 'B'], dtype='object')
With axis=1
print(df.iloc(axis=1)[0].index)
Output:
RangeIndex(start=0, stop=2, step=1)
Looking at reindex documentation examples, I realized I can do something like this:
Let the parameter be axis={'index', 'columns'}
Get the relevant labels using getattr: labels = getattr(df, axis)
Open to other pandas specific solutions.
If I were forced to use axis={1, 0}, then #Bharath suggestion to use an helper function makes sense.
I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().
There are few issues I am having with Dask Dataframes.
lets say I have a dataframe with 2 columns ['a','b']
if i want a new column c = a + b
in pandas i would do :
df['c'] = df['a'] + df['b']
In dask I am doing the same operation as follows:
df = df.assign(c=(df.a + df.b).compute())
is it possible to write this operation in a better way, similar to what we do in pandas?
Second question is something which is troubling me more.
In pandas if i want to change the value of 'a' for row 2 & 6 to np.pi , I do the following
df.loc[[2,6],'a'] = np.pi
I have not been able to figure out how to do a similar operation in Dask. My logic selects some rows and I only want to change values in those rows.
Edit Add New Columns
Setitem syntax now works in dask.dataframe
df['z'] = df.x + df.y
Old answer: Add new columns
You're correct that the setitem syntax doesn't work in dask.dataframe.
df['c'] = ... # mutation not supported
As you suggest you should instead use .assign(...).
df = df.assign(c=df.a + df.b)
In your example you have an unnecessary call to .compute(). Generally you want to call compute only at the very end, once you have your final result.
Change rows
As before, dask.dataframe does not support changing rows in place. Inplace operations are difficult to reason about in parallel codes. At the moment dask.dataframe has no nice alternative operation in this case. I've raised issue #653 for conversation on this topic.
Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]