I'm confused why the following pandas does not successfully assign the last two values of column A to the first two entries of column B:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7], 'B': [10, 20, 30, 40, 50, 60, 70]})
df = df.join(pd.DataFrame({'C': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}))
df['B2'] = df.B.shift(2)
df[:2].B2 = list(df[-2:].A)
What's perplexing to me is that in an (apparently) equivalent "real" application, it does appear to work (and to generate some strange behavior).
Why does the final assignment fail to change the values of the two entries in the dataframe?
It can work and that's why its insidious, see here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Generally with multi-dtyped frames it depends on the construction of when it would work (e.g. if you create it all at once, I think it will always work). Since you are creating it after (via join) it is dependent on the underlying numpy view creation mechanisms.
don't ever ever ever assign like that, use loc
df.loc[:2,'B2'] = ....
Related
I'm new to Dash and trying to filter the the following dataframe with a dropdown and show the filter dataframe. I am not able to define the callback function properly. Any sample code is really appreciated,
df = pd.DataFrame({
'col1': ['A', 'B', 'B', 'A', 'B', 'A'],
'col2': [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
})
Here is the page from the docs that I think might be a good place to start. If you know the dataframe ahead of time, then you can populate the dropdown with the values that you know you'll need. If you don't, then you would need to use a callback to update the dropdown options prop.
The first example on that docs page shows how to get the value from a dropdown, and output it. In your case, you'd use the dropdown to filter the dataframe perhaps like:
#app.callback(
Output('data-table-id', 'data'),
[Input('dropdown-id', 'value')
)
def callback_func(dropdown_value):
df_filtered = df[df['col1'].eq(dropdown_value)]
return df_filtered.to_dict(orient='records)
And then you would set the output of the callback to the data prop of a Dash datatable. I've added some made-up values for the Output and Input part of the dropdown, hopefully that gives the general idea.
I have a dataframe of decent size (95,000 rows, 68 columns.) When I load in the excel file it makes a certain column go from text to being interpreted as an int. I either need a way to specify on loading that a certain column is to hold strings, or I need to figure out why astype(str) is performing so slowly.
Example code below
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e'],
'C': ['1.1', '2.1', '3.0', '4.1', '5.1'] })
df['A']=df['A'].astype(str)+'addedtext'
print(df)
This code works fine, does exactly what I want, changing Column A from ints to strings and verifying they are strings by using + with other strings and getting the result I want
The problem is, running this on a single column (95000 rows) of my other dataframe takes 7-8 minutes. I feel like thats very slow for such a simple change? Maybe I'm crazy. Is there a faster method? Is there a way to load a csv or excel file in but specify one column by a certain data type before hand? Could it have to do with the fact that there is a very large number of nulls in that column?
EDIT: Im just dumb. my timing loop contained the saving of the file.
I would maybe try a list comprehension:
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e'],
'C': ['1.1', '2.1', '3.0', '4.1', '5.1'] })
start= time.time()
df['A']=df['A'].astype(str)+'addedtext'
end = time.time()
print('original:', end-start)
start= time.time()
df['A']=[str(x) + 'addedtext' for x in df['A']]
end = time.time()
print('slightly faster alternative', end-start)
my results were:
original: 0.002001523971557617
slightly faster alternative 0.000997781753540039
Hope this helps
I want to add 2 column constant to a list at the 1st column and the 2nd column.
I wrote 2 loop, because I wanted to add 2 columns.
I use this code:
list_a=[[1,'aa'],[2,'bb'],[3,'cc'],[4,'dd']]
constant1=2018
constant2=30
for item in list_a:
item.insert(0,constant1)
for item in list_a:
item.insert(1,constant2)
print(list_a)
The output is:
[[2018, 30, 1, 'aa'], [2018, 30, 2, 'bb'], [2018, 30, 3, 'cc'], [2018, 30, 4, 'dd']]
But I think this is not a good idea,is there any other way?
You can use a list comprehension:
res = [[2018, 30, *v] for v in list_a]
[[2018, 30, 1, 'aa'],
[2018, 30, 2, 'bb'],
[2018, 30, 3, 'cc'],
[2018, 30, 4, 'dd']]
Making operations "in place" has no significant advantage here. You have the same time complexity either way.
It is easy with pandas dataframe, if you are ok with creating a new list rather than changing it in place and do not mind installing an extra package. pandas is highly recommended if your work with large tables. For simplicity I assume columns are typed, which is the case in your example.
The code below is based on
how do I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame(list_a)
df.insert(0, "first", value=constant1)
df.insert(1, "second", value=constant2)
list_a = df.values.tolist()
back convertion to lists from dataframe is a bit slow because, probably
because once one get taste of working with dataframes he do not want go back to nested list.
While I keep it close to your solution, yet Pandas is quite flexible, alternatively, you can create a dataframe with two columns and concatenate it, if you fancy a one-liner, see
Is it possible to add several columns at once to a pandas DataFrame?
If restricted to standard lists operations or prefer to modify the list in place. Well, one loop should suffice
for item in list_a:
item.insert(0, constant2)
item.insert(0, constant1)
And one change a row in one line
for (i, item) in enumerate(list_a):
list_a[i] = [constant1, constant2].extend(item)
If you are ok with creating a new list, rather than changing the list_a in place, and prefer use standard list operations, list comprehension would be even more compact and even slightly faster than the loop.
It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.
Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data
data[0] == 100
True
I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)
In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe
Edit:
While investigating this, I found another case of a missing warning:
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Even though an almost identical example does trigger a warning:
data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Update: I'm responding to the answer by #firelynx here because it's hard to put it in the comment.
In the answer, #firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
Explaining what you're doing, step by step
The Dataframe you create, is not a view
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False
new_data is also not a view, because you are taking all columns
new_data = data[['a', 'b']]
new_data._is_view
False
now you are assigning data to be the Series 'a'
data = data['a']
type(data)
pandas.core.series.Series
Which is a view
data._is_view
True
Now you update a value in the non-copy new_data
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
This should not give a warning. It is the whole dataframe.
The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.
Avoiding writing code like this
The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]
The problem is really that you should always be writing
data[['a']] not data['a']
Left creates a dataframe view, right creates a series.
Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.
This does not work. First of all using data.a syntax causes cognitive dissonance.
A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination
Clean code to the rescue
It is hard to see the difference between data[['a']] and data['a']
This is a smell. We should be doing neither.
The proper way using clean code principles and the zen of python "Explicit is better than implicit"
is this:
columns = ['a']
data[columns]
This may not be so mind boggling, but take a look at the following example:
data[['ad', 'cpc', 'roi']]
What does this mean? What are these columns? What data are you getting here?
These are the first questions to arrive in anyone's head when reading this line of code.
How to solve it? Don't say a comment.
ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]
More explicit is always better.
For more, please consider buying a book on clean code. Maybe this one
I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.
cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']
This gives "15.0" as result.
cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']
This gives as result a Series:
sc
b 15.0
Name: mvl, dtype: float64
type(df1.loc['b','mvl']) -> pandas.core.series.Series
type(df.loc['b','mvl']) -> numpy.float64
Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?
I hope it is not a stupid question. Thanks!
This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:
nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)
>>> xno.loc['a']
8
>>> xcat.loc['a']
a 8
dtype: int64
The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.
There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.