I keep getting different attribute errors when trying to run this file in ipython...beginner with pandas so maybe I'm missing something
Code:
from pandas import Series, DataFrame
import pandas as pd
import json
nan=float('NaN')
data = []
with open('file.json') as f:
for line in f:
data.append(json.loads(line))
df = DataFrame(data, columns=['accepted', 'user', 'object', 'response'])
clean = df.replace('NULL', nan)
clean = clean.dropna()
print clean.value_counts()
AttributeError: 'DataFrame' object has no attribute 'value_counts'
Any ideas?
value_counts is a Series method rather than a DataFrame method (and you are trying to use it on a DataFrame, clean). You need to perform this on a specific column:
clean[column_name].value_counts()
It doesn't usually make sense to perform value_counts on a DataFrame, though I suppose you could apply it to every entry by flattening the underlying values array:
pd.value_counts(df.values.flatten())
To get all the counts for all the columns in a dataframe, it's just df.count()
value_counts() is now a DataFrame method since pandas 1.1.0
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html
I had the same problem, it was working but now for some reason it is not. I replaced it with a groupby:
grouped = pd.DataFrame(data.groupby(['col1','col2'])['col2'].count())
grouped.columns = ['Value_counts']
grouped
value_counts work only for series. It won't work for entire DataFrame. Try selecting only one column and using this attribute.
For example:
df['accepted'].value_counts()
It also won't work if you have duplicate columns. This is because when you select a particular column, it will also represent the duplicate column and will return dataframe instead of series. At that time remove duplicate column by using
df = df.loc[:,~df.columns.duplicated()]
df['accepted'].value_counts()
If you are using groupby(), just create a new variable to store data.groupby('column_name') then after take that variable and access that column again by applying value_counts(). Like df=data.groupby('city'), after you may say df['city'].value_counts(). This worked for me
Related
I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.
Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted
This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]
I am trying to filter out all rows that contain a specific character (¬) in pandas.
note: columns are about 1000, hence can't use column name in the code.
Attempt
filtered = pd.loc[:(pd == '¬').any(1).idxmax()]
output: 'DataFrame' object has no attribute 'str'
Try:
from pandas.api.types import is_string_dtype
import numpy as np
filtered=df.loc[np.array([df[col].str.contains("¬") for col in df.columns if is_string_dtype(df[col])]).any(axis=0)]
Where df is the dataframe. You can restrict number of columns you want to iterate through, by tweaking inner list and df.columns accordingly (now it will iterate over all the columns, that are of string type).
I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks
Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe
I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.
I am trying to sort rows within column sample.single by excluding ('./.'). The all data types are objects. I have attempted the options below. I suspect the special character is compromising the second attempt. Dataframe is comprised of 195 columns.
My gtdata columns:
Index(['sample.single', 'sample2.single', 'sample3.single'] dtype='object')
Please advise, thank you!
gtdata = gtdata[('sample.single')!= ('./.') ]
I receive a key error: KeyError: True
When I try:
gtdata = gtdata[gtdata.sample.single != ('./.') ]
I receive an attribute error:
AttributeError: 'DataFrame' object has no attribute 'single'
Not 100% sure what you're trying to achieve, but assuming you have cells containing the "./." string which you want to filter out, the below is one way to do it:
import pandas as pd
# generate some sample data
gtdata = pd.DataFrame({'sample.single': ['blah1', 'blah2', 'blah3./.'],
'sample2.single': ['blah1', 'blah2', 'blah3'],
'sample3.single': ['blah1', 'blah2', 'blah3']})
# filter out all cells in column sample.single containing ./.
gtdata = gtdata[~gtdata['sample.single'].str.contains("./.")]
When subsetting in Pandas you should be passing a boolean vector with the same dimension as the DataFrame.
The problem with your first approach was that ('sample.single')!=('./.') evaluates into a single boolean value as opposed to a boolean vector. You're also comparing two strings, not any column in the DataFrame.
The problem with your second approach is that gtdata.sample.single doesn't make sense in pandas syntax. To get the sample.single column you have to refer to is as gtdata['sample.single']. If your column name did not contain a ".", you could use the shorthand that you were trying to use: e.g. gtdata.sample_single.
I recommend reviewing the documentation for subsetting Pandas DataFrames.
If I read a csv file into a pandas dataframe, followed by using a groupby (pd.groupby([column1,...])), why is that I cannot call a to_excel attribute on the new grouped object.
import pandas as pd
data = pd.read_csv("some file.csv")
data2 = data.groupby(['column1', 'column2'])
data2.to_excel("some file.xlsx") #spits out an error about series lacking the attribute 'to_excel'
data3 = pd.DataFrame(data=data2)
data3.to_excel("some file.xlsx") #works just perfectly!
Can someone explain why pandas needs to go through the whole process of converting from a dataframe to a series to group the rows?
I believe I was unclear in my question.
Re-framed question: Why does pandas convert the dataframe into a different kind of object (groupby object) when you use pd.groupby()? Clearly, you can cast this object as a dataframe, where the grouped columns become the (multi-level) indices.
Why not do this by default (without the user having to manually cast it as a dataframe)?
To answer your reframed question about why groupby gives you a groupby object and not a DataFrame: it does this for efficiency. The groupby object doesn't duplicate all the info about the original data; it essentially stores indices into the original DataFrame, indicating which group each row is in. This allows you to use a single groupby object for multiple aggregating group operations, each of which may use different columns, (e.g., you can do g = df.groupby('Blah') and then separately do g.SomeColumn.sum() and g.OtherColumn.mean()).
In short, the main point of groupby is to let you do aggregating computations on the groups. Simply pivoting the values of a single column out to an index level isn't what most people do with groupby. If you want to do that, you have to do it yourself.