I am trying to sort rows within column sample.single by excluding ('./.'). The all data types are objects. I have attempted the options below. I suspect the special character is compromising the second attempt. Dataframe is comprised of 195 columns.
My gtdata columns:
Index(['sample.single', 'sample2.single', 'sample3.single'] dtype='object')
Please advise, thank you!
gtdata = gtdata[('sample.single')!= ('./.') ]
I receive a key error: KeyError: True
When I try:
gtdata = gtdata[gtdata.sample.single != ('./.') ]
I receive an attribute error:
AttributeError: 'DataFrame' object has no attribute 'single'
Not 100% sure what you're trying to achieve, but assuming you have cells containing the "./." string which you want to filter out, the below is one way to do it:
import pandas as pd
# generate some sample data
gtdata = pd.DataFrame({'sample.single': ['blah1', 'blah2', 'blah3./.'],
'sample2.single': ['blah1', 'blah2', 'blah3'],
'sample3.single': ['blah1', 'blah2', 'blah3']})
# filter out all cells in column sample.single containing ./.
gtdata = gtdata[~gtdata['sample.single'].str.contains("./.")]
When subsetting in Pandas you should be passing a boolean vector with the same dimension as the DataFrame.
The problem with your first approach was that ('sample.single')!=('./.') evaluates into a single boolean value as opposed to a boolean vector. You're also comparing two strings, not any column in the DataFrame.
The problem with your second approach is that gtdata.sample.single doesn't make sense in pandas syntax. To get the sample.single column you have to refer to is as gtdata['sample.single']. If your column name did not contain a ".", you could use the shorthand that you were trying to use: e.g. gtdata.sample_single.
I recommend reviewing the documentation for subsetting Pandas DataFrames.
Related
def store_press_release_links(all_sublinks_df, column_names):
all_press_release_links_df = pd.DataFrame(columns=column_names)
for i in range(len(all_sublinks_df)):
if (len(all_sublinks_df.loc[i,'sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES'))>0):
all_press_release_links_df.loc[i, 'link'] = all_sublinks_df[i,'link']
all_press_release_links_df.loc[i, 'sub_link'] = all_sublinks_df[i,'sublink']
else:
continue
all_press_release_links_df = all_press_release_links_df.drop_duplicates()
all_press_release_links_df.reset_index(drop=True, inplace=True)
return all_press_release_links_df
store_press_release_links() is a function that accepts a dataframe, all_sublinks_DF, which has two columns. 1. link 2. sub_link
The contents of both these columns are iink names.
I want to look through all the link names present in the sub_link column of the all_sublinks_DF Dataframe one by one and check if the link has the keywords ' press|releases|Press|Releases|PRESS|RELEASES' in it.
If it does, then I want to store that entire row of the all_sublinks_DF Dataframe to a new dataframe all_press_release_links_df.
But when I run this function it gives the error : AttributeError: 'str' object has no attribute 'str'
Where am I going wrong?
I think what you want to do, instead of the loop just use
all_press_release_links_df = all_sublinks_df[all_sublinks_df['sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES')]
There are many things wrong here. There are almost no situations where you want to loop through a pandas dataframe and equally there are almost no situations where you want to build a pandas dataframe row by row.
Here, the entire operation can be done with a single operation (split on two lines for readability) like such:
def store_press_release_links(all_sublinks_df):
check_str = 'press|releases|Press|Releases|PRESS|RELEASES'
return all_sublinks_df[all_sublinks_df.link.str.contains(check_str)]
The reason why you had the error message you had is because you select individual cells of the dataframe, which are not of type pandas.Series. The str property is a pd.Series property.
Also note how the columns field is no longer needed.
I'm using pandas to read a simple CSV file of election results:
constituency,anug,apnuafc,cg,ljp,pppc,...
Barima-Waini,0,3905,0,170,8022,...
Pomeroon-Supenaam,86,7343,149,120,18788,...
Essequibo Islands-West Demerara,310,23811,318,0,47855,...
...
I access this with election.votes in views.py:
results = pd.read_csv(election.votes)
For each row I want to add a new column for the winning party. I've tried:
results["winner"] = results.max(axis=1)
But this adds the highest value, not the corresponding column header. So I've tried:
results["winner"] = results.idxmax(axis=1)
I then get the error reduction operation 'argmax' not allowed for this dtype.
Because of the strings of the constituencies I can't use to_numeric to make idxmax work.
Is there another efficient way to get the column header?
Use DataFrame.select_dtypes for get only numeric columns:
import numpy as np
results["winner"] = results.select_dtypes(np.number).idxmax(axis=1)
I'm new to Python and coding in general. I am attempting to automate the processing of some groundwater model output data in python. One pandas dataframe has measured stream flow with multiple columns of various types (left), the other has modeled stream flow (right). I've attempted to use pd.merge on column "Name" in order to link the correct modeled output value to the corresponding measured site value. When I use the following script I get the corresponding error:
left = measured_df
right = modeled_df
combined_df = pd.merge(left, right, on= 'Name')
ValueError: The column label 'Name' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
The modeled data for each stream starts out as a numpy array (not sure about the dtype)
array(['silver_drn', '24.681524615195002'], dtype='<U18')
I then use np.concatenate to combine the 6 stream outputs into one array:
modeled = np.concatenate([[blitz_drn],[silvies_ss_drn],[silvies_drn],[bridge_drn],[krumbo_drn], [silver_drn]])
Then pd.DataFrame to create a pandas data frame with a column header:
modeled_df = pd.DataFrame(data=modeled, columns= [['Name','Modeled discharge (CFS)']])
See image links below to see how each dataframe looks (not sure the best way to share just yet).
left =
right =
Perhaps I'm misunderstanding how pd.merge works,or maybe the datatypes are different even if they appear to be text, but figured if each column was a string, it would append the modeled output to the corresponding row where the "Name" matches within each dataframe. Any help would be greatly appreciated.
When you do this:
modeled_df = pd.DataFrame(data=modeled,
columns= [['Name','Modeled discharge (CFS)']])
you create a MultiIndex on the columns. And that MultiIndex is trying to be merged with a DataFrame with a normal index which doesn't work as you might expect.
You should instead do:
modeled_df = pd.DataFrame(data=modeled,
columns=['Name','Modeled discharge (CFS)'])
# ^ ^
Then the merge should work as expected.
I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
I keep getting different attribute errors when trying to run this file in ipython...beginner with pandas so maybe I'm missing something
Code:
from pandas import Series, DataFrame
import pandas as pd
import json
nan=float('NaN')
data = []
with open('file.json') as f:
for line in f:
data.append(json.loads(line))
df = DataFrame(data, columns=['accepted', 'user', 'object', 'response'])
clean = df.replace('NULL', nan)
clean = clean.dropna()
print clean.value_counts()
AttributeError: 'DataFrame' object has no attribute 'value_counts'
Any ideas?
value_counts is a Series method rather than a DataFrame method (and you are trying to use it on a DataFrame, clean). You need to perform this on a specific column:
clean[column_name].value_counts()
It doesn't usually make sense to perform value_counts on a DataFrame, though I suppose you could apply it to every entry by flattening the underlying values array:
pd.value_counts(df.values.flatten())
To get all the counts for all the columns in a dataframe, it's just df.count()
value_counts() is now a DataFrame method since pandas 1.1.0
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html
I had the same problem, it was working but now for some reason it is not. I replaced it with a groupby:
grouped = pd.DataFrame(data.groupby(['col1','col2'])['col2'].count())
grouped.columns = ['Value_counts']
grouped
value_counts work only for series. It won't work for entire DataFrame. Try selecting only one column and using this attribute.
For example:
df['accepted'].value_counts()
It also won't work if you have duplicate columns. This is because when you select a particular column, it will also represent the duplicate column and will return dataframe instead of series. At that time remove duplicate column by using
df = df.loc[:,~df.columns.duplicated()]
df['accepted'].value_counts()
If you are using groupby(), just create a new variable to store data.groupby('column_name') then after take that variable and access that column again by applying value_counts(). Like df=data.groupby('city'), after you may say df['city'].value_counts(). This worked for me