Creating a dataframe in a for loop based on another dataframe - python

I have a data frame, df, and I'd like to get all the columns in it and the count of unique values in it and save it as another data frame. I can't seem to find a way to do that. I can, however, print what I want on the console. Here's what I mean:
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
Now that prints what I want just fine. Instead of printing, if I do something like newdf = pd.DataFrame(evry_colm, df[evry_colm].value_counts().count(), columns = ('a', 'b')), it throws an error that reads "TypeError: object of type 'numpy.int32' has no len()". Obviously, that isn't right.
Soo, how can I make a data frame like columnName and UniqueCounts?

To count unique values per column you can use apply and nunique function on data frame.
Something like:
import pandas as pd
df = pd.DataFrame([
{'a': 1, 'b': 2},
{'a': 2, 'b': 2}
])
count_series = df.apply(lambda col: col.nunique())
# returned object is pandas Series
# a 2
# b 1
# to map it to DataFrame try
pd.DataFrame(count_series).T

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
print(df)
print()
df = pd.DataFrame({col: [df[col].nunique()] for col in df})
print(df)
Output:
A B
0 1 1
1 1 2
2 2 3
3 2 4
A B
0 2 4

Related

Python Pandas find value in dataframe regardless of column

Is there a simple way to check for a value within a dataframe when it could possibly be in a variety of columns? Whether using iterrow and searching each row for the value and finding which column it is in or just checking the dataframe as a whole and getting its position like iat coords.
import pandas as pd
d = {'id': [1, 2, 3], 'col2': [3, 4, 5], 'col3': [8,3,9]}
df = pd.DataFrame(data=d)
df = df.set_index('id')
df
Sample Data
col2 col3
id
1 3 8
2 4 3
3 5 9
Find 3
df.isin([3]).any()
Output Column:
col2 True
col3 True
dtype: bool
Want more detals? Here you go:
df[df.isin([3])].stack().index.tolist()
Co-ordinates output:
[(1, 'col2'), (2, 'col3')]
You can search the value in dataframe and get the Boolean dataframe for your search. It
gives you all equalities of var1 in df.
df[df.eq(var1).any(1)]

Python to retrieve condition based columns

I have to retrieve all rows from w_loaded_updated_iod.xlsx where on column waived = Yes.
I have tried this:
import pandas as pd
excel1 = 'C:/Users/gopoluri/Desktop/Latest/w_loaded_updated_iod.xlsx'
df1 = pd.read_excel(excel1)
values1 = df1[0 : 7]
dataframes = [values1]
df1.loc[df1['Waived'] == 'Yes'].to_excel("output11.xlsx")
But I am getting and all columns. But I need the all rows only from column 2, column 3, column 5, column8. Can anyone please correct my code if anything is wrong.
Like below:
you can get columns x,y,z from your dataframe by filtering as follows:
df = df.loc[["x", "y", "z"]].
Example:
df = pd.DataFrame(dict(a=[1,2,3],b=[3,4,5],c=[5,6,7]))
df = df[["a","b"]]
df # prints output:
a b
0 1 3
1 2 4
2 3 5

pandas groupby() with custom aggregate function and put result in a new column

Suppose I have a dataframe with 3 columns. I want to group it by one of the columns and compute a new value for each group using a custom aggregate function.
This new value has a totally different meaning and its column just is not present in the original dataframe. So, in effect, I want to change the shape of the dataframe during the groupby() + agg() transformation. The original dataframe looks like (foo, bar, baz) and has a range index while the resulting dataframe needs to have only (qux) column and baz as an index.
import pandas as pd
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': ['a', 'b', 'c'], 'baz': [0, 0, 1]})
df.head()
# foo bar baz
# 0 1 a 0
# 1 2 b 0
# 2 3 c 1
def calc_qux(gdf, **kw):
qux = ','.join(map(str, gdf['foo'])) + ''.join(gdf['bar'])
return (None, None) # but I want (None, None, qux)
df = df.groupby('baz').agg(calc_qux, axis=1) # ['qux'] but then it fails, since 'qux' is not presented in the frame.
df.head()
# qux
# baz
# 0 1,2ab
# 1 3c
The code above produces an error ValueError: Shape of passed values is (2, 3), indices imply (2, 2) if I'm trying to return from the aggregation function different amount of values than the number of columns in the original dataframe.
You want to use apply() here since you are not operating on a single column (in which case agg() would be appropriate):
import pandas as pd
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': ['a', 'b', 'c'], 'baz': [0, 0, 1]})
def calc_qux(x):
return ','.join(x['foo'].astype(str).values) + ''.join(x['bar'].values)
df.groupby('baz').apply(calc_qux).to_frame('qux')
Yields:
qux
baz
0 1,2ab
1 3c

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources