list of DataFrames as an argument of function/loop

list of DataFrames as an argument of function/loop - python

I have multiple DataFrame and I need to perform various operations on them. I want to put them in one list to avoid listing them all the time as in the example bellow:
for df in (df1, df2,df3,df4,df5,df6,df7):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
I tried to make a list of them (DataFrames=[df1,df2,df3,df4,df5,df6,df7]) but it's working neither in the loop or as an argument of a function.
for df in (DataFrame):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")

you can place the dataframes on a list and add the columns like this:
import pandas as pd
from pandas import DataFrame
data = {'COUNTRY': ['country1', 'country2', 'country3'],
'2018': [12.0, 27, 35],
'2019': [23, 39.6, 40.3],
'2020': [35, 42, 56]}
df_list = [DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data)]
def change_dataframes(data_frames=None):
for df in data_frames:
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
pd.to_numeric(df['2018'], downcast="float")
pd.to_numeric(df['2019'], downcast="float")
return data_frames

Using nunvie's answer as a base, here is another option for you:
import pandas as pd
data = {
'COUNTRY': ['country1', 'country2', 'country3'],
'2018': ['12.0', '27', '35'],
'2019': ['2:3', '3:9.6', '4:0.3'],
'2020': ['35', '42', '56']
}
df_list = [pd.DataFrame(data) for i in range(5)]
def data_prep(df: pd.DataFrame):
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
df['2018'] = pd.to_numeric(df['2018'], downcast="float")
df['2019'] = pd.to_numeric(df['2019'], downcast="float")
return df
new_df_list = map(data_prep, df_list)
The improvements (in my opinion) are as follows. First, it is more concise to use list comprehension for the test setup (that's not directly related to the answer). Second, pd.to_numeric doesn't have inplace (at least in pandas 1.2.3). It returns the series you passed if the parsing succeeded. Thus, you need to explicitly say df['my_col'] = pd.to_numeric(df['my_col']).
And third, I've used map to apply the data_prep function to each DataFrame in the list. This makes data_prep responsible for only one data frame and also saves you from writing loops. The benefit is leaner and more readable code, if you like the functional flavour of it, of course.

Related

Get column names with corresponding index in python pandas

I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age

Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)

You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1

A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names

In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.

The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Pandas: How to work with sliced data using .loc?

df1 = pd.DataFrame({'id_imp': ['a', 'b',
'c','d','e','f','g'],
'name': ['jon', 'jon', 'tom', 'ber', 'gary','gary',
'zul'],
'state' : ['ca', 'ny', 'tn','ca','tn','tn','il'],
'county': ['wood','wood','fair','bridge','rosewelt','rosewelt','lili']})
df2 = pd.DataFrame({'id_sal': ['h', 'i', 'j','k','l'],
'name': ['jon', 'zolie', 'tom', 'ber', 'gary'],
'state' : ['ca', 'ch', 'tn','ca','tn'],
'county': ['wood','plas','fair','bridge','rosewelt']})
df3 = df1.loc[(~df1.name.isin(df2.name))]
I am trying to do small operation by writing below code but its giving me a warning: What could be the problem?
df3['name'] = df3.loc[:, 'name'].fillna(0)
SettingWithCopyWarning: Try using .loc[row_indexer,col_indexer] = value instead

It looks like:
df['name'] - returns an entirely new object, i.e. a copy
But you want to work with the original object. so use:
df3.loc[:, 'name'] - which returns a subset of the original object, i.e. a view
df3.loc[:, 'name'] = df3['name'].fillna(0)
If you are trying to select rows and columns in the same line of code .loc[] works better.

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?

The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Python 3.6 Pandas - Left justify data in dataframe

I am trying to left justify the column data in a dataframe when printing it using a bit of a hack of some code I borrowed from another question. It does not appear to be working however in the context I am trying to use it - the df.stack line:
import pandas as pd
master_list = [['cat', 123, 'yellow'], ['dog', 12345, 'green'], ['horse', 123456, 'red']]
df = pd.DataFrame(master_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.colheader_justify','light', 'display.width', 2000, 'display.max_colwidth', 500):
df = df.stack().str.lstrip().unstack()
print(df)
What do I need to amend? There is no built in option in Pandas to do this in a straightforward manner by the looks of things...
Thanks

(Moving to an answer for easier formatting and readability.)
import pandas as pd
master_list = [['cat', 123, 'yellow'], ['dog', 12345, 'green'], ['horse', 123456, 'red']]
df = pd.DataFrame(master_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.colheader_justify','light', 'display.width', 2000, 'display.max_colwidth', 500):
df = df.stack().str.lstrip().unstack()
df = df.style.set_properties(**{'text-align': 'left'})
df
Edited. However, the output isn't changed; it looks the same as before.
Image of output:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

list of DataFrames as an argument of function/loop - python

Related

Get column names with corresponding index in python pandas

Convert a muti-valued dict into a pandas dataframe

Pandas: How to work with sliced data using .loc?

Dask categorize() won't work after using .loc

Python 3.6 Pandas - Left justify data in dataframe

Categories

Resources