Loop to split a large dataframe to small dataframes - python

DF is a dataframe with columns [A1,B1,C1,A2,B2,C2,A3,B3,C3]
I want to split that 'DF' data frame into small dataframes DF1,DF2,DF3
DF1 to have [A1,B1,C1] as columns
DF2 to have [A2,B2,C2] as columns
DF3 to have [A3,B3,C3] as columns
The number in the name of the dataframe DF'3' should match with its columns [A'3',B'3',C'3']
i tried
for i in range(1,4):
'DF{}'.format(i)=DF[DF['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]]
Getting the error
SyntaxError: cannot assign to function call
Is it possible to do this in a single loop?

You can't dynamically change an object's name.
You can use a list comprehension with explicit definition of the dfs:
df1,df2,df3=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,4)]
Update based on ViettelSolutions' comment
Here is a more concise way of doing that: df1,df2,df3=[df[[f'A{i}',f'B{i}','C{i}']] for i in range(1,4)]
You can also use a list instead of explicitly name the dfs, and unpack them when needed.
n=4 # Define the number of dfs
dfs=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,n)]

The error message stems from trying to assign a dataframe to the string format function, instead of a variable.
Dynamically creating variables from DF1 to DFN for N numbers can be a bit tricky. It is easy to create key-item pairs in dicts though. Try the following:
dfs = {}
for i in range(1,4):
dfs["DF{}".format(i)] = DF[["A{}".format(i), "B{}".format(i), "C{}".format(i)]]
Instead of getting DF1, DF2 and DF3 variables, you get dfs["DF1"], dfs["DF2"], and dfs["DF3"]

You could make it completely configurable:
def split_dataframe(df, letters, numbers):
return [df[[f'{letter}{number}' for letter in letters]] for number in numbers]
letters = ("A","B","C")
numbers = range(1,4)
df1, df2, df3 = split_dataframe(df, letters, numbers)
You can make it the function even more general as follows:
import re
letters_pattern = re.compile("^\D+")
numbers_pattern = re.compile("\d+$")
def split_dataframe(df):
letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns))
numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns))
return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]
This method has 2 advantages:
you don't need to provide the letters and the numbers in advance, the method will discover what is available in the header and proceed
it will manage "irregular" situations - when, for example, d1 exists but d2 doesn't
To give a concrete example:
df = pd.DataFrame({"A1":[1,2], "B1":[2,3], "C1":[3,4], "D1":[4,5], "A2":[2,3], "B2":[10,11], "C2":[12,13]})
for sub_df in split_dataframe(df):
print(sub_df)
OUTPUT
A1 B1 C1 D1
0 1 2 3 4
1 2 3 4 5
A2 B2 C2
0 2 10 12
1 3 11 13
The columns names discovery process could be set as optional if you pass letters and numbers you only want to consider, as follows:
def split_dataframe(df, letters=None, numbers=None):
letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns)) if letters is None else letters
numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns)) if numbers is None else numbers
return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]
for sub_df in split_dataframe(df, letters=("B","C"), numbers=[1,2]):
print(sub_df)
OUTPUT
B1 C1
0 2 3
1 3 4
B2 C2
0 10 12
1 11 13

Related

Filter a dataframe by column index in a chain, without using the column name or table name

Generate an example dataframe
import random
import string
import numpy as np
df = pd.DataFrame(
columns=[random.choice(string.ascii_uppercase) for i in range(5)],
data=np.random.rand(10,5))
df
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
1 0.933778 0.393021 0.547383 0.469255 0.053089
2 0.994518 0.156547 0.917894 0.070152 0.201373
3 0.077694 0.685540 0.865004 0.830740 0.605135
4 0.760294 0.838441 0.905885 0.146982 0.157439
5 0.116676 0.340967 0.400340 0.293894 0.220995
6 0.632182 0.663218 0.479900 0.931314 0.003180
7 0.726736 0.276703 0.057806 0.624106 0.719631
8 0.677492 0.200079 0.374410 0.962232 0.915361
9 0.061653 0.984166 0.959516 0.261374 0.361677
Now I want to filter a dataframe using the values in the first column, but since I make heavy use of chaining (e.g. df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).pipe(func)) I need a much more compact notation for the operation. Normally you'd do something like
df[df.iloc[:, 0] < 0.5]
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
but the awkwardly redundant syntax is horrible for chaining. I want to replace it with a .query(), and normally you'd use the column name like df.query('V < 0.5'), but here I want to be able to query the table by column index number instead of by name. So in the example, I've deliberately randomized the column names. I also can not use the table name in the query like df.query('#df[0] < 0.5') since in a long chain, the intermediate result has no name.
I'm hoping there is some syntax such as df.query('_[0] < 0.05') where I can refer to the source table as some symbol _.
You can using f-string notation in df.query:
df.query(f'{df.columns[0]} < .5')
Output:
J M O R N
3 0.114554 0.131948 0.650307 0.672486 0.688872
4 0.272368 0.745900 0.544068 0.504299 0.434122
6 0.418988 0.023691 0.450398 0.488476 0.787383
7 0.040440 0.220282 0.263902 0.660016 0.955950
Update using "walrus" operator in python 3.8+
Let's try this:
((dfout := df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).to_frame(name='values'))
.query(f'{dfout.columns[0]} > -2'))
output:
values
N -1.356779
O -1.202353
M -1.591623
T -1.557801
You can use lambda functions in loc, which passes in the dataframe. You can then use iloc for your positional indexing. So you could do:
df.loc[lambda x: x.iloc[:, 0] > 0.5]
This should work in a method chain.
For a single column with index:
df.query(f"{df.columns[0]}<0.5")
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
For multiple columns with index:
idx = [0,1]
col = df.columns[np.r_[idx]]
val = 0.5
query = ' and '.join([f"{i} < {val}" for i in col])
# V < 0.5 and O < 0.5
print(df.query(query))
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
5 0.116676 0.340967 0.400340 0.293894 0.220995

Ignoring an invalid filter among multiple filters on a DataFrame

Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

How to add a specific number of characters to the end of string in Pandas?

I am using the Pandas library within Python and I am trying to increase the length of a column with text in it to all be the same length. I am trying to do this by adding a specific character (this will be white space normally, in this example I will use "_") a number of times until it reaches the maximum length of that column.
For example:
Col1_Before
A
B
A1R
B2
AABB4
Col1_After
A____
B____
A1R__
B2___
AABB4
So far I have got this far (using the above table as the example). It is the next part (and the part that does it that I am stuck on).
df['Col1_Max'] = df.Col1.map(lambda x: len(x)).max()
df['Col1_Len'] = df.Col1.map(lambda x: len(x))
df['Difference_Len'] = df ['Col1_Max'] - df ['Col1_Len']
I may have not explained myself well as I am still learning. If this is confusing let me know and I will clarify.
consider the pd.Series s
s = pd.Series(['A', 'B', 'A1R', 'B2', 'AABB4'])
solution
use str.ljust
m = s.str.len().max()
s.str.ljust(m, '_')
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4
dtype: object
for your case
m = df.Col1.str.len().max()
df.Col1 = df.Col1.ljust(m '_')
It isn't the most pandas-like solution, but you can try the following:
col = np.array(["A", "B", "A1R", "B2", "AABB4"])
data = pd.DataFrame(col, columns=["Before"])
Now compute the maximum length, the list of individual lengths, and the differences:
max_ = data.Before.map(lambda x: len(x)).max()
lengths_ = data.Before.map(lambda x: len(x))
diffs_ = max_ - lengths_
Create a new column called After adding the underscores, or any other character:
data["After"] = data["Before"] + ["_"*i for i in diffs_]
All this gives:
Before After
0 A A____
1 B B____
2 A1R A1R__
3 AABB4 AABB4
Without creating extra columns:
In [63]: data
Out[63]:
Col1
0 A
1 B
2 A1R
3 B2
4 AABB4
In [64]: max_length = data.Col1.map(len).max()
In [65]: data.Col1 = data.Col1.apply(lambda x: x + '_'*(max_length - len(x)))
In [66]: data
Out[66]:
Col1
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4

Fill pandas data frame using .append()

I have a dataframe with a column containing comma separated strings. What I want to do is separate them by comma, count them and append the counted number to a new data frame. If the column contains a list with only one element, I want to differentiate wheather it is a string or an integer. If it is an integer, I want to append the value 0 in that row to the new df.
My code looks as follows:
def decide(dataframe):
df=pd.DataFrame()
for liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
df.append(pd.Series([len(x)]), ignore_index=True)
else:
#check if element in list is int
for i in x:
try:
int(i)
print i
x = []
df.append(pd.Series([int(len(x))]), ignore_index=True)
except:
print i
x = [1]
df.append(pd.Series([len(x)]), ignore_index=True)
return df
The Input data look like this:
C1
0 a,b,c
1 0
2 a
3 ab,x,j
If I now run the function with my original dataframe as input, it returns an empty dataframe. Through the print statement in the try/except statements I could see that everything works. The problem is appending the resulting values to the new dataframe. What do I have to change in my code? If possible, please do not give an entire different solution, but tell me what I am doing wrong in my code so I can learn.
******************UPDATE************************************
I edited the code so that it can be called as lambda function. It looks like this now:
def decide(x):
For liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
x = len(x)
print x
else:
#check if element in list is int
for i in x:
try:
int(i)
x = []
x = len(x)
print x
except:
x = [1]
x = len(x)
print x
And I call it like this:
df['Count']=df['C1'].apply(lambda x: decide(x))
It prints the right values, but the new column only contains None.
Any ideas why?
This is a good start, it could be simplified, but I think it works as expected.
#I have a dataframe with a column containing comma separated strings.
df = pd.DataFrame({'data': ['apple, peach', 'banana, peach, peach, cherry','peach','0']})
# What I want to do is separate them by comma, count them and append the counted number to a new data frame.
df['data'] = df['data'].str.split(',')
df['count'] = df['data'].apply(lambda row: len(row))
# If the column contains a list with only one element
df['first'] = df['data'].apply(lambda row: row[0])
# I want to differentiate wheather it is a string or an integer
df['first'] = pd.to_numeric(df['first'], errors='coerce')
# if the element in x is an integer, len(x) should be set to zero
df.loc[pd.notnull(df['first']), 'count'] = 0
# Dropping temp column
df.drop('first', 1, inplace=True)
df
data count
0 [apple, peach] 2
1 [banana, peach, peach, cherry] 4
2 [peach] 1
3 [0] 0

Categories

Resources