Group by with lists in Dataframe - python

I have a problem with a Dataframe looking like this:
It contains "ClusterLabels" (0-44) and I want to group the "Document" col by the ClusterLabel value. I want These lists from "Document" to be combined in one list per Cluster. (duplicate words sould be kept)
Tryed the ".groupby" argument but it gives the error "sequence item 0: expected str instance, list found".
Can someone help?

Don't use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.
Better is use list comprehension with flatten lists:
df1 = (df.groupby('ClusterLabel')['Document']
.agg(lambda x: [z for y in x for z in y])
.reset_index())
Or flatten in itertools.chain:
from itertools import chain
df1 = (df.groupby('ClusterLabel')['Document']
.agg(lambda x: list(chain(*x)))
.reset_index())

You can do this like :
import pandas as pd
df = pd.DataFrame({"Document" : [["a","b","c","d"],["a","d"],["a","b"],["c","d"],["d"]],
"ClusterLabel": [0,0,0,1,1]})
df
df.groupby("ClusterLabel").sum()

Related

how to make withColumnRenamed query generic in pyspark

Description
I have 2 lists
List1=['curentColumnName1','curentColumnName2','currentColumnName3']
List2=['newColumnName1','newColumnName2','newColumnName3']
Their is a dataframe df which contains all columns
I want to check like if column 'curentColumnName1 is present in dataframe,if yes then rename it to newColumnName1
Need to do this for all columns if those are present in dataframe
How to achieve this scenario using pyspark
Just iterate over the first list, check if it in the column list, and rename:
for i in range(len(List1)):
if List1[i] in df.columns:
df = df.withColumnRenamed(List1[i], List2[i])
P.S. Instead of two lists, it's better to use dictionary - it's easier to maintain, and you can avoid errors when you add/remove elements only in one list
Here is anotherway of doing it in one-liner :
from functools import reduce
df = reduce(
lambda a, b: a.withColumnRenamed(b[0], b[1]),
zip(List1, List2),
df,
)
You can achieve in one line:
df.selectExpr(*[f"{old_col} AS {new_col}" for old_col, new_col in zip(List1, List2)]).show()

Filtering a large Pandas DataFrame based on a list of strings in column names

Stack Overflow Family,
I have recently started learning Python and am using Pandas to handle some factory data. The csv file is essentially a large dataframe (1621 rows × 5633 columns). While I need all the rows as these are data of each unit, I need to filter many unwanted columns. I have identified a list of strings in these column names that I can use to find only the wanted columns, however, I am not able to figure out what a good logic here would be or any built in python functions.
dropna is not an option for me as some of these wanted columns have NA as values (for example test limit)
dropna for columns with all NA is also not good enough as I will still end up with a large number of columns.
Looking for some guidance here. Thank you for your time.
If you have a list of valid columns you can just use df.filter(cols_subset, axis=1) to drop everything else.
You could use a regex to also match substrings from your list in column names:
df.filter(regex='|'.join(cols_subset), axis=1)
Or you could match only columns starting with a substring from your list:
df.filter(regex='^('+'|'.join(cols_subset)+')', axis=1)
EDIT:
Given the time complexity of my previous solution, I came up with a way to use list comprehension:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
app = ["app", "ban"]
new_list = [x for x in fruits if any(y in x for y in app)]
output:
['apple', 'banana']
This should only display the columns you need. In your case you just need to do:
my_strings = ["A", "B", ...]
new_list = [x for x in df.columns if any(y in x for y in my_strings)]
print(new_list)
If you know exactly the column names, what you could do is some thing like that:
unwanted_cols = ['col1', 'col4'] #list of unwanted cols names
df_cleaned = current_df.drop(unwanted_cols, axis=1)
# or
current_df.drop(unwanted_cols, inplace=True, axis=1)
If you don't know exactly the columns names what you could do is first retrieveing all the columns
all_cols = current_df.columns.tolist()
and apply a regex on all of the columns names, to obtain all of the columns names that matches your list of string and apply the same code as above
You can drop columns from a dataframe by applying string contains with regular expression. Below is an example
df.drop(df.columns[df.columns.str.contains('^abc')], axis=1)

How to create a set of sets effectively in Python?

I have two dataframes two dataframes, with two columns. The rows are value pairs, where order is not important: a-b == b-a for me. I need to compare these value pairs between the two dataframes.
I have a solution, but that is terribly slow for a dataframe with 300k
import pandas as pd
df1 = pd.DataFrame({"col1" : [1,2,3,4], "col2":[2,1,5,6]})
df2 = pd.DataFrame({"col1" : [2,1,3,4], "col2":[1,9,8,9]})
mysets = [{x[0],x[1]} for x in df1.values.tolist()]
df1sets = []
for element in mysets:
if element not in df1sets:
df1sets.append(element)
mysets = [{x[0],x[1]} for x in df2.values.tolist()]
df2sets = []
for element in mysets:
if element not in df2sets:
df2sets.append(element)
intersect_sets = [x for x in df1sets if x in df2sets]
this works, but it is terribly slow, and there must be an easier way to do this. One of my problem is that is that I cannot add a set to a set, I cannot create {{1,2}, {2,3}} etc
Pandas solution is merge with sorted values of columns, remove duplicates and convert to sets:
intersect_sets = ([set(x) for x in pd.DataFrame(np.sort(df1.to_numpy(), axis=1))
.merge(pd.DataFrame(np.sort(df2.to_numpy(), axis=1)))
.drop_duplicates()
.to_numpy()])
print (intersect_sets)
[{1, 2}]
Another idea with set of frozensets:
intersect_sets = (set([frozenset(x) for x in df1.to_numpy()]) &
set([frozenset(x) for x in df2.to_numpy()]))
print (intersect_sets)
{frozenset({1, 2})}

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.
There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34
import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe
I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

filtering a dataframe on values in a list

I have the below data frame :-
I want to filter where ever there is 11 in claim_status
and for the claim_ststaus_reason for aa1.
I am trying to the below code but it simply giving me all the rows
my_list = 'aa1'
df[df['claim_status_reason'].str.contains( "|".join(my_list), regex=True)].reset_index(drop=True)
Expected output:-
1.) where there is 11 in claim_ststus
2.) where there is aa1 in the claim_status_reason
You can use apply to obtain your desired filter like:
df[(df['claim_staus'].apply(lambda x: 11 in x)) & (df['claim_status_reason'].apply(lambda x: 'a1' in x))]
Don't use string operations on lists within series. You can use list comprehensions instead. Your data structure choice is anti-Pandas because you should try to avoid putting lists in series in the first place. These operations are not vectorisable.
mask1 = np.array([11 in x for x in df['claim_staus']])
mask2 = np.array(['aa1' in x for x in df['claim_status_reason']])
df = df[mask1 & mask2]

Categories

Resources