Merge specific rows in pandas Df - python

I have df after read_excel where some of values (from one column, with strings) are divided. How can i merge them back?
for example:
the df i have
{'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
the df i want
{'CODE': ['A','B','C'],
'TEXT': ['Aa','Bbb','C'],
'NUMBER': ['1','2','3']}
I can't find the right solution. I tried to import data in different ways but it also did not help

You can forward fill missing values or Nones for groups with aggregate join and first non None value for NUMBER column:
d = {'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
df = pd.DataFrame(d)
df1 = df.groupby(df['CODE'].ffill()).agg({'TEXT':''.join, 'NUMBER':'first'}).reset_index()
print (df1)
CODE TEXT NUMBER
0 A Aa 1
1 B Bbb 2
2 C C 3
You can generate dictionary:
cols = df.columns.difference(['CODE'])
d1 = dict.fromkeys(cols, 'first')
d1['TEXT'] = ''.join
df1 = df.groupby(df['CODE'].ffill()).agg(d1).reset_index()

Related

Iterate over rows in pandas dataframe. If blanks exist before a specific column, move all column values over

I am attempting to iterate over all rows in a pandas dataframe and move all leftmost columns within each row over until all the non null column values in each row touch. The amount of column movement depends on the number of empty columns between the first null value and the cutoff column.
In this case I am attempting to 'close the gap' between values in the leftmost columns into the column 'd' touching the specific cutoff column 'eee'. The correlating 'abc' rows should help to visualize the problem.
Column 'eee' or columns to the right of 'eee' should not be touched or moved
def moveOver():
df = {
'aaa': ['a', 'a', 'a', 'a', 'a', 'a'],
'bbb': ['', 'b', 'b', 'b', '', 'b'],
'ccc': ['', '', 'c', 'c', '', 'c'],
'ddd': ['', '', '', 'd', '', ''],
'eee': ['b', 'c', 'd', 'e', 'b', 'd'],
'fff': ['c', 'd', 'e', 'f', 'c', 'e'],
'ggg': ['d', 'e', 'f', 'g', 'd', 'f']
}
In row 1 AND 5: 'a' would be moved over 3 column index's to column 'ddd'
In row 2: ['a','b'] would be moved over 2 column index's to columns ['ccc', 'ddd'] respectively
etc.
finalOutput = {
'aaa': ['', '', '', 'a', '', ''],
'bbb': ['', '', 'a', 'b', '', 'a'],
'ccc': ['', 'a', 'b', 'c', '', 'b'],
'ddd': ['a', 'b', 'c', 'd', 'a', 'c'],
'eee': ['b', 'c', 'd', 'e', 'b', 'd'],
'fff': ['c', 'd', 'e', 'f', 'c', 'e'],
'ggg': ['d', 'e', 'f', 'g', 'd', 'f']
}
You can do this:
keep_cols = df.columns[0:df.columns.get_loc('eee')]
df.loc[:,keep_cols] = [np.roll(v, Counter(v)['']) for v in df[keep_cols].values]
print(df):
aaa bbb ccc ddd eee fff ggg
0 a b c d
1 a b c d e
2 a b c d e f
3 a b c d e f g
4 a b c d
5 a b c d e f
Explanation:
You want to consider only those columns which are to the left of 'eee', so you take those columns as stored in keep_cols
Next you'd want each row to be shifted by some amount (we need to know how much), to shift I used numpy's roll. But how much amount? It is given by number of blank values - for that I used Counter from collections.

Confronting values between dataframe

I'm trying to find a way to confront the equality of values contained into a different dataframes having different column names.
label = {
'aoo' : ['a', 'b', 'c'],
'boo' : ['a', 'b', 'c'],
'coo' : ['a', 'b', 'c']
'label': ['label', 'label', 'label']
}
unlabel = {
'unlabel1' : ['a', 'b', 'c'],
'unlabel2' : ['a', 'b', 'c'],
'unlabel3': ['a', 'b', 'hhh']
}
label = pd.DataFrame(label)
unlabel = pd.DataFrame(unlabel)
Desired output is a dataframe that contains the column where their values is equal and the column label.
Where a single value is not equal unlabel['unlabel3'] i don't want to keep the values in the output.
desired_output = {
'unlabel1' : ['a', 'b', 'c'],
'unlabel2' : ['a', 'b', 'c'],
'label' : ['label', 'label', 'label']
}
If the labels where numbers I could try np.where but I can't find similar helper for string.
Could you help?
Thanks
You can use pd.merge and specify the columns to merge with left_on and right_on
out = unlabel.merge(label, left_on=['unlabel1', 'unlabel2', 'unlabel3'], right_on=['aoo', 'boo', 'coo'], how='left').drop(['unlabel3', 'aoo', 'boo', 'coo'], axis=1)
print(out)
unlabel1 unlabel2 label
0 a a label
1 b b label
2 c c NaN

Reverse the group/items in Python

I have a table like this:
Group
Item
A
a, b, c
B
b, c, d
And I want to convert to like this:
Item
Group
a
A
b
A, B
c
A, B
d
B
What is the best way to achieve this?
Thank you!!
If you are working in pandas, you can use 'explode' to unpack items, and can use 'to_list' lambda for the grouping stage.
Here is some info on 'explode' method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html.
import pandas as pd
df = pd.DataFrame(data={'Group': ['A', 'B'], 'Item': [['a','b','c'], ['b','c','d']]})
Exploding
df.explode('Item').reset_index(drop=True).to_dict(orient='records')
[{'Group': 'A', 'Item': 'a'},
{'Group': 'A', 'Item': 'b'},
{'Group': 'A', 'Item': 'c'},
{'Group': 'B', 'Item': 'b'},
{'Group': 'B', 'Item': 'c'},
{'Group': 'B', 'Item': 'd'}]
Exploding and then using 'to_list' lambda
df.explode('Item').groupby('Item')['Group'].apply(lambda x: x.tolist()).reset_index().to_dict(orient='records')
[{'Item': 'a', 'Group': ['A']},
{'Item': 'b', 'Group': ['A', 'B']},
{'Item': 'c', 'Group': ['A', 'B']},
{'Item': 'd', 'Group': ['B']}]
Not the most efficient, but very short:
>>> table = {'A': ['a', 'b', 'c'], 'B': ['b', 'c', 'd']}
>>> reversed_table = {v: [k for k, vs in table.items() if v in vs] for v in set(v for vs in table.values() for v in vs)}
>>> print(reversed_table)
{'b': ['A', 'B'], 'c': ['A', 'B'], 'd': ['B'], 'a': ['A']}
With dictionaries, you wouldtypically approach it like this:
table = {'A': ['a', 'b', 'c'], 'B': ['b', 'c', 'd']}
revtable = dict()
for v,keys in table.items():
for k in keys:
revtable.setdefault(k,[]).append(v)
print(revtable)
# {'a': ['A'], 'b': ['A', 'B'], 'c': ['A', 'B'], 'd': ['B']}
Assuming that your tables are in the form of a pandas dataframe, you could try something like this:
import pandas as pd
import numpy as np
# Create initial dataframe
data = {'Group': ['A', 'B'], 'Item': [['a','b','c'], ['b','c','d']]}
df = pd.DataFrame(data=data)
Group Item
0 A [a, b, c]
1 B [b, c, d]
# Expand number of rows based on list column ("Item") contents
list_col = 'Item'
df = pd.DataFrame({
col:np.repeat(df[col].values, df[list_col].str.len())
for col in df.columns.drop(list_col)}
).assign(**{list_col:np.concatenate(df[list_col].values)})[df.columns]
Group Item
0 A a
1 A b
2 A c
3 B b
4 B c
5 B d
*Above snippet taken from here, which includes a more detailed explanation of the code
# Perform groupby operation
df = df.groupby('Item')['Group'].apply(list).reset_index(name='Group')
Item Group
0 a [A]
1 b [A, B]
2 c [A, B]
3 d [B]

How to do a distinct count of one field, grouped by another in Pandas

If I wanted to create a dataframe, which is equivalent to this SQL in PANDAS how would I do it?
SELECT COUNTRY, COUNT(DISTINCT PRODUCT) AS UNIQUE_PRODUCTS
FROM SALES
GROUP BY COUNTY
df = pd.DataFrame({
'Country': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'Product': ['X', 'X', 'X', 'Y', 'Z', 'Y', 'Z']
})
df.groupby('Country').Product.nunique()
>>>
Country
A 1
B 3
C 2
Name: Product, dtype: int64

Lowercase columns by name using dataframe method

I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.

Categories

Resources