Split dataframe by certain condition but keep the original dataframe

Split dataframe by certain condition but keep the original dataframe - python

I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?

First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0

Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

Related

Removing index which is not common between 2 dataframes

I have the following code:
import pandas.util.testing as testing
df = testing.makeDataFrame()
df
This this I have created 2 dataframes with one dataframe have 2 less lines than the original one.
This is df - Original
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
71ZFgWaeDE 1.887420 0.337702 -0.176539 0.149089
alWOjkQ2eZ 1.997701 -0.354276 1.997802 -0.086803
This is df1 - with 2 less lines
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
What I am trying to do is to remove all the rows which are not common between the two dataframes. To do this, we find the duplicate index in the two columns.
duplicates = set(df.index).intersection(df1.index)
Could you please advise how can I remove rows where index is not in the duplicates ?

If you want to remove the indices in place:
idx = df.index.difference(df1.index)
df.drop(idx, inplace=True)
If you want to create a new object:
idx = df.index.intersection(df1.index)
new_df = df.loc[idx]

How to find and replace substrings at the end of column headers

I have the following columns, among others, in my dataframe: dom_pop', 'an_dom_n', 'an_dom_ncmplt. Equivalent columns exist in multiple dataframes, with the suffix changing. For example, in another dataframe they may be called out as pa_pop', 'an_pa_n', 'an_pa_ncmplt. I want to append '_kwh' to these cols across all my dataframes.
I wrote the following code:
cols = ['_n$', '_ncmplt', '_pop'] << the $ is added to indicate string ending in _n.
filterfuel = 'kwh'
for c in cols:
dfdom.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfdom.columns]
dfpa.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfpa.columns]
dfsw.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfsw.columns]
kwh gets appended to _ncmplt and _pop cols, but not the _n column. If I remove the $ _n gets appended but then _ncmplt looks like 'an_dom_n_kwh_cmplt'.
for df dom the corrected names should look like dom_pop_kwh', 'an_dom_n_kwh', 'an_dom_ncmplt_kwh'
Why is $ not being recongnised as an end of string parameter?

You can use np.where with a regex
cols = ['_n$', '_ncmplt', '_pop']
filterfuel = 'kwh'
pattern = fr"(?:{'|'.join(cols)})"
for df in [dfdom, dfpa, dfsw]:
df.columns = np.where(df.columns.str.contains(pattern, regex=True),
df.columns + f"_{filterfuel}", df.columns)
Output:
>>> pattern
'(?:_n$|_ncmplt|_pop)'
# dfdom = pd.DataFrame([[0]*4], columns=['dom_pop', 'an_dom_n', 'an_dom_ncmplt', 'hello'])
# After:
>>> dfdom
dom_pop_kwh an_dom_n_kwh an_dom_ncmplt_kwh hello
0 0 0 0 0

Python - Convert columns with specific base_name into rows

I have the following format of a csv file:
id a_mean_val_1 a_mean_val_2 a_var_val_1 a_var_val_2 b_mean_val_1 b_mean_val_2 b_var_val_1 b_var_val_2
I would like to melt the columns 1 and 2 for all a and b features into rows as follows:
id a_mean a_var b_mean b_var
1 val1 val1 val1 val1
1 val2 val2 val2 val2
I am unsure how to achieve the melt function in pandas, where I could basically have an expression that matches keeps the base name: a_mean as root column and everything that has a suffix for that variable to melt them into rows.
Is there another method I could use to specify these rules?
Thank you

Like this:
rows = []
for line in open('mycsv.csv'):
fields = line.split(',')
rows.append( fields[0::2] )
rows.append( fields[1::2] )
df = pandas.DataFrame(rows, fields=['a_mean','a_var','b_mean','b_var'])
That doesn't provide an ID number. Is the ID part of the CSV file?

I went through the columns and if they were a part of a base column, then appended to a list. Finally, converted those to a dataframe.
So this code would work regardless of the order of the columns
[UPDATED WITH ID]
Since we're adding the entire columns one after the other, the ids will always start from the top, go to the end, and then repeat. So we can take "id" of the original df and multiply that by the number of rows to get the "id" for the new df.
Here's the CSV I used:
id,a_mean_val_1,a_mean_val_2,a_var_val_1,a_var_val_2,b_mean_val_1,b_mean_val_2,b_var_val_1,b_var_val_2
1,a_mean_val_1, a_mean_val_2, a_var_val_1, a_var_val_2, b_mean_val_1 ,b_mean_val_2, b_var_val_1, b_var_val_2
2,a_mean_val_5, a_mean_val_6, a_var_val_5, a_var_val_6, b_mean_val_5 ,b_mean_val_6, b_var_val_5, b_var_val_6
df = pd.read_csv('data_csv.csv')
# Ignore ID
columns = df.columns.tolist()[1:]
df_dict = {}
base = ['a_mean', 'a_var', 'b_mean', 'b_var']
for bas in base:
df_dict[bas] = []
for col in columns:
# for example, "a_mean" in "a_mean_val_1" then append
if(bas in col):
df_dict[bas] = df_dict[bas] + df[col].tolist()
ids = df['id'].tolist()
df_new = pd.DataFrame(df_dict)
df_new['id'] = ids*df.shape[0]
a_mean a_var b_mean b_var id
a_mean_val_1 a_var_val_1 b_mean_val_1 b_var_val_1 1
a_mean_val_5 a_var_val_5 b_mean_val_5 b_var_val_5 2
a_mean_val_2 a_var_val_2 b_mean_val_2 b_var_val_2 1
a_mean_val_6 a_var_val_6 b_mean_val_6 b_var_val_6 2

Subset string rows that contain a 'flexible' pattern

I have the following df.
data = [
['DWWWWD'],
['DWDW'],
['WDWWWWWWWWD'],
['DDW'],
['WWD'],
]
df = pd.DataFrame(data, columns=['letter_sequence'])
I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'. Examples of rows I want in my output df: DWD, DWWWWWWWWWWWD, WWWWWDWDW...
I came up with the following, but it does not really work for 'whichever number of W's'.
df[df['letter_sequence'].str.contains(
'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]
Desired output new_df:
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
Any alternatives?

Use [W]{1,} for one or more W, regex=True is by default, so should be omit:
df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD

You can use the regex: D\w+D.
The code is shown below:
df = df[df['letter_sequence'].str.contains('Dw+D')]
Please let me know if it helps.

How can I split a column into 2 in the correct way?

I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:

You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split dataframe by certain condition but keep the original dataframe - python

Related

Removing index which is not common between 2 dataframes

How to find and replace substrings at the end of column headers

Python - Convert columns with specific base_name into rows

Subset string rows that contain a 'flexible' pattern

How can I split a column into 2 in the correct way?

Categories

Resources