I have this problem with Pandas as from a 'csv' file with multiple columns I would like to generate a new 'csv' file with a single column containing all the values as in the example below:
From:
column 1, column 2, column 3
1, 2, 3,
4, 5, 6
I would like to obtain:
column 1
1
2
3
4
5
6
Thank you all in advance
You can use ravel on DataFrame values:
pd.DataFrame({'column': df.values.ravel()})
Output:
column
0 1
1 2
2 3
3 4
4 5
5 6
you can try this
import pandas as pd
df = pd.DataFrame({"column1": [1, 4], "column2": [2, 5], "column3": [3, 6]})
print("df : \n", df)
reshaped_value = df.values.reshape(-1)
new_df = pd.DataFrame(columns=["column1"])
new_df["column1"] = reshaped_value
print("\nnew_df : \n", new_df)
output :
df :
column1 column2 column3
0 1 2 3
1 4 5 6
new_df :
column1
0 1
1 2
2 3
3 4
4 5
5 6
Related
Is there a way to drop values in one column based on comparison with another column? Assuming the columns are of equal length
For example, iterate through each row and drop values in col1 greater than values in col2? Something like this:
df['col1'].drop.where(df['col1']>=df['col2']
Pandas compare columns and drop rows based on values in another column
import pandas as pd
d = {
'1': [1, 2, 3, 4, 5],
'2': [2, 4, 1, 6, 3]
}
df = pd.DataFrame(d)
print(df)
dfd = df.drop(df[(df['1'] >= df['2'])].index)
print('update')
print(dfd)
Output
1 2
0 1 2
1 2 4
2 3 1
3 4 6
4 5 3
update
1 2
0 1 2
1 2 4
3 4 6
here is the original dataframe I want raw in with ":" on column 4 to be replace by next columns form 5 to 9]1
Here is the result I want to look like]2
I tried my best can't figure it out
I tried to recreate the problem with a dummy data.
data = {'col_1': ['1:', '1:', '1:', 1, 1],
'col_2': [2, 2, 2, 2, 2],
'col_3': [':', ":" , 3, 3, ":"],
'col_4': [4, 4, 4, 4, 4],
'col_5': [5, 5, 5, 5, 5]}
df = pd.DataFrame.from_dict(data)
This is what it looks like.
print(df)
col_1 col_2 col_3 col_4 col_5
0 1: 2 : 4 5
1 1: 2 : 4 5
2 1: 2 3 4 5
3 1 2 3 4 5
4 1 2 : 4 5
If your data is similar to the dummy one that I created, the following code works
col_names = df.columns
for idx, col in enumerate(col_names): # loop every columns to ensure
indices = df[df[col] == ":"].index # get indices of rows that has ":" value
if len(indices) > 0: # only shift if the row has ":" values
for i in range(idx, len(col_names) - 1): # shift the columns to left 1 by 1
df.loc[indices, col_names[i]] = df.loc[indices, col_names[i+1]]
df.loc[indices, col_names[i+1]] = "" # set the rows from last column as empty
After running the code, your dataframe should be shifted. Like this
print(df)
col_1 col_2 col_3 col_4 col_5
0 1: 2 4 5
1 1: 2 4 5
2 1: 2 3 4 5
3 1 2 3 4 5
4 1 2 4 5
After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3
How can I add columns of two dataframes (A + B), so that the result (C) takes into account missing values ('---')?
DataFrame A
a = pd.DataFrame({'A': [1, 2, 3, '---', 5]})
A
0 1
1 2
2 3
3 ---
4 5
DataFrame B
b = pd.DataFrame({'B': [3, 4, 5, 6, '---']})
B
0 3
1 4
2 5
3 6
4 ---
Desired Result of A+B
C
0 4
1 6
2 8
3 ---
4 ---
Replace the '---' with np.nan, add the columns and fillna with '---'
(a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---')
You can assign the result to a new dataframe or an existing one:
df = pd.DataFrame()
df.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))
OR
a.assign(C = (a['A'].replace('---', np.nan)+b['B'].replace('---', np.nan)).fillna('---'))
How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T