I have the following format of a csv file:
id a_mean_val_1 a_mean_val_2 a_var_val_1 a_var_val_2 b_mean_val_1 b_mean_val_2 b_var_val_1 b_var_val_2
I would like to melt the columns 1 and 2 for all a and b features into rows as follows:
id a_mean a_var b_mean b_var
1 val1 val1 val1 val1
1 val2 val2 val2 val2
I am unsure how to achieve the melt function in pandas, where I could basically have an expression that matches keeps the base name: a_mean as root column and everything that has a suffix for that variable to melt them into rows.
Is there another method I could use to specify these rules?
Thank you
Like this:
rows = []
for line in open('mycsv.csv'):
fields = line.split(',')
rows.append( fields[0::2] )
rows.append( fields[1::2] )
df = pandas.DataFrame(rows, fields=['a_mean','a_var','b_mean','b_var'])
That doesn't provide an ID number. Is the ID part of the CSV file?
I went through the columns and if they were a part of a base column, then appended to a list. Finally, converted those to a dataframe.
So this code would work regardless of the order of the columns
[UPDATED WITH ID]
Since we're adding the entire columns one after the other, the ids will always start from the top, go to the end, and then repeat. So we can take "id" of the original df and multiply that by the number of rows to get the "id" for the new df.
Here's the CSV I used:
id,a_mean_val_1,a_mean_val_2,a_var_val_1,a_var_val_2,b_mean_val_1,b_mean_val_2,b_var_val_1,b_var_val_2
1,a_mean_val_1, a_mean_val_2, a_var_val_1, a_var_val_2, b_mean_val_1 ,b_mean_val_2, b_var_val_1, b_var_val_2
2,a_mean_val_5, a_mean_val_6, a_var_val_5, a_var_val_6, b_mean_val_5 ,b_mean_val_6, b_var_val_5, b_var_val_6
df = pd.read_csv('data_csv.csv')
# Ignore ID
columns = df.columns.tolist()[1:]
df_dict = {}
base = ['a_mean', 'a_var', 'b_mean', 'b_var']
for bas in base:
df_dict[bas] = []
for col in columns:
# for example, "a_mean" in "a_mean_val_1" then append
if(bas in col):
df_dict[bas] = df_dict[bas] + df[col].tolist()
ids = df['id'].tolist()
df_new = pd.DataFrame(df_dict)
df_new['id'] = ids*df.shape[0]
a_mean a_var b_mean b_var id
a_mean_val_1 a_var_val_1 b_mean_val_1 b_var_val_1 1
a_mean_val_5 a_var_val_5 b_mean_val_5 b_var_val_5 2
a_mean_val_2 a_var_val_2 b_mean_val_2 b_var_val_2 1
a_mean_val_6 a_var_val_6 b_mean_val_6 b_var_val_6 2
Related
I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given
You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]
Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10
I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?
First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0
Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...
I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i
I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5