Change a dataframe with one header to two headers - python

I have a dataframe and I want to change it to the others which I attached. The name of columns are not important, also the new one does not have index 0,1,2.... This seems rather obvious, but I can't seem to figure out it. Note: I don't want to change the values and columns. I only have the same structure. I've added a simple dataframe as an example:
import pandas as pd
df = pd.DataFrame()
df['timestamp'] =[ 1, 2]
df['cons_id'] = [2, 3]
df[ 'value'] = [ 4,5]
The dataframe which I have:
Here is the output that I want to change the dataframe to:

Are you looking for set_index:
df = df.set_index('timestamp').rename_axis(columns='timestamp')
print(df)
# Output
timestamp cons_id value
timestamp
1 2 4
2 3 5

Related

comapre value in two dataframe for alerting

I have df like below with:-
import pandas as pd
# initialize list of lists
data = [[0, 2, 3],[0,2,2],[1,1,1]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
The clauses are the column names are dynamic sometimes it's 3 columns and sometimes it's 5 columns sometimes 1 column.
and I have on other df which is telling me the anomaly
# initialize list of lists
data = [[0,1,1]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
Now if any of the columns in df2 is having value 1 it means it's an anomaly then I have to alert. the only clause is I want to check if 1090 is 1 in df2 then the value of 1090 in df1 and if it's less than 4 then do nothing
As of now, I am doing it like this:-
if df2.any(axis=1).any() == True:
print("alert")

Split pandas dataframe rows up to searched column value into new dataframes

I have a dataframe that contains multiple header rows (a combination of multiple csvs). Is there a way to split the dataframe back into individual dataframes without using .iloc? iloc works, but will be time consuming for my workflow.
data = {'A': [1,2,3,'A',4,5,6,'A',7,8,9],
'B': [9,8,7,'B',6,5,4,'B',3,2,1]}
df = pd.DataFrame(data, columns = ['A','B'])
## My current approach:
df1 = df.iloc[:3,]
df2 = df.iloc[4:7,]
df3 = df.iloc[8:,]
Is there a better way to split the data frame by searching for the values in the columns? i.e. something like df1,df2,df3 = df.split(df['A']=='A')
One can use eq to check for the header rows, then groupby on the cumsum:
header_rows = df.eq(df.columns).all(1)
dfs = {k:v for k,v in df[~header_rows].groupby(header_rows.cumsum())}
then, for example dfs[0] gives:
A B
0 1 9
1 2 8
2 3 7

Assign a series to ALL columns of the dataFrame (columnwise)?

I have a dataframe, and series of the same vertical size as df, I want to assign
that series to ALL columns of the DataFrame.
What is the natural why to do it ?
For example
df = pd.DataFrame([[1, 2 ], [3, 4], [5 , 6]] )
ser = pd.Series([1, 2, 3 ])
I want all columns of "df" to be equal to "ser".
PS Related:
One way to solve it via answer:
How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series (creating all true mask), but I guess there should be some more
simple way.
If I need NOT all, but SOME columns - the answer is given here:
Assign a Series to several Rows of a Pandas DataFrame
Use to_frame with reindex:
a = ser.to_frame().reindex(columns=df.columns, method='ffill')
print (a)
0 1
0 1 1
1 2 2
2 3 3
But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data:
df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)
Maybe a different way to look at it:
df = pd.concat([ser] * df.shape[1], axis=1)

How to analyze a dataframe with multiple headers?

For example, I have a df with 3 headers. I want to analyze data from one of the columns in the first header and one of the columns in the second header. How do i do that?
It's hard to know if this will work because you haven't provided you're data but you can try this.
First access the column names
data.columns
Then isolate the corresponding columns you would like to analyze
data = data[['column_1', 'column_2']]
Index the columns based on the names that appear as the current column names, ignore the column names not currently used and just index based on the corresponding match.
You can then rename the columns.
data.columns = ['new_column_1_name', 'new_column_2_name']
You can pull them out as tuples:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[["A", "B"], ["a", "b"]])
In [12]: df
Out[12]:
A B
a b
0 1 2
1 3 4
In [13]: df[[("A", "a")]]
Out[13]:
A
a
0 1
1 3
In your case it might be:
df[[("Year", "All ages")]]
See the advanced section of the docs for multi-index indexing and slicing.

How to add an empty column to a dataframe?

What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Here is an example adding multiple columns:
mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])
or
mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0
You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like:
df['new'] = pd.Series(dtype='int')
# or use other dtypes like 'float', 'object', ...
If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.
Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is:
df = df.reindex(columns = header_list)
where "header_list" is a list of the headers you want to appear.
any header included in the list that is not found already in the dataframe will be added with blank cells below.
so if
header_list = ['a','b','c', 'd']
then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by #DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list
df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
df[i]=np.nan
df["C"] = ""
df["D"] = np.nan
Assignment will give you this warning SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
so its better to use insert:
df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works:
mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.
cost_tbl.insert(1, "col_name", "")
The above statement would insert an empty Column after the first column.
this will also work for multiple columns:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
df1 = pd.DataFrame(columns=['C','D','E'])
df = df.join(df1, how="outer")
>>>df
A B C D E
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
Then do whatever you want to do with the columns
pd.Series.fillna(),pd.Series.map()
etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking.
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> nan_cols_name = ["C","D","whatever"]
>>> df.assign(**{col:np.nan for col in nan_cols_name})
A B C D whatever
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
You can also unpack multiple dict in a dict that you unpack if you want different values for different columns.
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
nan_cols_name = ["C","D","whatever"]
empty_string_cols_name = ["E","F","bad column with space"]
df.assign(**{
**{col:np.nan for col in my_empy_columns_name},
**{col:"" for col in empty_string_cols_name}
}
)
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe.
1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp
2nd step, combine the df_temp and your data frame.
df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty']))
df = pd.concat([df_temp, df])
It might be the best solution, but it is another way to think about this question.
the reason of I am using this method is because I am get this warning all the time:
: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["empty1"], df["empty2"] = [np.nan, ""]
great I found the way to disable the Warning
pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter.
df[' ']=df.apply(lambda _: '', axis=1)
df_2 = pd.concat([df,df1],axis=1) #worked but only once.
# Note: df & df1 have the same rows which is my index.
#
df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!?
df_4 = pd.concat([df_2,df_3],axis=1)
I then replaced the second lambda call with
df_2['']='' #which appears to add a blank column
df_4 = pd.concat([df_2,df_3],axis=1)
The output I tested it on was using xlsxwriter to excel.
Jupyter blank columns look the same as in excel although doesnt have xlsx formatting.
Not sure why the second Lambda call didnt work.

Categories

Resources