I have a dataframe that contains multiple header rows (a combination of multiple csvs). Is there a way to split the dataframe back into individual dataframes without using .iloc? iloc works, but will be time consuming for my workflow.
data = {'A': [1,2,3,'A',4,5,6,'A',7,8,9],
'B': [9,8,7,'B',6,5,4,'B',3,2,1]}
df = pd.DataFrame(data, columns = ['A','B'])
## My current approach:
df1 = df.iloc[:3,]
df2 = df.iloc[4:7,]
df3 = df.iloc[8:,]
Is there a better way to split the data frame by searching for the values in the columns? i.e. something like df1,df2,df3 = df.split(df['A']=='A')
One can use eq to check for the header rows, then groupby on the cumsum:
header_rows = df.eq(df.columns).all(1)
dfs = {k:v for k,v in df[~header_rows].groupby(header_rows.cumsum())}
then, for example dfs[0] gives:
A B
0 1 9
1 2 8
2 3 7
Related
At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9
I have df like below with:-
import pandas as pd
# initialize list of lists
data = [[0, 2, 3],[0,2,2],[1,1,1]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
The clauses are the column names are dynamic sometimes it's 3 columns and sometimes it's 5 columns sometimes 1 column.
and I have on other df which is telling me the anomaly
# initialize list of lists
data = [[0,1,1]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
Now if any of the columns in df2 is having value 1 it means it's an anomaly then I have to alert. the only clause is I want to check if 1090 is 1 in df2 then the value of 1090 in df1 and if it's less than 4 then do nothing
As of now, I am doing it like this:-
if df2.any(axis=1).any() == True:
print("alert")
I have a dataset imported from a CSV file to a dataframe in Python. I want to remove some specific rows from this dataframe and append them to an empty dataframe. So far I have tried to remove row 1 and 0 from the "big" dataframe called df and put these into dff using this code:
dff = pd.DataFrame() #Create empty dataframe
for x in range(0, 2):
dff = dff.append(df.iloc[x]) #Append the first 2 rows from df to dff
#How to remove appended rows from df?
This seems to work, however the columns are flipped, for e.g., df got order A, B, C, then dff will get the order C, B, A; other than that the data is correct. Also how do I remove a specific row from a dataframe?
If your goal is just to remove the first two rows into another dataframe, you don't need to use a loop, just slice:
import pandas as pd
df = pd.DataFrame({"col1": [1,2,3,4,5,6], "col2": [11,22,33,44,55,66]})
dff = df.iloc[:2]
df = df.iloc[2:]
Will give you:
dff
Out[6]:
col1 col2
0 1 11
1 2 22
df
Out[8]:
col1 col2
2 3 33
3 4 44
4 5 55
5 6 66
If your list of desired rows is more complex than just the first two, per your example, a more generic method could be:
dff = df.iloc[[1,3,5]] # Your list of row numbers
df = df.iloc[~df.index.isin(dff.index)]
This means that even if the index column isn't sequential integers, any rows that you used to populate dff will be removed from df.
I managed to solve it by doing:
dff = pd.DataFrame()
dff = df.iloc[:0]
This will copy the first row of df (the titles of the colums e.g. A,B,C) into dff, then append work as it should with any row and row e.g. 1150 can be appended and removed using:
dff = dff.append(df.iloc[1150])
df = df.drop(df.index[1150])
I'm new in Python, I need to get many variables in multiple dataframes:
I wrote this code but I need a long time to configure it for many excersises.
This is the code:
import pandas as pd
df = pd.concat([df1[df1.columns[0]], df2[df1.columns[0]], df1[df1.columns[1]],
df2[df1.columns[1]], df1[df1.columns[2]], df2[df1.columns[2]],
df1[df1.columns[3]], df2[df1.columns[3]], df1[df1.columns[4]],
df2[df1.columns[4]], df1[df1.columns[5]], df2[df1.columns[5]],
df1[df1.columns[6]], df2[df1.columns[6]]], axis=1)
The number of dataframes and columns can be much bigger. Thanks.
It looks like what you're trying to do is: for all of the columns in one dataframe, combine the columns from that dataframe with those from another with the same columns, into a single dataframe with two of every column in the same original order.
In your case:
df1 = DataFrame([['a','b','c'], ['d','e','f']])
df2 = DataFrame([['g','h','i'], ['j','k','l']])
df = concat([s for ss in [(df1[c], df2[c]) for c in df1.columns] for s in ss], axis=1)
print(df)
Result:
0 0 1 1 2 2
0 a g b h c i
1 d j e k f l
I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])