Python append rows in dataframe are flipped - python

I have a dataset imported from a CSV file to a dataframe in Python. I want to remove some specific rows from this dataframe and append them to an empty dataframe. So far I have tried to remove row 1 and 0 from the "big" dataframe called df and put these into dff using this code:
dff = pd.DataFrame() #Create empty dataframe
for x in range(0, 2):
dff = dff.append(df.iloc[x]) #Append the first 2 rows from df to dff
#How to remove appended rows from df?
This seems to work, however the columns are flipped, for e.g., df got order A, B, C, then dff will get the order C, B, A; other than that the data is correct. Also how do I remove a specific row from a dataframe?

If your goal is just to remove the first two rows into another dataframe, you don't need to use a loop, just slice:
import pandas as pd
df = pd.DataFrame({"col1": [1,2,3,4,5,6], "col2": [11,22,33,44,55,66]})
dff = df.iloc[:2]
df = df.iloc[2:]
Will give you:
dff
Out[6]:
col1 col2
0 1 11
1 2 22
df
Out[8]:
col1 col2
2 3 33
3 4 44
4 5 55
5 6 66
If your list of desired rows is more complex than just the first two, per your example, a more generic method could be:
dff = df.iloc[[1,3,5]] # Your list of row numbers
df = df.iloc[~df.index.isin(dff.index)]
This means that even if the index column isn't sequential integers, any rows that you used to populate dff will be removed from df.

I managed to solve it by doing:
dff = pd.DataFrame()
dff = df.iloc[:0]
This will copy the first row of df (the titles of the colums e.g. A,B,C) into dff, then append work as it should with any row and row e.g. 1150 can be appended and removed using:
dff = dff.append(df.iloc[1150])
df = df.drop(df.index[1150])

Related

Assign a new column in pandas in a similar way as in pyspark

I have the following dataframe:
df = pd.DataFrame([['A', 1],['B', 2],['C', 3]], columns=['index', 'result'])
index
result
A
1
B
2
C
3
I would like to create a new column, for example multiply the column 'result' by two, and I am just curious to know if there is a way to do it in pandas as pyspark does it.
In pyspark:
df = df\
.withColumn("result_multiplied", F.col("result")*2)
I don't like the fact of writing the name of the dataframe everytime I have to perform an operation as it is done in pandas such as:
In pandas:
df['result_multiplied'] = df['result']*2
Use DataFrame.assign:
df = df.assign(result_multiplied = df['result']*2)
Or if column result is processing in code before is necessary lambda function for processing counted values in column result:
df = df.assign(result_multiplied = lambda x: x['result']*2)
Sample for see difference column result_multiplied is count by multiple original df['result'], for result_multiplied1 is used multiplied column after mul(2):
df = df.mul(2).assign(result_multiplied = df['result']*2,
result_multiplied1 = lambda x: x['result']*2)
print (df)
index result result_multiplied result_multiplied1
0 AA 2 2 4
1 BB 4 4 8
2 CC 6 6 12

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

comapre value in two dataframe for alerting

I have df like below with:-
import pandas as pd
# initialize list of lists
data = [[0, 2, 3],[0,2,2],[1,1,1]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
The clauses are the column names are dynamic sometimes it's 3 columns and sometimes it's 5 columns sometimes 1 column.
and I have on other df which is telling me the anomaly
# initialize list of lists
data = [[0,1,1]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
Now if any of the columns in df2 is having value 1 it means it's an anomaly then I have to alert. the only clause is I want to check if 1090 is 1 in df2 then the value of 1090 in df1 and if it's less than 4 then do nothing
As of now, I am doing it like this:-
if df2.any(axis=1).any() == True:
print("alert")

Split pandas dataframe rows up to searched column value into new dataframes

I have a dataframe that contains multiple header rows (a combination of multiple csvs). Is there a way to split the dataframe back into individual dataframes without using .iloc? iloc works, but will be time consuming for my workflow.
data = {'A': [1,2,3,'A',4,5,6,'A',7,8,9],
'B': [9,8,7,'B',6,5,4,'B',3,2,1]}
df = pd.DataFrame(data, columns = ['A','B'])
## My current approach:
df1 = df.iloc[:3,]
df2 = df.iloc[4:7,]
df3 = df.iloc[8:,]
Is there a better way to split the data frame by searching for the values in the columns? i.e. something like df1,df2,df3 = df.split(df['A']=='A')
One can use eq to check for the header rows, then groupby on the cumsum:
header_rows = df.eq(df.columns).all(1)
dfs = {k:v for k,v in df[~header_rows].groupby(header_rows.cumsum())}
then, for example dfs[0] gives:
A B
0 1 9
1 2 8
2 3 7

Pandas-iterate through a dataframe column and concatenate corresponding row values that contain a list

I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)

Categories

Resources