I don't seem to be able to subset data using integer column names using loc command
# 6*4 data set with column names as x,y,8,9
df = pd.DataFrame(np.random.randint(0,10,(6,4)),
index=('a','b','c','1','2','3'),
columns=['x','y', 8, 9])
df2 = df.loc[:,:'x']
df3 = df.loc[:,:'8']
df2 works but df3 throws error.
You can do either:
df3 = df.loc[:,8]
To get only column 8
Or:
df3 = df.loc[:,df.columns[:list(df.columns).index(8)+1]]
To get all columns until column 8 (inclusive - remove +1 to get exclusive).
Related
I have many dataframes with one column (same name in all) whose indexes are date ranges - I want to merge/combine these dataframes into one, summing the values where any dates are common. below is a simplified example
range1 = pd.date_range('2021-10-01','2021-11-01')
range2 = pd.date_range('2021-11-01','2021-12-01')
df1 = pd.DataFrame(np.random.rand(len(range1),1), columns=['value'], index=range1)
df2 = pd.DataFrame(np.random.rand(len(range2),1), columns=['value'], index=range2)
here '2021-11-01' appears in both df1 and df2 with different values
I would like to obtain a single dataframe of 62 rows (32+31-1) where the 2021-11-01 date contains the sum of its values in df1 and df2
We can use pd.concate() on the two dataframes, then df.reset_index() to get a new regular-integer index, rename the date column, and then use df.groupby().sum().
df = pd.concat([df1,df2]) # this gives 63 rows by 1 column, where the column is the values and the dates are the index
df = df.reset_index() # moves the dates to a column, now called 'index', and makes a new integer index
df = df.rename(columns={'index':'Date'}) #renames the column
df.groupby('Date').sum()
I have df like below with:-
import pandas as pd
# initialize list of lists
data = [[0, 2, 3],[0,2,2],[1,1,1]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
The clauses are the column names are dynamic sometimes it's 3 columns and sometimes it's 5 columns sometimes 1 column.
and I have on other df which is telling me the anomaly
# initialize list of lists
data = [[0,1,1]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['10028', '1090','1058'])
Now if any of the columns in df2 is having value 1 it means it's an anomaly then I have to alert. the only clause is I want to check if 1090 is 1 in df2 then the value of 1090 in df1 and if it's less than 4 then do nothing
As of now, I am doing it like this:-
if df2.any(axis=1).any() == True:
print("alert")
I have a dataframe that contains multiple header rows (a combination of multiple csvs). Is there a way to split the dataframe back into individual dataframes without using .iloc? iloc works, but will be time consuming for my workflow.
data = {'A': [1,2,3,'A',4,5,6,'A',7,8,9],
'B': [9,8,7,'B',6,5,4,'B',3,2,1]}
df = pd.DataFrame(data, columns = ['A','B'])
## My current approach:
df1 = df.iloc[:3,]
df2 = df.iloc[4:7,]
df3 = df.iloc[8:,]
Is there a better way to split the data frame by searching for the values in the columns? i.e. something like df1,df2,df3 = df.split(df['A']=='A')
One can use eq to check for the header rows, then groupby on the cumsum:
header_rows = df.eq(df.columns).all(1)
dfs = {k:v for k,v in df[~header_rows].groupby(header_rows.cumsum())}
then, for example dfs[0] gives:
A B
0 1 9
1 2 8
2 3 7
I have two data frames.
The columns name are the same of those data frames.
I want to sum the float values of the same columns from dataframes
Then I can use
df3 = df1.add(df2)
However, my dataframes contain two colums of string. These strings are added too.
How can I wrtie the code not to add the string but to add the float in two data frames
The two sample dataframes are as follow:
df1 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[1,2,3,4]),index=[0,1,2,3])
df2 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[3,1,2,4]),index=[0,1,2,3])
When I used df3 = df1.add(df2)
it also added the string in column "Team" as follow:
Team Value
0 AA 4
1 BB 3
2 CC 5
3 DD 8
How can I write code without adding the Team but the Value.
Thanks,
Zep
Use the team names as indices instead of integer indices:
In [2]: df1 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[1,2,3,4])).set_index('Team')
...: df2 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[3,1,2,4])).set_index('Team')
In [3]: df1 + df2
Out[3]:
Value
Team
A 4
B 3
C 5
D 8
In case you have multiple other columns, just sum the columns:
total = df1['Value'] + df2['Value']
If, in addition, you need a dataframe of the same shape as df1 and df2 with Value replaced by the sum, you can do
df3 = df1.copy()
df3['Value'] = total
I have two dataframes containing the result of a corr() from different parts of a single source (csv). Now I want to compare all the values in the two dataframes to check if they are equal or even if they fall within a particular range. So the puseudo code would be something like:
df1['column1']['row1'] == df2['column1']['row1']
Is there a simple way of doing this in Pandas?
You have many ways to do that. One of the ways I follow is as below:
df3 = df2[df1.ne(df2).any(axis=1)]
df3 will list out all the rows in which atleast one cell will not match.
FYI, ne here stands for not equal.
Example:
create df1
data = [['batman', 10], ['joker', 15], ['alfred', 14]]
df1 = pd.DataFrame(data, columns = ['Name', 'Age'])
create df2 which is slightly different from df1
data = [['batman', 10], ['joker', 6], ['alfred', 17]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
extract the rows with atleast one unequal cell
df3 = df2[df1.ne(df2).any(axis=1)]
df3
print the the resultant df3
Name Age
1 joker 6 // the age is different in df1 and df2 for joker
2 alfred 17 // the age is different in df1 and df2 for alfred
Now, from the resultant dataframe, you can check the range requirements as per your business case.