Subset of a Pandas Dataframe consisting of rows with specific column values - python

I'm having a problem with a single line of my code.
Here is what I'd like to achieve:
reading_now is a string consisting of 3 characters
df2 is a data frame that is a subset of df1
I'd like df2 to consist of rows from df1 where the first three characters of the value in column "Code" is equal to "reading_now"
I tried using the following two lines with no success:
*df2 = df1.loc[(df1['Code'])[0:3] == reading_now]*
*df2 = df1[(str(df1.Code)[0:3] == reading_now)]*

Looks like you were really close with your 2nd attempt.
You could solve this a couple of different ways.
reading_now = 'AAA'
df1 = pd.DataFrame([{'Code': 'AAA'}, {'Code': 'BBB'}, {'Code': 'CCC'}])
solution:
df2 = df1[df1['Code'].str.startswith(reading_now)]
or
df2 = df1[df1['Code'][0:3] == reading_now]
The df2 dataframe will contain the row that starts with the reading_now string.

You could use
df2 = df1[df1['Code'].str[0:3] == reading_now]
For example:
data = ['abcd', 'cbdz', 'abcz', 'bdaz']
df1 = pd.DataFrame(data, columns=['Code'])
df2 = df1[df1['Code'].str[0:3] == 'abc']
df2 will result in a dataframe with 'Code' column containing 'abcd' and 'abcz'

Related

replace NaNs with 0 for df columns where column name contains specific string (pandas)

I have a dataframe as a result of a pivot which has several thousand columns (representing time-boxed attributes). Below is a much shortened version for resemblance.
d = {'incount - 14:00': [1,'NaN', 1,1,'NaN','NaN','NaN','NaN',1],
'incount - 15:00': [2,1,2,'NaN','NaN','NaN',1,4,'NaN'],
'outcount - 14:00':[2,'NaN',1,1,1,1,2,2,1]
'outcount - 15:00':[2,2,1,1,'NaN',2,'NaN',1,1]}
df = pd.DataFrame(data=d)
I want to replace the NaNs in columns that contain "incount" with 0 (leaving other columns untouched). I have tried the following but predictably it does not recognise the column name.
df['incount'] = df_all['incount'].fillna(0)
I need the ability to search the column names and only impact those containing a defined string.
try this:
m = df.columns[df.columns.str.startswith('incount')]
df.loc[:, m] = df.loc[:, m].fillna(0)
print(df)
you can use:
loop_cols = list(df.columns[df.columns.str.contains('incount',na=False)]) #get columns containing incount as a list
#or
#loop_cols = [col for col in df.columns if 'incount' in col]
print(loop_cols)
'''
['incount - 14:00', 'incount - 15:00']
'''
for i in loop_cols:
df[i]=df[i].fillna(0)

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

How to compare two columns using pandas?

BACKGROUND:
I have two columns: 'address' and 'raw_data'. The dataset looks like this:
this is just a sample I made up, the original dataset is over 6m rows and in a different language
Problem:
I need to find all the data where the 'address' and 'raw_data' are not matched meaning there were some sorta of mistakes were made when logging in the data from 'address' to 'raw_data.
I'm fairly new to Pandas. My plan is separate the 'raw_data' column by comma, then compare the newly produced columns with the original 'address' column (to see if the 'address' column has those info, if not, that means there is a mistake?).
Like I said, I'm new to pandas and this is what I have so far.
import pandas as pd
columns = ['address', 'raw_data']
df=pd.read_csv('address.csv', usecols=columns)
df = pd.concat([df['address'], df['raw_data'].str.split(',', expand=True)], axis=1)
Now the new columns has info like this: "CITY":"ATLANTA". I want to the columns to just have ATLANTA without all the the colons and 'CITY' in order to compare the info with 'address' column.
How should I go on about it?
Also, at this point of my pandas learning experience, I do not yet know how to compare two columns. Could someone help a newbie out please? Thanks a lot!
PS: by comparison of two columns I meant to check whether one column has the characters in the second column, not to check whether the two columns are equal. Just want to point that out.
df = pd.DataFrame([[2, 2], [3, 6],[1,1]], columns = ["col1", "col2"])
comparison_column = np.where(df["col1"] == df["col2"], True, False)
df["equal"] = comparison_column
col1 col2 equal
2 2 True
3 6 False
1 1 True
I will use this data:
import numpy as np
import pandas as pd
j = {"address":"foo","b": "bar"}
j2 = {"address":"foo2","b": "bar2"}
values = [["foo", j], ["bar", j2]]
df = pd.DataFrame(data=values, columns=["address", "raw_data"])
df
address raw_data
0 foo {'address': 'foo', 'b': 'bar'}
1 bar {'address': 'foo2', 'b': 'bar2'}
I will separate columns from raw_data (with .values.tolist()) in another df (df2):
df2 = pd.DataFrame(df['raw_data'].values.tolist())
df2
address b
0 foo bar
1 foo2 bar2
To compare you use:
df.address == df2.address
0 True
1 False
If you need save this in the original df you can add a column:
df["result"] = df.address == df2.address
You can separate them from , by just treating them as a dict. You can map custom functions to columns with apply function. In this case you have define a function that accesses to keys of dictionary and extracts values.
df['address_raw'] = df['raw_data'].apply(lambda x: x['address'])
df['city_raw'] = df['raw_data'].apply(lambda x: x['CITY'])
df['addrline2_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_2'])
df['addrline3_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_3'])
df['utmnorthing_raw'] = df['raw_data'].apply(lambda x: x['UTM_NORTHING'])
These lines will create columns of each field in the dict and then you can just compare the ones like:
df['address'] == df['address_raw']

Appending a column to data frame using Pandas in python

I'm trying some operations on Excel file using pandas. I want to extract some columns from a excel file and add another column to those extracted columns. And want to write all the columns to new excel file. To do this I have to append new column to old columns.
Here is my code-
import pandas as pd
#Reading ExcelFIle
#Work.xlsx is input file
ex_file = 'Work.xlsx'
data = pd.read_excel(ex_file,'Data')
#Create subset of columns by extracting columns D,I,J,AU from the file
data_subset_columns = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
#Compute new column 'Percentage'
#'Num Labels' and 'Num Tracks' are two different columns in given file
data['Percentage'] = data['Num Labels'] / data['Num Tracks']
data1 = data['Percentage']
print data1
#Here I'm trying to append data['Percentage'] to data_subset_columns
Final_data = data_subset_columns.append(data1)
print Final_data
Final_data.to_excel('111.xlsx')
No error is shown. But Final_data is not giving me expected results. ( Data not getting appended)
There is no need to explicitly append columns in pandas. When you calculate a new column, it is included in the dataframe. When you export it to excel, the new column will be included.
Try this, assuming 'Num Labels' and 'Num Tracks' are in "D,I,J,AU" [otherwise add them]:
import pandas as pd
data_subset = pd.read_excel(ex_file, 'Data', parse_cols="D,I,J,AU")
data_subset['Percentage'] = data_subset['Num Labels'] / data_subset['Num Tracks']
data_subset.to_excel('111.xlsx')
The append function of a dataframe adds rows, not columns to the dataframe. Well, it does add columns if the appended rows have more columns than in the source dataframe.
DataFrame.append(other, ignore_index=False, verify_integrity=False)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
I think you are looking for something like concat.
Combine DataFrame objects horizontally along the x axis by passing in axis=1.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george

How to coerce pandas dataframe column to be normal index

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?
You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

Categories

Resources