Python to retrieve condition based columns - python

I have to retrieve all rows from w_loaded_updated_iod.xlsx where on column waived = Yes.
I have tried this:
import pandas as pd
excel1 = 'C:/Users/gopoluri/Desktop/Latest/w_loaded_updated_iod.xlsx'
df1 = pd.read_excel(excel1)
values1 = df1[0 : 7]
dataframes = [values1]
df1.loc[df1['Waived'] == 'Yes'].to_excel("output11.xlsx")
But I am getting and all columns. But I need the all rows only from column 2, column 3, column 5, column8. Can anyone please correct my code if anything is wrong.
Like below:

you can get columns x,y,z from your dataframe by filtering as follows:
df = df.loc[["x", "y", "z"]].
Example:
df = pd.DataFrame(dict(a=[1,2,3],b=[3,4,5],c=[5,6,7]))
df = df[["a","b"]]
df # prints output:
a b
0 1 3
1 2 4
2 3 5

Related

Conditionally insert rows in the middle of dataframe using pandas

I have a dataset that I need to add rows based on conditions. Rows can be added anywhere within the dataset. i.e., middle, top, and bottom.
I have 26 columns in the data but will only use a few to set conditions.
I want my code to go through each row and check if a column named "potveg" has values 4,8 or 9. If true, add a row below it and set 'col,' 'lat' column values similar to those of the last row, and set the values of columns 'icohort' and 'isrccohort' to those of the last row + 1. Then export the new data frame to CSV. I have tried several implementations based on this logic: Pandas: Conditionally insert rows into DataFrame while iterating through rows in the middle
PS* New to Python and Pandas
Here is the code I have so far:
for index, row in df.iterrows():
last_row = df.iloc[index-1]
next_row = df.iloc[index]
new_row = {
'col':last_row.col,
'row':last_row.row,
'tmpvarname':last_row.tmpvarname,
'year':last_row.year,
'icohort':next_row.icohort,
'isrccohort':next_row.icohort,
'standage':3000,
'chrtarea':0,
'potveg':13,
'currentveg':13,
'subtype':13,
'agstate':0,
'agprevstate':0,
'tillflag':0,
'fertflag':0,
'irrgflag':0,
'disturbflag':0,
'disturbmonth':0,
'FRI':2000,
'slashpar':0,
'vconvert':0,
'prod10par':0,
'prod100par':0,
'vrespar':0,
'sconvert':0,
'tmpregion':last_row.tmpregion
}
new_row = {k:v for k,v in new_row.items()}
if (df.iloc[index]['potveg'] == 4):
newdata =df.append(new_row, ignore_index=True)
Following the steps you suggested, you could write something like:
df = pd.DataFrame({'id':[1,2,4,5], 'before': [1,2,4,5], 'after': [1,2,4,5]})
new_df = pd.DataFrame()
for i, row in df.iterrows():
new_df = pd.concat([new_df, pd.DataFrame(row.to_frame().transpose())])
if row['id'] == 2:
# add our new row, with data for `col` before coming from the previous row, and `after` coming from the following row
temp = pd.DataFrame({'id': [3], 'before': [df.loc[i]['before']], 'after': [df.loc[i+1]['after']]})
new_df = pd.concat([new_df, pd.DataFrame(temp)])
You might need to consider exploring how you could approach the problem without iterating over the dataframe as this might be quite slow if you have a large dataset. I'd suggest checking the apply function.
You should expect new_df to have:
id before after
1 1 1
2 2 2
3 2 4
4 4 4
5 5 5
With a row with id 3 added after the row with id 2.
Inserting rows at a specific position can be done this way:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 4, 5], 'col2': ['A', 'B', 'D', 'E']})
new_row = pd.DataFrame({'col1': [3], 'col2': ['C']})
idx_pos = 2
pd.concat([df.iloc[:idx_pos], new_row, df.iloc[idx_pos:]]).reset_index(drop=True)
Output:
col1 col2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E

Save pandas pivot_table to include index and columns names

I want to save a pandas pivot table for human reading, but DataFrame.to_csv doesn't include the DataFrame.columns.name. How can I do that?
Example:
For the following pivot table:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [6, 7, 8]])
>>> df.columns = list("ABC")
>>> df.index = list("XY")
>>> df
A B C
X 1 2 3
Y 6 7 8
>>> p = pd.pivot_table(data=df, index="A", columns="B", values="C")
When viewing the pivot table, we have both the index name ("A"), and the columns name ("B").
>>> p
B 2 7
A
1 3.0 NaN
6 NaN 8.0
But when exporting as a csv we lose the columns name:
>>> p.to_csv("temp.csv")
===temp.csv===
A,2,7
1,3.0,
6,,8.0
How can I get some kind of human-readable output format which contains the whole of the pivot table, including the .columns.name ("B")?
Something like this would be fine:
B,2,7
A,,
1,3.0,
6,,8.0
Yes, it is possible by append helper DataFrame, but reading file is a bit complicated:
p1 = pd.DataFrame(columns=p.columns, index=[p.index.name]).append(p)
p1.to_csv('temp.csv',index_label=p.columns.name)
B,2,7
A,,
1,3.0,
6,,8.0
#set first column to index
df = pd.read_csv('temp.csv', index_col=0)
#set columns and index names
df.columns.name = df.index.name
df.index.name = df.index[0]
#remove first row of data
df = df.iloc[1:]
print (df)
B 2 7
A
1 3.0 NaN
6 NaN 8.0

Quick ways to fill the missing values of a pandas dataframe with complicated rules

In the m*(n+1) pandas dataframe data_df, there is a timestamp column whose values are possibly repeated integers in range(0,p) (which denote time; there are p unique values in total) and no missing values. There are other columns data_1, data_2, data_3, ... data_n, each with some missing values.
I would like to fill the missing values in each row of the data columns using specific numbers with respect to the timestamp value of that row. Therefore, I obtained a p*n pandas dataframe median_table. The values on the ith row of median_table are used to fill the missing values in data_df whose timestamp is i.
However, I could not come up with a quick and memory-friendly way to do this. Currently, I use the following code (median_table and data_df are already defined):
new_data_df = pd.DataFrame()
for _timestamp in median_table.timestamp:
temp_df = data_df.loc[data_df.timestamp == _timestamp]
temp_df.fillna(median_table.loc[_timestamp, :], inplace=True)
new_data_df = new_data_df.append(temp_df)
which is extremely inefficient. Another algorithm:
for _timestamp in median_table.timestamp:
data_df.loc[data_df.timestamp == _timestamp] = \
data_df.loc[data_df.timestamp == _timestamp]\
.fillna(median_table.loc[_timestamp, :], inplace=False)
worked equally slowly for me.
Is there a quicker way to do the same thing?
Try this approach: Identify NaNs in data_df and merge those with your median_table. I'd expect that to be faster than a for-loop. Apologies if I've incorrectly assumed the structure of your data, but this may at least get you started:
import pandas as pd
import numpy as np
# Create dummy dataframe
data_df = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"data": [1, 2, np.nan, np.nan]
})
print data_df
"""
Dataframe looks like:
data timestamp
1.0 1
2.0 2
NaN 3
NaN 4
"""
# Create dummy median table
median_table = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"missing_data": [100, 200, 300, 400]
})
print median_table
"""
Median table looks like:
missing_data timestamp
100 1
200 2
300 3
400 4
"""
# Find NaNs in "data" column in data_df
nan_indexes = data_df["data"].isnull()
nan_df = data_df[nan_indexes]
print nan_df
"""
nan_df looks like:
data timestamp
NaN 3
NaN 4
"""
# Merge nan_df with median_table based on timestamp column
new_df = pd.merge(left=nan_df, right=median_table, on="timestamp", how="left")
print new_df
"""
new_df looks like:
data timestamp missing_data
NaN 3 300
NaN 4 400
"""
# Clean up new_df
new_df = new_df[["timestamp", "missing_data"]] # Discard "data" column
new_df.columns = ["timestamp", "data"] # Rename "missing_data" column to "data"
print new_df
"""
new_df now looks like:
timestamp data
3 300
4 400
"""

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

How to update values in a specific row in a Python Pandas DataFrame?

With the nice indexing methods in Pandas I have no problems extracting data in various ways. On the other hand I am still confused about how to change data in an existing DataFrame.
In the following code I have two DataFrames and my goal is to update values in a specific row in the first df from values of the second df. How can I achieve this?
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
df2 = pd.DataFrame({'filename' : 'test2.dat', 'n':16}, index=[0])
# this overwrites the first row but we want to update the second
# df.update(df2)
# this does not update anything
df.loc[df.filename == 'test2.dat'].update(df2)
print(df)
gives
filename m n
0 test0.dat 12 None
1 test2.dat 13 None
[2 rows x 3 columns]
but how can I achieve this:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
[2 rows x 3 columns]
So first of all, pandas updates using the index. When an update command does not update anything, check both left-hand side and right-hand side. If you don't update the indices to follow your identification logic, you can do something along the lines of
>>> df.loc[df.filename == 'test2.dat', 'n'] = df2[df2.filename == 'test2.dat'].loc[0]['n']
>>> df
Out[331]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to do this for the whole table, I suggest a method I believe is superior to the previously mentioned ones: since your identifier is filename, set filename as your index, and then use update() as you wanted to. Both merge and the apply() approach contain unnecessary overhead:
>>> df.set_index('filename', inplace=True)
>>> df2.set_index('filename', inplace=True)
>>> df.update(df2)
>>> df
Out[292]:
m n
filename
test0.dat 12 None
test2.dat 13 16
In SQL, I would have do it in one shot as
update table1 set col1 = new_value where col1 = old_value
but in Python Pandas, we could just do this:
data = [['ram', 10], ['sam', 15], ['tam', 15]]
kids = pd.DataFrame(data, columns = ['Name', 'Age'])
kids
which will generate the following output :
Name Age
0 ram 10
1 sam 15
2 tam 15
now we can run:
kids.loc[kids.Age == 15,'Age'] = 17
kids
which will show the following output
Name Age
0 ram 10
1 sam 17
2 tam 17
which should be equivalent to the following SQL
update kids set age = 17 where age = 15
If you have one large dataframe and only a few update values I would use apply like this:
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
data = {'filename' : 'test2.dat', 'n':16}
def update_vals(row, data=data):
if row.filename == data['filename']:
row.n = data['n']
return row
df.apply(update_vals, axis=1)
Update null elements with value in the same location in other.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
more information in this link
There are probably a few ways to do this, but one approach would be to merge the two dataframes together on the filename/m column, then populate the column 'n' from the right dataframe if a match was found. The n_x, n_y in the code refer to the left/right dataframes in the merge.
In[100] : df = pd.merge(df1, df2, how='left', on=['filename','m'])
In[101] : df
Out[101]:
filename m n_x n_y
0 test0.dat 12 None NaN
1 test2.dat 13 None 16
In[102] : df['n'] = df['n_y'].fillna(df['n_x'])
In[103] : df = df.drop(['n_x','n_y'], axis=1)
In[104] : df
Out[104]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to put anything in the iith row, add square brackets:
df.loc[df.iloc[ii].name, 'filename'] = [{'anything': 0}]
I needed to update and add suffix to few rows of the dataframe on conditional basis based on the another column's value of the same dataframe -
df with column Feature and Entity and need to update Entity based on specific feature type
df.loc[df.Feature == 'dnb', 'Entity'] = 'duns_' + df.loc[df.Feature == 'dnb','Entity']

Categories

Resources