I have a bunch of txt files that i need to compile into a single master file. I use read_csv to extract the information inside. There are some rows to drop, and i was wondering if it's possible to use the skiprows feature without specifying the index number of rows that i want to drop, but rather to tell which one to drop according to its row content/value. Here's how the data looks like to illustrate my point.
Index Column 1 Column 2
0 Rows to drop Rows to drop
1 Rows to drop Rows to drop
2 Rows to drop Rows to drop
3 Rows to keep Rows to keep
4 Rows to keep Rows to keep
5 Rows to keep Rows to keep
6 Rows to keep Rows to keep
7 Rows to drop Rows to drop
8 Rows to drop Rows to drop
9 Rows to keep Rows to keep
10 Rows to drop Rows to drop
11 Rows to keep Rows to keep
12 Rows to keep Rows to keep
13 Rows to drop Rows to drop
14 Rows to drop Rows to drop
15 Rows to drop Rows to drop
What is the most effective way to do this?
Is this what you want to achieve:
import pandas as pd
df = pd.DataFrame({'A':['row 1','row 2','drop row','row 4','row 5',
'drop row','row 6','row 7','drop row','row 9']})
df1 = df[df['A']!='drop row']
print (df)
print (df1)
Original Dataframe:
A
0 row 1
1 row 2
2 drop row
3 row 4
4 row 5
5 drop row
6 row 6
7 row 7
8 drop row
9 row 9
New DataFrame with rows dropped:
A
0 row 1
1 row 2
3 row 4
4 row 5
6 row 6
7 row 7
9 row 9
While you cannot skip rows based on content, you can skip rows based on index. Here are some options for you:
skip n number of row:
df = pd.read_csv('xyz.csv', skiprows=2)
#this will skip 2 rows from the top
skip specific rows:
df = pd.read_csv('xyz.csv', skiprows=[0,2,5])
#this will skip rows 1, 3, and 6 from the top
#remember row 0 is the 1st line
skip nth row in the file
#you can also skip by counts.
#In below example, skip 0th row and every 5th row from there on
def check_row(a):
if a % 5 == 0:
return True
return False
df = pd.read_csv('xyz.txt', skiprows= lambda x:check_row(x))
More details of this can be found in this link about skip rows
No. skiprows will not allow you to drop based on the row content/value.
Based on Pandas Documentation:
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or
number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2].
Since you cannot do that using skiprows, I could think of this way as efficient :
df = pd.read_csv(filePath)
df = df.loc[df['column1']=="Rows to keep"]
Related
I want convert only first 2 values of the first column of my df into header or turn that values as column/row.
I mean, I have this
0 1 2
Person 7.8 10
Person2 6 11
But I want get this
Person Person2
7.8 6
10 11
Use if first column is not index use DataFrame.set_index, transpose, create default index and last remove columns name:
df = df.set_index(0).T.reset_index(drop=True).rename_axis(None, axis=1)
else:
df = df.T.reset_index(drop=True).rename_axis(None, axis=1)
I have following DataFrame
0
1
2
3
4
First
row
row
row
row
Second
row
row
row
row
beacuse my dataframe can be longer, I want to rename first 3 columns, and then I want the rows for the next 3 columns, to be dropped as a row for the first column
df.rename(columns={0:'data', 1:'user', 2:'file')
data
user
file
3
4
5
dataa_1
user_1
file_1
dataa_2
user_2
file_2
Second
row
row
row
row
row
and then I want to write some code, so the rest of 3 column's row would be moved as a second row of my first column:
data
user
file
3
4
5
dataa_1
user_1
file_1
row
row
row
dataa_2
user_2
file_2
row
row
row
Maybe something like this
import pandas as pd
df = pd.DataFrame([range(6),range(6)],columns = ['first','second','third','fourth','fifth','six'])
df2 = df[['fourth','fifth','six']]
df2 = df2.rename(columns = {'fourth':'first','fifth':'second','six':'third'})
df3 = pd.concat([df,df2])
For the following result
I have imported an excel sheet using Pandas like this:
w = pd.read_excel(r"C:\Users\lvk\Downloads\Softwares\Prob.xls", header=None)
Once I imported the excel sheet, I need to delete the rows with even a single zero in any column.
Are there any functions in Python to do that?
Please let me know.
Input:
row1: 0 4 3 5
row2: 1 6 5 61
row3: 1 3 6 0
Expected output:
1 6 5 61
Pandas has very powerful interfaces for indexing and selecting data. Among them are the use of the loc keyword to access by rows, and square brackets to pass indexing logic to loc. Normally you might use the names of your columns to do logical operations on their values. Here I don't know the index or columns of your excel data, so we will just loop through all the columns that are there.
# We are going to look in each column
for col in w.columns:
# And select only the rows in w that don't have a 0 in that column
w = w.loc[w[column] != 0]
I have a csv file with 1000 rows and 1000 columns
I have just found that I can call each component from specific rows and columns using
df = pd.read_csv('name.csv', sep=",")
print(df.iloc[120, 250])
which means I am calling component from row 120 and column 250.
but my question is, how can I call a component with its name of column and the value of its row not the values of its column and the value of its row.
for example the name or row of column 1 is 23
for column 2 I have 43
for column 3 the first row is 55
If I write df.iloc[0, 2] it will be 55 and df.iloc[0, 0] is 23,I want instead of writing the value of column (for example 2 or 0 or 6) to force the code to give values of column starting with 55 or 23
If you are talking about accesing a value by the name of the column and number of row in
df.at[index:column]
I have two dataframe's one with expression and another with values. Dataframe 1 criteria column has value with column name of another dataframe. My need is to take each row values from Dataframe 2 and replace Dataframe 1 criteria without loop.
How should I do it in an optimized way ?
DataFrame 1:
Criteria point
0 chgdsl='10' 1
1 chgdt ='01022007' 2
3 chgdsl='9' 3
DataFrame 2:
chgdsl chgdt chgname
0 10 01022007 namrr
1 9 02022007 chard
2 9 01022007 exprr
I expect that when I take first row of DataFrame 2 , output of Dataframe 1 will be 10='10' , 01022007 ='01022007' 10='9'
Need to take one row at a time from Dataframe 2 and replace it in all rows of Dataframe 1.