removing rows of pandas input file. - python

I am reading files in pandas for which column names are not starting with row number one , instead there is headline/name row 1 of data.csv
>>> df = pd.read_csv("data.csv")
>>> df
Unnamed: 0 Unnamed: 1 name Unnamed: 3
0 col1 col2 col3 col4
1 1 2 3 4
2 2 5 4 6
In this case how i can delete row with headlines/names and make sure actual column names are col1, col2 etc.
Thanks in advance

You can choose to skip rows:
You can choose specific line numbers to skip or a quantity of lines to skip. If you use specific row numbers, then pass a list to skiprows. In your case you could use the following to be certain things are read correctly:
pd.read_csv("data.csv",header=[0], skiprows=[0])
Data:
I used the following data stored in a file called data.csv
,,name,
0, col1, col2, col3, col4,
1, 1, 2, 3, 4,
2, 2, 5, 4, 6
Output:
0 col1 col2 col3 col4 Unnamed: 5
0 1 1 2 3 4 NaN
1 2 2 5 4 6 NaN
From the docs:
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
Link to source:
Here is a link to the documentation for your reference.

Considering your data is in data.csv, you can use below code:
df = pd.read_csv("data.csv", skiprows=1)
Output:
col1 col2 col3 col4 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 1 2 3 4 NaN NaN NaN
1 2 5 4 6 NaN NaN NaN
Remove the unwanted columns with
df = df.dropna(axis=1)
print(df)
Output:
col1 col2 col3 col4
0 1 2 3 4
1 2 5 4 6
As #jpp pointed out you can also achieve these in one step as follows:
df = pd.read_csv("data.csv", skiprows=1, usecols=['col1', 'col2', 'col3', 'col4'])
Refer to read_csv(), dropna() for more information.

Related

How to select rows filtered with condition on the previous and the next rows in pandas and put them in a empty df?

Considering the following dataframe df :
df = pd.DataFrame(
{
"col1": [0,1,2,3,4,5,6,7,8,9,10],
"col2": ["A","B","C","D","E","F","G","H","I","J","K"],
"col3": [1e-0,1e-1,1e-2,1e-3,1e-4,1e-5,1e-6,1e-7,1e-8,1e-9,1e-10],
"col4": [0,4,2,5,6,7,6,3,6,2,1]
}
)
I would like to select rows when the col4 value of the current row is greater than the col4 values of the previous and next rows and to store them in an empty frame.
I wrote the following code that works :
df1=pd.DataFrame()
for i in range(1,len(df)-1,1):
if ( (df.iloc[i]['col4'] > df.iloc[i+1]['col4']) and (df.iloc[i]['col4'] > df.iloc[i-1]['col4']) ):
df1=pd.concat([df1,df.iloc[i:i+1]])
I got the expected dataframe df1
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
But this code is very ugly, not readable, ... Is there a best solution ?
Use boolean indexing with compare next and previous values by Series.shift and Series.gt for greater values, for chain bitwise AND use &:
df = df[df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))]
print (df)
col1 col2 col3 col4
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
EDIT: Solution for always include first and last rows:
mask = df['col4'].gt(df['col4'].shift()) & df['col4'].gt(df['col4'].shift(-1))
mask.iloc[[0, -1]] = True
df = df[mask]
print (df)
col1 col2 col3 col4
0 0 A 1.000000e+00 0
1 1 B 1.000000e-01 4
5 5 F 1.000000e-05 7
8 8 I 1.000000e-08 6
10 10 K 1.000000e-10 1

Pandas: Split dataframe with duplicate values into dataframe with unique values

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?
I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

How to enter the value of one index and column into a new cell with +1 in the iteration?

I have the following DataFrame named df1:
col1
col2
col3
5
3
50
10
4
3
2
0
1
I would like to create a loop that adds a new column called "Total", which takes the value of col1 index 0 (5) and enters that value under the column "Total" at index 0. The next iteration, will col2 index 1 (4) and that value will go under column "Total" at index 1. This step will continue all columns and rows are completed.
The ideal output will be the following:
df1
col1
col2
col3
Total
5
3
50
5
10
4
3
4
2
0
1
1
I have the following code but I would like to find a more efficient way of doing this as I have a large DataFrame:
df1.iloc[0,3] = df1.iloc[0,0]
df1.iloc[1,3] = df1.iloc[1,1]
df1.iloc[2,3] = df1.iloc[2,2]
Thank you!
Numpy has a built in diagonal function:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [5, 10, 2], 'col2': [3, 4, 0], 'col3': [50, 3, 1]})
df['Total'] = np.diag(df)
print(df)
Output
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
You can try apply on rows
df['Total'] = df.apply(lambda row: row.iloc[row.name], axis=1)
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
Hope this logic will help
length = len(df1["col1"])
total = pd.Series([df1.iloc[i, i%3] for i in range(length)])
# in i%3, 3 is number of cols(col1, col2, col3)
# add this total Series to df1

How to switch n columns to rows of a r rows pandas dataframe (n*r rows in the final dataframe)?

Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4

How to make a sum row for two columns python dataframe

I have a pandas dataframe:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
And I want to add a new row summing over two columns [Col1,Col2] like:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
Total 3 5 NaN
Ignoring Col3. What should I do? Thanks in advance.
You can use the pandas.DataFrame.append and pandas.DataFrame.sum methods:
df2 = df.append(df.sum(), ignore_index=True)
df2.iloc[-1, df2.columns.get_loc('Col3')] = np.nan
You can use pd.DataFrame.loc. Note the final column will be converted to float since NaN is considered float:
import numpy as np
df.loc['Total'] = [df['Col1'].sum(), df['Col2'].sum(), np.nan]
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].astype(int)
print(df)
Col1 Col2 Col3
0 1 2 3.0
1 2 3 4.0
Total 3 5 NaN

Categories

Resources