Csv missing columns with Pandas Dataframe - python

I want to read a csv as dataframe into Pandas.
My csv file has the following format
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I read the csv with Pandas I get the following dataframe
a b c d
0 1 2 3 4 5
1 2 3 4 5 6
When I execute print df.columns
I get something like :
Index([u'a', u'b', u'c', u'd'], dtype='object')
And when I execute print df.iloc[0]
I get :
a 2
b 3
c 4
d 5
Name: (0, 1)
I would like to have something a dataframe like
a b c d col1 col2
0 1 2 3 4 5
1 2 3 4 5 6
I don't know how many columns I will have to had. But I need as many columns as the number of value in the first line after the header. How can I achieve that ?

One way to do this would be to read in the data twice. Once with the first row (the original columns) skipped and the second with only the column names read (and all the rows skipped)
df = pd.read_csv(header=None, skiprows=1)
columns = pd.read_csv(nrows=0).columns.tolist()
columns
Output
['a', 'b', 'c', 'd']
Now find number of missing columns and use a list comprehension to make new columns
num_missing_cols = len(df.columns) - len(columns)
new_cols = ['col' + str(i+1) for i in range(num_missing_cols)]
df.columns = columns + new_cols
df
a b c d col1 col2
0 0 1 2 3 4 5
1 1 2 3 4 5 6

Related

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Pandas datrafame inconsistent data in mulitple rows

I have something like that:
>>> x = {'id': [1,1,2,2,2,3,4,5,5], 'value': ['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']}
>>> df = pd.DataFrame(x)
>>> df
id value
0 1 a
1 1 a
2 2 b
3 2 b
4 2 c
5 3 d
6 4 e
7 5 f
8 5 g
I want to filter inconsistent values in this table. For example, columns with id=2 or id=5 are inconsistent, because the same id is associated with different values. I have read solutions about where or any, but they are not something like "comparing if columns with this id always have the same value.
How can I solve this problem?
You can use groupby and filter. This should give you the ids with inconsistent values.
df.groupby('id').filter(lambda x: x.value.nunique()>1)
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
In your case we do groupby + transform with nunique
unc_df=df[df.groupby('id').value.transform('nunique').ne(1)]
id value
2 2 b
3 2 b
4 2 c
7 5 f
8 5 g
I guess, you can use drop_duplicates to drop repetitive rows based on id column:
In [599]: df.drop_duplicates('id', keep='first')
Out[599]:
id value
0 1 a
2 2 b
5 3 d
6 4 e
7 5 f
So the above will pick the first value for duplicated id column. And you will have 1 row per id in your resultant dataframe.

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

How to delete a rows pandas df

I am trying to remove a row in a pandas df plus the following row. For the df below I want to remove the row when the value in Code is equal to X. But I also want to remove the subsequent row as well.
import pandas as pd
d = ({
'Code' : ['A','A','B','C','X','A','B','A'],
'Int' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
If I use this code it removes the desired row. But I can't use the same for value A as there are other rows that contain A, which are required.
df = df[df.Code != 'X']
So my intended output is:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
4 B 4
5 A 5
I need something like df = df[df.Code != 'X'] +1
Using shift
df.loc[(df.Code!='X')&(df.Code.shift()!='X'),]
Out[99]:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5
You need to find the index of the element you want to delete, and then you can simply delete at that index twice:
>>> i = df[df.Code == 'X'].index
>>> df.drop(df.index[[i]], inplace=True)
>>> df.drop(df.index[[i]], inplace=True, errors='ignore')
>>> df
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Categories

Resources