'DataFrame' object has no attribute 'string_column' - python

I am trying to count the delimiter from my CSV file using this piece of code:
import pandas as pd
df = pd.read_csv(path,sep=',')
df['comma_count'] = df.string_column.str.count(',')
print (df)
But I keep getting this error:
'DataFrame' object has no attribute 'string_column'.
Trying to iterate through my dataframe had no avail.
I am trying to achieve this:
id val new comma_count
id val new 2
0 a 2.0 234.0 2
1 a 5.0 432.0 2
2 a 4.0 234.0 2
3 a 2.0 23423.0 2
4 a 9.0 324.0 2
5 a 7.0 NaN 1
6 NaN 234.0 NaN 1
7 a 6.0 NaN 1
8 4 NaN NaN 0
My file:
id,val,new
a,2,234
a,5,432
a,4,234
a,2,23423
a,9,324
a,7
,234
a,6,
4

Use different separator with select first column and count:
df1 = pd.read_csv(path,sep='|')
df['dot_count'] = df1.iloc[:, 0].str.count(',')

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

pandas pivot table where the column contains a string with multiple catogeries

I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN

Python Read all the sheet and combine

I tried to concatenate all the sheets in the excel file without leaving NaN in other sheets
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
df = pd.concat([fil.parse(name) for name in names])
print(df)
Looks like it only appends the sheets to the first sheet.
The result:
COUNT NAME Number count2
0 4.0 kiko NaN NaN
1 5.0 esmer NaN NaN
2 6.0 jason NaN NaN
0 NaN NaN 9.0 23.0
1 NaN NaN 10.0 13.0
2 NaN NaN 11.0 14.0
The result that I want:
COUNT NAME Number count2
0 4.0 kiko 9.0 23.0
1 5.0 esmer 10.0 13.0
2 6.0 jason 11.0 14.0
Concatenate on axis 1 (columns) instead of axis 0 (index, the default), like so: df = pd.concat([fil.parse(name) for name in names], axis=1).
Code
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
# concatenated
df = pd.concat([fil.parse(name) for name in names], axis=1)
print(df)
Output
COUNT NAME Number count2
0 4 kiko 9 23
1 5 esmer 10 13
2 6 jason 11 14

How to remove multiple headers from dataframe and keeps just the first python

I'm working with a csv file that presents multiple headers, all are repeated like in this example:
1 2 3 4
0 POSITION_T PROB ID
1 2.385 2.0 1
2 POSITION_T PROB ID
3 3.074 6.0 3
4 6.731 8.0 4
6 POSITION_T PROB ID
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
I want to remove the duplicated headers to get the a final document like this:
0 POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
The way in which I trying to remove these headers is by using:
df1 = [df!='POSITION_T'][df!='PROB'][df!='ID']
But that produce the error TypeError: Could not compare ['TRACK_ID'] with block values
Some ideas? thanks in advance!
Filtering out by field value:
df = pd.read_table('yourfile.csv', header=None, delim_whitespace=True, skiprows=1)
df.columns = ['0','POSITION_T','PROB','ID']
del df['0']
# filtering out the rows with `POSITION_T` value in corresponding column
df = df[df.POSITION_T.str.contains('POSITION_T') == False]
print(df)
The output:
POSITION_T PROB ID
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
6 12.508 2.0 1
7 12.932 4.0 2
8 12.985 4.0 2
This is not ideal! The best way to deal with this would be to handle it in the file parsing.
mask = df.iloc[:, 0] == 'POSITION_T'
d1 = df[~mask]
d1.columns = df.loc[mask.idxmax].values
d1 = d1.apply(pd.to_numeric, errors='ignore')
d1
POSITION_T PROB ID
1
1 2.385 2.0 1
3 3.074 6.0 3
4 6.731 8.0 4
7 12.508 2.0 1
8 12.932 4.0 2
9 12.985 4.0 2
To keep the bottom level column names only:
df.columns=[multicols[-1] for multicols in df.columns]
past_data=pd.read_csv("book.csv")
past_data = past_data[past_data.LAT.astype(str).str.contains('LAT') == False]
print(past_data)
Replace the CSV (here: book.csv)
Replace your variable names (here: past_data)
Replace all the LAT with your any of your column name
That's All/ your multiple headers will be removed

How To Create A Dataframe Series

I have a dictionary of Pandas Series objects that I want to turn into a Dataframe. The key for each series should be the column heading. The individual series overlap but, each label is unique.
I thought I should be able to just do
df = pd.DataFrame(data)
But I keep getting the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I get the same error if I try to turn each series into a frame, and use pd.concat(data, axis=1).
Which doesn't make sense if you take the column label into account. What am I doing wrong, and how do I fix it?
I believe you need reset_index with parameter drop=True of each Series in dict comprehension, because duplicates in index:
s = pd.Series([1,4,5,2,0], index=[1,2,2,3,5])
s1 = pd.Series([5,7,8,1],index=[1,2,3,4])
data = {'a':s, 'b': s1}
print (s.reset_index(drop=True))
0 1
1 4
2 5
3 2
4 0
dtype: int64
df = pd.concat({k:v.reset_index(drop=True) for k,v in data.items()}, axis=1)
print (df)
a b
0 1 5.0
1 4 7.0
2 5 8.0
3 2 1.0
4 0 NaN
If you need drop rows where duplicated index use boolean indexing with duplicated:
print (s[~s.index.duplicated()])
1 1
2 4
3 2
5 0
dtype: int64
df = pd.concat({k:v[~v.index.duplicated()] for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.0 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Another solution:
print (s.groupby(level=0).mean())
1 1.0
2 4.5
3 2.0
5 0.0
dtype: float64
df = pd.concat({k:v.groupby(level=0).mean() for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.5 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN

Categories

Resources