I'm reading a csv file through pandas in python and the last column also includes ; how can i remove it. if i use delimiter as ; it does not work.
Example :
0 -0.22693644;
1 -0.22602014;
2 0.37201694;
3 -0.27763826;
4 -0.5549711;
Name: Z-Axis, dtype: object
I would use parameter comment:
df = pd.read_csv(file, comment=';')
NOTE: this will work properly only for the last column, as everything starting from the comment character till the end of string will be ignored
PS as a little bonus Pandas will treat such column as numeric one, not as a string.
Use str.rstrip:
df['Z-Axis'] = df['Z-Axis'].str.rstrip(";")
Another option:
df['Z-Axis'] = df['Z-Axis'].str[:-1]
Related
This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 17 days ago.
I am creating a dataframe based on a csv import:
ID, attachment, attachment, comment, comment
1, lol.jpg, lmfao.png, 'Luigi',
2, cat.docx, , 'It's me', 'Mario'
Basically the number of 'attachments' and 'comment' columns corresponds to the line that has the bigger number of said attachment and comment.
Since I am exporting the CSV from a third party software, I do not know in advance how many attachments and comment columns there will be.
Importing this CSV with pd.read_csv creates the following dataframe
ID
attachment
attachment.1
comment
comment.1
0
1
lol.jpg
lmfao.png
'Luigi'
1
2
cat.docx
'It's me'
'Mario'
Is there a simple way to select all attachment/comment columns?
Such as attachments_df = imported_df.attachment.all or comments_df = imported_df['comment].??
Thanks.
Use DataFrame.filter for columns starting by string by ^ and optionaly . with \d for comma with decimal for end of string is used $:
attachments_df = imported_df.filter(regex='^attachment\.*\d*$')
comments_df = imported_df.filter(regex='^comment\.*\d*$')
Another possible solution:
attachments_df = imported_df.loc[:,imported_df.columns.str.startswith('attachment')]
comments_df = imported_df.loc[:,imported_df.columns.str.startswith('comment')]
you also can use like atribute of filter function:
imported_df.filter(like='attach')
'''
attachment attachment.1
0 lol.jpg lmfao.png
1 cat.docx NaN
I have some difficulties to exploit csv scraping file in pandas.
I have several columns, one of them contain prices as '1 800 €'
After to import csv as dataframe, I can not convert my columns in Integrer
I deleted euro symbol without problem
data['prix']= data['prix'].str.strip('€')
I tried to delete space with the same approach, but the space still remaied
data['prix']= data['prix'].str.strip()
or
data['prix']= data['prix'].str.strip(' ')
or
data['prix']= data['prix'].str.replace(' ', '')
I tried to force the conversion in Int
data['prix']= pd.to_numeric(data['prix'], errors='coerce')
My column was fill by Nan value
I tried to convert before operation of replace space in string
data = data.convert_dtypes(convert_string=True)
But same result : impossible to achieve my aim
the spaces are always present and I can not convert in integer
I looked with Excel into dataset, I can not identify special problem in the data
I tried also to change encoding standard in read_csv ... ditto
In this same dataset I had the same problem for the kilometrage as 15 256 km
And I had no problem to retreat and convert to int ...
I would like to test through REGEX to copy only numbers of the field et create new column with
How to proceed ?
I am also interested by other ideas
Thank you
Use str.findall:
I would like to test through REGEX to copy only numbers of the field et create new column with
data['prix2'] = data['prix'].str.findall(r'\d+').str.join('').astype(int)
# Or if it raises an exception
data['prix2'] = pd.to_numeric(data['prix'].str.findall('(\d+)').str.join(''), errors='coerce')
To delete the white space use this line:
data['prix']= data['prix'].str.replace(" ","")
and to convert the string into a int use this line:
data['prix'] = [int(i) for i in data['prix']]
I have data frame like this:
id : Name
0 : one
1 : one + two
2 : two
3 : two + three + four
I want to filter rows from this dataframe where name contains '+' and save it to another dataframe. I tried:
df[df.Name.str.contains("+")]
but i'm gettin error:
nothting to repeat at position 0
Any help would be appreciated...Thanks
Looking at the documentation of the str.contains method, it assumes that the string you are passing is a regexp by default.
Therefore, you can either escape the plus character: "\+" or pass the argument regex=False to the method:
df[df.Name.str.contains("\+")]
df[df.Name.str.contains("+", regex=False)]
I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']
I am using the ALL.zip file located here. My goal is to create a pandas DataFrame with it. However, if I run
data=pd.read_csv(foo.csv)
the column names do not match up. The first column has no name, and then the second column is labeled with the first, and the last column is a Series of NaN. So I tried
colnames=[list of colnames]
data=pd.read_csv(foo.csv, names=colnames, header=False)
which gave me the exact same thing, so I ran
data=pd.read_csv(foo.csv, names=colnames)
which lined the colnames up perfectly, but had the csv assigned column names(the first line in the csv document) perfectly aligned as the first row of data it. So I ran
data=data[1:]
which did the trick.
So I found a work around without solving the actual problem. I looked at the read_csv document and found it a bit overwhelming, and could not figure out a way using only pd.read_csv to fix this problem.
What was the fundamental problem (I am assuming it is either user error or a problem with the file)? Is there a way to fix it with one of the commands from the read_csv?
Here is the first 2 rows from the csv file
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",
It's not the column that you're having a problem with, it's the index
import pandas as pd
df = pd.read_csv('P00000001-ALL.csv', index_col=False, low_memory=False)
print(df.head(1))
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00458844 P60006723 Rubio, Marco HEFFERNAN, MICHAEL APO
contbr_st contbr_zip contbr_employer \
0 AE 090960009 INFORMATION REQUESTED PER BEST EFFORTS
contbr_occupation contb_receipt_amt contb_receipt_dt \
0 INFORMATION REQUESTED PER BEST EFFORTS 210 27-JUN-15
receipt_desc memo_cd memo_text form_tp file_num tran_id election_tp
0 NaN NaN NaN SA17A 1015697 SA17.796904 P2016
The low_memory=False is because column 6 has mixed datatype.
The problem comes from having every line in the file except for the first terminating in a comma (the separator character). Pandas thinks there's an empty column there if it needs to consider the first 'column name' as the index column.
Try
data= pd.read_csv('P00000001-AL.csv',index_col=False)