I have one numeric feature in a data frame but in excel some of the values contain quotes which need to be removed.
Below table is what my data appears to be in Excel file now I want to remove quotes from last 3 rows using python.
Col1
Col2
123
A
456
B
789
C
"123"
D
"456"
E
"789"
F
I have used following code in Python:
df["Col1"] = df['Col1'].replace('"', ' ').astype(int)
But above code gives me error message: invalid literal for int() with base 10: '"123"'.
I have also tried strip() function but still it is not working.
If I do not convert the data type and use below code
df["Col1"] = df['Col1'].replace('"', ' ')
Then the code is getting executed without any error however while saving the file into CSV it is still showing quotes.
One way is to use converter function while reading Excel file. Something along those lines (assuming that data provided is in Excel file in columns 'A' and 'B'):
import pandas as pd
def conversion(value):
if type(value) == int:
return value
else:
return value.strip('"')
df = pd.read_excel('remove_quotes_excel.xlsx', header=None,
converters={0: conversion})
# df
0 1
0 123 A
1 456 B
2 789 C
3 123 D
4 456 E
5 789 F
Both columns are object type, but now (if needed) it is straightforward to convert to int:
df[0] = df[0].astype(int)
You can do it by using this code. (Regex is if you get a warning)
df.Col1.replace('\"', '', regex = True, inplace = True)
First convert the Col1 into series
df_Series = df['Col1']
Apply replace function on series
df_Series = df_Series.replace('"','').astype(int)
then append the Series into df data frame.
Related
I'm currently trying to convert a column of different datatypes into a series of lists on each row.
my code so far is:
def clean_data(path):
df = pd.read_csv(path)
data = pd.DataFrame(df)
data['Export MPANs/MSIDs'].replace(' ', np.nan, inplace=True)
data.dropna(subset=['Export MPANs/MSIDs'])
data['Export MPANs/MSIDs'] = data[data['Export MPANs/MSIDs'].str.contains(pat='[A-Za-z]',regex=True) == False]
data['Export MPANs/MSIDs'] = data['Export MPANs/MSIDs'].replace(to_replace=';', value=',')
data['Export MPANs/MSIDs'] = data['Export MPANs/MSIDs'].replace(to_replace=' ', value=',')
return data
This code was meant to:
Identify any semi-colons in the string and replace them
Identify any spaces in the string and replace them
Identify any letters and drop them for the data set
Here's the raw data:
0 1030085114723;1030085114955
1 10500018724101050001872400
2 MSID: 394837
3 1050002018370
4 1023518907371 1023518908064
Here's the data-set after calling the function:
0 ARLAFD
1 AW_GRA
2 BREFC2
3 CROWNP
4 ESWMID
Essentially, it returns data from the neighbouring column. I'm not entirely sure why this is happening as i'm not explicitly calling any other column from the data set - any help would be appreciated!
I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003
Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)
Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]
I have two data frames containing a common variable, 'citation'. I am trying to check if values of citation in one data frame are also values in the other data frame. The problem is that the variables are of different format. In one data frame the variables appear as:
0154/0924
0022/0320
whereas in the other data frame they appear as:
154/ 924
22/ 320
the differences being: 1) no zeros before the first non-zero integer of the number before the hyphen and 2) zeros that appear after the hyphen but before the first non-zero integer after the hyphen are replaced with spaces, ' ', in the second data frame.
I am trying to use a function and apply it, as shown in the code below, but I am having trouble with regex and I could not find documentation on this exact problem.
def Clean_citation(citation):
# Search for opening bracket in the name followed by
# any characters repeated any number of times
if re.search('\(.*', citation):
# Extract the position of beginning of pattern
pos = re.search('\(.*', citation).start()
# return the cleaned name
return citation[:pos]
else:
# if clean up needed return the same name
return citation
df['citation'] = df['citation'].apply(Clean_citation)
Aside: Maybe something relevant- 01 invalid token
My solution:
def convert_str(strn):
new_strn = [s.lstrip("0") for s in strn.split('/')] #to strip only leading 0's
return ('/ ').join(new_strn)
So,
convert_str('0154/0924') #would return
'154/ 924'
Which is in the same format as 'citation' in the other data frame. Could make use of pandas apply function to 'apply' convert_str function on 'citation' column of first dataframe.
Solution
You can use x.str.findall('(\d+)') where x is either the pandas.Dataframe column or a pandas.Series object. You can run this on both columns and extract the true numbers, with each row as a list of two numbers or none (if no number is present.
You could then concatenate the numbers into a single string:
num_pair_1 = df1.Values.str.findall('(\d+)')
num_pair_2 = df2.Values.str.findall('(\d+)')
a = num_pair_1.str.join('/') # for first data column
b = num_pair_2.str.join('/') # for second data column
And now finally compare a and b as they should not have any of those additional zeros or spaces.
# for a series s with the values
s.str.strip().str.findall('(\d+)')
# for a column 'Values' in a dataframe df
df.Values.str.findall('(\d+)')
Output
0 []
1 [154, 924]
2 [22, 320]
dtype: object
Data
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
ss = """
154/ 924
22/ 3
"""
s = pd.Series(StringIO(ss))
df = pd.DataFrame(s.str.strip(), columns=['Values'])
Output
Values
0
1 154/ 924
2 22/ 320
Here's a pattern that would filter both:
pattern = '[0\s]*(\d+)/[0\s]*(\d+)'
s = pd.Series(['0154/0924','0022/0320', '154/ 924', '22/ 320'])
s.str.extract('[0\s]*(\d+)/[0\s]*(\d+)')
Output:
0 1
0 154 924
1 22 320
2 154 924
3 22 320
Convert the str to a list by str.split('/') and map to int:
int will remove the leading zeros
If the values in the list are different, df1['citation'] == df2['citation'] will compare as False by row
Requires no regular expressions or list comprehensions
Dataframe setup:
df1 = pd.DataFrame({'citation': ['0154/0924', '0022/0320']})
df2 = pd.DataFrame({'citation': ['154/ 924', '22/ 320']})
print(df1)
citation
0154/0924
0022/0320
print(df2)
citation
154/ 924
22/ 320
Split on / and set type to int:
def fix_citation(x):
return list(map(int, x.split('/')))
df1['citation'] = df1['citation'].apply(fix_citation)
df2['citation'] = df2['citation'].apply(fix_citation)
print(df1)
citation
[154, 924]
[22, 320]
print(df2)
citation
[154, 924]
[22, 320]
Compare the columns:
df1 == df2
I am merging two csv(data frame) using below code:
import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)
I have the following CSV files
file1:
test_id, student_id
1, 01990
2, 02300
3, 05555
file2:
test_id, result
1, pass
3, fail
after merge
test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail
If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv function it converts it into numeric and removes leading 0.
How can I keep the column as "text" even after to_csv?
I think its to_csv function which saves back again as numeric
Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric
Caveat Please use merge or join. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.
a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
[d.set_index('test_id') for d in [a, df]],
axis=1, join='outer'
).reset_index()
It's not dropping the leading zero on the merge, it's dropping it on the read_csv. You can fix this by specifying that column is a string at import time:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
The important part is the dtype parameter. You are telling pandas to import this column as a string. The skipinitialspace parameter is set to True, because the column headers are defined with spaces, so we strip it:
test_id, student_id
^ The student_id starts here, at the space
The final code looks like this:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')
With the results dataframe looking like this:
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
Then when you run to_csv your result should be:
test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail
Solution with join, first need read_csv with parameter dtype for convert student_id to string and remove whitespaces by skipinitialspace:
df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)
df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})
==============================================================
In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]:
test_id student_id result
0 01 1990 pass
1 02 2300 NaN
2 003 5555 fail
I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?
You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64
Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)