Replacing Periods in DF's Columns
I was wondering if there was an efficient way to replace periods in pandas dataframes without having to iterate through each row and call.replace() on the row.
import pandas as pd
df = pd.DataFrame.from_dict({'column':['Sam M.']})
df.column = df.column.replace('.','')
print df
Result
column
0 None
Desired Result
column
0 Sam M
df['column'].str.replace('.', '', regex=False)
0 Sam M
Name: column, dtype: object
Because . is a regex special character so put '\' front of it then it will be good:
Solution:
df['column'].str.replace('\.','')
Example:
df['column']=df['column'].str.replace('\.','')
print(df)
Output:
column
0 Sam M
Related
I have a pandas data frame which looks like this.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
2 cat 1 B
I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.
Column1 Column2 Column3
0 cat 1 C
1 dog 1 A
Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.
If dataframe is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
import pandas as pd
df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
"Column2":[1,1,1],
"Column3":["C","A","B"]})
df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)
Inside the drop_duplicates() method of Dataframe you can provide a series of column names to eliminate duplicate records from your data.
The following "Tested" code does the same :
import pandas as pd
df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat', 'toy', 'cat'])
df.insert(loc=1,column='Column2',value=['bat', 'flower', 'bat'])
df.insert(loc=2,column='Column3',value=['xyz', 'abc', 'lmn'])
df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)
Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
Use drop_duplicates() by using column name
import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates(subset=["Column1"], keep="first")
keep=first to instruct Python to keep the first value and remove other columns duplicate values.
keep=last to instruct Python to keep the last value and remove other columns duplicate values.
Suppose we want to remove all duplicate values in the excel sheet. We can specify keep=False
I imported a dataset looks like this.
Peak, Trough
0 1857-06-01, 1858-12-01
1 1860-10-01, 1861-06-01
2 1865-04-01, 1867-12-01
3 1869-06-01, 1870-12-01
4 1873-10-01, 1879-03-01
5 1882-03-01, 1885-05-01
6 1887-03-01, 1888-04-01
it is a CSV file. But when I check the .shape, it is
(7, 1)
I thought CSV file can automatically be seperated by its commas, however this one doesn't work.
I want to split this column into two, sperated by comma, and also the column names as well. How can I do that?
Use 'sep' tag in read_csv
It's like:
df = read_csv(path, sep = ', ')
Same data to text file or csv and then use read_csv with parameter skipinitialspace=True and parse_dates for convert values to datetimes:
df = pd.read_csv('data.txt', skipinitialspace=True, parse_dates=[0,1])
print (df.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
If data are in excel in one column is possible use Series.str.split by first column, convert to datetimes and last set new columns names:
df = pd.read_excel('data.xlsx')
df1 = df.iloc[:, 0].str.split(', ', expand=True).apply(pd.to_datetime)
df1.columns = df.columns[0].split(', ')
print (df1.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df1.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
Is it possible to add a new column to a dataframe that comes from regular expression used on text from first column? How this could be done?
re.compile ('\S+#\S+', s)
And I would like to use that regexp on each row for each text from frst column and ass another column add the outcome of regexp.
for idx, data_string in df.itertuples(name='first_column'):
# do things with the data_string here
# save result in second column
df.loc[idx, 'second_column'] = result
Maybe I get you wrong, but isn't it just iterating over all rows and save the result from your regexp in the second column?
Pandas DataFrame all must be same length ,so the row which match regex is the only row going to be in dataframe at the end.
You just need to define function which apply regex on string
and use apply function in pandas series and insert it to dataframe at the end.
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({'col_1':['123','12','b23','134'],'col_2':['a','b','c','d']})
df
Out[1]:
col_1 col_2
0 123 a
1 12 b
2 b23 c
3 134 d
def regex(string):
pattern = re.compile(r"\d{1,2}")
result = pattern.match(string)
if result:
return result.group()
return np.nan #Here if not match so i can drop all row later
new_col = df.col_1.apply(regex)
df.insert(loc =2,column='new_col',value=new_col)
df = df.dropna()
df
Out[2]:
col_1 col_2 new_col
0 123 a 12
1 12 b 12
3 134 d 13
I have a df that looks like this:
col1_test col1_test.1
abc NaN
How do I drop only the .1 while keeping all the other characters in the column name?
current code to drop .1:
df.columns = df.columns.str.extract(r'\.?', expand=False)
but this is dropping the other characters in the column name like underscore.
New df:
col1_test col1_test
abc NaN
Once this part is set, I will merge the columns using this:
df = df.groupby(level=0, axis=1).first()
This is not recommended because it becomes difficult to index specific columns when there are duplicate headers.
A better solution, however, since trying to perform a groupby, would be to pass a callable.
df
col1_test col1_test.1
0 abc NaN
df.groupby(by=lambda x: x.rsplit('.', 1)[0], axis=1).first()
col1_test
0 abc
For reference, you'd remove column suffixes with str.replace:
df.columns = df.columns.str.replace(r'\.\d+$', '')
You can also use str.rsplit:
df.columns = df.columns.str.rsplit('.', 1).str[0]
df
col1_test col1_test
0 abc NaN
I have a pandas dataframe with two columns of strings. I want to identify all row where the string in the first column (s1) appears within the string in the second column (s2).
So if my columns were:
abc abcd*ef_gh
z1y xxyyzz
I want to keep the first row, but not the second.
The only approach I can think of is to:
iterate through dataframe rows
apply df.str.contains() to s2 using the contents of s1 as the matching pattern
Is there a way to accomplish this that doesn't require iterating over the rows?
It is probably doable (for simple matching only), in a vectorised way, with numpy chararray methods:
In [326]:
print df
s1 s2
0 abc abcd*ef_gh
1 z1y xxyyzz
2 aaa aaabbbsss
In [327]:
print df.ix[np.char.find(df.s2.values.astype(str),
df.s1.values.astype(str))>=0,
's1']
0 abc
2 aaa
Name: s1, dtype: object
The best I could come up with is to use apply instead of manual iterations:
>> df = pd.DataFrame({'x': ['abc', 'xyz'], 'y': ['1234', '12xyz34']})
>> df
x y
0 abc 1234
1 xyz 12xyz34
>> df.x[df.apply(lambda row: row.y.find(row.x) != -1, axis=1)]
1 xyz
Name: x, dtype: object