Regex Sub and Pandas - python

I am struggling replacing a string in a pandas cell with data from a dictionary. I have a pandas frame:
import pandas as pd
import numpy as np
import re
f = {'GAAP':['<1>','2','3','4'],'CP':['5','6','<7>','8']}
filter = pd.DataFrame(data=f)
filter
and a dictionary:
d = {'GAAP':['100','101'],'CP':['500','501','502']}
d
I am trying to get the following output:
op = {'GAAP':['100|101','2','3','4'],'CP':['5','6','500|501|502','8']}
op = pd.DataFrame(data=op)
op
I tried something like:
def rep1(fr,di):
op=re.sub('\<.*?\>',fr,di)
return(op)
a='|'.join(d['GAAP'])
op=rep1(filter['GAAP'],a)
op
but get an error saying series objects are mutable and cannot be hashed.Any suggestions on what I am doing wrong ?

Let us try use pd.to_numeric convert the <> to NaN, then fillna
filter=filter.apply(pd.to_numeric,errors='coerce').fillna(pd.Series(d).str.join('|'))
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8

one way about it using replace : get the regexes that match the <> and pair them with their replacements from the dictionary.
outcome = filter.replace({'GAAP':"\<\d\>", 'CP':"\<\d\>"},
{"GAAP":"|".join(d['GAAP']), "CP":"|".join(d["CP"])},
regex=True)
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8

Related

How to convert first column of dataframe in to its headers

I have dataframe df:
0
0 a
1 b
2 c
3 d
4 e
O/P should be:
a b c d e
0
1
2
3
4
5
I want column containing(a, b,c,d,e) as header of my dataframe.
Could anyone help?
If your dataframe is pandas and its name is df. Try solving it with pandas:
Firstly convert initial df content to a list, afterwards create a new dataframe defining its columns with the list.
import pandas as pd
list = df[0].tolist() #df[0] is getting the content of first column
dfSolved = pd.DataFrame([], columns = list)
You may provide more details like the index and values of the expected output, the operation you wanna do, etc, so that we could give a specific solution to your case
Here is the solution:
import pandas as pd
import io
import numpy as np
data_string = """ columns_name
0 a
1 b
2 c
3 d
4 e
"""
df = pd.read_csv(io.StringIO(data_string), sep='\s+')
# Solution
df_result = pd.DataFrame(data=[[np.nan]*5],
columns=df['columns_name'].tolist())

How to remove all occurrences of a character in a dataframe?

I have a dataframe with multiple columns and most have special characters like $, % or ^ and so on... How can I delete these characters throughout the entire data frame? I only know how to delete by column, for example:
df['Column'] = df['Column'].str.replace('^\d+','')
Just noticed that pandas.DataFrame.replace does not work with special characters like $, %, ^ etc. So, you can use following snippet to get rid of these special characters from the whole dataframe. We need to make sure that a certain column is of type string before applying str.replace
import pandas as pd
from pandas.api.types import is_string_dtype
f = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['f;','d:','sda$sd'],
'D':['s%','d;','d^p'],
'E':[5,3,6],
'F':[7,4,3]})
f looks like follows.
A B C D E F
0 1 4 f; s% 5 7
1 2 5 d: d; 3 4
2 3 6 sda$sd d^p 6 3
Now use a loop to replace the strings.
for col in f.columns:
if is_string_dtype(f[col]):
f[col] = f[col].str.replace('[^A-Za-z0-9-\s]+', '')
Output:
A B C D E F
0 1 4 f s 5 7
1 2 5 d d 3 4
2 3 6 sdasd dp 6 3
UPDATE:
The pandas version 0.24.1 does not replace some special characters, but versions 0.23.4 and 0.25.1 do work. So if you have any of these working versions, you can easily use pandas.DataFrame.replace to delete the special characters as follows. Make sure to escape these characters with \.
f = f.replace({'\$':'', '\^':'','%':''}, regex=True)
This results in the same output as above.
I think you want:
pandas.DataFrame.replace(to_replace, value)
The parameters accept regex and it should cover the whole df. Hope this helps.
Here's the documentation on this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace

I have a value in a table but when I try to verify if the value is there it returns false [duplicate]

In Python to check if a value is in a list you can simply do the following:
>>>9 in [1,2,3,6,9]
True
I would like to do the same for a Pandas DataFrame but unfortunately Pandas does not recognise that notation:
>>>import pandas as pd
>>>df = pd.DataFrame([[1,2,3,4],[5,6,7,8]],columns=["a","b","c","d"])
a b c d
0 1 2 3 4
1 5 6 7 8
>>>7 in df
False
How would I achieve this using Pandas DataFrame without iterating through each column/row or anything complicated?
Basically you have to check the matrix without the schema, so:
7 in df.values
x in df checks if x is in the columns:
for x in df:
print x,
out: a b c d

See if a value exists in a DataFrame

In Python to check if a value is in a list you can simply do the following:
>>>9 in [1,2,3,6,9]
True
I would like to do the same for a Pandas DataFrame but unfortunately Pandas does not recognise that notation:
>>>import pandas as pd
>>>df = pd.DataFrame([[1,2,3,4],[5,6,7,8]],columns=["a","b","c","d"])
a b c d
0 1 2 3 4
1 5 6 7 8
>>>7 in df
False
How would I achieve this using Pandas DataFrame without iterating through each column/row or anything complicated?
Basically you have to check the matrix without the schema, so:
7 in df.values
x in df checks if x is in the columns:
for x in df:
print x,
out: a b c d

Python pandas Reading specific values from HDF5 files using read_hdf and HDFStore.select

So I created hdf5 file with a simple dataset that looks like this
>>> pd.read_hdf('STORAGE2.h5', 'table')
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Using this script
import pandas as pd
import scipy as sp
from pandas.io.pytables import Term
store = pd.HDFStore('STORAGE2.h5')
df_tl = pd.DataFrame(dict(A=list(range(5)), B=list(range(5))))
df_tl.to_hdf('STORAGE2.h5','table',append=True)
I know I can select columns using
x = pd.read_hdf('STORAGE2.h5', 'table', columns=['A'])
or
x = store.select('table', where = 'columns=A')
How would I select all values in column 'A' that equals 3 or specific or indicies with strings in column 'A' like 'foo'? In pandas dataframes I would use df[df["A"]==3] or df[df["A"]=='foo']
Also does it make a difference in efficiency if I use read_hdf() or store.select()?
You need to specify data_columns= (you can use True as well to make all columns searchable)
(FYI, the mode='w' will start the file over, and is just for my example)
In [50]: df_tl.to_hdf('STORAGE2.h5','table',append=True,mode='w',data_columns=['A'])
In [51]: pd.read_hdf('STORAGE2.h5','table',where='A>2')
Out[51]:
A B
3 3 3
4 4 4

Categories

Resources