Create pandas dataframe from string - python

I can easily build a pandas dataframe from a string that contains only one key value pair. For example:
string1 = '{"Country":"USA","Name":"Ryan"}'
dict1 = json.loads(string1)
df=pd.DataFrame([dict1])
print(df)
However, when I use a string that has more than one key value pair :
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
dict2 = json.loads(string2)
I get the following error:
raise JSONDecodeError("Extra data", s, end)
I am aware that string2 is not a valid JSON.
What modifications can I do on string2 programmatically so that I can convert it to a valid JSON and then get a dataframe output which is as follows:
| Country | Name |
|---------|------|
| USA | Ryan |
| Sweden | Sam |
| Brazil | Ralf |

Your error
The error says it all. The JSON is not valid. Where did you get that string2? Are you typing it in yourself?
In that case you should surround the items with brackets [] and separate the items with comma ,.
Working example:
import pandas as pd
import json
string2 = '[{"Country":"USA","Name":"Ryan"},{"Country":"Sweden","Name":"Sam"},{"Country":"Brazil","Name":"Ralf"}]'
df = pd.DataFrame(json.loads(string2))
print(df)
Returns:
Country Name
0 USA Ryan
1 Sweden Sam
2 Brazil Ralf
Interestingly, if you are extra observant, in this line here df=pd.DataFrame([dict1]) you are actually putting your dictionary inside an array with brackers[]. This is because pandas DataFrame accepts arrays of data. What you actually have in your first example is an item in which case a serie would make more sense or df = pd.Series(dict1).to_frame().T.
Or:
string1 = '[{"Country":"USA","Name":"Ryan"}]' # <--- brackets here to read json as arr
dict1 = json.loads(string1)
df=pd.DataFrame(dict1)
print(df)
And if you understood this I think it becomes easier to understand that we need , to seperate the elements.
Alternative inputs
But let's say you are creating this dataset yourself, then you could go ahead and do this:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
dict1 = [{"Country":i, "Name":y} for i,y in data] # <-- dictionaries inside arr
df = pd.DataFrame(dict1)
Or:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
df = pd.DataFrame(dict1, columns=['Country','Name'])
Or which I would prefer to use a CSV-structure:
data = '''\
Country,Name
USA,Ryan
Sweden,Sam
Brazil,Ralf'''
df = pd.read_csv(pd.compat.StringIO(data))

In the off chance that you are getting data from elsewhere in the weird format that you described, following regular expression based substitutions can fix your json and there after you can go as per #Anton vBR 's solution.
import pandas as pd
import json
import re
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
#create dict of substitutions
rd = { '^{' : '[{' , #substitute starting char with [
'}$' : '}]', #substitute ending char with ]
'}{' : '},{' #Add , in between two dicts
}
#replace as per dict
for k,v in rd.iteritems():
string2 = re.sub(r'{}'.format(k),r'{}'.format(v),string2)
df = pd.DataFrame(json.loads(string2))
print(df)

Related

How to remove quotes from Numeric data in Python

I have one numeric feature in a data frame but in excel some of the values contain quotes which need to be removed.
Below table is what my data appears to be in Excel file now I want to remove quotes from last 3 rows using python.
Col1
Col2
123
A
456
B
789
C
"123"
D
"456"
E
"789"
F
I have used following code in Python:
df["Col1"] = df['Col1'].replace('"', ' ').astype(int)
But above code gives me error message: invalid literal for int() with base 10: '"123"'.
I have also tried strip() function but still it is not working.
If I do not convert the data type and use below code
df["Col1"] = df['Col1'].replace('"', ' ')
Then the code is getting executed without any error however while saving the file into CSV it is still showing quotes.
One way is to use converter function while reading Excel file. Something along those lines (assuming that data provided is in Excel file in columns 'A' and 'B'):
import pandas as pd
def conversion(value):
if type(value) == int:
return value
else:
return value.strip('"')
df = pd.read_excel('remove_quotes_excel.xlsx', header=None,
converters={0: conversion})
# df
0 1
0 123 A
1 456 B
2 789 C
3 123 D
4 456 E
5 789 F
Both columns are object type, but now (if needed) it is straightforward to convert to int:
df[0] = df[0].astype(int)
You can do it by using this code. (Regex is if you get a warning)
df.Col1.replace('\"', '', regex = True, inplace = True)
First convert the Col1 into series
df_Series = df['Col1']
Apply replace function on series
df_Series = df_Series.replace('"','').astype(int)
then append the Series into df data frame.

Pandas: Replacing string with hashed string via regex

I have a DataFrame with 29 columns, and need to replace part of a string in some columns with a hashed part of the string.
Example of the column is as follows:
ABSX, PLAN=PLAN_A ;SFFBJD
ADSFJ, PLAN=PLAN_B ;AHJDG
...
...
Code that captures the part of the string:
Test[14] = Test[14].replace({'(?<=PLAN=)(^"]+ ;)' :'hello'}, regex=True)
I want to change the 'hello' to hash of '(?<=PLAN=)(^"]+ ;)' but it doesn't work this way. Wanted to check if anyone did this before without looping line by line of the DataFrame?
here is what I suggest:
import hashlib
import re
import pandas as pd
# First I reproduce a similar dataset
df = pd.DataFrame({"v1":["ABSX", "ADSFJ"],
"v2": ["PLAN=PLAN_A", "PLAN=PLAN_B"],
"v3": ["SFFBJD", "AHJDG"]})
# I search for the regex and create a column matched_el with the hash
r = re.compile(r'=[a-zA-Z_]+')
df["matched_el"] = ["".join(r.findall(w)) for w in df.v2]
df["matched_el"] = df["matched_el"].str.replace("=","")
df["matched_el"] = [hashlib.md5(w.encode()).hexdigest() for w in df.matched_el]
# Then I replace in v2 using this hash
df["v2"] = df["v2"].str.replace("(=[a-zA-Z_]+)", "=")+df["matched_el"]
df = df.drop(columns="matched_el")
Here is the result
v1 v2 v3
0 ABSX PLAN=8d846f78aa0b0debd89fc1faafc4c40f SFFBJD
1 ADSFJ PLAN=3b9a3c8184829ca5571cb08c0cf73c8d AHJDG

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Pandas: Randomize letters in a column

I have a data frame that looks like this:
id1 | id2
----------------------------
ab51c-ee-1a | cga--=%abd21
I am looking to randomize the letters only:
id1 | id2
----------------------------
ge51r-eq-1b | olp--=%cqw21
I think I can do something like this:
newid1 = []
for index, row in df.iterrows():
string = ''
for i in row['id1']:
if i.isalpha():
string+=random.choice(string.letters)
else:
string+=i
newcolumn.append(string)
But it doesn't seem very efficient. Is there a better way?
Lets use apply, with the power of str.replace to replace only alphabets using regex i.e
import string
import random
letters = list(string.ascii_lowercase)
def rand(stri):
return random.choice(letters)
df.apply(lambda x : x.str.replace('[a-z]',rand))
Output :
id1 id2
0 gp51e-id-1v jvj--=%glw21
For one specific column use
df['id1'].str.replace('[a-z]',rand)
Added by #antonvbr: For future reference, if we want to change upper and lower case we could do this:
letters = dict(u=list(string.ascii_uppercase),l=list(string.ascii_lowercase))
(df['id1'].str.replace('[a-z]',lambda x: random.choice(letters['l']))
.str.replace('[A-Z]',lambda x: random.choice(letters['u'])))
How about this:
import pandas as pd
from string import ascii_lowercase as al
import random
df = pd.DataFrame({'id1': ['ab51c-ee-1a'],
'id2': ['cga--=%abd21']})
al = list(al)
df = df.applymap(lambda x: ''.join([random.choice(al) if i in al else i for i in list(x)]))

How to replace comma with dash using python pandas?

I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?
You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64
Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)

Categories

Resources