How to replace comma with dash using python pandas? - python

I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?

You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64

Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)

Related

How to remove quotes from Numeric data in Python

I have one numeric feature in a data frame but in excel some of the values contain quotes which need to be removed.
Below table is what my data appears to be in Excel file now I want to remove quotes from last 3 rows using python.
Col1
Col2
123
A
456
B
789
C
"123"
D
"456"
E
"789"
F
I have used following code in Python:
df["Col1"] = df['Col1'].replace('"', ' ').astype(int)
But above code gives me error message: invalid literal for int() with base 10: '"123"'.
I have also tried strip() function but still it is not working.
If I do not convert the data type and use below code
df["Col1"] = df['Col1'].replace('"', ' ')
Then the code is getting executed without any error however while saving the file into CSV it is still showing quotes.
One way is to use converter function while reading Excel file. Something along those lines (assuming that data provided is in Excel file in columns 'A' and 'B'):
import pandas as pd
def conversion(value):
if type(value) == int:
return value
else:
return value.strip('"')
df = pd.read_excel('remove_quotes_excel.xlsx', header=None,
converters={0: conversion})
# df
0 1
0 123 A
1 456 B
2 789 C
3 123 D
4 456 E
5 789 F
Both columns are object type, but now (if needed) it is straightforward to convert to int:
df[0] = df[0].astype(int)
You can do it by using this code. (Regex is if you get a warning)
df.Col1.replace('\"', '', regex = True, inplace = True)
First convert the Col1 into series
df_Series = df['Col1']
Apply replace function on series
df_Series = df_Series.replace('"','').astype(int)
then append the Series into df data frame.

Python, how to create a table from JSON data - indexing

I am trying to create a table from JSON data. I have already used the json.dumps for my data:
this is what I am trying to export to the table:
label3 = json.dumps({'class': CLASSES[idx],"confidence": str(round(confidence * 100, 1)) + "%","startX": str(startX),"startY": str(startY),"EndX": str(endX),"EndY": str(endY),"Timestamp": now.strftime("%d/%m/%Y, %H:%M")})
I have tryied with:
val1 = json.loads(label3)
df = pd.DataFrame(val1)
print(df.T)
The system gives me an error that I must pass an index.
And also with:
val = ast.literal_eval(label3)
val1 = json.loads(json.dumps(val))
print(val1)
val2 = val1["class"][0]["confidence"][0]["startX"][0]["startY"][0]["endX"][0]["endY"][0]["Timestamp"][0]
df = pd.DataFrame(data=val2, columns=["class", "confidence", "startX", "startY", "EndX", "EndY", "Timestamp"])
print(df)
When I try this, the error it gives is that String indices mustb be integers.
How can I create the index?
Thank you,
There are two ways we can tackle this issue.
Do as directed by the error, pass the index to the dataframe function
pd.Dataframe(val1, index=list(range(number_of_rows)) # number of rows is 1 in your case.
While dumping the data using json.dumps, dump dictionary which has the mapping from key:list of values instead of key:value. For example
json.dumps({ 'class': [ CLASSES[idx] ],"confidence": [ ' some confidence ' ] })
I have shortened your given example. See I am passing values as list of values(even if it is only one value per key).

How to change all columns in csv file to str?

I am working on a script that imports an excel file, iterates through a column called "Title," and returns False if a certain keyword is present in "Title." The script runs, until I get to part where I want to export another csv file that gives me a separate column. My error is as follows: AttributeError: 'int' object has no attribute 'lower'
Based on this error, I changed the df.Title to a string using df['Title'].astype(str), but I get the same error.
import pandas as pd
data = pd.read_excel(r'C:/Users/Downloads/61_MONDAY_PROCESS_9.16.19.xlsx')
df = pd.DataFrame(data, columns=['Date Added','Track Item', 'Retailer Item ID','UPC','Title','Manufacturer','Brand','Client Product
Group','Category','Subcategory',
'Amazon Sub Category','Segment','Platform'])
df['Title'].astype(str)
df['Retailer Item ID'].astype(str)
excludes = ['chainsaw','pail','leaf blower','HYOUJIN','brush','dryer','genie','Genuine
Joe','backpack','curling iron','dog','cat','wig','animal','dryer',':','tea', 'Adidas', 'Fila',
'Reebok','Puma','Nike','basket','extension','extensions','batteries','battery','[EXPLICIT]']
my_excludes = [set(x.lower().split()) for x in excludes]
match_titles = [e for e in df.Title.astype(str) if any(keywords.issubset(e.lower().split()) for
keywords in my_excludes)]
def is_match(title, excludes = my_excludes):
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
This is the part that returns the error:
df['match_titles'] = df['Title'].apply(is_match)
result = df[df['match_titles']]['Retailer Item ID']
print(df)
df.to_csv('Asin_List(9.18.19).csv',index=False)
Use the following code to import your file:
data = pd.read_excel(r'C:/Users/Downloads/61_MONDAY_PROCESS_9.16.19.xlsx',
dtype='str')`
For pandas.read_excel, you can pass an optional parameter dtype.
You can also use it to pass multiple data types for different columns:
ex: dtype={'Retailer Item ID': int, 'Title': str})
At the line where you wrote
match_titles = [e for e in df.Title.astype(str) if any(keywords.issubset(e.lower().split()) for
keywords in my_excludes)]
python returns as variable e an integer and not the String you like.This happens because when you write df.Title.astype(str) you are searching the index of a new pandas dataframe containing only the column Title and not the contents of the column.If you want to iterate through column you should try
match_titles = [e for e in df.ix[:,5] if any(keywords.issubset(e.lower().split()) for keywords in my_excludes)
The df.ix[:,5] returns the fifth column of the dataframe df,which is the column you want.If this doesn't work try with the iteritems() function.
The main idea is that if you directly assign a df[column] to something else,you are assigning its index,not its contents.

Dataframe to dictionary, values came out scrambled

I have a dataframe that contains two columns that I would like to convert into a dictionary to use as a map.
I have tried multiple ways of converting, but my dictionary values always comes up in the wrong order.
My python version is 3 and Pandas version is 0.24.2.
This is what the first few rows of my dataframe looks like:
geozip.head()
Out[30]:
Geoid ZIP
0 100100 36276
1 100124 36310
2 100460 35005
3 100460 35062
4 100460 35214
I would like my dictionary to look like this:
{100100: 36276,
100124: 36310,
100460: 35005,
100460: 35062,
100460: 35214,...}
But instead my outputs came up with the wrong order for the values.
{100100: 98520,
100124: 36310,
100460: 57520,
100484: 35540,
100676: 19018,
100820: 57311,
100988: 15483,
101132: 36861,...}
I tried this first but the dictionary came out unordered:
geozipmap = geozip.set_index('Geoid')['ZIP'].to_dict()
Then I tried coverting the two columns into list first then convert to dictionary, but same problem occurred:
geoid = geozip.Geoid.tolist()
zipcode = geozip.ZIP.tolist()
geozipmap = dict(zip(geoid, zipcode))
I tried converting to OrderedDict and that didn't work either.
Then I've tried:
geozipmap = {k: v for k, v in zip(geoid, zipcode)}
I've also tried:
geozipmap = {}
for index, g in enumerate(geoid):
geozipmap[geoid[index]] = zipcode[index]
I've also tried the answers suggested:
panda dataframe to ordered dictionary
None of these work. Really not sure what is going on?
try this default_dict and if same key have multiple values you can provide those as list
from collections import defaultdict
df =pd.DataFrame(data={"Geoid":[100100,100124,100460,100460,100460],
"ZIP":[36276,36310,35005,35062,35214]})
data_dict = defaultdict(list)
for k,v in zip(df['Geoid'],df['ZIP']):
data_dict[k].append(v)
print(data_dict)
defaultdict(<class 'list'>, {100100: [36276], 100124: [36310], 100460: [35005, 35062, 35214]})
Will this work for you?
dfG = df['Geoid'].values
dfZ = df['ZIP'].values
for g , z in zip (dfG,dfZ):
print(str(g)+':'+str(z))
This gives the output as below (but the values are strings)
100100:36276
100124:36310
100460:35005
100460:35062
100460:35214

turning a collections counter into dictionary

I have a collection outcome resulting from the function:
Counter(df.email_address)
it returns each individual email address with the count of its repetitions.
Counter({nan: 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
what I want to do is to use it as if it was a dictionary and create a pandas dataframe out of it with two columns one for email addresses and one for the value associated.
I tried with:
dfr = repeaters.from_dict(repeaters, orient='index')
but i got the following error:
AttributeError: 'Counter' object has no attribute 'from_dict'
It makes thing that Counter is not a dictionary as it looks like. Any idea on how to append it to a df?
d = {}
cnt = Counter(df.email_address)
for key, value in cnt.items():
d[key] = value
EDIT
Or, how #Trif Nefzger suggested:
d = dict(Counter(df.email_address))
as ajcr wrote at the comment, from_dict is a method that belongs to dataframe and thus you can write the following to achieve your goal:
from collections import Counter
import pandas as pd
repeaters = Counter({"nan": 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
dfr = pd.DataFrame.from_dict(repeaters, orient='index')
print dfr
Output:
testorders#worldstores.co.uk 1
nan 1618
store#kiddicare.com 265
Alternatively you could use pd.Series.value_counts, which returns a Series object.
df.email_address.value_counts(dropna=False)
Sample output:
b#y.com 2
a#x.com 1
NaN 1
dtype: int64
This is not exactly what you asked for but looks like what you'd like to achieve.

Categories

Resources