I have a large dataframe that has certain cells which have values like: <25-27>. Is there a simple way to convert these into something like:25|26|27 ?
Source data frame:
import pandas as pd
import numpy as np
f = {'function':['2','<25-27>','200'],'CP':['<31-33>','210','4001']}
filter = pd.DataFrame(data=f)
filter
Output Required
output = {'function':['2','25|26|27','200'],'CP':['31|32|33','210','4001']}
op = pd.DataFrame(data=output)
op
thanks a lot !
import re
def convert_range(x):
m = re.match("<([0-9]+)+\-([0-9]+)>", x)
if m is None:
return x
s1, s2 = m.groups()
return "|".join([str(s) for s in range(int(s1), int(s2)+1)])
op = filter.applymap(convert_range)
Related
I apologize if this is a rudimentary question. I feel like it should be easy but I cannot figure it out. I have the code that is listed below that essentially looks at two columns in a CSV file and matches up job titles that have a similarity of 0.7. To do this, I use difflib.get_close_matches. However, the output is multiple single lines and whenever I try to convert to a DataFrame, every single line is its own DataFrame and I cannot figure out how to merge/concat them. All code, as well as current and desired outputs are below. Any help would be much appreciated.
Current Code is:
import pandas as pd
import difflib
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n=3
cutoff = 0.7
for aList in aLists:
best = difflib.get_close_matches(aList, bLists, n, cutoff)
print(best)
Current Output is:
['SW Engineer']
['Manu Engineer']
[]
['IT Help']
Desired Output is:
Output
0 SW Engineer
1 Manu Engineer
2 (blank)
3 IT Help
The table I am attempting to do this one is:
Any help would be greatly appreciated!
Here is a simple way to achieve this.I have converted first to a string.Then the first and last brackets are removed from that string and then is appended to a global list.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
temp = str(temp)
strippedString = temp.lstrip("[").rstrip("]")
# print(temp)
best.append(strippedString)
print(best)
Output
[
"'SW Engineer'",
"'Manu Engineer'",
'',
"'IT Help'"
]
Here is another better way to achieve this.
You can simply use numpy to concatenate multiple arrays into single one.And then you can convert it to normal array if you want.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
best.append(temp)
# print(best)
# Use concatenate() to join two arrays
combinedNumpyArray = np.concatenate(best)
#Converting numpy array to normal array
normalArray = combinedNumpyArray.tolist()
print(normalArray)
Output
['SW Engineer', 'Manu Engineer', 'IT Help']
Thanks
You could use Panda's .apply() to run your function on each entry. This could then either be added as a new column or a new dataframe created.
For example:
import pandas as pd
import difflib
def get_best_match(word):
matches = difflib.get_close_matches(word, JT, n, cutoff)
return matches[0] if matches else None
df = pd.read_csv('name.csv')
JT = df['JT']
n = 3
cutoff = 0.7
df['Output'] = df['JTs'].apply(get_best_match)
Or for a new dataframe:
df_output = pd.DataFrame({'Output' : df['JTs'].apply(get_best_match)})
Giving you:
JTs JT Output
0 Software Engineer Manu Engineer SW Engineer
1 Manufacturing Engineer SW Engineer Manu Engineer
2 Human Resource Manager IT Help None
3 IT Help Desk f IT Help
Or:
Output
0 SW Engineer
1 Manu Engineer
2 None
3 IT Help
import pandas as pd
import numpy as np
data = np.random.randint(10,100,15).reshape(5,3)
df = pd.DataFrame(data,index = ["a","c","e","f","h"],columns = ["column1","column2","column3"])
df = df.reindex([["a","b","c","d","e","f","g","h"]])
result = df
result = df.drop("column1",axis = 1)
print(result)
Ok guys.I solved the problem.The mistake was the square brackets.In df.reindex part,I used 2 square brackets.But I should have use one.So thank you so much for your helps.
I have a dataframe, where I want to replace some words to others, based on another dataframe:
import pandas as pd
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns=["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist["item"], dist["idx"], regex=True)
print(a)
How can I do that? (this does not work in its current form)
You can try this:
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns= ["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
b = dict(zip(dist["idx"], dist["item"]))
def replace_items(token):
for key, value in b.items():
token = token.replace(value, key)
return token
a["pf"] = a["pf"].apply(replace_items)
Please be aware that the banana in your dist dataframe is balana. Not sure if this is intended...
Converting the translation table to dictionary seems to solve the problem:
import pandas as pd
dist = pd.DataFrame([["apple","21"],["banana","25"],["lemon","30"]],columns=["item","idx"])
dist = dist.set_index('item')['idx'].to_dict()
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist, regex=True)
print(a)
I am trying to replicate a simple Technical-Analysis indicator using xlwings. However, the list/data seems not to be able to read Excel values. Below is the code
import pandas as pd
import datetime as dt
import numpy as np
#xw.func
def EMA(df, n):
EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
df = df.join(EMA)
return df
When I enter a list of excel data : EMA = ({1,2,3,4,5}, 5}, I get the following error message
TypeError: list indices must be integers, not str EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
(Expert) help much appreciated! Thanks.
EMA() expects a DataFrame df and a scalar n, and it returns the EMA in a separate column in the source DataFrame. You are passing a simple list of values, this is not supposed to work.
Construct a DataFrame and assign the values to the Close column:
v = range(100) # use your list of values instead
df = pd.DataFrame(v, columns=['Close'])
Call EMA() with this DataFrame:
EMA(df, 5)
I have a dataframe and am trying to get the closest matches using mahalanobis distance across three categories, like:
from io import StringIO
from sklearn import metrics
import pandas as pd
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
print(df)
print(d)
Where that pid column is a unique identifier.
What I need to do is take that ndarray returned by the pairwise_distances call and update the original dataframe so each row has some kind of list of its closest N matches (so pid 0 might have an ordered list by distance of like 2, 1, 5, 3, 4 (or whatever it actually is), but I'm totally stumped how this is done in python.
from io import StringIO
from sklearn import metrics
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df