Python replace integers by integers only at particular location - python

hi i have a file which contains the data as shown below. I want to replace the integers which occurs after 'A' (fourth column) 2,3,15,25,115,1215 with other integers which i have them in dictionary (key,value). the number of white spaces after 'A' ranges from 0-3. I tried str.replace(old,new) method in python but it replaces all instance of the integers in the file.
This is the replacement i want to do inside the file.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56
Suggest me some ways to do it.

replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.split()
if len(spl) == 8:
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+[str(replacements.get(k, str(k))) for k in ints]))
else:
spl[-3] = spl[-3].replace("A","")
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+["A"]+[str(replacements.get(k, str(k))) for k in ints]))
print(res)
['Name 1 N ASHA A 0 35 23', 'Name 2 R MONA A 5 30 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A 1220 45 56']
Not sure if you want to use the data or write it to a file but if your file is like your example this will replace the digits from the dict, if the len of split is different we know we have a number and an A without a space so we replace .
There will also always be a space so if you write to file and have to work on the file again it will be a lot easier.
I would just remove the map and use strings as keys and values unless you actually want ints.
If you want to keep the exact same format and only want to change the first number:
replacements = {"2":"0","3":"5","15":"7","25":"30","115":"120","1215":"1220"}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.rsplit(None, 3)
end = spl[-3:]
if "A" == end[0][0]:
k = end[0][1:]
res.append(line.replace(k,replacements.get(k,k)))
else:
k = end[0]
res.append(line.replace(k,replacements.get(k,k)))
print(res)
['Name 1 N ASHA A 0 35 03', 'Name 2 R MONA A 5 25 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A1220 45 56']

Editted based on additional info regarding all other numbers.
This is entirely dependent on the specific characteristics of your file that you mention in your comments.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
with open('input.txt', 'r') as fin, open('output.txt', 'w') as fout:
pos_a = 22 # 0-indexed position of 'A' in every line
for line in fin:
left_side = line[:pos_a + 1]
num_to_convert = line[pos_a + 1: pos_a + 5]
right_side = line[pos_a + 5:]
# String formatting to preserve padding as per original file
newline = '{}{:>4}{}'.format(left_side,
replacements[int(num_to_convert)],
right_side)
fout.write(newline)
If there's a possibility that one of the values in the column will not be in your replacements dict, and you want to keep that value unchanged, then instead of replacements[int(num1)], do replacements.get(int(num1), num1)

Regex101
^[\w\d\s]{23}([\d\s]{1,4}).*$
Debuggex Demo
Note: This is more of a fixed length parsing
Python
import re
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
searchString = "Name 1 N ASHA A 2 35 23 "
replace_search = re.search('^[\w\d\s]{23}([\d\s]{1,4}).*$', searchString, re.IGNORECASE)
if replace_search:
result = replace_search.group(1)
convert_result = int(result)
dictionary_lookup = int(replacements[convert_result])
replace_result = '% 4d' % dictionary_lookup
regex_replace = r"\g<1>" + replace_result + r"\g<3>"
line = re.sub(r"^([\w\d\s]{23})([\d\s]{1,4})(.*)$", regex_replace, searchString)
print(line)

Related

how to sort data inside text file in accending order in python?

I have a text file containing number integer and string value pairs I want to sort them in ascending order but I am not getting it correct
**TextFile.txt **
87,Toronto
45,USA
45,PAKISTAN
33,India
38,Jerry
30,Tom
23,Jim
7,Love
38,Hate
30,Stress
My code
def sort_outputFiles():
print('********* Sorting now **************')
my_file = open("TextFile.txt", "r",encoding='utf-8')
data = my_file.read()
data_into_list = data.split("\n")
my_file.close()
score=[]
links=[]
for d in data_into_list:
if d != '':
s=d.split(',')
score.append(s[0])
links.append(s[1])
n = len(score)
for i in range(n):
for j in range(0, n-i-1):
if score[j] < score[j+1]:
score[j], score[j+1] = score[j+1], score[j]
links[j], links[j+1] = links[j+1], links[j]
for l,s in zip(links,score):
print(l," ",s)
My output
********* Sorting now **************
Toronto 87
Love 7
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Expected output
********* Sorting now **************
Toronto 87
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Love 7
having error at line 2 in out put it should be in last
You are comparing strings, not numbers.
In an dictionary (like, the physical book), the words are sorted by who has the "lowest" first letter, and if it's a tie, we pick the lowest second letter, and so on. This is called lexicographical order.
So the string "aaaa" < "ab". Also, "1111" < "12"
To fix this, you have to convert the string to a number (using int(s[0]) instead of s[0] in the score.append function).
This will make 1111 > 12. Your code will give the correct result.
You can use Python's sorted() function to sort the list. If you use the key parameter, you can specify a custom sorting behaviour, and if you use the reverse parameter, you can sort in descending order.
Also, you can use the csv module to make reading your input file easier.
import csv
with open("TextFile.txt", "r", encoding="utf-8", newline="") as csvfile:
lines = list(csv.reader(csvfile))
for line in sorted(lines, key=lambda l: int(l[0]), reverse=True):
print(f"{line[1]} {line[0]}")
Output:
Toronto 87
USA 45
PAKISTAN 45
Jerry 38
Hate 38
India 33
Tom 30
Stress 30
Jim 23
Love 7
Not sure about your intentions, but a compact implementation would be like:
with open('textfile.txt', 'r') as f:
d = [l.split(',') for l in f.readlines()]
d=[(dd[1][:-1], int(dd[0])) for dd in d]
d_sorted = sorted(d, key=lambda x:x[1], reverse=True)
print(d_sorted)

how to compare two csv file in python and flag the difference?

i am new to python. Kindly help me.
Here I have two set of csv-files. i need to compare and output the difference like changed data/deleted data/added data. here's my example
file 1:
Sn Name Subject Marks
1 Ram Maths 85
2 sita Engilsh 66
3 vishnu science 50
4 balaji social 60
file 2:
Sn Name Subject Marks
1 Ram computer 85 #subject name have changed
2 sita Engilsh 66
3 vishnu science 90 #marks have changed
4 balaji social 60
5 kishor chem 99 #added new line
Output - i need to get like this :
Changed Items:
1 Ram computer 85
3 vishnu science 90
Added item:
5 kishor chem 99
Deleted item:
.................
I imported csv and done the comparasion via for loop with redlines. I am not getting the desire output. its confusing me a lot when flagging the added & deleted items between file 1 & file2 (csv files). pl suggest the effective code folks.
The idea here is to flatten your dataframe with melt to compare each value:
# Load your csv files
df1 = pd.read_csv('file1.csv', ...)
df2 = pd.read_csv('file2.csv', ...)
# Select columns (not mandatory, it depends on your 'Sn' column)
cols = ['Name', 'Subject', 'Marks']
# Flat your dataframes
out1 = df1[cols].melt('Name', var_name='Item', value_name='Old')
out2 = df2[cols].melt('Name', var_name='Item', value_name='New')
out = pd.merge(out1, out2, on=['Name', 'Item'], how='outer')
# Flag the state of each item
condlist = [out['Old'] != out['New'],
out['Old'].isna(),
out['New'].isna()]
out['State'] = np.select(condlist, choicelist=['changed', 'added', 'deleted'],
default='unchanged')
Output:
>>> out
Name Item Old New State
0 Ram Subject Maths computer changed
1 sita Subject Engilsh Engilsh unchanged
2 vishnu Subject science science unchanged
3 balaji Subject social social unchanged
4 Ram Marks 85 85 unchanged
5 sita Marks 66 66 unchanged
6 vishnu Marks 50 90 changed
7 balaji Marks 60 60 unchanged
8 kishor Subject NaN chem changed
9 kishor Marks NaN 99 changed
count, flag = 0, 1
for i, j in zip(df1.values, df2.values):
if sum(i == j) != 4:
if flag:
print("Changed Items:")
flag = 0
print(j)
count += 1
if count != len(df2):
print("Newly added:")
print(*df2.iloc[count:, :].values)

Writing a program that prints out the name of students with more than six quiz score

I'm new to Python. Please help me out. From a "score.txt" file I have to print out the names of students that have more than 6 quiz scores. It is as follows
joe 10 15 20 30 40,
bill 23 16 19 22,
sue 8 22 17 14 32 17 24 21 2 9 11 17,
grace 12 28 21 45 26 10,
john 14 32 25 16 89
My initial approach was like this to separate data from string
f=open("score.txt", "r")
f.readlines()[1:]
This gave me a list. How can I check the len(elements)>=6 and then print the names?
Since they are lines of data in the file and each line seperated by a space you can use string manipulation here
f=open("score.txt", "r")
entries = f.readlines()
for entry in entries:
# split entry by space, first is name, the rest are scores
chunk = entry.split(' ')
name = chunk[0]
scores = chunk[1:]
if len(scores) > 6:
print(f'{name} had more than 6 quiz scores {len(scores)}')

Pandas Series and Nan Values for mismatched values

I have these two dictionaries,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico), the output is as follows:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".
Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
You could do the following thing. Define the function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS

Choose higher value based off column value between two dataframes

question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna

Categories

Resources