Remove chinese parentheses and inside content in string column with Python

Remove chinese parentheses and inside content in string column with Python - python

I would like to remove chinese type parentheses and their contents inside from the following dataframe:
id title
0 1 【第一次拍卖】深圳市光明新区公明街道中心区（拍卖） ---> （拍卖）need to remove
1 2 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 3 【第一次拍卖】（破）广东省深圳市龙岗区布吉新区 ---> (破) need to remove
3 4 【第一次拍卖】深圳市宝安区新安街道新城大道
4 5 （拍卖）【第二次拍卖】深圳市盐田区沙头角东和路 ---> (拍卖) need to remove
I tried with df['title'].str.replace(r'\（[^（）]*\）', '') and df['title'].str.replace(r'\（[^)]*\）', ''), but they both can remove them if they are in the end of string.
0 【第一次拍卖】深圳市光明新区公明街道中心区 ---> this row works
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】(拍卖)广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 (拍卖)【第二次拍卖】深圳市盐田区沙头角东和路
How could I modify my code to get the following output? Thank you.
0 【第一次拍卖】深圳市光明新区公明街道中心区
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 【第二次拍卖】深圳市盐田区沙头角东和路

The following three solutions work out:
df['title'].str.replace(r'\（[^（）]*\）', '')
df['title'].str.replace(r'\（[^)]*\）', '')
df['title'].str.replace(r'\（\S+\）', '')
Out:
0 【第一次拍卖】深圳市光明新区公明街道中心区
1 【第一次拍卖】深圳市龙岗区龙岗街道新生社区
2 【第一次拍卖】广东省深圳市龙岗区布吉新区
3 【第一次拍卖】深圳市宝安区新安街道新城大道
4 【第二次拍卖】深圳市盐田区沙头角东和路

Related

how can i print a range of numbers without a list in python?

for make in range(1, content+1):
print(make)
Result:
1
2
3
4
5
But i want them like this :
1 2 3 4 5 6 7

You need to modify the print command to tell it the separator to use with the "end" argument:
for make in range(1, content+1):
print(make,end=' ')

Removing the first and only the first '-' in the values of a string column

1 2
0 ADRC-111-01 ADRC111
1 ADRC-11955-01 ADRC11955
2 ADRC-18133-01 ADRC18133
3 SWAN0023-03 SWAN0023
In Column 1, I wish to get rid of the first - sign, regardless of how many are in the the cell. There are one or two - in each entry.
Desired output:
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023

Use .str.replace with n=1:
df['1'] = df['1'].str.replace('-', '', n=1)
Output:
>>> df
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023

Why apply function did not work on pandas dataframe

ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
I am trying to encyrpt here is below head of ct_data
Unnamed: 0 IM NO CT ID
0 0 214281340 x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 214281244 -vf6738ee3bedf47e8acf4613034069ab0|aa0d2dac654
2 2 175326863 __g3d877adf9d154637be26d9a0111e1cd6|6FfHZRoiWs
3 3 299631931 __gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
4 4 214282320 773840905c424a10a4a31aba9d6458bb|__g1114a30c6e
But I get as below
Unnamed: 0 ... CT ID
0 0 ... x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 ... aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
2 2 ... 6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
3 3 ... 54e2c39cd35044ffbd9c0918d07923dc|__gbe204670ca
4 4 ... __g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
5 5 ... 9e6eb976075b4b189ae7dde42b67ca3d|WgpKucd28IcdE
IM NO columns header name and its value should be 20 digit encrpted ,
Normally encryption is done as below
import pyffx
strEncrypt = pyffx.Integer(b'dkrya#Jppl1994', length=20)
strEncrptVal = strEncrypt.encrypt(int('9digit IM No'))
ct_data.iloc[:, 1]) displays below thing
0 214281340
1 214281244
2 175326863
3 299631931
4 214282320
5 214279026

This should be a comment but it contains formatted data.
It is probably a mere display problem. With the initial sample of you dataframe, I have executed your command and printed its returned values:
print(ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x))))
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
So it is correctly executed. Let us go one step further:
ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
print(ct_data['IM NO'])
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
Again...
That means that your command was successfull, but as the IM NO column is now larger, you system can no more display all the columns and it displays the first and las ones, with ellipses (...) in the middle.

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other

If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():

So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Finding particular column value via regex

I have a txt file containing multiple rows as below.
56.0000 3 1
62.0000 3 1
74.0000 3 1
78.0000 3 1
82.0000 3 1
86.0000 3 1
90.0000 3 1
94.0000 3 1
98.0000 3 1
102.0000 3 1
106.0000 3 1
110.0000 3 0
116.0000 3 1
120.0000 3 1
Now I am looking for the row which has '0' in the third column .
I am using python regex package. What I have tried is re.match("(.*)\s+(0-9)\s+(1)",line) but of no help..
What should be the regular expression pattern I should be looking for?

You probably don't need a regex for this. You can strip trailing whitespaces from the right side of the line and then check the last character:
if line.rstrip()[-1] == "0": # since your last column only contains 0 or 1
...

Just split line and read value from list.
>>> line = "56.0000 3 1"
>>> a=line.split()
>>> a
['56.0000', '3', '1']
>>> print a[2]
1
>>>
Summary:
f = open("sample.txt",'r')
for line in f:
tmp_list = line.split()
if int(tmp_list[2]) == 0:
print "Line has 0"
print line
f.close()
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Line has 0
110.0000 3 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove chinese parentheses and inside content in string column with Python - python

Related

how can i print a range of numbers without a list in python?

Removing the first and only the first '-' in the values of a string column

Why apply function did not work on pandas dataframe

reading a file that is detected as being one column

Finding particular column value via regex

Categories

Resources