Removing characters from lists in pandas column - python

I have a pandas DataFrame df with two columns (NACE and cleaned) which looks like this:
NACE cleaned
0 071 [260111, 260112]
1 072 [2603, 2604, 2606, 261610, 261690, 2607, 2608]
2 081 [251511, 251512, 251520, 251611, 251612, 25162]
3 089 [251010, 251020, 2502, 25030010, 251110, 25112]
4 101 [020110, 02012020, 02012030a), 02012050, 020130]
... ... ...
92 324 [95030021, 95030041, 95030049, 95030029, 95030]
93 325 [901841, 90184910, 90184990b), 841920, 90183110]
94 329 [960310, 96039010, 96039091, 96039099, 960321]
95 331 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-, 983843]
96 332 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-]
The cleaned column consists of lists of strings, some of which still contain characters that need to be removed. Specifically I need to remove all +, -, and ).
To focus on one of these +, I have tried many methods including:
df['cleaned'] = df['cleaned'].str.replace('+', '')
but also:
df.replace('+', '', regex = True, inplace = True)
and a desperate:
for i in df['cleaned']:
for x in i:
i.replace('+', '')
Different versions of these solutions work on most dataframes, but not when the column consists of lists.

Just change
for i in df['cleaned']:
for x in i:
i.replace('+', '')
to:
for i in df['cleaned']:
for x in range(len(i)):
i[x].replace('+', '')
it should work.

Related

Pandas, how to get an even distribution of instances by 10% percentiles from my CD database (as a dataframe)?

What I want is to see how I should group my CD's so that I have a similar group count for each 'bin' eg A+B C+D and E+F+G+H for example. It's more of an exercise rather than a need, but I don't have enough space to have a pile for each letter of the alphabet, so I'd rather have say 10 piles, but how to split them up.
So I have the following obtained from my DataFrame, showing the cumulative sum of entries through numbers (#) and the alphabet;
In[135]:csum
Out[135]:
key
# 9
A 25
B 43
C 63
D 76
E 82
F 98
G 105
H 116
I 120
J 125
K 130
L 139
M 154
N 160
O 164
P 186
R 221
S 234
T 298
U 302
V 319
W 325
Y 326
Name: count, dtype: int64
I've written a function 'distribution' to get the result I wanted... i.e. 10 separate groups, showing which alphabetic clusters to use.
dist = distribution(byvar, various=True)
dist
Out[138]:
quants
(8.999, 49.0] #AB
(49.0, 79.6] CD
(79.6, 104.3] EF
(104.3, 121.0] GHI
(121.0, 134.5] JK
(134.5, 158.8] LM
(158.8, 189.5] NOP
(189.5, 259.6] RS
(259.6, 313.9] TU
(313.9, 326.0] VWY
dtype: object
The code is here;
import pandas as pd
import numpy as np
def distribution(df, various=False):
'''
Parameters
----------
df : dataframe
various : boolean, optional
Select if Various df
Returns
-------
df
Shows how to distribute groupings to get similar size bunches.
'''
global gar, csum
if various:
df['AZ'] = df['album'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
else:
df['AZ'] = df['artist'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
gar = df.groupby('AZ')
csum = gar.size().cumsum() ### => csum becomes a Series obj
sdf = pd.DataFrame(csum.iteritems(), columns=['key','count'])
sdf['quants'] = pd.qcut(sdf['count'], q=np.array(range(11))*0.1)
gsdf = sdf.groupby('quants')
return gsdf.apply(lambda x: x['key'].sum())
So my question arises from the fact that I couldn't see how to achieve this without converting my Series object (csum) back into a DataFrame before using pd.qcut to split it up.
Can anyone see a more concise approach that bypasses the creating of the intermediate 'sdf' DataFrame ?

Pandas - inserting a comma on a number

I’m using Python and pandas and I’m using a dataframe that has temperatures (Celsius) on it, I worked it and right now they follow this pattern, e.g.
362
370
380
385
376
I want to make it have the comma between the second and third number,
e.g. 36,2
But I just can’t do this, is this possible?
Thanks in advance!
Try with division + astype + str.replace:
df['temp'] = (df['temp'] / 10).astype(str).str.replace('.', ',', regex=False)
temp
0 36,2
1 37,0
2 38,0
3 38,5
4 37,6
DataFrame Used:
import pandas as pd
df = pd.DataFrame({'temp': [362, 370, 380, 385, 376]})
temp
0 362
1 370
2 380
3 385
4 376
Presumably, you want the last digit to be separated by a comma (for example, 88 should be 8,8). In that case, this will work:
ls = [362, 370, 380, 385, 376]
ls = [f"{str(item)[:-1]},{str(item)[-1]}" for item in ls]
# ['36,2', '37,0', '38,0', '38,5', '37,6']
Where:
str(item)[:-1] get's all digits except the final one
str(item)[-1] get's just the final digit
In a dataframe, your values are stored as a pandas series. In that case:
import pandas as pd
ls = pd.Series([362, 370, 380, 385, 376])
ls = ls.astype("str").map(lambda x : f"{x[:-1]},{x[-1]}")
Or more specifically
df["Your column"] = df["Your column"].astype("str").map(lambda x : f"{x[:-1]},{x[-1]}")
Output:
0 36,2
1 37,0
2 38,0
3 38,5
4 37,6
You would have to convert this integer data to string in order to enter the ','.
For example:
temp=362
x = str(temp)[:-1]+','+str(temp)[-1]
You could use this in a loop or a list comprehension which was already mentioned. (They can be trickier to understand, so I provided this instead) Hope it helps!

Splitting a Column using Pandas

I am trying to split the following column using Pandas: (df name is count)
Location count
POINT (-118.05425 34.1341) 355
POINT (-118.244512 34.072581) 337
POINT (-118.265586 34.043271) 284
POINT (-118.360102 34.071338) 269
POINT (-118.40816 33.943626) 241
to this desired outcome:
X-Axis Y-Axis count
-118.05425 34.1341 355
-118.244512 34.072581 337
-118.265586 34.043271 284
-118.360102 34.071338 269
-118.40816 33.943626 241
I have tried removing the word 'POINT', and both the brackets. But then I am met with an extra white space at the beginning of the column. I tried using:
count.columns = count.columns.str.lstrip()
But it was not removing the white space.
I was hoping to use this code to split the column:
count = pd.DataFrame(count.Location.str.split(' ',1).tolist(),
columns = ['x-axis','y-axis'])
Since the space between both x and y axis could be used as the separator, but the white space.
You can use .str.extract with regex pattern having capture groups:
df[['x-axis', 'y-axis']] = df.pop('Location').str.extract(r'\((\S+) (\S+)\)')
print(df)
count x-axis y-axis
0 355 -118.05425 34.1341
1 337 -118.244512 34.072581
2 284 -118.265586 34.043271
3 269 -118.360102 34.071338
4 241 -118.40816 33.943626
a quick solution can be:
(df['Location']
.str.split(' ', 1) # like what you did,
.str[-1] # select only lat,lon
.str.strip('(') # remove open curly bracket
.str.strip(')') # remove close curly bracket
.str.split(' ', expand=True)) # expand to two columns
then you may rename column names using .rename or df.columns = colnames

Create Multiple dataframes from a large text file

Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84

Pandas: How to extract columns on new columns which contain special separators?

My data frame has some columns which contains digits and words. Before the digits and words sometimes there are special character like ">*".
The column are mostly divided in , or /. Based on separators, I want to section it into new columns and delete it.
Reproduced my dataframe and with my code:
d = {'error': [
'test,121',
'123',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error'] = df['error'].str.replace(' ', '')
df[['error1', 'error2']] = df['error'].str.extract('.*?(\w*)[,|/](\w*)')
df
So far my approach is first to remove the whitespaces with
df['error'] = df['error'].str.replace(' ', '')
Than I constructed my regex with this help
https://regex101.com/r/UHzTOq/13
.*?(\w*)[,|/](\w*)
Afterwards I delete the messy column with:
df.drop(columns =["error"], inplace = True)
My single values in the row are not considered. Therefore I get a NaN as a result. How to include them in my regex?
Solution is:
df[['error1', 'error2']] = df['error'].str.extract(r'^[>*:]*(.*?)(?:[,|\\](.*))?$')
Assuming that we'd like to add those values with only a test or a 123 in error1 column, maybe then we'd just slightly modify your original expression:
^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$
I'm pretty sure there should be other easier ways though.
Test
import pandas as pd
d = {'error': [
'test,121',
'123',
'test',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error1'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\1')
df['error2'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\2')
print(df)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Output
error error1 error2
0 test,121 test 121
1 123 123
2 test test
3 test,test test test
4 >errrI1GB,213 errrI1GB 213
5 *errrI1GB,213 errrI1GB 213
6 *errrI1GB/213 errrI1GB 213
7 *>errrI1GB/213 errrI1GB 213
8 >*errrI1GB,213 errrI1GB 213
9 >test, test test test
10 >>test, test test test
11 >>:test,test test test
RegEx Circuit
jex.im visualizes regular expressions:

Categories

Resources