I have a pandas DataFrame df with two columns (NACE and cleaned) which looks like this:
NACE cleaned
0 071 [260111, 260112]
1 072 [2603, 2604, 2606, 261610, 261690, 2607, 2608]
2 081 [251511, 251512, 251520, 251611, 251612, 25162]
3 089 [251010, 251020, 2502, 25030010, 251110, 25112]
4 101 [020110, 02012020, 02012030a), 02012050, 020130]
... ... ...
92 324 [95030021, 95030041, 95030049, 95030029, 95030]
93 325 [901841, 90184910, 90184990b), 841920, 90183110]
94 329 [960310, 96039010, 96039091, 96039099, 960321]
95 331 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-, 983843]
96 332 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-]
The cleaned column consists of lists of strings, some of which still contain characters that need to be removed. Specifically I need to remove all +, -, and ).
To focus on one of these +, I have tried many methods including:
df['cleaned'] = df['cleaned'].str.replace('+', '')
but also:
df.replace('+', '', regex = True, inplace = True)
and a desperate:
for i in df['cleaned']:
for x in i:
i.replace('+', '')
Different versions of these solutions work on most dataframes, but not when the column consists of lists.
Just change
for i in df['cleaned']:
for x in i:
i.replace('+', '')
to:
for i in df['cleaned']:
for x in range(len(i)):
i[x].replace('+', '')
it should work.
Related
What I want is to see how I should group my CD's so that I have a similar group count for each 'bin' eg A+B C+D and E+F+G+H for example. It's more of an exercise rather than a need, but I don't have enough space to have a pile for each letter of the alphabet, so I'd rather have say 10 piles, but how to split them up.
So I have the following obtained from my DataFrame, showing the cumulative sum of entries through numbers (#) and the alphabet;
In[135]:csum
Out[135]:
key
# 9
A 25
B 43
C 63
D 76
E 82
F 98
G 105
H 116
I 120
J 125
K 130
L 139
M 154
N 160
O 164
P 186
R 221
S 234
T 298
U 302
V 319
W 325
Y 326
Name: count, dtype: int64
I've written a function 'distribution' to get the result I wanted... i.e. 10 separate groups, showing which alphabetic clusters to use.
dist = distribution(byvar, various=True)
dist
Out[138]:
quants
(8.999, 49.0] #AB
(49.0, 79.6] CD
(79.6, 104.3] EF
(104.3, 121.0] GHI
(121.0, 134.5] JK
(134.5, 158.8] LM
(158.8, 189.5] NOP
(189.5, 259.6] RS
(259.6, 313.9] TU
(313.9, 326.0] VWY
dtype: object
The code is here;
import pandas as pd
import numpy as np
def distribution(df, various=False):
'''
Parameters
----------
df : dataframe
various : boolean, optional
Select if Various df
Returns
-------
df
Shows how to distribute groupings to get similar size bunches.
'''
global gar, csum
if various:
df['AZ'] = df['album'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
else:
df['AZ'] = df['artist'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
gar = df.groupby('AZ')
csum = gar.size().cumsum() ### => csum becomes a Series obj
sdf = pd.DataFrame(csum.iteritems(), columns=['key','count'])
sdf['quants'] = pd.qcut(sdf['count'], q=np.array(range(11))*0.1)
gsdf = sdf.groupby('quants')
return gsdf.apply(lambda x: x['key'].sum())
So my question arises from the fact that I couldn't see how to achieve this without converting my Series object (csum) back into a DataFrame before using pd.qcut to split it up.
Can anyone see a more concise approach that bypasses the creating of the intermediate 'sdf' DataFrame ?
I’m using Python and pandas and I’m using a dataframe that has temperatures (Celsius) on it, I worked it and right now they follow this pattern, e.g.
362
370
380
385
376
I want to make it have the comma between the second and third number,
e.g. 36,2
But I just can’t do this, is this possible?
Thanks in advance!
Try with division + astype + str.replace:
df['temp'] = (df['temp'] / 10).astype(str).str.replace('.', ',', regex=False)
temp
0 36,2
1 37,0
2 38,0
3 38,5
4 37,6
DataFrame Used:
import pandas as pd
df = pd.DataFrame({'temp': [362, 370, 380, 385, 376]})
temp
0 362
1 370
2 380
3 385
4 376
Presumably, you want the last digit to be separated by a comma (for example, 88 should be 8,8). In that case, this will work:
ls = [362, 370, 380, 385, 376]
ls = [f"{str(item)[:-1]},{str(item)[-1]}" for item in ls]
# ['36,2', '37,0', '38,0', '38,5', '37,6']
Where:
str(item)[:-1] get's all digits except the final one
str(item)[-1] get's just the final digit
In a dataframe, your values are stored as a pandas series. In that case:
import pandas as pd
ls = pd.Series([362, 370, 380, 385, 376])
ls = ls.astype("str").map(lambda x : f"{x[:-1]},{x[-1]}")
Or more specifically
df["Your column"] = df["Your column"].astype("str").map(lambda x : f"{x[:-1]},{x[-1]}")
Output:
0 36,2
1 37,0
2 38,0
3 38,5
4 37,6
You would have to convert this integer data to string in order to enter the ','.
For example:
temp=362
x = str(temp)[:-1]+','+str(temp)[-1]
You could use this in a loop or a list comprehension which was already mentioned. (They can be trickier to understand, so I provided this instead) Hope it helps!
I am trying to split the following column using Pandas: (df name is count)
Location count
POINT (-118.05425 34.1341) 355
POINT (-118.244512 34.072581) 337
POINT (-118.265586 34.043271) 284
POINT (-118.360102 34.071338) 269
POINT (-118.40816 33.943626) 241
to this desired outcome:
X-Axis Y-Axis count
-118.05425 34.1341 355
-118.244512 34.072581 337
-118.265586 34.043271 284
-118.360102 34.071338 269
-118.40816 33.943626 241
I have tried removing the word 'POINT', and both the brackets. But then I am met with an extra white space at the beginning of the column. I tried using:
count.columns = count.columns.str.lstrip()
But it was not removing the white space.
I was hoping to use this code to split the column:
count = pd.DataFrame(count.Location.str.split(' ',1).tolist(),
columns = ['x-axis','y-axis'])
Since the space between both x and y axis could be used as the separator, but the white space.
You can use .str.extract with regex pattern having capture groups:
df[['x-axis', 'y-axis']] = df.pop('Location').str.extract(r'\((\S+) (\S+)\)')
print(df)
count x-axis y-axis
0 355 -118.05425 34.1341
1 337 -118.244512 34.072581
2 284 -118.265586 34.043271
3 269 -118.360102 34.071338
4 241 -118.40816 33.943626
a quick solution can be:
(df['Location']
.str.split(' ', 1) # like what you did,
.str[-1] # select only lat,lon
.str.strip('(') # remove open curly bracket
.str.strip(')') # remove close curly bracket
.str.split(' ', expand=True)) # expand to two columns
then you may rename column names using .rename or df.columns = colnames
Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84
My data frame has some columns which contains digits and words. Before the digits and words sometimes there are special character like ">*".
The column are mostly divided in , or /. Based on separators, I want to section it into new columns and delete it.
Reproduced my dataframe and with my code:
d = {'error': [
'test,121',
'123',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error'] = df['error'].str.replace(' ', '')
df[['error1', 'error2']] = df['error'].str.extract('.*?(\w*)[,|/](\w*)')
df
So far my approach is first to remove the whitespaces with
df['error'] = df['error'].str.replace(' ', '')
Than I constructed my regex with this help
https://regex101.com/r/UHzTOq/13
.*?(\w*)[,|/](\w*)
Afterwards I delete the messy column with:
df.drop(columns =["error"], inplace = True)
My single values in the row are not considered. Therefore I get a NaN as a result. How to include them in my regex?
Solution is:
df[['error1', 'error2']] = df['error'].str.extract(r'^[>*:]*(.*?)(?:[,|\\](.*))?$')
Assuming that we'd like to add those values with only a test or a 123 in error1 column, maybe then we'd just slightly modify your original expression:
^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$
I'm pretty sure there should be other easier ways though.
Test
import pandas as pd
d = {'error': [
'test,121',
'123',
'test',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error1'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\1')
df['error2'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\2')
print(df)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Output
error error1 error2
0 test,121 test 121
1 123 123
2 test test
3 test,test test test
4 >errrI1GB,213 errrI1GB 213
5 *errrI1GB,213 errrI1GB 213
6 *errrI1GB/213 errrI1GB 213
7 *>errrI1GB/213 errrI1GB 213
8 >*errrI1GB,213 errrI1GB 213
9 >test, test test test
10 >>test, test test test
11 >>:test,test test test
RegEx Circuit
jex.im visualizes regular expressions: