Splitting a Column using Pandas - python

I am trying to split the following column using Pandas: (df name is count)
Location count
POINT (-118.05425 34.1341) 355
POINT (-118.244512 34.072581) 337
POINT (-118.265586 34.043271) 284
POINT (-118.360102 34.071338) 269
POINT (-118.40816 33.943626) 241
to this desired outcome:
X-Axis Y-Axis count
-118.05425 34.1341 355
-118.244512 34.072581 337
-118.265586 34.043271 284
-118.360102 34.071338 269
-118.40816 33.943626 241
I have tried removing the word 'POINT', and both the brackets. But then I am met with an extra white space at the beginning of the column. I tried using:
count.columns = count.columns.str.lstrip()
But it was not removing the white space.
I was hoping to use this code to split the column:
count = pd.DataFrame(count.Location.str.split(' ',1).tolist(),
columns = ['x-axis','y-axis'])
Since the space between both x and y axis could be used as the separator, but the white space.

You can use .str.extract with regex pattern having capture groups:
df[['x-axis', 'y-axis']] = df.pop('Location').str.extract(r'\((\S+) (\S+)\)')
print(df)
count x-axis y-axis
0 355 -118.05425 34.1341
1 337 -118.244512 34.072581
2 284 -118.265586 34.043271
3 269 -118.360102 34.071338
4 241 -118.40816 33.943626

a quick solution can be:
(df['Location']
.str.split(' ', 1) # like what you did,
.str[-1] # select only lat,lon
.str.strip('(') # remove open curly bracket
.str.strip(')') # remove close curly bracket
.str.split(' ', expand=True)) # expand to two columns
then you may rename column names using .rename or df.columns = colnames

Related

Convert number string with commas and negative values to float [Pandas]

I would like to convert negative value strings and strings with commas to float
df. But I am struggling to do both operations at the same time
customer_id Revenue
332 1,293.00
293 -485
4284 1,373.80
284 -327
Output_df
332 1293.00
293 485
4284 1373.80
284 327
Convert to numeric and then take the absolute value:
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
If the above doesn't work, then try:
df["Revenue"] = pd.to_numeric(df["Revenue"].str.strip().str.replace(",", "")).abs()
Here I first make a call to str.strip() to remove any whitespace in your float. Then, I remove commas using str.replace().
Does using .str.replace() help?
df["Revenue"] = pd.to_numeric(df["Revenue"].str.replace(',','').abs()
If you are getting the DataFrame from a csv file, you can use the following at import to address the commas, and then deal with the - later:
df.read_csv ('foo.​csv', thousands=',')
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()

Remove * from a specific column value

For this dataframe, what is the best way to get ride of the * of "Stad Brussel*". In the real dataframe, the * is also on the upside. Please refer to the pic. Thanks.
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel* 1000 176545
4 Etterbeek 1040 47414
Desired results:
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel 1000 176545
4 Etterbeek 1040 47414
You can try:
df['Dutch name'] = df['Dutch name'].replace({'\*':''}, regex = True)
This will remove all * characters in the 'Dutch name' column. If you need to remove the character from multiple columns use:
df.replace({'\*':''}, regex = True)
If you manipulate only strings you can use regular expression matching. See here.
Something like :
import re
txt = 'Your file as a string here'
out = re.sub('\*', '', txt)
out now contain what you want.
for dataframe, first define column(s) to be checked:
cols_to_check = ['4']
then,
df[cols_to_check] = df[cols_to_check].replace({'*':''}, regex=True)

Removing characters from lists in pandas column

I have a pandas DataFrame df with two columns (NACE and cleaned) which looks like this:
NACE cleaned
0 071 [260111, 260112]
1 072 [2603, 2604, 2606, 261610, 261690, 2607, 2608]
2 081 [251511, 251512, 251520, 251611, 251612, 25162]
3 089 [251010, 251020, 2502, 25030010, 251110, 25112]
4 101 [020110, 02012020, 02012030a), 02012050, 020130]
... ... ...
92 324 [95030021, 95030041, 95030049, 95030029, 95030]
93 325 [901841, 90184910, 90184990b), 841920, 90183110]
94 329 [960310, 96039010, 96039091, 96039099, 960321]
95 331 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-, 983843]
96 332 [-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-]
The cleaned column consists of lists of strings, some of which still contain characters that need to be removed. Specifically I need to remove all +, -, and ).
To focus on one of these +, I have tried many methods including:
df['cleaned'] = df['cleaned'].str.replace('+', '')
but also:
df.replace('+', '', regex = True, inplace = True)
and a desperate:
for i in df['cleaned']:
for x in i:
i.replace('+', '')
Different versions of these solutions work on most dataframes, but not when the column consists of lists.
Just change
for i in df['cleaned']:
for x in i:
i.replace('+', '')
to:
for i in df['cleaned']:
for x in range(len(i)):
i[x].replace('+', '')
it should work.

How can I split a text from a parenthesis in a CSV, and create another column with it

I'm completely new to the Python world, so I've been struggling with this issue for a couple days now. I thank you guys in advance.
I have been trying to separate a single Row and column text in three diferente ones. To explain myself better, here's where I am.
So this is my pandas dataframe from a csv:
In[2]:
df = pd.read_csv('raw_csv/consejo_judicatura_guerrero.csv', header=None)
df.columns = ["institution"]
df
Out[2]:
institution
0 1.1.2. Consejo Nacional de Ciencias (CNCOO00012)
Then, I try first to separate the 1.1.2. in a new column called number, which I kind of nailed it:
In[3]:
new_df = pd.DataFrame(df['institution'].str.split('. ',1).tolist(),columns=['number', 'institution'])
Out[3]:
number institution
0 1.1.2. Consejo Nacional de Ciencias (CNCOO00012)
Finally, trying to split the (CNCOO00012) in a new column called unit_id I get the following:
In[4]:
new_df['institution'] = pd.DataFrame(new_df['institution'].str.split('(').tolist(),columns=['institution', 'unit_id'])
Out[4]:
------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-70d13206881c> in <module>
----> 1 new_df['institution'] = pd.DataFrame(new_df['institution'].str.split('(').tolist(),columns=['institution', 'unit_id'])
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
472 if is_named_tuple(data[0]) and columns is None:
473 columns = data[0]._fields
--> 474 arrays, columns = to_arrays(data, columns, dtype=dtype)
475 columns = ensure_index(columns)
476
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in to_arrays(data, columns, coerce_float, dtype)
459 return [], [] # columns if columns is not None else []
460 if isinstance(data[0], (list, tuple)):
--> 461 return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
462 elif isinstance(data[0], abc.Mapping):
463 return _list_of_dict_to_arrays(
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
491 else:
492 # list of lists
--> 493 content = list(lib.to_object_array(data).T)
494 # gh-26429 do not raise user-facing AssertionError
495 try:
pandas/_libs/lib.pyx in pandas._libs.lib.to_object_array()
TypeError: object of type 'NoneType' has no len()
What can I do to successfully achieve this task?
You can use assign with str.split like below. But format of text should be fixed.
df.assign(number = df.institution.str.split().str[0], \
unit_id = df.institution.str.split().str[-1])
Output:
institution number unit_id
0 1.1.2. Consejo Nacional de Ciencias (CNCOO00012) 1.1.2. (CNCOO00012)
Or If you want to strip () from unit_id use
df.assign(number = df.institution.str.split().str[0], \
unit_id = df.institution.str.split().str[-1].str.strip('()'))
institution number unit_id
0 1.1.2. Consejo Nacional de Ciencias (CNCOO00012) 1.1.2. CNCOO00012
If your data is tidy enough you can perform all three steps in one command using pd.Series.str.extract() which can split a series of strings into columns by regex, ie.:
df.institution.str.extract(r'(?P<number>[0-9.]+) (?P<institution>[A-Za-z ]+) \((?P<unit_id>[A-Z0-9]+)')
If the missing values cause a problem you can add dropna():
df.institution.dropna().str.extract(r'(?P<number>[0-9.]+) (?P<institution>[A-Za-z ]+) \((?P<unit_id>[A-Z0-9]+)')
Output:
number institution unit_id
0 1.1.2. Consejo Nacional de Ciencias CNCOO00012
Just a thought, but what about using named capture groups in a regular expression. For example, use the following after you imported your CSV-file:
df.iloc[:,0].str.extract(r'^(?P<number>[\d.]*)\s+(?P<instituion>.*)\s+\((?P<unit_id>[A-Z\d]*)\)$')
This would expand your dataframe as such:
number instituion unit_id
0 1.1.2. Consejo Nacional de Ciencias CNCOO00012
About the regular expression's pattern:
^ - A start string ancor
^(?P<number>[\d.]*) - A named capture group, number, made up of zero or more characters (greedy) in character class of dots and digits.
\s+ - One or more spaces.
(?P<instituion>.*) - A named capture group, institution, made up of zero or more characters (greedy) other than new-line.
\s+\( - One or more spaces followed by a literal opening paranthesis.
(?P<unit_id>[A-Z\d]*) - A named capture group, unit_id, made up of zero or more characters (greedy) in character class of uppercase letters and digits.
\)$ - Closing paranthesis followed by end of string ancor.
Online Demo

Pandas - Iterating over an index in a loop

I have a weird interaction that I would need help with. Basically :
1)
I have created a pandas dataframe that containts 1179 rows x 6 columns. One column is street names and the same value will have several duplicates (because each line represents a point, and each point is associated with a street).
2)
I also have a list of all the streets in this panda dataframe.
3)If I run this line, I get an output of all the rows matching that street name:
print(sub_df[sub_df.AQROUTES_3=='AvenueMermoz'])
Result :
FID AQROUTES_3 ... BEARING E_ID
983 983 AvenueMermoz ... 288.058014
984 984 AvenueMermoz ... 288.058014
992 992 AvenueMermoz ... 288.058014
1005 1005 AvenueMermoz ... 288.058014
1038 1038 AvenueMermoz ... 288.058014
1019 1019 AvenueMermoz ... 288.058014
However, if I run this command in a loop with the string of my list as the street name, it returns an empty dataframe :
x=()
for names in pd_streetlist:
print(names)
x=names
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
x=()
Returns :
RangSaint_Joseph
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
AvenueAugustin
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
and so on...
I can't figure out why. Anybody has an idea?
Thanks
I believe the issue is in this line:
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
To each names you add unnecessarily quote characters at the beginning and at the end so that each valid name of the street (in your example 'AvenueMermoz' turns into "'AvenueMermoz'" where we had to use double quotes to enclose single-quoted string).
As #busybear has commented - there is no need to cast to str either. So, the corrected line would be:
print(sub_df[sub_df.AQROUTES_3 == x])
So youre adding quotation marks to the filter which you shouldnt. now youre filtering on 'AvenueMermoz' while you just want to filter on AvenueMermoz .
so
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
should become
print(sub_df[sub_df.AQROUTES_3 ==str(x)])

Categories

Resources