I have a column that I am trying to split into multiple columns in python. The data in the column looks like this below,
1;899.618000;2;0.551582;7;93.643914;8;12.00000
I need to split this column by every other delimiter (;) into separate columns, so I need it to look like the below.
Col1
1;899.618000
Assuming the data is consistently float-like, you can use a regex that checks if you have a non float representation after the separator:
s = '1;899.618000;2;0.551582;7;93.643914;8;12.00000'
import re
re.split(';(?=\d+;)', s)
output:
['1;899.618000', '2;0.551582', '7;93.643914', '8;12.00000']
this should do the trick
s = "1;899.618000;2;0.551582;7;93.643914;8;12.00000"
l1 = s.split(";")
l2 = [l1[i]+l1[i+1] for i in range (0,len(l1),2)]
the value of l2 will be
['1899.618000', '20.551582', '793.643914', '812.00000']
Related
Below is my data in SQL Server
After reading data in python it become like this
I use the below code to split the value to multiple columns
# 1. To split single array column to multiple column based on '\t'
df[['v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11','v12','v13','v14','v15',\
'v16','v17','v18','v19','v20','v21','v22','v23','v24','v25','v26','v27','v28',\
'v29','v30','v31','v32','v33','v34','v35','v36','v37','v38','v39','v40','v41']] = df['_VALUE'].str.split(pat="\t", expand=True)
# 2. To remove the '\r\n' from the last column
df['v41'] = df['v41'].replace(r'\s+|\\n', ' ', regex=True)
But in some data sets the array value is more Eg. 100 columns, the above code is so big. I have to write from V1 to V100. Is there any simple way to do this.
You can replace the hardcoded array from your code with one that generates the array for you using a method like this:
df[[f'v{x}' for x in range(100)]] = df['_VALUE'].str.split(pat="\t", expand=True)
I have a dataframe (df) that contains a column with urls. I want to filter out the values that do not contain a '.'.
I tried this:
df = df[~df['Domain'].str.contains('.')]
But the results still have some values with a value with no '.' in it. Any advice on how to filter out the specifically '.'?
str.contains treats the input as a regular expression by default. Try escaping the dot:
df = df[~df['Domain'].str.contains('\.')]
Or, turn off the regex input by setting the regex flag to false:
df = df[~df['Domain'].str.contains('.', regex=False)]
The is my pandas data frame, In the index column i want to keep only the values after double underscore(__) and remove the rest.
Use str.split with parameter n=1 for split by first splitter (if possible multiple __) and select second lists:
df['index'].str.split('__', n=1).str[1]
Or use list comprehension if no missing values and performance is important:
df['last'] = [x.split('__', 1)[1] for x in df['index']]
df['index'].apply(lambda x: x.split('__')[-1]) will do the trick
From an itertuple loop, I read in some row values and then converted the values read into Series data, then changed the astype to a string and used concat to add it to dff and as shown below.
In [24]: dff
Out[24]:
SRD Aspectno
0 9450 [9450.01, 9450.02]
1 9880 [9880.01, 9880.02, 9880.03]
When I apply the following command line, it strips out all the data. I have used the split command before, It may have something to do with the square brackets, but using str.strip or str(0), also removes all the data.
In [25]: splitdff = dff['Aspectno'].str.split(',', expand = True)
In [26]: splitdff
Out[26]:
0
0 NaN
1 NaN
What am I doing wrong?
Also, when converting the data read after reading the rows, how do I get data in row 0 to be shifted to the left, i.e, [9450.01, 9450.02] shift over to the left by one column?
It looks like you're trying to split a list on a comma, that's a method intended for strings. Try this to break the values out into their own columns:
import pandas as pd
...
dff['Aspectno'].apply(pd.Series)
It will give you a DataFrame with the entries in columns. The lists are different lengths, so there will be a number of columns equal to the length of the longest list. If you know that length you could do this:
dff[['col1','col2','col3']] = dff['Aspectno'].apply(pd.Series)
The code dff['Aspectno'] select the series Aspectno, so [9450.02, 9880.03] and the split on character , doesn't do anything, as there are no commas in the series values.
I have a DataFrame 'df' with a string column. I was trying to remove a list of special values from this column.
For example if the column 'number' is: onE1, I want it change to 1; if the column is FOur4, I want it change to 4
I used the following code:
for i in ['onE','TwO','ThRee', 'FOur']:
print(i)
df['new_number'] = df['number'].str.replace(i,'')
Although print(i) shows the i go through the list of strings, the column 'new_number' only removed 'FOur' from column 'number', the rest string 'onE','TwO','ThRee' are still in column 'new_number', which means onE1, is still onE1; but value FOur4 changed to 4 in the column 'new_number'
So what is wrong with this piece of code?
To get the numbers from the string in the dataFrame you can use this:
number = ''.join(x for x in df['number'].str if x.isdigit())
I found a similar post to this question
pandas replace (erase) different characters from strings
we can use regex to solve this issue
df['new_number'] = df['number'].str.replace('onE|TwO|ThRee|FOur','')