Pandas Column Split (Array) - python

Below is my data in SQL Server
After reading data in python it become like this
I use the below code to split the value to multiple columns
# 1. To split single array column to multiple column based on '\t'
df[['v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11','v12','v13','v14','v15',\
'v16','v17','v18','v19','v20','v21','v22','v23','v24','v25','v26','v27','v28',\
'v29','v30','v31','v32','v33','v34','v35','v36','v37','v38','v39','v40','v41']] = df['_VALUE'].str.split(pat="\t", expand=True)
# 2. To remove the '\r\n' from the last column
df['v41'] = df['v41'].replace(r'\s+|\\n', ' ', regex=True)
But in some data sets the array value is more Eg. 100 columns, the above code is so big. I have to write from V1 to V100. Is there any simple way to do this.

You can replace the hardcoded array from your code with one that generates the array for you using a method like this:
df[[f'v{x}' for x in range(100)]] = df['_VALUE'].str.split(pat="\t", expand=True)

Related

Splitting column into multiple columns every other delimiter in python

I have a column that I am trying to split into multiple columns in python. The data in the column looks like this below,
1;899.618000;2;0.551582;7;93.643914;8;12.00000
I need to split this column by every other delimiter (;) into separate columns, so I need it to look like the below.
Col1
1;899.618000
Assuming the data is consistently float-like, you can use a regex that checks if you have a non float representation after the separator:
s = '1;899.618000;2;0.551582;7;93.643914;8;12.00000'
import re
re.split(';(?=\d+;)', s)
output:
['1;899.618000', '2;0.551582', '7;93.643914', '8;12.00000']
this should do the trick
s = "1;899.618000;2;0.551582;7;93.643914;8;12.00000"
l1 = s.split(";")
l2 = [l1[i]+l1[i+1] for i in range (0,len(l1),2)]
the value of l2 will be
['1899.618000', '20.551582', '793.643914', '812.00000']

Python sets stored as string in a column of a pandas dataframe

I have a pandas dataframe, where one column contains sets of strings (each row is a (single) set of strings). However, when I "save" this dataframe to csv, and read it back into a pandas dataframe later, each set of strings in this particular column seems to be saved as a single string. For example the value in this particular row, should be a single set of strings, but it seems to have been read in as a single string:
I need to access this data as a python set of strings, is there a way to turn this back into a set? Or better yet, have pandas read this back in as a set?
You can wrap the string in the "set()" function to turn it back into a set.
string = "{'+-0-', '0---', '+0+-', '0-0-', '++++', '+++0', '+++-', '+---', '0+++', '0++0', '0+00', '+-+-', '000-', '+00-'}"
new_set = set(string)
I think you could use a different separator while converting dataframe to csv.
import pandas as pd
df = pd.DataFrame(["{'Ramesh','Suresh','Sachin','Venkat'}"],columns=['set'])
print('Old df \n', df)
df.to_csv('mycsv.csv', sep= ';', index=False)
new_df = pd.read_csv('mycsv.csv', sep= ';')
print('New df \n',new_df)
Output:
You can use series.apply I think:
Let's say your column of sets was called column_of_sets. Assuming you've already read the csv, now do this to convert back to sets.
df['column_of_sets'] = df['column_of_sets'].apply(eval)
I'm taking eval from #Cabara's comment. I think it is the best bet.

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

Why after using the split command it deletes all data from dataframe

From an itertuple loop, I read in some row values and then converted the values read into Series data, then changed the astype to a string and used concat to add it to dff and as shown below.
In [24]: dff
Out[24]:
SRD Aspectno
0 9450 [9450.01, 9450.02]
1 9880 [9880.01, 9880.02, 9880.03]
When I apply the following command line, it strips out all the data. I have used the split command before, It may have something to do with the square brackets, but using str.strip or str(0), also removes all the data.
In [25]: splitdff = dff['Aspectno'].str.split(',', expand = True)
In [26]: splitdff
Out[26]:
0
0 NaN
1 NaN
What am I doing wrong?
Also, when converting the data read after reading the rows, how do I get data in row 0 to be shifted to the left, i.e, [9450.01, 9450.02] shift over to the left by one column?
It looks like you're trying to split a list on a comma, that's a method intended for strings. Try this to break the values out into their own columns:
import pandas as pd
...
dff['Aspectno'].apply(pd.Series)
It will give you a DataFrame with the entries in columns. The lists are different lengths, so there will be a number of columns equal to the length of the longest list. If you know that length you could do this:
dff[['col1','col2','col3']] = dff['Aspectno'].apply(pd.Series)
The code dff['Aspectno'] select the series Aspectno, so [9450.02, 9880.03] and the split on character , doesn't do anything, as there are no commas in the series values.

Extract text enclosed between a delimiter and store it as a list in a separate column

I have a Panda dataframe with a text column in the format below. There are some values/text meshed in between ##. I want to find such text which are present between ## and extract them in a separate column as a list.
##fare_curr.currency####based_fare_90d.price##
htt://www.abcd.lol/abcd-Search?from:##based_best_flight_fare_90d.air##,to:##mbased_90d.water##,departure:##mbased_90d.date_1##TANYT&pas=ch:0Y&mode=search
Consider the above two strings to be two rows of the same column. I want to get a new column with a list [fare_curr.currency, based_fare_90d.price] in the first row and [based_best_flight_fare_90d.air, mbased_90d.water, based_90d.date_1] in the second row.
Given this df
df = pd.DataFrame({'data':
['##fare_curr.currency####based_fare_90d.price##',
'htt://www.abcd.lol/abcd-Search?\ from:##based_best_flight_fare_90d.air##,to:##mbased_90d.water##,departure:#
#mbased_90d.date_1##TANYT&pas=ch:0Y&mode=search']})
You can get desired result in a new column using
df['new'] = pd.Series(df.data.str.extractall('##(.*?)##').unstack().values.tolist())
You get
data new
0 ##fare_curr.currency####based_fare_90d.price## [fare_curr.currency, based_fare_90d.price, None]
1 htt://www.abcd.lol/abcd-Search?from:##based_be... [based_best_flight_fare_90d.air, mbased_90d.wa...

Categories

Resources