strings to column using python - python

I have entire table as string like below:
a= "id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
inside string we do have some ascii code like \xe9e so we have to convert the string to non-ascii
My expected output is to convert above string to a dataframe
as below:
id date type status description
1 20-Jan-2019 cat1 active customer is under observation
2 18-Feb-2019 cat2 active customer is genuine
My code :
b = a.splitlines()
c = pd.DataFrame([sub.split(";") for sub in b])
I am getting the following output. but I need the fist row as my header and also convert the ascii to utf-8 text.
0 1 2 3 4 5 6
0 id date type status description None None
1 1 20-Jan-2019 cat1 active customer is underée observation None None
2 2 18-Feb-2019 cat2 active customer is genuine None None
Also, please not here it is creating extra columns with value None. Which should not be the case

Here is a bit of a hacky answer, but given that your question isn't really clear, this should hopefully be sufficient.
import pandas as pd
import numpy as np
import re
a="id;date;type;status;description\r\n1;20-Jan-2019;cat1;active;customer is under\xe9e observation\r\n2;18-Feb-2019;cat2;active;customer is genuine\r\n"
b=re.split('; |\r|\n',a) #split at the delimiters.
del b[-1] #also delete the last index, which we dont need
b[1:]=[re.sub(r'\xe9e', '', b[i]) for i in range(1,len(b))] #get rid of that \xe9e issue
df=pd.DataFrame([b[i].split(';') for i in range(1,len(b))]) #make the dataframe
##list comprehension allows to generalize this if you add to string##
df.columns=b[0].split(';') #split the title words for column names
df['id']=[i for i in range(1,len(b))]
df
This output is presumably what you meant by a dataframe:

Related

How do I change the same string within a column and make it permanent using Pandas

I'm trying to change the Strings "SLL" under the competitions column to "League" but when i tried this:
messi_dataset.replace("SLL", "League",regex = True)
It only changed the first "SLL" to "League" but then other strings that were "SLL" became "UCL. I have no idea why. I also tried changing regex = True to inlace = True but no luck.
https://drive.google.com/file/d/1ldq6o70j-FsjX832GbYq24jzeR0IwlEs/view?usp=sharing
https://drive.google.com/file/d/1OeCSutkfdHdroCmTEG9KqnYypso3bwDm/view?usp=sharing
Suppose you have a dataframe as below:
import pandas as pd
import re
df = pd.DataFrame({'Competitions': ['SLL', 'sll','apple', 'banana', 'aabbSLL', 'ccddSLL']})
# write a regex pattern that replaces 'SLL'
# I assumed case-irrelevant
regex_pat = re.compile(r'SLL', flags=re.IGNORECASE)
df['Competitions'].str.replace(regex_pat, 'league', regex=True)
# Input DataFrame
Competitions
0 SLL
1 sll
2 apple
3 banana
4 aabbSLL
5 ccddSLL
Output:
0 league
1 league
2 apple
3 banana
4 aabbleague
5 ccddleague
Name: Competitions, dtype: object
Hope it clarifies.
base on this Answer test this code:
messi_dataset['competitions'] = messi_dataset['competitions'].replace("SLL", "League")
also, there are many different ways to do this like this one that I test:
messi_dataset.replace({'competitions': 'SLL'}, "League")
for those cases that 'SLL' is a part of another word:
messi_dataset.replace({'competitions': 'SLL'}, "League", regex=True)

Separate several columns of data that contain hyphens , removing elements in Python

I have a dataset, df, where I would like to separate strings within Python.
Data
Type Id
aa - generation aa - generation01
aa_led - generation aa_led - generation01
ss - generation ss- generation01
Desired
Type Id
aa aa01
aa_led aa_led01
ss ss01
Doing
I am trying to incorporate this code into my script, however, this splits by hyphen but my column
names are not reserved.
new = wordstring.strip('-').split('-')
Any suggestion is appreciated
Thank you
If you just want to remove generation from every value in df. You can use applymap:
df = df.applymap(lambda x : x.replace('- generation', '').replace(' ',''))
OUTPUT:
Type Id
0 aa aa01
1 aa_led aa_led01
2 ss ss01

Querying a list object from API and returning it into dataframe - issues with format

I have the below script that returns data in a list format per quote of (i). I set up an empty list, and then query with the API function get_kline_data, and pass each output into my klines_list with the .extend function
klines_list = []
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_list.extend([i,klines])
klines_list
klines_list then returns data in this format;
['REQ-ETH',
[['1619317500',
'0.0000491',
'0.0000491',
'0.0000491',
'0.0000491',
'5.1147',
'0.00025113177']],
'REQ-BTC',
[['1619317500',
'0.00000219',
'0.00000219',
'0.00000219',
'0.00000219',
'19.8044',
'0.000043371636']],
'XLM-BTC',
[['1619317500',
'0.00000863',
'0.00000861',
'0.00000863',
'0.00000861',
'653.5693',
'0.005629652673']]]
I then try to convert it into a dataframe;
import pandas as py
df = py.DataFrame(klines_list)
And this is the result;
0
0 REQ-ETH
1 [[1619317500, 0.0000491, 0.0000491, 0.0000491,...
2 REQ-BTC
3 [[1619317500, 0.00000219, 0.00000219, 0.000002...
4 XLM-BTC
5 [[1619317500, 0.00000863, 0.00000861, 0.000008..
The structure of the DF is incorrect and it seems to be due to the way I have put my list together.
I would like the quantitative data in a column corresponding to the correct entry in list a, not in rows. Also, the ticker data, or list a, ("REQ-ETH/REQ-BTC") etc should be in a separate column. What would be a good way to go about restructuring this?
Edit: #Ynjxsjmh
This is the output when following the suggestion below for appending a dictionary within the for loop
REQ-ETH REQ-BTC XLM-BTC
0 [1619317500, 0.0000491, 0.0000491, 0.0000491, ... NaN NaN
1 NaN [1619317500, 0.00000219, 0.00000219, 0.0000021... NaN
2 NaN NaN [1619317500, 0.00000863, 0.00000861, 0.0000086...
pandas.DataFrame() can accept a dict. It will construct the dict key as column header, dict value as column values.
import pandas as pd
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
klines_data = {}
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_data[i] = klines[0]
# ^
# |
# Add a key to klines_data
df = pd.DataFrame(klines_data)
print(df)
REQ-ETH REQ-BTC XLM-BTC
0 1619317500 1619317500 1619317500
1 0.0000491 0.00000219 0.00000863
2 0.0000491 0.00000219 0.00000861
3 0.0000491 0.00000219 0.00000863
4 0.0000491 0.00000219 0.00000861
5 5.1147 19.8044 653.5693
6 0.00025113177 0.000043371636 0.005629652673
If the length of klines is not equal, you can use
df = pd.DataFrame.from_dict(klines_data, orient='index').T

Extract prefix from string in dataframe column where exists in a list

Looking for some help.
I have a pandas dataframe column and I want to extract the prefix where such prefix exists in a separate list.
pr_list = ['1 FO-','2 IA-']
Column in df is like
PartNumber
ABC
DEF
1 FO-BLABLA
2 IA-EXAMPLE
What I am looking for is to extract the prefix where present, put in a new column and leave the rest of the string in the original column.
PartNumber Prefix
ABC
DEF
BLABLA 1 FO-
EXAMPLE 2 IA-
Have tried some things like str.startswith but a bit of a python novice and wasn't able to get it to work.
much appreciated
EDIT
Both solutions below work on the test data, however I am getting an error
error: nothing to repeat at position 16
Which suggests something askew in my dataset. Not sure what position 16 refers to but looking at both the prefix list and PartNumber column in position 16 nothing seems out of the ordinary?
EDIT 2
I have traced it to have an * in the pr_list seems to be throwing it. is * some reserved character? is there a way to break it out so it is read as text?
You can try:
df['Prefix']=df.PartNumber.str.extract(r'({})'.format('|'.join(pr_list))).fillna('')
df.PartNumber=df.PartNumber.str.replace('|'.join(pr_list),'')
print(df)
PartNumber Prefix
0 ABC
1 DEF
2 BLABLA 1 FO-
3 EXAMPLE 2 IA-
Maybe it's not what you are looking for, but may it help.
import pandas as pd
pr_list = ['1 FO-','2 IA-']
df = pd.DataFrame({'PartNumber':['ABC','DEF','1 FO-BLABLA','2 IA-EXAMPLE']})
extr = '|'.join(x for x in pr_list)
df['Prefix'] = df['PartNumber'].str.extract('('+ extr + ')', expand=False).fillna('')
df['PartNumber'] = df['PartNumber'].str.replace('|'.join(pr_list),'')
df

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Similar to this question, but my CSV has a slightly different format. Here is an example:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?
As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
I have reproduced your file
With
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.
Here is a less than optimal option:
df = pd.read_csv('s083838383.csv', sep='##$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='##$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.
df looks like this:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract to apply regex and expand the columns:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result is:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details string values, you could do:
result['details'] = result['details'].str.strip('"')
If the details object items needs to be a dicts instead of strings, you could do:
from json import loads
result['details'] = result['details'].apply(loads)

Categories

Resources