Convert a large string to Dataframe

Convert a large string to Dataframe - python

I've a large string looking like this :
'1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
the issue is that I can't use one space or two to split my string because from Start date to str_date theres 2spaces but in the next line there will be 3 for example and maybe the next line will have 1space to seperate ... this makes it very hard to create a correct DataFrame as I want, is there a way to this ? thanks

to get a list with all the words that have _ (as you requested in the comments) you could use a regular expression:
import re
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
list(map(re.Match.group, re.finditer(r'\w+_.\w+', s)))
output:
['str_date', 'cal_nt', 'cal_Rate_td']
or you can use a list comprehension:
[e for e in s.split() if '_' in e]
output:
['str_date', 'cal_nt', 'cal_Rate_td']
to get a data frame from your string you could use the above information, the third field:
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n'
third_fields = [e for e in s.split() if '_' in e]
rows = []
for third_field, row in zip(third_fields, s.split('\n')):
current_row = []
row = row.strip()
first_field = re.search(r'\d+\b', row).group()
current_row.append(first_field)
# remove first field
row = row[len(first_field):].strip()
second_field, rest_of_fields = row.split(third_field)
parsed_fields = [e.group() for e in re.finditer(r'\b[\w\d]+\b', rest_of_fields)]
current_row.extend([second_field, third_field, *parsed_fields])
rows.append(current_row)
pd.DataFrame(rows)
output:

Like #kederrac answer, you can use regex to split them
import re
s = "1 Start Date str_date B 10 C "
l = re.compile("\s+").split(s.strip())
# output ['1', 'Start', 'Date', 'str_date', 'B', '10', 'C']

Related

How to split a dataframe string column into multiple columns?

I have a pandas dataframe. This dataframe consists of a single column. I want to parse this column according to the '&' sign and add the data to the right of the "=" sign as a new column. Examples are below.
The dataframe I have;
tags
0 letter1=A&letter2=B&letter3=C
1 letter1=D&letter2=E&letter3=F
2 letter1=G&letter2=H&letter3=I
3 letter1=J&letter2=K&letter3=L
4 letter1=M&letter2=N&letter3=O
5 letter1=P&letter2=R&letter3=S
. .
. .
dataframe that I want to convert;
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
.
.
I tried to do something with this code snippet.
columnname= df["tags"][0].split("&")[i].split("=")[0]
value =df["tags"][0].split("&")[i].split("=")[1]
But I'm not sure how I can do it for the whole dataframe. I am looking for a faster and stable way.
Thanks in advance,

do this..
import pandas as pd
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
df["letter1"] = df["tags"].apply(lambda x: x.split("&")[0].split("=")[-1])
df["letter2"] = df["tags"].apply(lambda x: x.split("&")[1].split("=")[-1])
df["letter3"] = df["tags"].apply(lambda x: x.split("&")[2].split("=")[-1])
df = df[["letter1", "letter2", "letter3"]]
df

Split into separate columns, via str.split, using & :
step1 = df.tags.str.split("&", expand=True)
Get the new columns from the first row of step1:
new_columns = step1.loc[0, :].str[:-2].array
Get rid of the letter1= prefix in each column, set the new_columns as the header:
step1.set_axis(new_columns, axis='columns').transform(lambda col: col.str[-1])
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
5 P R S

d=list(df["tags"])
r={}
for i in d:
for ele in i.split("&"):
if ele.split("=")[0] in r.keys():
r[ele.split("=")[0]].append(ele.split("=")[1])
else:
r[ele.split("=")[0]]=[]
r[ele.split("=")[0]].append(ele.split("=")[1])
df = pd.DataFrame({i:pd.Series(r[i]) for i in r})
print (df)

Using regex
import pandas as pd
import re
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
pattern=re.compile("\=(\w+)") # Look for pattern
df['letter1'], df['letter3'],df["letter2"] = zip(*df["tags"].apply(lambda x: pattern.findall(x)))
Output
tags letter1 letter2 letter3
0 letter1=A&letter2=B&letter3=C A B C
1 letter1=D&letter2=E&letter3=F D E F
2 letter1=G&letter2=H&letter3=I G H I
3 letter1=J&letter2=K&letter3=L J K L
4 letter1=M&letter2=N&letter3=O M N O
5 letter1=P&letter2=R&letter3=S P R S

Why will this string not convert to float?

Column 'Amount' is a string. I want to change it to float so that I can input these rows into a later calculation.
In [1] import pandas as pd
data = pd.read_csv('input.csv')
data
Out [1]
ID Amount Cost
0 A 9,596,249.09 1000000
1 B 38,385,668.57 50000
2 C 351,740.00 100
3 D - 23
4 E 178,255.96 999
Note the that 'D' has an Amount of ' - ' rather than zero.
First I clean up the bad data:
In [2]
data['Amount'] = data['Amount'].replace(' - ', 0)
data
Out [2]
ID Amount Cost
0 A 9,596,249.09 1000000
1 B 38,385,668.57 50000
2 C 351,740.00 100
3 D 0 23
4 E 178,255.96 999
Then I try to convert to float using 2 methods. Both unsuccessful:
In [3]
pd.Series(data['Amount']).astype(float)
Out [3]
ValueError: could not convert string to float: '9,596,249.09'
and:
In [4]
pd.to_numeric(data['Amount'])
Out [4]
ValueError: Unable to parse string "9,596,249.09" at position 0
In my desperation I attempt to loop through the rows:
In [5]
def cleandata(x):
return float(x)
data['Amount'] = data['Amount'].apply(cleandata)
Out [5]
ValueError: could not convert string to float: '9,596,249.09'
Appreciate any advice you could give. I have tried for hours. Thank you.

try:
data = pd.read_csv('input.csv', thousands=',', decimal='.')

You should get rid of the commas, that way should fix the problem. Try this:
data['Amount'] = data['Amount'].apply(lambda x: x.replace(",", "")) # take the commas away
data['Amount'] = data.Amount.astype(float)

Creating a list (y) seems to work.
In [1]:
import pandas as pd
data = pd.read_csv('input.csv')
y = list(data["Amount"])
y = [item.replace(" - " , '0') for item in y]
y = [item.replace("," , '') for item in y]
data["Amount"] = y
data["Amount"] = pd.to_numeric(data['Amount'], errors='coerce')
data['Result'] = data["Amount"] - data["Cost"]
data
Out [1]:
ID Amount Cost Result
0 A 9596249.09 1000000 8596249.09
1 B 38385668.57 50000 38335668.57
2 C 351740.00 100 351640.00
3 D 0.00 23 -23.00
4 E 178255.96 999 177256.9
There is certainly a better and more pythonic way to write this^ i'm sure.

Extracting multiple words from pandas dataframe column into same column

Suppose a dataframe consists of two columns A={1,2,3} B={'a b c d', 'e f g h', 'i j k l'}. For A = 2, I would like to change the corresponding entry in column B to 'e f h'. (ie. extract the first, second and last word, not drop the third word, not the same).
It is easy to extract single words using the df.loc[df['colA']=2,'colB'].str.split().str[x], where x= 0,1 and -1, but I'm having difficulty joining the three words back into one string efficiently. The most efficient way I can think of is provided below. Is there a better way of achieving what I'm trying to do? Thanks.
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x]
df.loc[df['colA']=2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
Expected and actual result:
A B
1 a b c d
2 e f h
3 i j k l

You were pretty close to the solution, the only problem is that str[x] returns a value wrapped in a Series object. You could fix this by extracting the value from the Series as shown:
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x].values[0]
df.loc[df['colA']==2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
You can also achieve the same by making use of the apply function
df.loc[df['colA']==2, 'colB'] = df.loc[df['colA']==2,'colB'].apply(lambda x: ' '.join(x.split()[0:2] + [x.split()[-1]]))

How about this:
df = pd.DataFrame(data = {'A': [1,2,3],
'B': ['a b c d', 'e f g h', 'i j k l']})
y = lambda x : df.loc[df['A']==2,'B'].str[0:2*x+2] + df.loc[df['A']==2,'B'].str[-1]
df.loc[df1['A']==2,'B'] = y(1)
Then df is the wanted:
A B
0 1 a b c d
1 2 e f h
2 3 i j k l

Counting Occurrences Within a List Within a Dataframe

I have a panda dataframe that contains a list of articles; the outlet, publish date, link etc. One of the columns in this dataframe is a list of keywords. For example, in the keyword column each cell contains a list like [drop, right, states, laws].
My ultimate goal is to count the number of occurrences of each unique word on each day. The challenge that I'm having is breaking the keywords out of their lists and then matching them to the date on which they occurred. ...assuming this is even the most logical first step.
At present I have a solution in the code below, however I'm new to python and in thinking through these things I still think in an Excel mindset. The code below works but it's very slow. Is there a fast way to do this?
# Create a list of the keywords for articles in the last 30 days to determine their quantity
keyword_list = stories_full_recent_df['Keywords'].tolist()
keyword_list = [item for sublist in keyword_list for item in sublist]
# Create a blank dataframe and new iterator to write the keyword appearances to
wordtrends_df = pd.DataFrame(columns=['Captured_Date', 'Brand' , 'Coverage' ,'Keyword'])
r = 0
print("Creating table on keywords: {:,}".format(len(keyword_list)))
print(time.strftime("%H:%M:%S"))
# Write the keywords out into their own rows with the dates and origins in which they occur
while r <= len(keyword_list):
for i in stories_full_recent_df.index:
words = stories_full_recent_df.loc[i]['Keywords']
for word in words:
wordtrends_df.loc[r] = [stories_full_recent_df.loc[i]['Captured_Date'], stories_full_recent_df.loc[i]['Brand'],
stories_full_recent_df.loc[i]['Coverage'], word]
r += 1
print(time.strftime("%H:%M:%S"))
print("Keyword compilation complete.")
Once I have each word on it's own row I'm simply using .groupby() to figure out the number of occurences each day.
# Group and count the keywords and days to find the day with the least of each word
test_min = wordtrends_df.groupby(('Keyword', 'Captured_Date'), as_index=False).count().sort_values(by=['Keyword','Brand'], ascending=True)
keyword_min = test_min.groupby(['Keyword'], as_index=False).first()
At present there around about 100,000 words in this list and it takes me an hour to run through that list. I'd love thoughts on a faster way to do it.

I think you can get the expected result by doing this:
wordtrends_df = pd.melt(pd.concat((stories_full_recent_df[['Brand', 'Captured_Date', 'Coverage']],
stories_full_recent_df.Keywords.apply(pd.Series)),axis=1),
id_vars=['Brand','Captured_Date','Coverage'],value_name='Keyword')\
.drop(['variable'],axis=1).dropna(subset=['Keyword'])
An explanation with a small example below.
Consider an example dataframe:
df = pd.DataFrame({'Brand': ['X', 'Y'],
'Captured_Date': ['2017-04-01', '2017-04-02'],
'Coverage': [10, 20],
'Keywords': [['a', 'b', 'c'], ['c', 'd']]})
# Brand Captured_Date Coverage Keywords
# 0 X 2017-04-01 10 [a, b, c]
# 1 Y 2017-04-02 20 [c, d]
First thing you can do is expand the keywords column so that each keyword occupies its own column:
a = df.Keywords.apply(pd.Series)
# 0 1 2
# 0 a b c
# 1 c d NaN
Concatenate this with the original df without Keywords column:
b = pd.concat((df[['Captured_Date','Brand','Coverage']],a),axis=1)
# Captured_Date Brand Coverage 0 1 2
# 0 2017-04-01 X 10 a b c
# 1 2017-04-02 Y 20 c d NaN
Melt this last result to create a row per keyword:
c = pd.melt(b,id_vars=['Captured_Date','Brand','Coverage'],value_name='Keyword')
# Captured_Date Brand Coverage variable Keyword
# 0 2017-04-01 X 10 0 a
# 1 2017-04-02 Y 20 0 c
# 2 2017-04-01 X 10 1 b
# 3 2017-04-02 Y 20 1 d
# 4 2017-04-01 X 10 2 c
# 5 2017-04-02 Y 20 2 NaN
Finally, drop the useless variable column and drop rows where Keyword is missing:
d = c.drop(['variable'],axis=1).dropna(subset=['Keyword'])
# Captured_Date Brand Coverage Keyword
# 0 2017-04-01 X 10 a
# 1 2017-04-02 Y 20 c
# 2 2017-04-01 X 10 b
# 3 2017-04-02 Y 20 d
# 4 2017-04-01 X 10 c
Now you're ready to count by keywords and dates.

Strip all characters from column header before a :

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown

Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7

import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).

You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']

df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']

You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a large string to Dataframe - python

Like #kederrac answer, you can use regex to split them import re s = "1 Start Date str_date B 10 C " l = re.compile("\s+").split(s.strip()) # output ['1', 'Start', 'Date', 'str_date', 'B', '10', 'C']

Related

How to split a dataframe string column into multiple columns?

Why will this string not convert to float?

Extracting multiple words from pandas dataframe column into same column

Counting Occurrences Within a List Within a Dataframe

Strip all characters from column header before a :

Categories

Resources