Converting a large String into a Dataframe - python

I've a large String looking like this : '2002 | 09| 90|NUMBER|SALE|CLIENT \n 2002 | 39| 96|4958|100|James ...' split by "|" and "\n". the size of each line is the same, what's the best way to turn this into a dataframe looking like this :
2002 09 90 NUMBER SALE CLIENT
2002 39 96 4958 100 James
.....

You can pass in the StringIO object to the pandas.read_csv method to get the desired result. Example,
Try this:
from io import StringIO
import pandas as pd
data = StringIO("""'2002 | 09| 90|NUMBER|SALE|CLIENT \n 2002 | 39| 96|4958|100|James """)
df = pd.read_csv(data, sep=r"\s*\|\s*")
print(df)
Output:
2002 09 90 NUMBER SALE CLIENT
0 2002 39 96 4958 100 James
UPDATE (As per your requirements discussed in comments):
from io import StringIO
import pandas as pd
import re
data = StringIO("""Start Date str_date B N C \n Calculation notional cal_nt C B R\n price of today price_of_today N B R \n""")
lines = []
for line in data:
line = line.strip()
line = re.search(r"^(.*)\s(.*?\s.*?\s.*?\s.*?)$", line)
grp1 = line.group(1)
grp2 = line.group(2)
line = "|".join([grp1, "|".join(grp2.split(" "))])
lines.append(line)
data = StringIO("\n".join(lines))
columns = ["header1", "header2", "header3", "header4", "header5"]
df = pd.read_csv(data, names=columns, sep=r"\s*\|\s*")
print(df)

Related

Which is the correct way to use `to_csv` after reading `json` from restapi ? How to get data in tabular format?

I am trying to read data from : http://dummy.restapiexample.com/api/v1/employees and trying to put it out in tabular format.
I am getting the output. But columns are not created from json file.
How can do this in right way?
Code:
import pandas as pd
import json
df1 = pd.read_json('http://dummy.restapiexample.com/api/v1/employees')
df1.to_csv('try.txt',sep='\t',index=False)
Expected Output:
employee_name employee_salary employee_age profile_image
Tiger Nixon 320800 61
(along with other rows)
You can read the data directly from the web, like you're doing, but you need to help pandas interpret your data with the orient parameter:
df = pd.read_json('http://dummy.restapiexample.com/api/v1/employees', orient='index')
Then there's a second step to focus on the data you want:
df1 = pd.DataFrame(df.loc['data', 0])
Now you can write your csv.
Here are the different steps (note: the data is in [data] array of the JSON response):
import json
import pandas as pd
import requests
res = requests.get('http://dummy.restapiexample.com/api/v1/employees')
data_str = res.content
data_dict = json.loads(data_str)
data_df = pd.DataFrame(data_dict['data'])
data_df.to_csv('try.txt', sep='\t', index=False)
you have to parse your json first.
import pandas as pd
import json
import requests
r = requests.get('http://dummy.restapiexample.com/api/v1/employees')
j = json.loads(r.text)
df = pd.DataFrame(j['data'])
output
id employee_name employee_salary employee_age profile_image
0 1 Tiger Nixon 320800 61
1 2 Garrett Winters 170750 63
2 3 Ashton Cox 86000 66
3 4 Cedric Kelly 433060 22
4 5 Airi Satou 162700 33
5 6 Brielle Williamson 372000 61
6 7 Herrod Chandler 137500 59
7 8 Rhona Davidson 327900 55
8 9 Colleen Hurst 205500 39

Python : Inserting hyphens after column header in Pandas

I have created a dataframe in Python Pandas as below:
import pandas as pd
import os
cols = ('Name','AGE','SAL')
df = pd.read_csv("C:\\Users\\agupt80\\Desktop\\POC\\Python\\test.csv",names = cols)
print(df)
When I am printing dataframe I am getting below output:
Name AGE SAL
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100
Please help is letting me know, how can I insert a Hyphen "-" line after column header like below:
Name AGE SAL
------------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100
I don't know of any pandas customization option for printing separators, but you can just print the df to a string, and then insert the line yourself. Something like this:
string_repr = df.to_string().splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
>>> print(out)
Name AGE SAL
------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100
You could do something like this:
import pandas as pd
df = pd.DataFrame([
['Amit', 32, 100],
['gupta', 33, 200],
['hello', 34, 100],
['Amit', 33, 100]],
columns=['Name', 'AGE', 'SAL'])
lines = df.to_string().splitlines()
num_hyphens = max(len(line) for line in lines)
lines.insert(1, '-' * num_hyphens)
result = '\n'.join(lines)
print(result)
Output:
Name AGE SAL
------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 100
3 Amit 33 100
You can change the calculation of num_hyphens depending on how exactly you want your output to look like. For example you could do:
num_hyphens = 2 * len(lines[0]) - len(lines[0].strip())
To get:
Name AGE SAL
----------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 100
3 Amit 33 100
Note: If the DataFrame has a named index, to_string will output an additional header line with the name of the index. In that case you could choose to remove that line (replace it with the hyphens) add the hyphens after it (at position 2 instead of 1).
you need,
#shift the values one level
df=df.shift(1)
#fill the first row with hyphen
df.iloc[0]='----'
output
Name AGE SAL
---- ---- ----
Amit3 2 100
gupta 33 200
hello 34 300

write rows in pandas dataframe and append it to existing dataframe

I have the output of my script as year and the count of word from an article in that particular year :
abcd
2013
118
2014
23
xyz
2013
1
2014
45
I want to have each year added as a new column to my existing dataframe which contains only words.
Expected output:
Terms 2013 2014 2015
abc 118 76 90
xyz 23 0 36
The input for my script was a csv file :
Terms
xyz
abc
efg
The script I wrote is :
df = pd.read_csv('a.csv', header = None)
for row in df.itertuples():
term = (str(row[1]))
u = "http: term=%s&mindate=%d/01/01&maxdate=%d/12/31"
print(term)
startYear = 2013
endYear = 2018
for year in range(startYear, endYear+1):
url = u % (term.replace(" ", "+"), year, year)
page = urllib.request.urlopen(url).read()
doc = ET.XML(page)
count = doc.find("Count").text
print(year)
print(count)
The df.head is :
0
0 1,2,3-triazole
1 16s rrna gene amplicons
Any help will be greatly appreciated, thanks in advance !!
I would read the csv with numpy in an array, then reshape it also with numpy and then the resulting matrix/2D array to a DataFrame
Something like this should do it:
#!/usr/bin/env python
def mkdf(filename):
def combine(term, l):
d = {"term": term}
d.update(dict(zip(l[::2], l[1::2])))
return d
term = None
other = []
with open(filename) as I:
n = 0
for line in I:
line = line.strip()
try:
int(line)
except Exception as e:
# not an int
if term: # if we have one, create the record
yield combine(term, other)
term = line
other = []
n = 0
else:
if n > 0:
other.append(line)
n += 1
# and the last one
yield combine(term, other)
if __name__ == "__main__":
import pandas as pd
import sys
df = pd.DataFrame([r for r in mkdf(sys.argv[1])])
print(df)
usage: python scriptname.py /tmp/IN ( or other file with your data)
Output:
2013 2014 term
0 118 23 abcd
1 1 45 xyz

Create dictionary from csv using pandas with date as key

I wish to create dictionary from the table below
ID ArCityArCountry DptCityDptCountry DateDpt DateAr
1922 ParisFrance NewYorkUnitedState 2008-03-10 2001-02-02
1002 LosAngelesUnitedState California UnitedState 2008-03-10 2008-12-01
1901 ParisFrance LagosNigeria 2001-03-05 2001-02-02
1922 ParisFrance NewYorkUnitedState 2011-02-03 2008-12-01
1002 ParisFrance CaliforniaUnitedState 2003-03-04 2002-03-04
1099 ParisFrance BeijingChina 2011-02-03 2009-02-04
1901 LosAngelesUnitedState ParisFrance 2001-03-05 2001-02-02
.
import pandas as pd
import datetime
from pandas_datareader import data, wb
import csv
#import numpy as np
out= open("testfile.csv", "rb")
data = csv.reader(out)
data = [[row[0],row[1] + row[2],row[3] + row[4], row[5],row[6]] for row in data]
out.close()
print data
out=open("data.csv", "wb")
output = csv.writer(out)
for row in data:
output.writerow(row)
out.close()
df = pd.read_csv('data.csv')
for DateDpt, DateAr in df.iteritems():
df.DateDpt = pd.to_datetime(df.DateDpt, format='%Y-%m-%d')
df.DateAr = pd.to_datetime(df.DateAr, format='%Y-%m-%d')
print df
dept_cities = df.groupby('ArCityArCountry')
for city, departures in dept_cities:
print(city)
print([list(r) for r in departures.loc[:, ['AuthorID', 'DptCityDptCountry', 'DateDpt', 'DateAr']].to_records()])
Expected output
ParisFrance = { DateAr, ID, ArCityArCountry, DptCityDptCountry}
Note: I want to group by ArCityArCountry and DptCityDptCountry
You will notice I didn't include DateDpt; I want to select all IDs that fall between DateAr and DateDpt and actually in ParisFrance or CaliforniaUnitedStates between the specified periods.
for example In 1999-10-02 Mr A was in Paris until 2013-12-12 and Mr B was in Paris in 2010-11-04 and left 2012-09-09 that means MrA and Mr B were in Paris since MrB's visit to Paris fall in btw the time
MrA was there CaliforniaUnitedStates = { DateAr, ID, ArCityArCountry, DptCityDptCountry}

How to make table with multi-tier row header (index) using Pandas

I have the following data:
# colh1 rh1 rh2 rh3/up rh4/down
AddaVax ID LV 29 18
AddaVax ID SP 16 13
AddaVax ID LN 61 73
ADX ID LV 11 14
ADX IP LV 160 88
ADX ID SP 14 13
ADX IP SP 346 129
ADX ID LN 25 25
What I'd like to do is to make a table that looks like this
(later to be written in text or Excel file):
The actual data contain more than 2 columns but the number of rows
is always fixed (i.e. 10 rows).
I'm stuck with the following code:
import pandas as pd
from collections import defaultdict
dod = defaultdict(dict)
with open("mediate.txt", 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter=' ')
for row in tabreader:
if "#" in row[0]: continue
colh1, rh1, rh2, rhup, rhdown = row
dod["colh1"] = colh1
dod["rh1"] = rh1
dod["rh2"] = rh2
dod["rhup"] = rhup
dod["rhdown"] = rhdown
What's the way to do it?
Just using Pandas:
import pandas as pd
df = pd.read_csv('mediate.txt', sep='\t') # or sep=',' if comma delimited.
df.rename(columns={'rh3/up': 'Up', 'rh4/down': 'Down'}, inplace=True)
result = df.pivot_table(values=['Up', 'Down'],
columns='colh1',
index=['rh1', 'rh2']).stack(0) # Stack Up/Down
>>> result
colh1 ADX AddaVax
rh1 rh2
ID LN Up 25 61
Down 25 73
LV Up 11 29
Down 14 18
SP Up 14 16
Down 13 13
IP LV Up 160 NaN
Down 88 NaN
SP Up 346 NaN
Down 129 NaN

Categories

Resources