I wish to create dictionary from the table below
ID ArCityArCountry DptCityDptCountry DateDpt DateAr
1922 ParisFrance NewYorkUnitedState 2008-03-10 2001-02-02
1002 LosAngelesUnitedState California UnitedState 2008-03-10 2008-12-01
1901 ParisFrance LagosNigeria 2001-03-05 2001-02-02
1922 ParisFrance NewYorkUnitedState 2011-02-03 2008-12-01
1002 ParisFrance CaliforniaUnitedState 2003-03-04 2002-03-04
1099 ParisFrance BeijingChina 2011-02-03 2009-02-04
1901 LosAngelesUnitedState ParisFrance 2001-03-05 2001-02-02
.
import pandas as pd
import datetime
from pandas_datareader import data, wb
import csv
#import numpy as np
out= open("testfile.csv", "rb")
data = csv.reader(out)
data = [[row[0],row[1] + row[2],row[3] + row[4], row[5],row[6]] for row in data]
out.close()
print data
out=open("data.csv", "wb")
output = csv.writer(out)
for row in data:
output.writerow(row)
out.close()
df = pd.read_csv('data.csv')
for DateDpt, DateAr in df.iteritems():
df.DateDpt = pd.to_datetime(df.DateDpt, format='%Y-%m-%d')
df.DateAr = pd.to_datetime(df.DateAr, format='%Y-%m-%d')
print df
dept_cities = df.groupby('ArCityArCountry')
for city, departures in dept_cities:
print(city)
print([list(r) for r in departures.loc[:, ['AuthorID', 'DptCityDptCountry', 'DateDpt', 'DateAr']].to_records()])
Expected output
ParisFrance = { DateAr, ID, ArCityArCountry, DptCityDptCountry}
Note: I want to group by ArCityArCountry and DptCityDptCountry
You will notice I didn't include DateDpt; I want to select all IDs that fall between DateAr and DateDpt and actually in ParisFrance or CaliforniaUnitedStates between the specified periods.
for example In 1999-10-02 Mr A was in Paris until 2013-12-12 and Mr B was in Paris in 2010-11-04 and left 2012-09-09 that means MrA and Mr B were in Paris since MrB's visit to Paris fall in btw the time
MrA was there CaliforniaUnitedStates = { DateAr, ID, ArCityArCountry, DptCityDptCountry}
Related
I would like to make my data frame more aesthetically appealing and drop what I believe are the unnecessary first row and column from the multi-index. I would like the column headers to be: 'Rk', 'Team','Conf','G','Rec','ADJOE',.....,'WAB'
Any help is such appreciated.
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df = df[0]
df
You only have to iterate over the existing columns and select the second value. Then you can set the list of values as new columns:
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df.columns = [x[1] for x in df.columns]
df.head()
Output:
Rk Team Conf G Rec AdjOE AdjDE Barthag EFG% EFGD% ... ORB DRB FTR FTRD 2P% 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 122.42 89.05 .97491 60.21 421 ... 30.2120 2318 30.4165 21.710 62.21 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 117.39 89.06 .95982 53.835 42.93 ... 37.26 27.6141 28.2242 33.3247 54.827 424 34.8108 29.418 65.2303 3.416
When you read from HTML, specify the row number you want as header:
df = pd.read_html(url, header=1)[0]
print(df.head())
output:
>>
Rk Team Conf G Rec ... 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 ... 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 ... 424 34.8108 29.418 65.2303 3.416
2 3 Kentucky SEC 26 21-510–3 ... 46.342 35.478 29.519 68.997 4.89
3 4 Arizona P12 25 23-213–1 ... 39.91 33.7172 31.471 72.99 6.24
4 5 Baylor B12 26 21-59–4 ... 49.2165 35.966 30.440 68.3130 6.15
I am trying to read data from : http://dummy.restapiexample.com/api/v1/employees and trying to put it out in tabular format.
I am getting the output. But columns are not created from json file.
How can do this in right way?
Code:
import pandas as pd
import json
df1 = pd.read_json('http://dummy.restapiexample.com/api/v1/employees')
df1.to_csv('try.txt',sep='\t',index=False)
Expected Output:
employee_name employee_salary employee_age profile_image
Tiger Nixon 320800 61
(along with other rows)
You can read the data directly from the web, like you're doing, but you need to help pandas interpret your data with the orient parameter:
df = pd.read_json('http://dummy.restapiexample.com/api/v1/employees', orient='index')
Then there's a second step to focus on the data you want:
df1 = pd.DataFrame(df.loc['data', 0])
Now you can write your csv.
Here are the different steps (note: the data is in [data] array of the JSON response):
import json
import pandas as pd
import requests
res = requests.get('http://dummy.restapiexample.com/api/v1/employees')
data_str = res.content
data_dict = json.loads(data_str)
data_df = pd.DataFrame(data_dict['data'])
data_df.to_csv('try.txt', sep='\t', index=False)
you have to parse your json first.
import pandas as pd
import json
import requests
r = requests.get('http://dummy.restapiexample.com/api/v1/employees')
j = json.loads(r.text)
df = pd.DataFrame(j['data'])
output
id employee_name employee_salary employee_age profile_image
0 1 Tiger Nixon 320800 61
1 2 Garrett Winters 170750 63
2 3 Ashton Cox 86000 66
3 4 Cedric Kelly 433060 22
4 5 Airi Satou 162700 33
5 6 Brielle Williamson 372000 61
6 7 Herrod Chandler 137500 59
7 8 Rhona Davidson 327900 55
8 9 Colleen Hurst 205500 39
I've a large String looking like this : '2002 | 09| 90|NUMBER|SALE|CLIENT \n 2002 | 39| 96|4958|100|James ...' split by "|" and "\n". the size of each line is the same, what's the best way to turn this into a dataframe looking like this :
2002 09 90 NUMBER SALE CLIENT
2002 39 96 4958 100 James
.....
You can pass in the StringIO object to the pandas.read_csv method to get the desired result. Example,
Try this:
from io import StringIO
import pandas as pd
data = StringIO("""'2002 | 09| 90|NUMBER|SALE|CLIENT \n 2002 | 39| 96|4958|100|James """)
df = pd.read_csv(data, sep=r"\s*\|\s*")
print(df)
Output:
2002 09 90 NUMBER SALE CLIENT
0 2002 39 96 4958 100 James
UPDATE (As per your requirements discussed in comments):
from io import StringIO
import pandas as pd
import re
data = StringIO("""Start Date str_date B N C \n Calculation notional cal_nt C B R\n price of today price_of_today N B R \n""")
lines = []
for line in data:
line = line.strip()
line = re.search(r"^(.*)\s(.*?\s.*?\s.*?\s.*?)$", line)
grp1 = line.group(1)
grp2 = line.group(2)
line = "|".join([grp1, "|".join(grp2.split(" "))])
lines.append(line)
data = StringIO("\n".join(lines))
columns = ["header1", "header2", "header3", "header4", "header5"]
df = pd.read_csv(data, names=columns, sep=r"\s*\|\s*")
print(df)
With my code I can join 2 databases in one. Now, I need to do the same with another database file.
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
archivo:
Fecha Cliente Impresiones Impresiones 2 Revenue
22/12/17 Peter 55 5 $2
22/12/17 Juan 634527 523 $123
22/12/17 Pedro 836 898 $22
22/12/17 Esteban 125 99 $7,890
I have this results:
The problem is that I need to add the new database(archivo) into the Data.xlsx file and it will look like:
Code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
df.index = df.index.droplevel()
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
Extending my comment as an answer, I'd recommend creating a function that will reshape your dataframes to conform to a given format. I'd recommend doing this simply because it is much easier to just reshape your data, rather than reshape new entries to conform to the existing structure. This is because your current structure is a format that makes it extremely hard to work with (take it from me).
So, the easiest thing to do would be to create a function -
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
Now, load your dataframes -
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx', ...]:
df_list.append(pd.read_excel(file))
Now, call the process function with your df_list -
df = process(df_list)
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368
Save df to a new excel file. Repeat the process for every new dataframe that enters the system.
In summary, your entire code listing would look like this -
import pandas as pd
import pandas.io.formats.excel
import numpy as np
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
if __name__ == '__main__':
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx']:
df_list.append(pd.read_excel(file))
df = process(df_list)
with pd.ExcelWriter("test.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
The alternative to this tedious process is to change your dataset structure, and reconsider a more viable alternative that makes it much easier to add new data to existing data without having to keep reshaping everything from scratch. This is something you'll have to sit down and think about.
I have two dataframes that I read in via csv. Dataframe one consists of a phone number and some additional data. The second dataframe contains country codes and country names.
I want to take the phone number from the first dataset and compare it to the country codes of the second. Country codes can between one to four digits long. I go from the longest country code to the shortest. If there is a match, I want to assign the country name to the phone number.
Input longlist:
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5
Input country_list:
country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2
Output should be:
phonenumber, add_info, country, order_info
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4
I have it solved once like this:
#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
with open('Outputfile.csv', 'w') as outputlist:
reader = csv.reader(lookuplist, delimiter=',')
reader2 = csv.reader(inputlist, delimiter=';')
writer = csv.writer(outputlist, dialect='excel')
for i in reader2:
for xl in reader:
if xl[0].startswith(i[1]):
zeile = [xl[0], xl[1], i[0], i[1], i[2]]
writer.writerow(zeile)
lookuplist.seek(0)
But I would like to solve this problem, using pandas. What I got to work:
- Read in the csv files
- Remove duplicates from "longlist"
- Sort list of countries / country code
This is, what I have working already:
import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv',
usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv',
sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])
# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')
# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')
(...)
longlist.to_csv('path/to/output.csv')
But any way trying the same with datasets does not work. I cannot apply startswith (cannot iterate through objects and cannot apply it on objects). I would really appreciate your help.
i would do it this way:
cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})
lookup = cl['country_code']
lookup.index = cl['country_code']
ll['country_code'] = (
ll['phonenumber']
.apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
lookup.get(x[:2]), lookup.get(x[:1])]))
.apply(lambda x: x.get(x.first_valid_index()), axis=1)
)
# remove `how='left'` parameter if you don't need "unmatched" phone-numbers
result = ll.merge(cl, on='country_code', how='left')
Output:
In [195]: result
Out[195]:
phonenumber add_info country_code country order_info
0 34123425209 info1 34 Spain 1.0
1 92654321762 info2 92 Pakistan 4.0
2 12018883637 info3 1 USA 2.0
3 12428883637 info31 1242 Bahamas 3.0
4 6323450001 info4 63 Philippines 3.0
5 496789521134 info5 49 Germany 4.0
6 00000000000 BAD None NaN NaN
Explanation:
In [216]: (ll['phonenumber']
.....: .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
.....: lookup.get(x[:2]), lookup.get(x[:1])]))
.....: )
Out[216]:
0 1 2 3
0 None None 34 None
1 None None 92 None
2 None None None 1
3 1242 None None 1
4 None None 63 None
5 None None 49 None
6 None None None None
phones.csv: - i've intentionally added one Bahamas number (1242...) and one invalid number (00000000000)
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD