I have a large sep="|" tsv with an address field that has a bunch of values with the following
...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
This ends up as:
line1) ...xxx|yyy|Level 1 2 xxx Street\
line2) (MYCompany)|...
Tried running the quote=2 to turn non numeric into strings in read_table with Pandas but it still treats the backslash as new line. What is an efficient way to ignore rows with values in a field that contain backslash escapes to new line, is there a way to ignore the new line for \?
Ideally it will prepare the data file so it can be read into a dataframe in pandas.
Update: showing 5 lines with breakage on 3rd line.
1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
Here is another solution using regex:
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
will produce the following result:
I think you can try first read_csv with sep which is NOT in values and it seems that it read correct:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
Then you can create new file with to_csv and read_csv with sep="|":
df.to_csv('myfile.csv', header=False, index=False)
print pd.read_csv('myfile.csv', sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
Next solution with not createing new file, but write to variable output and then read_csv with io.StringIO:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
output = df.to_csv(header=False, index=False)
print output
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
If I test it in your data, it seems that 1. and 2.rows have 14 fields, next two 15 fields.
So I remove last item from both rows (3. and 4.), maybe this is only typo (I hope it):
import pandas as pd
import io
temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 1788768|1831171|208434489|2014-08-14 13:40:02|...
1 1788772|1831177|202234489|2014-08-14 13:41:37|...
2 1788776|1831182|205234489|2014-08-14 13:42:41|...
3 1788780|1831186|202634489|2014-08-14 13:43:46|...
output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13
0 3025 Melbourne
1 2116 Sydney
2 2000 Sydney
3 2444 NSW Other
But if data are correct, add parameter names=range(15) to read_csv:
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13 14
0 3025 Melbourne NaN
1 2116 Sydney NaN
2 2000 Sydney Sydney
3 2444 NSW Other Port Macquarie
Related
I have a pandas dataframe with a column, which I need to extract the word with [ft,mi,FT,MI] of the state column using regular expression and stored in other column.
df1 = {
'State':['Arizona 4.47ft','Georgia 1023mi','Newyork 2022 NY 74.6 FT','Indiana 747MI(In)','Florida 453mi FL']}
Expected output
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork NY 74.6ft 74.6ft
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi
Would anyone please help?
Build a regex pattern with the help of list l then use str.extract to extract the occurrence of this pattern from the State column
l = ['ft','mi','FT','MI']
df1['Distance'] = df1['State'].str.extract(r'(\S+(?:%s))\b' % '|'.join(l))
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork 2022 NY 74.6FT 74.6FT
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi
I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.
Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.
You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece
I need to automate the validations performed on text file. I have two text files and I need to check if the row in one file having unique combination of two columns is present in other text file having same combination of columns then the new column in text file two needs to be written in text file one.
The text file 1 has thousands of records and text file 2 is considered as reference to text file 1.
As of now I have written the following code. Please help me to solve this.
import pandas as pd
data=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample2.txt",delimiter=',')
df=pd.DataFrame(data)
print(df)
# uniquecal=df[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal)
data1=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample1.txt",delimiter=',')
df1=pd.DataFrame(data1)
print(df1)
# uniquecal1=df1[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal1
How can I put the vehicle price in dataframe one and save it to text file1?
Below is my sample dataset:
File1:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda NaN
1 aaa yyy mumbai tvs NaN
2 aaa xxx hyd maruti NaN
3 bbb xxx pune honda NaN
4 bbb aaa mumbai tvs NaN
File2:
vehicle_Brought_City Vehicle_Brand Vehicle_price
0 pune honda 50000
1 mumbai tvs 40000
2 hyd maruti 45000
del df['Vehicle_price']
print(df)
dd = pd.merge(df, df1, on=['vehicle_Brought_City', 'Vehicle_Brand'])
print(dd)
output:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda 50000
1 aaa yyy mumbai tvs 40000
2 bbb aaa mumbai tvs 40000
3 aaa xxx hyd maruti 45000
I am attempting to append two DataFrames using Python Pandas, but I am receiving a null error. How can I resolve this?
Here's the first DataFrame (after I load to Python):
name State
0 Tom NY
1 Lee CA
Here's the second DataFrame (after I load to Python) with no header:
0 1
0 Jon FL
1 Tan NJ
I attempt to append the DataFrames using:
pd.concat([df1,df2])
The result is:
name State 0 1
0 Tom NY NaN NaN
1 Lee CA NaN NaN
0 NaN NaN Jon FL
1 NaN NaN Tan NJ
I want the result to be:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
I've made the following attempt, but it doesn't work:
pd.concat([df1,df2], axis=1)
Here is my second unsuccessful attempt:
pd.concat([df1,df2], ignore_index=True)
Align your column names and use append
df1.columns = df.columns
df.append(df1).reset_index(drop=True)
# Result
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
Rename your column name and then concat them:
df2.columns = df1.columns
pd.concat([df1, df2], ignore_index=True)
Output:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
I need to import web-based data (as posted below) into Python. I used urllib2.urlopen (data available here). However, the data was imported as string lines. How can I convert them into a pandas DataFrame while stripping away the double-quotes "? Thank you for your help.
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"
You can do:
>>> import pandas as pd
>>> df=pd.read_csv('https://raw.githubusercontent.com/QuantEcon/QuantEcon.py/master/data/test_pwt.csv')
>>> df
country country isocode year POP XRAT \
0 Argentina ARG 2000 37335.653 0.999500
1 Australia AUS 2000 19053.186 1.724830
2 India IND 2000 1006300.297 44.941600
3 Israel ISR 2000 6114.570 4.077330
4 Malawi MWI 2000 11801.505 59.543808
5 South Africa ZAF 2000 45064.098 6.939830
6 United States USA 2000 282171.957 1.000000
7 Uruguay URY 2000 3219.793 12.099592
tcgdp cc cg
0 295072.218690 75.716805 5.578804
1 541804.652100 67.759026 6.720098
2 1728144.374800 64.575551 14.072206
3 129253.894230 64.436451 10.266688
4 5026.221784 74.707624 11.658954
5 227242.369490 72.718710 5.726546
6 9898700.000000 72.347054 6.032454
7 25255.961693 78.978740 5.108068