How to make pandas read_csv distinguish strings based on quoting - python

I want pandas.io.parsers.read_csv to distinguish between strings and the rest of data types based on the fact that strings in my csv file are always "quoted". Is it possible?
I have the following csv example:
"ID"|"DATE"|"NAME"|"YEAR"|"FLOAT"|"BOOL"
"01"|2000-01-01|"Name1"|1975|1.2|1
"02"||""||||
It should give me a dataframe where all the quoted guys are strings. Most likely pandas will make everything else np.float64, but I could deal with it afterwards. I want to wait with using dtype, because I have many columns, and I don't want to map types for all of them. I would like to try to make it only "quote"-based, if possible.
I tried to use quotechar='"' and quoting=3, but quotechar doesn't do anything at all, while quoting keeps "" which I don't want as well. It seems to me pandas parsers should be able to do it, since this is the way to distinguish strings in csv files.

Specifying dtypes would be the more straightforward way, but if you don't want to do that I'd suggest using quoting=3 and cleaning up afterwards.
strip_char = lambda x: x.strip('"')
In [40]: df = pd.read_csv(StringIO(s), sep='|', quoting=3)
In [41]: df
Out[41]:
"ID" "DATE" "NAME" "YEAR" "FLOAT" "BOOL"
0 "01" 2000-01-01 "Name1" 1975 1.2 1
1 "02" NaN "" NaN NaN NaN
[2 rows x 6 columns]
In [42]: df = df.rename(columns=strip_char)
In [43]: df[['ID', 'NAME']] = df[['ID', 'NAME']].applymap(strip_char)
In [44]: df
Out[44]:
ID DATE NAME YEAR FLOAT BOOL
0 01 2000-01-01 Name1 1975 1.2 1
1 02 NaN NaN NaN NaN
[2 rows x 6 columns]
In [45]: df.dtypes
Out[45]:
ID object
DATE object
NAME object
YEAR float64
FLOAT float64
BOOL float64
dtype: object
EDIT: Then you can set the index:
In [11]: df = df.set_index('ID')
In [12]: df
Out[12]:
DATE NAME YEAR FLOAT BOOL
ID
01 2000-01-01 Name1 1975 1.2 1
02 NaN NaN NaN NaN
[2 rows x 5 columns]

Related

Concatenate multiple columns of dataframe with a seperating character for Non-null values

I have a data frame like this:
df:
C1 C2 C3
1 4 6
2 NaN 9
3 5 NaN
NaN 7 3
I want to concatenate the 3 columns to a single column with comma as a seperator.
But I want the comma(",") only in case of non-null value.
I tried this but this doesn't work for non-null values:
df['New_Col'] = df[['C1','C2','C3']].agg(','.join, axis=1)
This gives me the output:
New_Col
1,4,6
2,,9
3,5,
,7,3
This is my ideal output:
New_Col
1,4,6
2,9
3,5
7,3
Can anyone help me with this?
Judging by your (wrong) output, you have a dataframe of strings and NaN values are actually empty strings (otherwise it would throw TypeError: expected str instance, float found because NaN is a float).
Since you're dealing with strings, pandas is not optimized for it, so a vanilla Python list comprehension is probably the most efficient choice here.
df['NewCol'] = [','.join([e for e in x if e]) for x in df.values]
In your case do stack
df['new'] = df.stack().astype(int).astype(str).groupby(level=0).agg(','.join)
Out[254]:
0 1,4,6
1 2,9
2 3,5
3 7,3
dtype: object
You can use filter to get rid of NaNs:
df['New_Col'] = df.apply(lambda x: ','.join(filter(lambda x: x is not np.nan,list(x))), axis=1)

How to change format of floar values in column with also NaN values in Pandas Data Frame in Python?

I have Pandas DataFrame in Python like below:
col
-------
7.0
2.0
NaN
...
"col1" is in float data type but I would like to convert displaying of floar values in this column from for example 7.0 to 7. I can not simply change date type to int because I have also "NaN" values in col1.
So as a result I need something like below:
col
-------
7
2
NaN
...
How can I do that in Python Pandas ?
You can use convert_dtypes to perform an automatic conversion:
df = df.convert_dtypes('col')
For all columns:
df = df.convert_dtypes()
output:
col
0 7
1 2
2 <NA>
After conversion:
df.dtypes
col Int64
dtype: object

python, pandas, work through bad data

so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string
I think you want pandas.to_numeric. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.
It's my understanding that .replace() will only apply to string datatypes. If you apply it to a non-string datatype (e.g. your numeric types), it will return nan. Converting the entire frame/series to string before using replace would work around this, but probably isn't the "best" way of doing so (e.g. see #Goyo's answer)!
See the notes on this page.

Dropping column values that don't meet a requirement

I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?
I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64

Python Pandas drop columns based on max value of column

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?
Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.

Categories

Resources