I am trying to remove nan values from a final csv file, which will source data for digital signage.
I use fillna to remove empty data to keep 'nan' from appearing on the sign. fillna is working, but because this data is formatted, I get ()- in the empty csv fields.
df = pd.read_csv(filename)
df = df.fillna('')
df = df.astype(str)
df['PhoneNumber']=df['CONTACT PHONE NUMBER'].apply(lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10])
I tried writing an if...else statement to separate lines in the array; but since the formatting is applied to the entire list, not entry by entry, that doesn't work.
A simple modification to your lambda function should do the job:
>>> y=lambda x: (x and '('+x[:3]+')'+x[3:6]+'-'+x[6:10]) or ''
>>> y('123456789')
'(123)456-789'
>>> y('')
''
EDIT:
You could also replace the and/or idiom with if-else construct:
>>> y=lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10] if x else ''
Related
I have a csv file that has some info inside of it. For my use case, I only need the first four characters in every cell.
So, using python, I need a solution that will allow me ideally to remove all characters in each cell after four characters, and optionally remove all spaces. If I could be pointed in the correct direction that'd be great!
one
two
three
OneOneOne
TwoTwoTwo
ThreeThreeThree
My Ideal output should look like
one
two
three
OneO
TwoT
Thre
Seems like your data contains some numeric values not of string type. In that case, you can convert the data to string first, then remove all spaces, and finally take the first 4 characters in each converted strings, as follows:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
If you don't need to remove spaces, you can use:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
Result:
print(df)
one two three
0 OneO TwoT Thre
Edit
If you want to apply to only specify columns, you can use:
For example, only apply to columns one and two:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df[['one', 'two']] = df[['one', 'two']].apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
Adapting the answer by #SeaBean to show how to apply to just selected columns,
df = pd.read_csv("mycsv.csv") # read csv if not already read
cols = ['col_1', 'col_2'] # cols to apply
for col in cols:
df[col] = df[col].apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
There may be a better way, but I think this could get you started:
import pandas as pd
df = pd.read_csv("myfile.csv")
# remove spaces and keep first four letters
df = df.applymap(lambda x: x.replace(' ', '')[:4])
Update to account for non-string columns. This only changes string columns. If you want to truncate numbers also, others answers have covered that.
import pandas as pd
file = "myfile.csv"
df = pd.read_csv(file)
# select only columns of type str
cols = (df.applymap(type) == str).all(0)
# first 4 letters of each cell
first_four_no_space = lambda x: x.replace(' ', '')[:4]
df.loc[:, cols] = df.loc[:, cols].applymap(first_four_no_space)
# Warning! This will overwrite your existing file.
# I would rename the output, but it sounds like you want to
# overwrite. Uncomment if you want to overwrite your existing
# file.
# df.to_csv(file, index=False)
I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')
Additional to nanselm2's answer, you can use 0 instead of False:
df["col"].str.contains(word)==0
somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by #kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
I have a column named keywords in my pandas dataset. The values of the column are like this :
[jdhdhsn, cultuere, jdhdy]
I want my output to be
jdhdhsn, cultuere, jdhdy
Try this
keywords = [jdhdhsn, cultuere, jdhdy]
if(isinstance(keyword, list)):
output = ','.join(keywords)
else:
output = keywords[1:-1]
The column of your dataframe seems to be a list
Lists are formatted with brackets and each elements of that list's repr()
Pandas has built in functions for dealing with strings
df['column_name'].str let's you use each element in the column and apply a str function on them. Just like ', '.join(['foo', 'bar', 'baz'])
Thus df['column_name_str'] = df['column_name'].str.join(', ') will produce a new column with the formatting you're after.
You can also use the .apply to perform arbitrary lambda functions on a column, such as:
df['column_name'].apply(lambda row: ', '.join(row))
But since pandas has the .str built in this isn't needed for this example.
Try this
data = ["[jdhdhsn, cultuere, jdhdy]"]
df = pd.DataFrame(data, columns = ["keywords"])
new_df = df['keywords'].str[1:-1]
print(df)
print(new_df)
I am trying to read DICOM files using pydicom in Python and want to store the header data into a pandas dataframe. How do I extract the data element value for this purpose?
So far I have created a dataframe with columns as the tag names in the DICOM file. I have accessed the data element but I only need to store the value of the data element and not the entire sequence. For this, I converted the sequence to a string and tried to split it. But it won't work either as the length of different tags are different.
refDs = dicom.dcmread('000000.dcm')
info_header = refDs.dir()
df = pd.DataFrame(columns = info_header)
print(df)
info_data = []
for i in info_header:
if (i in refDs):
info_data.append(str(refDs.data_element(i)).split(" ")[0])
print (info_data[0],len(info_data))
I have put the data element sequence element in a list as I could not put it into the dataframe directly. The output of the above code is
(0008, 0050) Accession Number SH: '1091888302507299' 89
But I only want to store the data inside the quotes.
This works for me:
import pydicom as dicom
import pandas as pd
ds = dicom.read_file('path_to_file')
df = pd.DataFrame(ds.values())
df[0] = df[0].apply(lambda x: dicom.dataelem.DataElement_from_raw(x) if isinstance(x, dicom.dataelem.RawDataElement) else x)
df['name'] = df[0].apply(lambda x: x.name)
df['value'] = df[0].apply(lambda x: x.value)
df = df[['name', 'value']]
Eventually, you can transpose it:
df = df.set_index('name').T.reset_index(drop=True)
Nested fields would require more work if you also need them.
I am just starting pandas so please forgive if this is something stupid.
I am trying to apply a function to a column but its not working and i don't see any errors also.
capitalizer = lambda x: x.upper()
for df in pd.read_csv(downloaded_file, chunksize=2, compression='gzip', low_memory=False):
df['level1'].apply(capitalizer)
print df
exit(1)
This print shows the level1 column values same as the original csv not doing upper. Am i missing something here ?
Thanks
apply is not an inplace function - it does not modify values in the original object, so you need to assign it back:
df['level1'] = df['level1'].apply(capitalizer)
Alternatively, you can use str.upper, it should be much faster.
df['level1'] = df['level1'].str.upper()
df['level1'] = map(lambda x: x.upper(), df['level1'])
you can use above code to make your column uppercase