I have a csv file that has some info inside of it. For my use case, I only need the first four characters in every cell.
So, using python, I need a solution that will allow me ideally to remove all characters in each cell after four characters, and optionally remove all spaces. If I could be pointed in the correct direction that'd be great!
one
two
three
OneOneOne
TwoTwoTwo
ThreeThreeThree
My Ideal output should look like
one
two
three
OneO
TwoT
Thre
Seems like your data contains some numeric values not of string type. In that case, you can convert the data to string first, then remove all spaces, and finally take the first 4 characters in each converted strings, as follows:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
If you don't need to remove spaces, you can use:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
Result:
print(df)
one two three
0 OneO TwoT Thre
Edit
If you want to apply to only specify columns, you can use:
For example, only apply to columns one and two:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df[['one', 'two']] = df[['one', 'two']].apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
Adapting the answer by #SeaBean to show how to apply to just selected columns,
df = pd.read_csv("mycsv.csv") # read csv if not already read
cols = ['col_1', 'col_2'] # cols to apply
for col in cols:
df[col] = df[col].apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
There may be a better way, but I think this could get you started:
import pandas as pd
df = pd.read_csv("myfile.csv")
# remove spaces and keep first four letters
df = df.applymap(lambda x: x.replace(' ', '')[:4])
Update to account for non-string columns. This only changes string columns. If you want to truncate numbers also, others answers have covered that.
import pandas as pd
file = "myfile.csv"
df = pd.read_csv(file)
# select only columns of type str
cols = (df.applymap(type) == str).all(0)
# first 4 letters of each cell
first_four_no_space = lambda x: x.replace(' ', '')[:4]
df.loc[:, cols] = df.loc[:, cols].applymap(first_four_no_space)
# Warning! This will overwrite your existing file.
# I would rename the output, but it sounds like you want to
# overwrite. Uncomment if you want to overwrite your existing
# file.
# df.to_csv(file, index=False)
Related
I have a short script to pivot data. The first column is a 9 digit ID number, often beginning with zeros such as 000123456
Here is the script:
df = pd.read_csv('source')
new_df = df.pivot_table(index = 'id', columns = df.groupby('id').cumcount().add(1), values = ['prog_id', 'prog_type'], aggfunc='first').sort_index(axis=1,level=1)
new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
new_df.to_csv('destination')
print(new_df)
Although the CSV is being read with an id such as 000123456, the output only contains 123456
Even when setting an explicit dtype, Pandas removes the leading zeros. Is there a work around for telling Pandas to leave the leading zeros?
Per comment on original post, set dtype as string:
df = pd.read_csv('source', dtype={'id':np.str})
You could use pandas' zfill() method right after reading your csv file "source". Basically, you would fill the values of your attribute "id", with as many zeros as you would like, in this particular case, making the number 9 digits long (3 zeros + 6 original digits). So, we would have:
df = pd.read_csv('source')
df.index = df.index.str.zfill(9)
# (...)
I am trying to read DICOM files using pydicom in Python and want to store the header data into a pandas dataframe. How do I extract the data element value for this purpose?
So far I have created a dataframe with columns as the tag names in the DICOM file. I have accessed the data element but I only need to store the value of the data element and not the entire sequence. For this, I converted the sequence to a string and tried to split it. But it won't work either as the length of different tags are different.
refDs = dicom.dcmread('000000.dcm')
info_header = refDs.dir()
df = pd.DataFrame(columns = info_header)
print(df)
info_data = []
for i in info_header:
if (i in refDs):
info_data.append(str(refDs.data_element(i)).split(" ")[0])
print (info_data[0],len(info_data))
I have put the data element sequence element in a list as I could not put it into the dataframe directly. The output of the above code is
(0008, 0050) Accession Number SH: '1091888302507299' 89
But I only want to store the data inside the quotes.
This works for me:
import pydicom as dicom
import pandas as pd
ds = dicom.read_file('path_to_file')
df = pd.DataFrame(ds.values())
df[0] = df[0].apply(lambda x: dicom.dataelem.DataElement_from_raw(x) if isinstance(x, dicom.dataelem.RawDataElement) else x)
df['name'] = df[0].apply(lambda x: x.name)
df['value'] = df[0].apply(lambda x: x.value)
df = df[['name', 'value']]
Eventually, you can transpose it:
df = df.set_index('name').T.reset_index(drop=True)
Nested fields would require more work if you also need them.
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])
I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.
ug|s|b|city|bg|1|94|ON-05-0216 9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217 9.72|28|288
I have second file test2.csv with only delimiter pipe
ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146
So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).
So the final result is the following:
ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
or
ug|s|b|city|bg|1|94|ON-05-0216|Total=150 9.72|28|288
whatever is easier.
I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.
import pandas as pd
a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)
Thank you in advance
Use:
# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')
# splitting file on tab and concatenating with rest
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)
# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))
# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2) # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)
# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)
# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.
When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.
I am trying to remove nan values from a final csv file, which will source data for digital signage.
I use fillna to remove empty data to keep 'nan' from appearing on the sign. fillna is working, but because this data is formatted, I get ()- in the empty csv fields.
df = pd.read_csv(filename)
df = df.fillna('')
df = df.astype(str)
df['PhoneNumber']=df['CONTACT PHONE NUMBER'].apply(lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10])
I tried writing an if...else statement to separate lines in the array; but since the formatting is applied to the entire list, not entry by entry, that doesn't work.
A simple modification to your lambda function should do the job:
>>> y=lambda x: (x and '('+x[:3]+')'+x[3:6]+'-'+x[6:10]) or ''
>>> y('123456789')
'(123)456-789'
>>> y('')
''
EDIT:
You could also replace the and/or idiom with if-else construct:
>>> y=lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10] if x else ''