I have a data set in csv that I read with pd.read_csv. I want to sort the esxisting data by descanding value.
my code is this:
dataset = pd.read_csv('Kripto.csv')
sorted = dataset.sort_values(by = "Freq1", ascending=False)
x = dataset.iloc[:, :].values
and my data set (print(dataset)) is this :
Letter;Freq1
0 A;0.0817
1 B;0.0150
2 C;0.0278
3 D;0.0425
4 E;0.1270
when i want to use this code:
sorted = dataset.sort_values(by = "Freq1", ascending=False)
python gives me an error and says KeyError: 'Freq1'
I know that "Freq1" is not the name of the column but ı have no idea how to assing a name
Your csv file has " ; " as separator, you need to indicate on the read_csv method:
import pandas as pd
dataset = pd.read_csv('your.csv', sep=';')
And that's all you need to do
Your CSV file uses semi-colons to separate values. Since Pandas by defaults expects comma's, use
dataset = pd.read_csv('Kripto.csv', sep=';')
instead.
You should also use the sorted dataset to get your values in sorted order, instead of dataset, since the latter will remain unsorted:
x = sorted.iloc[:, :].values
Related
I have a CSV file in the following format:
Date,Time,Open,High,Low,Close,Volume
09/22/2003,00:00,1024.5,1025.25,1015.75,1022.0,720382.0
09/23/2003,00:00,1022.0,1035.5,1019.25,1022.0,22441.0
10/22/2003,00:00,1035.0,1036.75,1024.25,1024.5,663229.0
I would like to add 20 new columns to this file, the value of each new column is synthetically created by simply randomizing a set of numbers.
It would be something like this:
import pandas as pd
df = pd.read_csv('dataset.csv')
print(len(df))
input()
for i in range(len(df)):
#Data that already exist
date = df.values[i][0]
time = df.values[i][1]
open_value= df.values[i][2]
high_value=df.values[i][3]
low_value=df.values[i][4]
close_value=df.values[i][5]
volume=df.values[i][6]
#This is the new data
prediction_1=randrange(3)
prediction_2=randrange(3)
prediction_3=randrange(3)
prediction_4=randrange(3)
prediction_5=randrange(3)
prediction_6=randrange(3)
prediction_7=randrange(3)
prediction_8=randrange(3)
prediction_9=randrange(3)
prediction_10=randrange(3)
prediction_11=randrange(3)
prediction_12=randrange(3)
prediction_13=randrange(3)
prediction_14=randrange(3)
prediction_15=randrange(3)
prediction_16=randrange(3)
prediction_17=randrange(3)
prediction_18=randrange(3)
prediction_19=randrange(3)
prediction_20=randrange(3)
#How to concatenate these data row by row in a matrix?
#How to add new column names and save the file?
I would like to concatenate them (old+synthetic data) and, after that, I would like to add 20 new columns named 'synthetic1', 'synthetic2', ..., 'synthetic20', to the existing column names and then save the resulting new dataset in a new text file.
I could do that easily with NumPy, but here, we have no numeric data and, therefore, I don't know how to do (or if it is possible to do) that. Is possible to do that with Pandas or another library?
Here's a way you can do:
import numpy as np
# set nrow and col, nrow should match the number of rows in existing df
n_row = 100
n_col = 20
f = pd.DataFrame(np.random.randint(100, size=(n_row, n_col)), columns=['synthetic' + str(x) for x in range(1,n_col+1)])
df = pd.concat([df, f])
I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)
I am trying to read DICOM files using pydicom in Python and want to store the header data into a pandas dataframe. How do I extract the data element value for this purpose?
So far I have created a dataframe with columns as the tag names in the DICOM file. I have accessed the data element but I only need to store the value of the data element and not the entire sequence. For this, I converted the sequence to a string and tried to split it. But it won't work either as the length of different tags are different.
refDs = dicom.dcmread('000000.dcm')
info_header = refDs.dir()
df = pd.DataFrame(columns = info_header)
print(df)
info_data = []
for i in info_header:
if (i in refDs):
info_data.append(str(refDs.data_element(i)).split(" ")[0])
print (info_data[0],len(info_data))
I have put the data element sequence element in a list as I could not put it into the dataframe directly. The output of the above code is
(0008, 0050) Accession Number SH: '1091888302507299' 89
But I only want to store the data inside the quotes.
This works for me:
import pydicom as dicom
import pandas as pd
ds = dicom.read_file('path_to_file')
df = pd.DataFrame(ds.values())
df[0] = df[0].apply(lambda x: dicom.dataelem.DataElement_from_raw(x) if isinstance(x, dicom.dataelem.RawDataElement) else x)
df['name'] = df[0].apply(lambda x: x.name)
df['value'] = df[0].apply(lambda x: x.value)
df = df[['name', 'value']]
Eventually, you can transpose it:
df = df.set_index('name').T.reset_index(drop=True)
Nested fields would require more work if you also need them.
I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out
I have a pandas DataFrame and I'm trying to change the names of the columns. Before I change the names, the columns are Series (as it should be), but after I change the names, they become DataFrames, and thus cannot be grouped, etc. The line of code that's giving me problems is:
df.rename(columns = varDict, inplace = True)
I have verified:
The same thing happens no matter how I rename; e.g., rename_axis, replacing with a list, etc.
df is a DataFrame both before and after the renaming.
varDict is a well-formed dict (as are my other dicts), and does in fact successfully change the names of the columns.
The columns are Series before the renaming, and DataFrames afterwards.
I cannot replicate this in a toy example, unfortunately.
Any ideas on where I'm going wrong? I'm using python 2.7.5 and pandas 0.12.0 on Mac OS X. Thanks in advance. The full code is below.
import pandas as pd
def FileToDict(filename):
with open(filename, 'rU') as f:
l = [line[:-1] for line in f]
return {x.partition(',')[0]:x.partition(',')[2] for x in l}
#---Preparing the dataset---
df = pd.DataFrame.from_csv('movieDataset.csv')
df = df.set_index(u'row')
#These dictionaries incorporate information for other files
varDict = FileToDict('varNum-to-varName.csv')
titleDict = FileToDict('titleNum-to-titleName.csv')
#Changes the datafile numbers into movie titles
df.datafile = df.datafile.apply(lambda x: titleDict[str(x)])
#Changes the variable numbers into recognizable names
varDict['row'] = 'row'
varDict['datafile'] = 'MovieTitle'
df.rename(columns = lambda x: varDict[str(x)], inplace = True)
#----------------
#At this point the columns become DataFrames rather than Series
#----------------