being new to python I am looking for some help reshaping this data, already know how to do so in excel but want a python specific solution.
I want it to be in this format.
entire dataset is 70k rows with different vc_firm_names, any help would be great.
If you care about performance, then I suggest you take a look at other methods (such as using numpy, or sorting the table):
https://stackoverflow.com/a/42550516/17323241
https://stackoverflow.com/a/66018377/17323241
https://stackoverflow.com/a/22221675/17323241 (look at second comment)
Otherwise, you can do:
# load data from csv file
df = pd.read_csv("example.csv")
# aggregate
df.groupby("vc_first_name")["investment_industry"].apply(list)
Assuming the original file is "original.csv", and you want to save it as "new.csv" I would do:
pd.read_csv("original.csv").groupby(by=["vc_firm_name"],as_index=False).aggregate(lambda x: ','.join(x)).to_csv("new.csv", index=False)
Related
I got an originally txt file converted to csv.
I have the column names but there is practically one row in the
unprocessed dataset.
How do I clean the dataset using pandas,numpy exc.methods so that each string/int between every comma will be placed in seperated column with the proper column name?
Thanks,
Ido
cols = ['AIRLINE_ID','AIRLINE_NAME','ALIAS','IATA','ICAO','CALLSIGN','COUNTRY','ACTIVE'
]
Airlines_raw_dataset
I looked for videos regarding this topic on youtube but I didn't encounter specific info for this highly dirty dataset.
Pandas Has a built in method for reading csv files. It may be used in this fashion
df = pd.read_csv('filename.csv')
You can read more about this method here -> Official Docs
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.
I want to put some data available in an excel file into a dataframe in Python.
The code I use is as below (two examples I use to read an excel file):
d=pd.ExcelFile(fileName).parse('CT_lot4_LDO_3Tbin1')
e=pandas.read_excel(fileName, sheetname='CT_lot4_LDO_3Tbin1',convert_float=True)
The problem is that the dataframe I get has the values with only two numbers after comma. In other words, excel values are like 0.123456 and I get into the dataframe values like 0.12.
A round up or something like that seems to be done, but I cannot find how to change it.
Can anyone help me?
thanks for the help !
You can try this. I used test.xlsx which has two sheets, and 'CT_lot4_LDO_3Tbin1' is the second sheet. I also set the first value as Text format in excel.
import pandas as pd
fileName = 'test.xlsx'
df = pd.read_excel(fileName,sheetname='CT_lot4_LDO_3Tbin1')
Result:
In [9]: df
Out[9]:
Test
0 0.123456
1 0.123456
2 0.132320
Without seeing the real raw data file, I think this is the best answer I can think of.
Well, when I try:
df = pd.read_csv(r'my file name')
I have something like that in df
http://imgur.com/a/Q2upp
And I cannot put .fileformat in the sentence
You might be interested in removing column datatype inference that pandas performs automatically. This is done by manually specifying the datatype for the column. Here is what you might be looking for.
Python pandas: how to specify data types when reading an Excel file?
Using pandas 0.20.1 something like this should work:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.fileformat')
for exemple, in excel:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.xlsx')
Read this documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Suppose I have a csv file with 400 columns. I cannot load the entire file into a DataFrame (won't fit in memory). However, I only really want 50 columns, and this will fit in memory. I don't see any built in Pandas way to do this. What do you suggest? I'm open to using the PyTables interface, or pandas.io.sql.
The best-case scenario would be a function like: pandas.read_csv(...., columns=['name', 'age',...,'income']). I.e. we pass a list of column names (or numbers) that will be loaded.
Ian, I implemented a usecols option which does exactly what you describe. It will be in upcoming pandas 0.10; development version will be available soon.
Since 0.10, you can use usecols like
df = pd.read_csv(...., usecols=['name', 'age',..., 'income'])
There's no default way to do this right now. I would suggest chunking the file and iterating over it and discarding the columns you don't want.
So something like pd.concat([x.ix[:, cols_to_keep] for x in pd.read_csv(..., chunksize=200)])