Data in the csv file is of the format ("user_id", "group_id", "group_value").
"group_id" ranges from 0 to 100.
For a given user_id, it may be possible that group_value for a particular group_id is not available.
I want to create a sparse matrix representation of the above data. ("group_id_0", "group_id_1", ... , "group_id_100")
What is the best way to achieve this in Python?
Edit: Data is too big to iterate over.
You could do this with pandas.
Update 08.08.2018:
As noticed by Can Kavaklıoğlu, as_matrix() is deprecated as of pandas version 0.23.0. Changed to values.
import pandas as pd
df = pd.read_csv('csv_file.csv', names=['user_id', 'group_id', 'group_value'])
df = df.pivot(index='user_id', columns='group_id', values='group_value')
mat = df.values
Related
We have some data in a Delta source which has nested structures. For this example we are focusing on a particular field from the Delta named status which has a number of sub-fields: commissionDate, decommissionDate, isDeactivated, isPreview, terminationDate.
In our transformation we currently read the Delta file in using PySpark, convert the DF to pandas using df.toPandas() and operate on this pandas DF using the pandas API. Once we have this pandas DF we would like to access its fields without using row iteration.
The data in Pandas looks like the following when queried using inventory_df["status"][0] (i.e. inventory_df["status"] is a list):
Row(commissionDate='2011-07-24T00:00:00+00:00', decommissionDate='2013-07-15T00:00:00+00:00', isDeactivated=True, isPreview=False, terminationDate=None)
We have found success using row iteration like:
unit_df["Active"] = [
not row["isDeactivated"] for row in inventory_df["status"]
]
but we have to use a row iteration each time we want to access data from the inventory_df. This is more verbose and is less efficient.
We would love to be able to do something like:
unit_df["Active"] = [
not inventory_df["status.isDeactivated"]
]
which is similar to the Spark destructuring approach, and allows accessing all of the rows at once but there doesn't seem to be equivalent pandas logic.
The data within PySpark has a format like status: struct<commissionDate:string,decommissionDate:string,isDeactivated:boolean,isPreview:boolean,terminationDate:string> and we can use the format mentioned above, selecting a subcolumn like df.select("status.isDeactivated").
How can this approach be done using pandas?
This may get you to where you think you are:
unit_df["Active"] = unit_df["Active"].apply(lambda x: pd.DataFrame(x.asDict()))
From here I would do:
unit_df = pd.concat([pd.concat(unif_df["Active"], ignore_index=True), unit_df], axis=1)
Which would get you a singular pd.DataFrame, now with columns for commissiondate, decomissiondate, etc.
I'm in the process of implementing a csv parser using Dask and pandas dataframes. I'd like to make it load only the columns it needs, so it works well with and doesn't need to load large amounts of data.
Currently the only method I've found of writing a column to a parquet/Dask dataframe is by loading all the data as a pandas dataframe, modifying the column and converting from pandas.
all_data = self.data_set.compute() # Loads all data, compute to pandas dataframe
all_data[column] = column_data # Modifies one column
self.data_set = dd.from_pandas(all_data, npartitions=2) # Store all data into dask dataframe
This seems really inefficient, so I was looking for a way to avoid having to load all the data and perhaps modify one column at a time or write directly to parquet.
I've stripped away most of the code but here is an example function that is meant to normalise the data for just one column.
import pandas as pd
import dask.dataframe as dd
def normalise_column(self, column, normalise_type=NormaliseMethod.MEAN_STDDEV):
column_data = self.data_set.compute()[column] # This also converts all data to pd dataframe
if normalise_type is NormaliseMethod.MIN_MAX:
[min, max] = [column_data.min(), column_data.max()]
column_data = column_data.apply(lambda x: (x - min) * (max - min))
elif normalise_type is NormaliseMethod.MEAN_STDDEV:
[mean, std_dev] = [column_data.mean(), column_data.std()]
column_data = column_data.apply(lambda x: (x - mean) / std_dev)
all_data = self.data_set.compute()
all_data[column] = column_data
self.data_set = dd.from_pandas(all_data, npartitions=2)
Can someone please help me make this more efficient for large amounts of data?
Due to the binary nature of the parquet format, and that compression is normally applied to the column chunks, it is never possible to update the values of a column in a file, without a full load-process-save cycle (the number of bytes would not stay constant). At least, Dask should enable you to do this partition-by-partition, without breaking memory.
It would be possible to make custom code to avoid parsing the compressed binary data in columns you know you don't want to change, just read and write again, but implementing this would take some work.
I want to get all the fields from the csv which are numerical fields and store those field in an array so that i can perform mathematical operations. I can get the data types but not able to restrict. I am very new to python scripting, please help
Edit: I have added one sample row
so here F1 and F3 are the numerical fields. So i want to keep these two field names in an array variable
FieldNames=["F1","F3"]
import csv
import pandas as pd
import numpy as np
data = pd.read_csv(r'C:\Users\spanda031\Downloads\test_19.csv')
print(data.dtypes)
with open(r'C:\Users\spanda031\Downloads\test_19.csv') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
print(headers)
for line in headers:
#print value in MyCol1 for each row
print(line)
v3=np.array(data[line])
select_dtypes
You can use np.number or, as indicated in the docs, 'number' to select all numeric series:
# read csv file
df = pd.read_csv('file.csv')
# subset dataframe to include only numeric columns
df = df.select_dtypes(include='number')
# get column labels in array
cols = df.columns.values
# extract NumPy array from dataframe
arr = df.values
Notice there's no need for the csv module, as Pandas can read csv files via pd.read_csv.
isdigit can be used to check whether only numeric value is present in any column of dataframe in Python. Say the column name is Measure, then, you can write
df['Measure_isdigit'] = map(lambda x: x.isdigit(), df['Measure'])
print df
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.
I have the following dictionary in python that represents a From - To Distance Matrix.
graph = {'A':{'A':0,'B':6,'C':INF,'D':6,'E':7},
'B':{'A':INF,'B':0,'C':5,'D':INF,'E':INF},
'C':{'A':INF,'B':INF,'C':0,'D':9,'E':3},
'D':{'A':INF,'B':INF,'C':9,'D':0,'E':7},
'E':{'A':INF,'B':4,'C':INF,'D':INF,'E':0}
}
Is it possible to output this matrix into excel or to a csv file so that it has the following format? I have looked into using csv.writer and csv.DictWriter but can not produce the desired output.
You may create a pandas dataframe from that dict, then save to CSV or Excel:
import pandas as pd
df = pd.DataFrame(graph).T # transpose to look just like the sheet above
df.to_csv('file.csv')
df.to_excel('file.xls')
Probably not the most minimal result, but pandas would solve this marvellously (and if you're doing data analysis of any kind, I can highly recommend pandas!).
Your data is already in a perfectformat for bringing into a Dataframe
INF = 'INF'
graph = {'A':{'A':0,'B':6,'C':INF,'D':6,'E':7},
'B':{'A':INF,'B':0,'C':5,'D':INF,'E':INF},
'C':{'A':INF,'B':INF,'C':0,'D':9,'E':3},
'D':{'A':INF,'B':INF,'C':9,'D':0,'E':7},
'E':{'A':INF,'B':4,'C':INF,'D':INF,'E':0}
}
import pandas as pd
pd.DataFrame(graph).to_csv('OUTPUT.csv')
but the output you want is this Transposed, so:
pd.DataFrame(graph).T.to_csv('OUTPUT.csv')
where T returns the transpose of the Dataframe.