Pandas read_csv into multiple DataFrames - python

I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.

Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on

To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2

Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))

Related

How to check pandas column names and then append to row data efficiently?

I have a dataframe with several columns, some of which have names that match the keys in a dictionary. I want to append the value of the items in the dictionary to the non null values of the column whos name matches the key in said dictionary. Hopefully that isn't too confusing.
example:
realms = {}
realms['email'] = '<email>'
realms['android'] = '<androidID>'
df = pd.DataFrame()
df['email'] = ['foo#gmail.com','',foo#yahoo.com]
df['android'] = [1234567,,55533321]
how could I could I append '<email>' to 'foo#gmail.com' and 'foo#yahoo.com'
without appending to the empty string or null value too?
I'm trying to do this without using iteritems(), as I have about 200,000 records to apply this logic to.
expected output would be like 'foo#gmail.com<email>',,'foo#yahoo.com<email>'
for column in df.columns:
df[column] = df[column].astype(str) + realms[column]
>>> df
email android
0 foo#gmail.com<email> 1234567<androidID>
1 foo#yahoo.com<email> 55533321<androidID>

Load CSV files in dictionary then make data frame for each csv file Python

I have multiple csv files
I was able to load them as data frames into dictionary by using keywords
# reading files into dataframes
csvDict = {}
for index, rows in keywords.iterrows():
eachKey = rows.Keyword
csvFile = "SRT21" + eachKey + ".csv"
csvDict[eachKey] = pd.read_csv(csvFile)
Now I have other functions to apply on every data frame's specific column.
on a single data frame the code would be like this
df['Cat_Frames'] = df['Cat_Frames'].apply(process)
df['Cat_Frames'] = df['Cat_Frames'].apply(cleandata)
df['Cat_Frames'] = df['Cat_Frames'].fillna(' ')
My question is how to loop through every data frame in the dictionary to apply those function?
I have tried
for item in csvDict.items():
df = pd.DataFrame(item)
df
and it gives me empty result
any solution or suggestion?
You can chain the applys like this:
for key, df in csvDict.items():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(cleandata).fillna(' ')
Items returns a tuple of key/value, so you should make your for loop actually say:
for key, value in csvDict.items():
print(df)
also you need to print the df if you aren't in jupyter
for key, value in csvDict.items():
df = pd.DataFrame(value)
df
I think this is how you should traverse the dictionary.
When there is no processing of data from one data set/frame involving another data set, don't collect data sets.
Just process the current one, and proceed to the next.
The conventional name for a variable receiving an "unpacked value" not going to be used is _:
for _, df in csvDict.items():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(…
- but why ask for keys to ignore them? Iterate the values:
for df in csvDict.values():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(…

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Python .transpose() gets error while transforming dictionary data

AttributeError: 'NoneType' object has no attribute 'transpose'
I have been trying to extract cells as dictionary(from pandas dataframe) and trying to join with existing data
for example , I have csv file which contains two columns id,device_type.each cell in device_type column contains dictionary data. i have trying to split and add with original data.
And trying to do something like below.
import json
import pandas
df = pandas.read_csv('D:\\1. Work\\csv.csv',header=0)
sf = df.head(12)
sf['visitor_home_cbgs'].fillna("{}", inplace = True).transpose()
-- csv file sample
ID,device_type
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5}
f5d639a64a88496099,{}
-- looks for output like below
id,device_type,ttype,tvalue
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060379800281",11
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"061110053031",5
3c30ee03047b478,{"060379800281":11,"061110053031":5,"060372062002":5},"060372062002",5
f5d639a64a88496099,{},NIL,NIL
avoid inplace=True
sf['visitor_home_cbgs'].fillna("{}").transpose()
when you give inplace=True, it converts the same dataframe and returns null.
If you want to use inplace=True, then do like below
sf['visitor_home_cbgs'].fillna("{}", inplace=True)
sf.transpose()
To create rows from column values
One solution it to iterate through dataframe rows and create new dataframe with desired columns and values.
import json
def extract_JSON(row):
df2 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
device_type = row['device_type']
dict = json.loads(device_type)
for key in dict:
df2.loc[len(df2)] = [row['ID'], row['device_type'], key, dict[key]]
if df2.empty:
df2.loc[0] = [row['ID'], row['device_type'], '', '']
return df2
df3 = pd.DataFrame(columns=['ID', 'device_type', 'ttype', 'tvalue'])
for _, row in df.iterrows():
df3 = df3.append(extract_JSON(row))
df3

Changing the dtype for specific columns in a pandas dataframe

I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)

Categories

Resources