Reshaping pandas DataFrame and saving tuples - python

I have the following pandas DataFrame, df.head():
userid followers experience fixed_date
0 12134 28266 Intermediate 2012-10-15
1 12134 28266 Intermediate 2012-10-15
2 91638 665 Missing 2012-10-15
3 148401 123 Professional 2012-10-15
4 5890 2436 Professional 2012-10-15
I'd like to make a new DataFrame where the rows are userid, the columns are fixed_date, and the values are a tuple of (followers,experience). As you can see, I have duplicate userid rows, which is the error I get when I try df.pivot(). But the number of followers can change at a later date, so I'd like to capture that for each userid.
I can give a little more background about the data. The rows are currently tweets, so a user can (and often) tweets more than one time in a given day. Therefore there will also be duplicate fixed_date because I disregard exact time of tweet (HH:MM:SS). In cases where the user tweeted multiple times in a given date, it would be great to group this into the cell value and make an array of tuples. If this is already asking for too much, it'd be more than okay to just have multiple columns of the same value. If that's not possible, I can also save the dates to a separate array and enumerate the columns 0..n. Just throwing out thoughts.
Any ideas?

It's not elegant but this will work:
df2 = pd.DataFrame(df.loc[:, ['followers', 'fixed_date']]) # New frame with just two cols
df2.index = df.userid # Set the index to user id
df2 = df2.drop_duplicates() # remove duplicate records
Or if you just want the tuples to pass to an array you could do this:
df.loc[:, ['userid', 'fixed_date', 'followers']].values
# array([[12134, '2012-10-15', 28266],
[12134, '2012-10-15', 28266],
[91638, '2012-10-15', 665],
[148401, '2012-10-15', 123],
[5890, '2012-10-15', 2436]], dtype=object)
Which you could use to pass as a sparse matrix or convert to a numpy ndarray/matrix.

Related

ValueError: cannot handle a non-unique multi-index! even when we have unique index

I asked a question last days and I got the accepted response for that. Here is the question,
Group the dataframe based on ids and stick the values of ids to each other with mean of the last days
But, the problem is that when I want to apply this code to a large dataframe, it gives me an error as ValueError: cannot handle a non-unique multi-index!. I've tried to check the index of my data frame by df.columns.value_counts() and as follow, all of the count for my dataframe is 1. my dataframe has 30 columns with 3000 rows. point_id, date, and Temperatures are columns.
Does anybody know how to solve this problem? Thank you so much.
You have 254 rows where you have at least 2 data points for the same (point_id, date). What do you want to do with records for a same (point_id, date)? You can group this data and keep the mean, for instance.
Here is the list:
df = pd.read_csv('dft.csv', index_col=0)
counts = df.value_counts(['point_id', 'date'], sort=False).loc[lambda x: x > 1]

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Pandas adding rows to df in loop

I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame.
The end format of the data frame I would like is something like the following:
df:
id 2018-01 2018-02 2018-03
234 2 1 3
345 4 5 1
534 5 3 4
234 2 2 3
When I iterate through the data in the loop I have a dictionary with the id, the month and the value for the month, for example:
{'id':234,'2018-01':2}
{'id':534,'2018-01':5}
{'id':534,'2018-03':4}
.
.
.
What is the best way to take an empty data frame and add rows and columns with their values to it in a loop?
Essentially as I iterate it would look something like this
df:
id 2018-01
234 2
then
df:
id 2018-01
234 2
534 5
then
df:
id 2018-01 2018-03
234 2
534 5 4
and so on...
IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value
df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]
for x in l:
df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
print(df)
2018-01
id
234 2
2018-01
id
234 2
534 5
2018-01 2018-03
id
234 2.0 NaN
534 5.0 4.0
It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:
df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')
If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:
def datagen(your_input):
for item in your_input:
# your code to make a dict
yield dict
The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.
The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:
df.loc[ len(df) ] = [new, row, of, data]
This will "append" the new data row to the end of the dataframe in-place.
The above example is for an empty Dataframe with exactly 4 columns, such as:
df = pandas.DataFrame( columns=["col1", "col2", "col3", "col4"] )
df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row.
For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.
–––––
I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.
–––––
However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.
Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:
df:
id date data
234 2018-01 2
534 2018-01 5
535 2018-03 4
(has two entries for the same date).
Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.

Categories

Resources