Only a fraction of the dataframe is merged in pandas - python - python

My problem is simple. I have a pandas dataframe with 124957 different tweets (related to a center-topic). The problem is that each date has more than 1 tweet (around 300 per day).
My goal is to perform sentiment analysis on the tweets of each day. In order to solve this, I am trying to combine all tweets of the same day into one string (which corresponds to each date).
To achieve this, I have tried the following:
indx=0
get_tweet=""
for i in range(0,len(cdata)-1):
get_date=cdata.date.iloc[i]
next_date=cdata.date.iloc[i+1]
if(str(get_date)==str(next_date)):
get_tweet=get_tweet+cdata.text.iloc[i]+" "
if(str(get_date)!=str(next_date)):
cdata.loc[indx,'date'] = get_date
cdata.loc[indx,'text'] = get_tweet
indx=indx+1
get_tweet=" "
df.to_csv("/home/development-pc/Documents/BTC_Tweets_1Y.csv")
My problem is that only a small sample of the data is actually converted to my format of choice.
Image of the dataframe
I do not know whether it is of importance, but the dataframe consists of three separate datasets that I combined into one using "pd.concat". After that, I sorted the newly created dataframe by date (ascending order) and reset the index as it was reversed (last input (2020-01-03) = 0 and first input (2019-01-01) = 124958).
Thanks in advance,
Filippos

Without going into the loop you used (think you are only concatating two first instances, not sure) you could use groupby and apply, here is an example:
# create some random data for example
import pandas as pd
import random
import string
dates = random.choices(pd.date_range(pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,6)),k=11)
letters = string.ascii_lowercase
texts = [' '.join([''.join(random.choices(letters, k=random.randrange(2,10))) for x in
range(random.randrange(3,12))]) for x in range(11)]
df = pd.DataFrame({'date':dates, 'text':texts})
# group
pd.DataFrame(df.groupby('date').apply(lambda g: ' '.join(g['text'])))

Related

Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html
It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

Looping a function over huge dataset

A (already defined) function takes ISIN (unique identifier in finance) as input and gets the corresponding RIC (another identifier) as output by looking at a particular internal web app where this data is available in tabular form. The Key limitation of this website is that it can't take more than 500 input ID at a time. So when 500 or less number of ISINs are entered as input it returns a dataframe containing 500 input ISIN and their corresponding RIC codes from the website.
Task is to take a csv as input containing 30k ISINs and batch them in group of 500 IDs so that it can pass through the function and then store the produced output (dataframe). Keep looping input and appending output incrementally.
Can someone please suggest how to break this data of 30K into size of 500 and then loop through function and store all results? Many thanks in advance!
.iloc is the method you want to use.
df.iloc[firstrow:lastrow , firstcol:lastcol]
if you put it in a for loop such as
for x in range (0, 30000, 50):
a = x #first row
b = x+50 #last row
thisDF = bigdf.iloc[a:b , firstcol:lastcol]
Try it and implement it in your code. You should make questions with some code you tried, so you get helped better.
Assuming you read in the .csv file as a pandas Series (e.g., by using something like this: pd.Series.from_csv('ISINs.csv')) or that you have a list, you could split these up as thus:
import pandas as pd
import numpy as np
# mock series of ISINs
isins = pd.Series(np.arange(0, 30002, 1))
data = pd.DataFrame()
for i in range(0, len(isins)//500):
isins_for_function = isins.iloc[i*500: i*500+500]
# if you have a list instead of a series, you will need to split it like this instead
isins_for_function = isins[i*500: i*500+500]
df = func(isins_for_function)
data = pd.concat([data, df])
# for Series
df = func(isins.iloc[-(len(isins)%500):]
# for list
df = func(isins[-(len(isins)%500):]
data = pd.concat([data, df])
This will concatenate your dataframes together into data. isins is your Series/list of isins. You will need the last bit after the for loop for any index values that are after the last chunk of 500 (in the Series above, which has 30002 rows, the last two are not included in the chunks of 500 so still need to be entered into the function func).

Concatenating and appending to a DataFrame inside the For Loop in Python

I have the following problem.
There is quite big dataset with the features and IDs. Due to the task definition, I'm trying to do clustering but not for all dataset, instead of that I'm taking each of the IDs and then train the model on the feature data from this particular ID. How does that look in details:
Imagine, that we have our initial dataframe df_init
Then I create the array with unique ID_s:
dd = df_init['ID'].unique()
After that, set comprehension is being created just like that:
dds = {x:y for x,y in df_init.groupby('ID')}
Using for loops and iterating over dds, I'm taking the data and use it for training the clustering algorithm. After that, pd.concat() is using to get the dataframe back(for this example, will show only two IDs):
df = pd.DataFrame()
d={}
n=5
for i in dd[:2]:
d[i] = dds[i].iloc[: , 1:5].values
ac = AgglomerativeClustering(n_clusters=n, linkage='complete').fit(d[i])
labels = ac.labels_
labels = pd.DataFrame(labels)
df = pd.concat([df, labels])
print(i)
print('Labels: ', labels)
So the result for this loop will be following output:
And the output df will look like that(shown only for first ID, the rest labels are also there):
My question is the following: how can I add the additional column to this dataframe in the loop, that will be matching certain ID to corresponding labels (4 labels-ID_1, another 4 labels-ID_2, etc.)? Are there any pandas solution for achieving that?
Many thanks in advance!
Below this line:
labels = pd.DataFrame(labels)
Add the following:
labels['ID']=i
This will give you the extra column with the proper ID for each subset

How to efficiently match values from 2 series and add them to a dataframe

I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?
The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']

MultiLabelBinarizer mixes up data when inverse transforming

I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.
After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.
I ran a simple test in jupyter notebook to show the error:
In the inverse_transformed value it messes up in row 1 mixing up the state and month.
jupyter notebook code
First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?
EDIT:
Thank you to #Nicolas M. for helping me solve my question. I ended up solving this issue like this.
Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it
I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.
One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.
here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.
inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.
in this code data is the dataframe that I pass to the function.
lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])
# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():
lb_dict[column].fit(data[column])
binarized_label = lb_dict[column].transform(data[column])
header = [column + str(i) for i in range(0, len(binarized_label[0]))]
binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))
# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)
Here is the inverse_transform function that I wrote:
inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])
# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
# create a list of the different headers based on if the name contains the original output column name
one_label_list = [x for x in output.columns.values.tolist() if column in x]
one_label_df = output[one_label_list]
# inverse transform the data frame for one label
single_label = label_binarizer[column].inverse_transform(one_label_df.values)
# join the output of the single label df to the entire output df
inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))
inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)
The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :
You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.
So if you take the line one, you have :
1000 - California - January
Now if you take the second one, you have :
750 - February - New York
And so on...
So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"
What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.
To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question
EDIT 1:
I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict
data = {
"State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
"Month" : ["January", "February", "May", "February", "January", "February" ],
"Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)
d = defaultdict(MultiLabelBinarizer) # dict of Features => model
list_encoded = [] # store single matrices
for column in df:
d[column].fit(df[column])
list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32
I hope it helps and the explaination is clear enough,
Nicolas

Categories

Resources