I have a csv file where one of the column stands for how many followers a Twitter user has:
The csv file has about 1 000 000 rows. I'd like to create a graph showing the distribution of followers number across the whole data. Since the range of followers number is quite big (starting from 0 followers up to hundred thousands) maybe the data on the graph should be quite approximate, it can be a graph where each bar represents 1000 followers or even more (so, 1st one would be 0-1000, then 1000-2000 etc). I hope I'm making myself clear.
I've tried a simple code but it gives a weird result.
df = pd.read_csv(".csv", encoding='utf8', delimiter=',')
df["user.followers_count"].hist()
Here's the result:
Does it have anything to do with the size and a large range of my data?
There is a bins argument in the hist function you have called. You just need to update it with a more reasonable value.
To understand: If the range is 1-10000, and you have set bins=10, then 1-1000 is one bin, 1000-2000 is another, and so on.
Increasing the number of bins (and thus reducing this range size) will help you get a smoother distribution curve and get what you are trying to achieve with this code/dataset.
Documentation link:
https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.hist.html
Given your data , i have the following output
Related
enter image description here---->Currently I am working on a data which has more than 26 million rows and 2 columns. I managed to seperate it into equal sized - 1 million csv files. The problem here is - I need to open this file in Python and plot the graph of the data. But some rows of the data was accidentally written in date format. (For example-It was desired to write 11.1118 celsius but instead excel got 11/11/2018)There are too many rows that I need to fix and I can't do it one by one. Is there any solution to this or..?
Given above is the part of my data and there are hundreds of same mistakes in the upcoming rows
If the image you posted is representative for the whole problem you have, I think you do not have any problem, because only the left column is corrupted and as far as I can see this column shows nothing but an equidistant list of numbers with step 0.5. If you need this as an x-value of a diagram you can simply create it separately and import only the data of the right column.
I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:
Number of unique OS used
Number of unique Browsers user
Number of unique cookies used
All these numbers are taken over a period of six months.
Now I did try to do a normal test using:
from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)
Which returns 0.0 meaning the data is not following normal distribution.
Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.
I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?
In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.
I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.
Something like
df_visits = df_visits.apply(lambda x: np.log(x))
of course you will also need to get rid of any zeros before you can log transform.
Image showing pre Vs post log transform
I have some trouble understanding the output of the Discrete Cosine Transform.
Background:
I want to achive a simple audio compression by saving only the most relevant frequencies of a DCT. In order to be somewhat general, I would cut several audio tracks into pieces of a fixed size, say 5 seconds.
Then I would do a DCT on each sample and find out which are the most important frequencies among all short snippets.
This however does not work, which might be due to my missunderstanding of the DCT. See for example the images below:
The first image shows the DCT of the first 40 seconds of an audio track (wanted to make it long enough so that I get a good mix of frequencies).
The second image shows the DCT of the first ten seconds.
The thrird image shows the DCT of a reverse concatination (like abc->abccba) of the first 40 seconds
I added a vertical mark at 2e5 for comparison. Samplerate of the music is the usual 44.1 khz
So here are my questions:
What is the frequency that corresponds to an individual value of the DCT-output-vector? Is it bin/2? Like if I have a spike at bin=10000, which frequency in the real world does this correspond to?
Why does the first plot show strong amplitudes for so many more frquencies than the seond? My intuition was that the DCT would yield values for all frequencies up to 44.l khz (so bin number 88.2k if my assumption in #1 is correct), only that the scale of the spikes would be different, which would then make up the difference in the music.
Why does the third plot show strong amplitudes for more frequencies than the first does? I thought that by concatenating the data, I would not get any new frequencies.
As DCTand FFT/DFT are very similar, I tried to learn more about ft (this and this helped), but apparently it didn't suffice.
Figured it out myself. And it was indeed written in the link I posted in the question. The frequency that corresponds to a certain bin_id is given by (bin_id * freq/2) / (N/2). Which essentially boils down to bin_id*1/t with N=freq*t. This means that the plots just have different granularities. So if plot#1 has a high point at position x, plot#2 will likely show a high point at x/4 and plot#3 at x*2
The image blow shows the data of plot#1 stretched to twice its size (in blue) and the data of plot#3 in yellow
Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.
I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.
I am trying to investigate differences between runs/experiments in a continuously logged data set. I am taking a fixed subset of a few months for this data set and then analysising it to come up with an estimate on when a run was started. I have this sorted in a series of times.
With this I then chop the data up into 30 hour chunks (approximate time between runs) and then put it into a dictionary:
data = {}
for time in times:
timeNow = np.datetime64(time.to_datetime())
time30hr = np.datetime64(time.to_datetime())+np.timedelta64(30*60*60,'s')
data[time] = df[timeNow:time30hr]
So now I have a dictionary of dataframes, indexed by by StartTime and each one contains all of my data for a run, plus some extra to ensure I have it all for every run. But to compare two runs together I need to have a common X value to stack them on top of each other. Now every run is different and the point I want to consider "the same" varies depending on what i'm looking at. For the example below I have used the largest value in that dataset to "pivot" on.
for time in data:
A = data[time]
#Find max point for value. And take the first if there is more than 1
maxTtime = A[A['Value'] == A['Value'].max()]['DateTime'][0]
# Now we can say we want 12 hours before and end 12 after.
new = A[maxTtime-datetime.timedelta(0.5):maxTtime+datetime.timedelta(0.5)]
#Stick on a new column with time from 0 point:
new['RTime'] = new['DateTime'] - maxTtime
#Plot values against this new time
plot(new['RTime'],new['Value'])
This yields a graph like:
Which is great except I can't get a decent legend in order to tell what run was what and work out how much variation there is. I believe half my problem is because Im iterating over a dictionary of dataframes which is causing issues.
Could someone recommend how to better organise this (a dictionary of dataframes is all I could do to get it to work). I've thought of doing a hierarchical dataframe and instead of indexing it by run time, assigning a set of identifiers to the runs (The actual time is contained within the dataframes themself so I have no problem loosing the assumed starttime) and plotting it then with a legend.
My final aim is to have a dataset and methodology that means I can investigate the similarity and differences between different runs using different "pivot points" amd produce a graph of each one which I can then interrogate (or at least tell which data set is which to interrogate the data directly) but couldn;t get past various errors with creating it.
I can upload a set of the data to a csv if required but am not sure on the best place to upload it to. Thanks